URL Optimizations

Everyday innocent-sounding words like “order” and “cgi” often appear in the body of messages I receive from legitimate mailing lists. Seeing them appear in a URL, however, is much more suspicious. URLs are the spammers’ preferred means of contact. It’s much easier to run a scam using a website as your point of contact than it is to pay for the overhead of a phone system or mail processing department. Spammers also like their privacy, since the rest of the free world hates them, and they prefer that even customers not know how to contact them or the companies they spam for. Whether it’s a link to click to visit a site or the URL of an image inside the message, URLs provide a lot of useful information specific to their own kind. Even non-sensible numbers will frequently stand out in URLs. This makes really good data for identifying not only spam but some legitimate mailing lists that use URLs in their unsubscribe tag lines. Users who are subscribed to some mailing lists that frequently include embedded advertisements (such as Yahoo Groups) will notice some specific characteristics of the URLs used in these advertisements that help the filter distinguish between advertising and real spam.

URLs are frequently tokenized differently than the rest of a message. The only delimiters usually used when tokenizing a URL are the slash, question mark, equal sign, period, and colon, although some filter authors perform the same basic type of token separation as they do in the rest of the message body. Tokenizing using URL-specific delimiters is done because the individual tokens are more frequently found based on their path in the URL, rather than on a specific context inside the URL. Regardless of how they are tokenized, URLs, when analyzed, can yield a lot of useful information. They can be categorized as places you want to go and places you don’t want to go. A spam containing places you don’t want to go is just as informative as a legitimate message containing places you do.

Url*getitrightnowwholesale

S: 00026

I: 00000

P: 0.9999

Url*thesedealzwontlast

S: 00026

I: 00000

P: 0.9999

Url*biz

S: 00008

I: 00000

P: 0.9998

Url*us

S: 00000

I: 00050

P: 0.0001

Url*java

S: 00018

I: 00000

P: 0.9999

Url*www

S: 00000

I: 00030

P: 0.0001

Url*com

S: 00000

I: 00033

P: 0.0001

Url*img

S: 00066

I: 00000

P: 0.9999

Ironically, legitimate URLs seem to be rare among spammers, while the wild and obnoxious names always pop up—with the exception of “java,” of course, which appeared as spammy only because this user doesn’t use Java

(not because Java programmers were spamming). The appearance of certain naming conventions, such as the extensive use of “img,” makes the task of identifying malicious URLs pretty easy. If we wanted to, we could probably determine the disposition of the message based on the URL information alone.

Ironically, URLs containing well-known web addresses are likely to appear as innocent or hapaxes. Not a single URL token containing the following words has ever appeared in my corpus as spammy:

Url*microsoft

Url*whitehouse

Url*sco

Url*linux

Url*quicken

Url*intuit

Url*amazon

Url*fbi

URL Optimizations - Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] - نسخه متنی

Jonathan A. Zdziarski

آدرس پست الکترونیک گیرنده :

آدرس پست الکترونیک فرستنده :

نام و نام خانوارگی فرستنده :

پیغام برای گیرنده ( حداکثر 250 حرف ) :

کد امنیتی را وارد نمایید

فونت

اندازه قلم

حالت نمایش

URL Optimizations