URL Optimizations
Everyday innocent-sounding words like “order” and “cgi” often appear in the body of messages I receive from legitimate mailing lists. Seeing them appear in a URL, however, is much more suspicious. URLs are the spammers’ preferred means of contact. It’s much easier to run a scam using a website as your point of contact than it is to pay for the overhead of a phone system or mail processing department. Spammers also like their privacy, since the rest of the free world hates them, and they prefer that even customers not know how to contact them or the companies they spam for. Whether it’s a link to click to visit a site or the URL of an image inside the message, URLs provide a lot of useful information specific to their own kind. Even non-sensible numbers will frequently stand out in URLs. This makes really good data for identifying not only spam but some legitimate mailing lists that use URLs in their unsubscribe tag lines. Users who are subscribed to some mailing lists that frequently include embedded advertisements (such as Yahoo Groups) will notice some specific characteristics of the URLs used in these advertisements that help the filter distinguish between advertising and real spam. URLs are frequently tokenized differently than the rest of a message. The only delimiters usually used when tokenizing a URL are the slash, question mark, equal sign, period, and colon, although some filter authors perform the same basic type of token separation as they do in the rest of the message body. Tokenizing using URL-specific delimiters is done because the individual tokens are more frequently found based on their path in the URL, rather than on a specific context inside the URL. Regardless of how they are tokenized, URLs, when analyzed, can yield a lot of useful information. They can be categorized as places you want to go and places you don’t want to go. A spam containing places you don’t want to go is just as informative as a legitimate message containing places you do.
Url*getitrightnowwholesale | S: 00026 | I: 00000 | P: 0.9999 |
Url*thesedealzwontlast | S: 00026 | I: 00000 | P: 0.9999 |
Url*biz | S: 00008 | I: 00000 | P: 0.9998 |
Url*us | S: 00000 | I: 00050 | P: 0.0001 |
Url*java | S: 00018 | I: 00000 | P: 0.9999 |
Url*www | S: 00000 | I: 00030 | P: 0.0001 |
Url*com | S: 00000 | I: 00033 | P: 0.0001 |
Url*img | S: 00066 | I: 00000 | P: 0.9999 |
Ironically, legitimate URLs seem to be rare among spammers, while the wild and obnoxious names always pop up—with the exception of “java,” of course, which appeared as spammy only because this user doesn’t use Java (not because Java programmers were spamming). The appearance of certain naming conventions, such as the extensive use of “img,” makes the task of identifying malicious URLs pretty easy. If we wanted to, we could probably determine the disposition of the message based on the URL information alone. Ironically, URLs containing well-known web addresses are likely to appear as innocent or hapaxes. Not a single URL token containing the following words has ever appeared in my corpus as spammy:
Url*microsoft
Url*whitehouse
Url*sco
Url*linux
Url*quicken
Url*intuit
Url*amazon
Url*fbi