Attacks on Tokenizers

The most common attack in the past has been on tokenizers, the heuristic component of the filter that performs breakdown of the message’s content. The responsibility for converting the message into meaningful data rests with the tokenizer. The most common approach spammers use is to obfuscate the message and try to confuse the heuristic functions so that they misinterpret the data. This approach could target not only the tokenizers but also the heuristic rule sets from the former generation of spam filters, as well as any other supporting algorithms used, such as data-polishing algorithms and even collaborative algorithms.

Encoding Abuses

As we discussed in Chapter 5, present-day Internet messages support a set of Multipurpose Internet Mail Extensions, or MIME. Part of MIME includes a series of encodings that can be used to convert non-ASCII data into ASCII data to support the existing mail transport infrastructure. MIME is frequently abused, however. Although it was originally intended to convert non-ASCII data, spammers have been using it for years to hide their message content.

The Problem

Obviously, if the filter can’t read the message, it can’t classify it as spam. The encoding most commonly used by spammers is Base64. This encoding is completely unintelligible to a human and doesn’t even contain any delimiters to make machine analysis useful. A Base64-encoded message will be transported and will appear to the spam filter as a bunch of binary junk. The message will be decoded by the user’s mail client back into its human- readable form, having never been read by and interpreted by a Base64- unaware spam filter.

The Solution

Fortunately, this problem has an easy solution: Just decode all Base64- encoded portions of the message. There are many public domain decoding tools that can be freely used to perform this decoding, or you can even write your own based on the encoding specification.

The Importance

This approach is still used today by many spammers as a last-ditch effort to push their spam through filters. It works only on the most primitive filters that are not Base64-aware, but that doesn’t stop spammers from wasting the extra few processor cycles to do it. Some believe this is done to evade some outgoing-mail spam detectors, which may be encountered when spamming through an open relay. Even these types of filters are widely Base64-aware, but some may not be.

Header Encodings

Header encodings are just as easily abused by spammers as message body encodings. In Chapter 5, we discussed header encodings as a means of using alternate character sets and also provided a few examples of how spammers are using these encodings to conceal guilty text in the message headers.

The Problem

Filters that are unaware of header-based encoding are likely to miss key information embedded in the message headers. While the filter may be able to compensate for this omission of data by analyzing the rest of the message, some email that is starved for data (such as image spams), could be missed if this data is ignored.

Subject: =?iso-8859-1?B?U2lsZGVuYWZpbCBDaXRyYXRlICBTaGlwcGVkIFF1aWNrbHk=?= 
From: "Cruz Maldonado" <cmaldonadosv@ccsg.tau.ac.il>
Date: Sat, 17 Apr 2004 22:13:28 +0000 MIME-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: 8bit

The human-readable data in these headers looks pretty innocent. When we decode the headers, however, we see that the subject is as guilty as they come.

Subject: Sildenafil Citrate Shipped Quickly

The Solution

The solution is easy—detect and decode portions of the message headers that are encoded. The encoding most commonly used by spammers is Base64, discussed in Chapter 5. Spam filters should be looking for these types of encodings in any of the message headers, not just in the subject. Spams use encoding in the From: headers, in the Subject:, and sometimes even in the To: headers. Multiple encoded blocks may be present in any one header, so it’s important to be thorough.

The Importance

The importance of decoding headers is in finding the hidden content. Data that is encoded is likely to be some of the most guilty data in the spam. A majority of spam can still be classified without this information, but adding support for decoding will help whittle away at that annoying little 0.10 percent of messages that make it through most filters.

Hypertextus Interruptus

The term “hypertextus interruptus” was originally coined by Bill Yerazunis and cited in Dr. John Graham-Cumming’s “The Spammers’ Compendium” at http://jgc.org/tsc. HTML comments are a part of the original HTML specification and are designed to allow web page authors to insert miscellaneous comments about whatever it is they’re coding in HTML. While these comments are present in the HTML source code of a document, they are not visible to the end user.

The Problem

HTML comments are abused by spammers to break up guilty-sounding words in spam. Since HTML comments are invisible to the end user, the original words will appear intact while the spam filter will see only a myriad of junk text.

Yes you he<!lansing>ard about th<!crossbill>ese weird <!cottony>little  
pil<!domesday>ls that are suppo<!=anabel>sed to make you bigger and of  
cou<!chord>rse you think they’re b<!soften>ogus snake potion. Well, let’s look  
at the facts: <strong>G<!eigenspace>RX2 has be<!waldron>en sold over 1.9  
Mill<!audacity>ion times within the last 18 months</strong>...
With awe<!tapestry>some results for hun<!wield>dreds of thous<!locale>ands of  
men all over the planet! They all enjoy a seriously enhanced version of their  
manh<!rescind>ood and <b>why shou<!seoul>ldn’t you</b>?

At first glance, the text above looks like a bunch of junk with a few words here and there. That is how the spam filters would view the message and would likely end up delivering it. The end user would see the message entirely differently, once the hypertext comments were removed by their email client.

Yes you heard about these weird little pills 
that are supposed to make you bigger and of course you think 
they’re bogus snake potion. Well, let’s look at the facts: GRX2 
has been sold over 1.9 Million times within the last 18 months... 
With awesome results for hundreds of thousands of men all over the planet!  
They all enjoy a seriously enhanced version of their manhood and why shouldn’t  
you?

The Solution

The solution to hypertext comments is to remove them, reassembling any words that they might have broken up. HTML comments can be identified by a leading <! symbol. Comments that span multiple lines usually end with -->, while short HTML comments end with just a single bracket >. Another thing spammers are doing to try to confuse filters is adding legitimate- sounding text to the HTML comments. Since the text inside HTML comments will never be viewable to the end user, the best practice is simply to ignore any data in comments. Finally, the hypertext being used to separate tokens doesn’t have to be in the form of HTML comments. Nonexistent HTML tags containing random words can be used in the same way to split up guilty tokens. A tokenizer should be aware of HTML tags without spaces, or with a length that is too long to possibly be a legitimate HTML tag.

Spam filters using multiword tokenizers may end up identifying the individual tokens even without HTML comment filtering. For example, if the window size of the tokenizer is three, the tokenizer might recognize “manh<!rescind>ood” as three separate tokens, “manh,” the comment, and “ood.” Tokenizers such as SBPH, which we’ll discuss in Chapter 11, will reassemble this using word skipping, but this approach is still imperfect.

Filtering out the HTML comments seems to be the easiest and best practice, as it both eliminates the fake legitimate data and puts an emphasis on the guilty data.

The Importance

Data loss could occur if some type of filtering isn’t used to remove these HTML comments. Since these comments can be used anywhere in the message body, an entire message could be lost if the filter fails to assemble these tokens. On the other hand, an unaware filter will begin to learn these token fragments, such as “manh” and “ood” instead of “manhood,” and eventually these will be clear identifiers of spam. This is a somewhat slower learning process, as spammers can break up guilty words in a number of different ways: “m,” “ma,” “man,” “manh,” “manho,” and “manhoo.” This increases the size of the dataset and takes longer, but once these tokens are trained they are usually very clear markers of spam.

ASCII Spam

Believe it or not, the first ASCII spam was actually detected about six months before this book was written. The idea behind ASCII spam is to provide nothing but junk data to the tokenizer so that it will pass the message through. ASCII spam uses ASCII art to draw pictures of concepts rather than spell them out or use images, which have a consistent framework. Fortunately, ASCII spams don’t appear correctly in every mail client, and they are also not very good at providing the much-needed teasers spammers rely on to attract middle-aged men to click a pornographic link (as you can see in Figure 7-1).

Figure 7-1: An ASCII spam

The Problem

The tokenizer approaches we discuss in this book don’t include any provisions for managing ASCII spam and will end up providing only the intelligible part of the message body and the message headers to the filter. ASCII spam generally consists of many different types of text, commonly a large percentage of characters that a filter would consider delimiters.

The Solution

There are many potential solutions for dealing with this kind of spam. Fortunately, many of the characters in this type of spam are normally treated as delimiters. This results in much of the message content being ignored, which will help whatever guilty data there is to stick out. The rest will be tokenized as garble, but it’s useful garble. I’ve seen only a few different ASCII spams, but many of them use the same types of garble to construct words and objects (primarily naked ladies, which generally have the same artistic detail as the mud-flap girls on semi-trailers—mama sita!). One thing spammers just can’t get around is that they have to provide a link to click on or some other way to contact them. It’s possible for this link also to be an illustration of ASCII art, but then the spammer would lose their instant-click or copy-paste feature. The URL itself, which is the spammer’s only means of contact, becomes more prominent in the absence of any other message body data and that helps to detect them.

Since these URLs are important, the tokenizer should include URL- specific tokens as part of its implementation. Header tokens are also important. Finally, if all else fails, collaborative algorithms such as message inoculation are very effective at inoculating the other users in a group with certain types of ASCII spam. This usually isn’t necessary, though. My filter doesn’t seem to have any trouble identifying the few ASCII spams I did receive, based on all of the other information in the message. I suspect that if things get worse and ASCII spams start to pop up everywhere, filters will easily adapt to identify them. ASCII spams are also cheesy, and spammers appear to realize this. The “teaser” effect is pretty much lost when using ASCII, so it behooves the spammer not to use it. Still, should a spam get through it’s much better than letting a raunchy picture spam through and might even give you a little giggle at what a corny spammer you’re dealing with.

The Importance

ASCII spam hasn’t been used enough to determine whether or not it will affect the accuracy of filters, but it is believed it will do so to some small degree. Messages like this are commonly evaluated based on their message headers and any plain text included in the message body. Some commonly used constructive text may also help not only to identify the message as spam but also to categorize these types of spams by artist. This is an area on which filter developers will definitely be focusing in the future, as the approach becomes more popularized.

Text-Splitting

We’ve already discussed how spammers use HTML comments to break up guilty-sounding tokens. Another approach they are using is text-splitting. Text-splitting is designed to degenerate guilty tokens into mere characters, making them indistinguishable from legitimate uses of the letters.

The Problem

Text-splitting uses a series of known delimiters to break up guilty-sounding tokens into single characters. For example,

From: Alicia Johnson <Alicia_Johnson____r-vtgzcjtkgcakvb@wholebargain.com>
Reply-To: Alicia Johnson <Alicia_Johnson____r-vtgzcjtkgcakvb@wholebargain.com>
Subject: Get your F/R/E/E 10 Day Supply N/O/W!
Mime-Version: 1.0 
Content-Type: multipart/alternative;  
boundary="_----------=_2656431139258145356951"
List-Unsubscribe: <mailto:unsub-vtgzcjtkgcakvb@wholebargain.com>

In this message, the subject includes guilty text that was split up using delimiter characters. This approach isn’t limited to a message’s headers; it can be used in the message body as well. If the filter doesn’t compensate for this, the data could become lost or wind up as degenerate in the dataset.

The idea behind this type of attack is to prevent the filter from seeing the words “FREE” and “NOW!” and instead to make it see only letters, which may leave it uncertain about the disposition of the message.

The Solution

Fortunately, this approach backfires most of the time. There is plenty of other guilty data for the filter to work with, and most filters have no problem identifying spams that use this technique. Even heuristic-based filters have coded additional rule sets to identify the extensive use of obfuscation. Statistical filters may or may not use the individual characters that result in the dataset. Some data is just as guilty in its single-character form as it is in a whole-word form. For example, the letters “S” and “X” by themselves are extremely guilty in my dataset.

To compensate for text-splitting, some filters have applied a form of token reassembly that will search for single-character tokens adjacent to other single-character tokens and attempt to group them together. For example,

F-R-E-E V/I/A/G/R/A

can easily be reassembled into “FREE” and “VIAGRA” just by looking for adjacent single characters.

Since there are a lot of different ways to split up text in this fashion, tokens don’t always reassemble perfectly. They are usually accurate enough for the filter to identify anyway, such as the token “AGRA.” A multiword-capable tokenizer can also identify the different components of partly reassembled tokens and form joined tokens such as “VI+AGRA.”

There are only a finite number of different ways to split up individual words. By the time they’ve seen a few spams, most filters have learned what they need to know to successfully identify the different permutations in future messages.

Other ways that spammers split up text include the use of noncommenting HTML tags. For example,

V<FONT SIZE=0> </FONT> 
I<FONT SIZE=0> </FONT> 
A<FONT SIZE=0> </FONT> 
G<FONT SIZE=0> </FONT> 
R<FONT SIZE=0> </FONT> 
A

These types of approaches are futile, because they provide even more interesting data than the original guilty word. The different font tags and other types of HTML junk inserted between characters are an easy identifier of spam. Implementing multiword-capable tokenizers, such as chained tokens (which we’ll discuss in Chapter 11), can greatly improve the ability to identify these types of attacks, although it is rarely necessary to do so. One advantage these advanced tokenizers have is the ability to associate each letter with an HTML tag to generate a very guilty token used exclusively by spammers.

And that’s the catch—spammers can use whatever obfuscation techniques they want to obscure text, but they leave plenty of trace evidence that the filter is capable of learning. Filters have a way of performing their own “email forensics” to detect these subtle attempts.

The Importance

Just like other approaches to obfuscation, text-splitting can potentially cause data loss. Also like other approaches, the degenerated data may be more interesting than the original data. Text reassembly is one of those areas where a self-evaluating algorithm could be used to determine if certain texts should be reassembled.

Table-Based Obfuscation

A rarely used approach to breaking up guilty text involves using a table to break words up into individual characters for each line. When the message is displayed to the user, the characters are reassembled, making the message appear as if it were a whole text. John Graham-Cumming first identified this approach in “The Spammers’ Compendium.”

The Problem

Text that is split up using tables is difficult to reassemble. As a result, much of the data could possibly be lost, leaving the filter with a limited amount of data to classify the message.

<table cellpadding=0 cellspacing=0 border=0><tr> 
<td><table cellspacing=0 cellpadding=0 border=0><tr><td> 
<font face="Courier New, Courier, mono" size=2>
<br>U<br> <br>O<br>a<br> <br>D<br>u<br>a 
<br> <br>N<br> <br>B<br>d<br> <br>N<br>
<br>C<br> <br>C<br>w<br> <br>1<br> <br>
<br> <br>1<br> <br>C<br>S<br></font></td></tr></table></td> 
<td><table cellspacing=0 cellpadding=0 border=0><tr><td><font 
face="Courier New, Courier, mono" size=2>
<br> N <br>   <br>bta 
<br>nd <br>    <br>ipl
<br>niv<br>nd <br>
<br>o r<br> <br>ach<br>ipl 
<br> <br>o o<br> <br>onf<br> <br>ALL<br>
ith<br> <br> <br> <br> <br> <br>
<br> <br>all<br>und<br></font></td></tr></table></td> 
<td><table cellspacing=0 cellpadding=0 
border=0><tr><td><font face="Courier New, Courier, mono" size=2>
<br>I V<br> <br>in <br>the <br> <br>oma<br>ers<br>lif<br> 
<br>equ <br> <br>elo<br>oma<br> <br>ne <br>
<br>ide<br> <br> NO<br>in <br> <br>3 1<br> <br>
<br> <br>2 1<br> <br> 24<br>ays <br></font></td></tr></table></td> 
<td><table cellspacing=0 cellpadding=0 border=0><tr><td><font 
face="Courier New, Courier, mono" size=2>
<br> E<br> <br>a <br> a<br> <br>s <br>it<br>e <br> <br>ir<br> <br>rs<br>s
<br> <br>is<br> <br>nt<br> <br>W <br>da<br> <br> 2<br> <br> <br>
<br> 2<br> <br> h<br> a<br></font></td></tr></table></td>

What looks like a garble of text actually appears in the recipient’s mail client looking like a legitimate whole message (shown in Figure 7-2). This approach is particularly devious but requires a bit of work and isn’t guaranteed to succeed.

Figure 7-2: A mail client’s rendering of a table-sliced message (courtesy of “The Spammers’ Compendium,” http://jgc.org/tsc)

The Solution

As is the case with many approaches like this, the amount of HTML code it takes to generate this kind of attack is itself a very detectable marker of spam. The use of many different combinations of HTML tags, such as table identifiers, often gives away the message as spam. The degenerated data itself can even provide useful evidence for the filter. Message headers can also be dead giveaways. All of the typical bogus information used in headers can’t be avoided using this approach, and ultimately most filters can still effectively identify these types of messages.

Filter authors could code a rather complex “pre-filter” to parse this information. The parsing would first break up the table column by column and assemble individual column cells together. In most cases, however, this is too much work. Since this approach is used only rarely and is easily detectable without parsing the table, most filters don’t need to do anything special to detect these messages.

The Importance

Most filter authors prefer to allow the tokenizer to grab all of the useful HTML that this spamming approach uses, which then becomes a clear identifier of spam. It’s not necessary for a filter to directly implement any type of table disassembly or token reassembly, and many believe that doing so even hurts accuracy. As long as approaches like this continue to leave an HTML signature of what they’re doing, it won’t be necessary to counter these types of attacks.

URL Encodings

Previously, we’ve discussed the use of encodings to translate different parts of a message. Another type of encoding that is frequently abused is URL encoding. It’s still unclear why web browsers incorporated some types of URL encoding, as they appear only to promote misdirection. For example, some older versions of browsers support the 32-bit representation of an IP address, or even a hexadecimal representation. This support has been removed in some newer versions of browsers to prevent malicious use but still exists in many mainstream web browsers. Unfortunately, the more important encodings are necessary to prevent special characters such as spaces and such in URLs.

The Problem

URL encodings that use hexadecimal codes can be used to break up guilty tokens inside a URL. For example,

http://127.0.0.1/%69%6e%64%65%78%2e%68%74%6d%6c

is actually an encoded representation of

http://127.0.0.1/indexl

Other types of encoding can include the use of HTML ASCII values to create characters. The same URL could be represented as follows:

&#104;&#116;&#116;&#112;&#58;&#47;&#47;&#49;&#50;&#55;&#46;&#48;&#46;&#48;&#46 
;&#49;&#47;&#105;&#110;&#100;&#101;&#120;&#46;&#104;&#116;&#109;&#108;&#0;

The actual URL isn’t clickable but could easily be embedded in an HTML message for show.

The Solution

Filters should be aware of these encodings in URLs and perform decoding on them. Although there is most likely a significant amount of other information in the email to qualify it as spam, it’s possible that microspams (spams containing very little data) containing only a single encoded URL could pass through the filter if they aren’t first decoded. URL-specific tokens can also make the identification of these microspams easier.

The Importance

The encoded portions of the URL are usually the guiltiest. If a tokenizer were to leave the URLs encoded, it would generate tokens such as “105,” “110,” and so on, which correspond to the decimal codes for individual characters. This by itself may be useful information, but the guilty words in URLs are often believed to be even more revealing. Some filters also ignore whole numbers unless they are mixed with some other punctuation, such as a dollar sign. If the filter doesn’t decode these types of URLs, it should at the very least use a pound sign in the token to set them apart from other types of numeric tokens.

Symbolic Text

The use of special characters and numbers to replace certain characters in guilty tokens has been seen in the wild since around the end of 2002.

The Problem

Guilty tokens are obfuscated with numbers and special characters to prevent spam filters from detecting them.

wõrk fr[]m h0me v1agrá

The Solution

These types of attacks are actually beneficial to identifying spam, because they set the text apart from legitimate text. Since these tokens are different from their plain-text counterparts, filters usually look at them as a completely different token—one that no legitimate user would ever use. Tokenizers can be made aware of these types of tricks, although in most cases no action is required to cause this attack to backfire on spammers.

The Importance

No special dispensation needs to be made for these types of tricks. In fact, spammers are doing us a favor by using them, as they are clear indicators of spam. No legitimate message would use the word “v1agrá.” One of the biggest mistakes spammers can make is using approaches that generate unique identifiers of spam.

Just Plain Dumb

Plenty of approaches to obfuscating the message are just plain dumb and not worth even considering in the coding of filter. In spite of popular myth, the approaches in this section are generally ineffective and lack in imagination. Some of these include the following.

Breaking Up URLs (Dumb)

Spammers will break up the URL to their website so that the spam filter doesn’t consider it to be a URL-specific token. For example,

type http://www then the following URL into your web browser: 
.somewebsite.com/somepagel

It’s true that this will prevent somewebsite.com from ever being treated as a URL-specific token, but the spammer will be able to get away with this only once before the token is learned. On top of this, the spammer generates more guilty-sounding text by giving the user instructions for typing in the URL! It’s unlikely that such a message would ever get past a trained filter, even the first time.

Embedded JavaScript (Really Dumb)

Some spam has been cleverly devised to include the entire message contents in JavaScript. The JavaScript will then populate the message window when the user opens the document. This type of trick may fool some unaware heuristic filters but is ultimately useless against statistical filters because it leaves so much trace evidence. Any email with embedded JavaScript is suspicious, and the fingerprint many spams provide is easy to identify. On top of this, many mail clients have moved away from allowing JavaScript to run inside an email, so the email would appear blank to the end user.

Removal of Whitespace (Stupid)

One of the more idiotic ideas spammers have had includes removing AllTheWhiteSpaceFromAMessage. The idea is to trick spam filters by making the text unintelligible without delimiters. This approach works all too well and makes the message undecipherable—even to humans! Since this affects the spammer’s message itself (not to mention making the spammer look stupid), this approach is very rare and is found only among the dumbest spammers.