Chained Tokens
Chained tokens are also known as multiword tokens, word pairs, phrases, and biGrams. The concept involves a simple data processing algorithm designed to provide more specific (and much better) data for the existing statistical algorithms. As the name implies, this algorithm is based on the concept of chaining adjacent tokens together to make new tokens. This approach creates somewhere in the vicinity of 2n - 2 additional data points to work with, where n is the number of tokens generated by normal single-word tokenization, providing a total of approximately 3n - 2 tokens, depending on any special optimizations for headers and such.Chained tokens are tokens that are combined with their adjacent partners to form two new tokens, which eventually leads to the identification of specific concepts in a message. Chained tokens aren’t a replacement for individual tokens but a complement to be used in tandem for better analysis. If the filter finds that certain concepts stick out more than some of the individual tokens in the message, these concepts will be given a slot in the decision matrix. This process involves merging different sets of tokens, some of which describe concepts. For example, if we are analyzing an email with the following phrase,
CALL NOW, IT'S FREE!
four tokens will be created using standard primitive tokenization:
CALL
NOW
IT’S
FREE!
With chained tokens, we also create the following additional tokens:
CALL NOW
NOW IT’S
IT’S FREE!
Thus, the familiar concepts of “call now” and “it’s free!” are created. However, just because the phrase “now it’s” doesn’t make sense to us as a concept doesn’t mean it isn’t one. The filter will eventually tell us. A few basic tokenizing rules apply to the implementation of chained tokens:
No chains are created between the message header and the message body; individual message headers from other parts of the message are treated as separate components from their message part bodies.
When processing the message header, no chains are created between individual headers. This means that 2n - 2 additional tokens will be created for each header.
Any kind of token can be combined with any other kind of token, so long as the two tokens are adjacent to one another. For example, words can be combined with nonwords such as dollar amounts.
Chained tokens are very easy to implement and provide many different types of intelligence to statistical filtering, as we’ll discuss later. The biggest advantage is that we now have more descriptive data with which to identify concepts in messages. The data generated by chained tokens isn’t too specific, however, and so we’ll be able to identify these concepts among many different messages, even ones written by different spammers.
Case Study Analysis
Now that we have an understanding of chained tokens, let’s explore their purpose in statistical filtering. The philosophy behind chained tokens is this: given two tokens, A and B, chained tokens enable a filter to calculate not only the interest of A and B individually, but also the interest of AB, which may or may not be greater than the individual tokens. If the two tokens combined are indeed more interesting, we may have created a concept that can be identified in other messages and that may have a context-specific probability that can help the filter better identify both spams and legitimate emails. This eventually leads to better spam identification and fewer false positives. In most spams, any identifying language patterns or concepts can be ascertained with only two adjacent tokens. For example,
Play Lotto
Hot Girls
Cheap Viagra Remove From
It takes only two tokens together to get a very specific idea of the concept related to the email. If someone was reading your email out loud using one or two interesting pairs of adjacent words, it would be easy to ascertain whether or not the message was spam. Therefore, a statistical filter should be able to perform this function with 15 pairs of words. Also, since spams are frequently different, it makes sense to keep the data in chunks that are as small as possible while being descriptive enough for an accurate comparison. Otherwise, you run the risk of having such specific data that you miss variations. With all these things considered, chained tokens strike a very good balance between analysis and verboseness in the building of concepts.
Pattern Identification
Let’s take a look at some examples of chained tokens used in a real-life environment. Below are some individual tokens and their chained counterpart, together with their associated probabilities.
all | 0.5513 |
the | 0.5699 |
all+the | 0.2541 |
This example shows that the words “all” and “the” are fairly uninteresting, in that their probability is not far from a neutral 0.5. Since they appear in a significant number of messages, both spam and innocent, the filter doesn’t see much value in them, and they are likely to be omitted from any decision matrix, even if it were starved for data. While these words alone may not be interesting enough to play a role in the analysis of a message, we see that together the phrase “all the” is a much more interesting piece of data when it shows up. This chained token has a much greater likelihood of showing up in the decision matrix of a message, improving its chances of identifying an innocent message. We’ll explain why shortly. Now let’s look at a set of tokens with the opposite effect on a mail corpus.
color | 0.3282 |
#000000 | 0.5794 |
color+#000000 | 0.9684 |
The word “color” is fairly innocent in this user’s dictionary, as it can be used to describe a number of things—the color of a font or the color of a shirt. We really have no idea what’s being described here, which is part of the problem—there are words, but no accompanying concept. The hexadecimal code for black (#000000), on the other hand, is a very neutral token for different reasons; obviously, it is used in some innocent HTML-based emails the user receives as well as spam. If these two individual tokens were used in analyzing a message, you would end up with a rather mediocre result. To make matters a bit more agonizing, the word “color” is a little more likely to show up in a Bayesian calculation and the hexadecimal code is not, since the hex code is less interesting, which could skew the result toward identifying the message as a legitimate email. However, the combination of the two individual tokens creates a concept—the idea that we’re talking about the color of an HTML component (a font or background color). We thus end up with a very guilty chained token that would dramatically change the result of the calculation.
Why wasn’t the chained token also neutral? After all, a font color can be part of innocent email as well as a spam. The answer is that the HTML generators used to author spams are slightly different from those used for legitimate messages. Outlook, Eudora, and Thunderbird word their HTML code differently (and in a different case) than bulk mail senders do, which will result in slightly different tokens being created. For example, a legitimate email client may use the following:
COLOR+#000000
COLOR+BLACK
Color+#000000
Color+Black
Also keep in mind that the chained token is going to appear most often when setting the font color. Black is a very popular color for big, bold text in spams. Since the hexadecimal code for black was by itself neutral, we can draw the conclusion that both legitimate messages and spams have used the color in a background setting or in some other part of their HTML code that did not use the color tag. For example,
bgColor | 0.5076 |
bgcolor | 0.4758 |
As shown here, the background color is set several times in both legitimate messages and spams, so the bgcolor tag by itself means nothing. A lot of people aren’t very strong users of HTML-based email, and so differentiating HTML code from actual conversation may not need to be as granular. What we want to avoid is turning our spam filter into a mere HTML recognizer—the presence of HTML tags alone don’t necessarily make a message spam (even for a user who doesn’t use HTML), so we want to be very careful about which HTML tags we parse to avoid filling up the decision matrix with HTML tokens.
Differentiation
We’ve learned so far that chained tokens can be used to improve the accurate identification of both legitimate messages and spams, and can play a significant role in the identification of what we’re calling concepts. Let’s take a look at another use for chained tokens.
Content-Type*mixed | 0.4334 |
Content-Type*multipart | 0.5222 |
Content-Type*multipart+mixed | 0.0100 |
The content-type headers define the presentation of data in an email. A content type of multipart/mixed is used by many legitimate email clients, and so it’s generally considered a legitimate content header. Most bulk-mail senders prefer to use multipart/alternative, and so, because “multipart” is used in both forms of email, the word “multipart” becomes a very neutral token. If we were using single words only, without any concept-based implementation, we would have discarded all of this useful information. With chaining, however, we’re able to catch the multipart/mixed and multipart/ alternative tokens, so that our filter becomes capable of identifying the concept of multipart/mixed email—a concept that it would otherwise be ignoring. Another good example of this is found in mailing lists:
unsubscribe | 0.5510 |
unsubscribe+from | 0.0833 |
Spams and mailing list emails alike use the word “unsubscribe”; however, only the mailing lists this user subscribes to use the phrase “unsubscribe from.” All this time, spammers have been trying to obfuscate their bogus unsubscribe algorithm to hide it from spam filters. They were so successful that it’s become very unlikely that you’ll find a spam with this phrase—at least until the spammers get a little brighter. When this happens, other unique concepts will simply take over. That’s one of the great things about using a Bayesian window—it allows a certain amount of buffering for new types of spammer tricks. When spammers start using “unsubscribe from,” the decision matrix will buffer an immediately erroneous decision and take a look at the mention of “free porn” also.Eventually, we’ll get to the point in filter development that multiple concepts can be correlated with one another. The presence of an “unsubscribe from” token in the absence of other mailing-list-like properties could be used to eliminate the token as noise, or even to invert the probability of the token in future AI logic. We’ll look at the possibilities of doing this later in this chapter, when we discuss Karnaugh mapping.
HTML Classification
We’ve already looked at how chained tokens can be used to identify HTML tags as opposed to text used in conversation. Yet another use for chained tokens is in the example that follows, which exhibits how well chained tokens can function to identify the different patterns of HTML code used by both spammers and legitimate users.
FONT | 0.4573 |
face | 0.5506 |
FONT+face | 0.2084 |
The “FONT face” token was found to have a much more innocent disposition than the two neutral tokens alone. This is due in part to the many other font tags used in spams (such as font size) that separate these two tokens. As we’ve discussed, many spam generators use their own type of formatting. Generators that spammers use might prefer “FONT COLOR” first, then “FACE” after, while Microsoft Outlook prefers “FONT face.” The case used can also be very different between generators. “Font,” “FONT,” and “font” are all different data to most tokenizers. If you’re looking at individual tokens, you’re going to miss all this, because just about every HTML message uses the FONT, FACE, and COLOR tags. What primitive tokenizers haven’t seemed to figure out is that many clients have a unique method of ordering, which chaining can capitalize on.
Contextual Analysis
Now we’ll move on to one of my favorite uses for chained tokens— contextual analysis. Yes, Virginia, chained tokens can be used to identify language patterns! As normal everyday people, we frequently engage in many different types of discussions. Some are informal, some are hot and heavy, and if you’re a hacker, some will most likely involve a lot of flaming. Spammers also use their own unique style of conversation, including canned messages, raunchy and perverse advertising, and obnoxious, used-car-salesman-type pitches. Chained tokens do a great job of performing natural language analysis of these individual types of contexts. You can’t tell the context of a message simply by looking at the word “that,” but when it is combined with other words, you can tell that the discussion is informal. The phrase “that sent” is informal and has a poor grammatical pretense. This is a great indicator of legitimate mail to many users, and our filter will now be able to identify such types of concepts:
that | 0.4233 |
sent | 0.4042 |
that+sent | 0.0100 |
Here are some more examples, selected at random. Some are innocent; others are very guilty. It’s easy to figure out which ones come from what type of text just by looking at them.
always+very | 0.0100 |
all+the | 0.2541 |
all+you | 0.6072 |
brought+you | 0.7541 |
email+because | 0.9901 |
that’s+have | 0.0100 |
that+sent | 0.0100 |
that+the | 0.0099 |
that+wouldn’t | 0.0100 |
that+you | 0.8362 |
Click+Here | 0.9901 |
Spammers have tried sending emails with “everyday common” text, and filters don’t seem to have any trouble identifying the difference between their text and legitimate text. Even when a spammer succeeds in crafting a message with enough “common text,” there is almost always an overabundance of guilty text to offset it. Further enhancements we’ll discuss later, such as Bayesian noise reduction, can make “common-text” spams completely ineffective.
Other Uses
What else are chained tokens good for? They enable the filter to more accurately identify legitimate email addresses in the From: and To: headers. This helps to create a better “whitelist” for innocent messages. For example, our primitive approach to tokenizing would give us only these tokens:
From*president
From*whitehouse
From*gov
all of which could easily be used by many other senders than the president, including the forged headers sent with spams. Implementing chained tokens allows us to create more sender-specific tokens from these headers:
From*president+whitehouse
From*whitehouse+gov
These are arguably concepts as well: the concept of a specific email address in the header. Attempts are frequently made by spammers to “fool” primitive tokenizers into thinking that their spam is a legitimate email from McAfee, Microsoft, or some other well-known domain. They frequently fail to fool even the most basic implementations of primitive filters, but they especially don’t fool multiword-capable filters. Manual whitelisting is cumber- some, but statistical whitelisting can be extremely powerful (and cool!).The same approach can also be used to identify the x-mailer more accurately (for example, Microsoft Outlook was considered a more legitimate mailer in my dataset than Microsoft Outlook Express), and the chained token allows us to realize this with one token instead of two. This brings us to our next useful function for chained tokens, and that is de-duplication.
Traditional Bayesian combination will incorporate only the 15 most interesting tokens into a decision window. If you eliminate the somewhat interesting words by replacing them with very interesting words, you have a more accurate window that ends up playing a significant role in that very small percentage of emails (such as list digests with embedded advertisements) for which limiting a calculation to 15 tokens results in a decision based only on the randomness of the decision matrix.
Administrative Concerns
The only administrative concern that can arise when using chained tokens is the amount of extra disk space consumed by the additional token records in each user’s database. The average user’s dictionary, without using chained tokens, seems to average around 500 KB to 1 MB in size. This means that an ISP with 30,000 customers will require 15 to 30 GB of additional disk space to accommodate an implementation without chained tokens (a very acceptable requirement). The average size of a dictionary using chained tokens, however, was originally around 10 to 20 MB, which would require several additional disks for a very large-scale implementation. This is still acceptable, but with a little less enthusiasm. The increased disk space requirement has inspired some alternative solutions to help manage the space concern.Additional purge tools can be implemented to purge ineffective chained tokens from the dictionary that have been stale for a given period of time or that don’t yield a particularly interesting value. DSPAM purges any tokens that fall within certain bands of probability—for example, between 0.30 and 0.70—and have remained stale for more than a week or two. It’s also advantageous to remove any tokens that have remained stale for a long period of time regardless of their probability—say around 60 days. This will usually remove about half to two-thirds of all chained tokens, leaving a very tight core of useful learned concepts. In one case study, a dictionary containing a total of 70,000 tokens was purged to 34,000 tokens, 20,000 of which were effective chained tokens. The additional 20,000 chained tokens played a significant role in accuracy.Another way to control the amount of disk space used by chained tokens is to convert the tokens into numeric values. Numeric values can generally be represented with fewer bytes and can also provide a faster implementation of indexing. Many filters convert tokens into a 64-bit checksum (an “unsigned long long” integer value). This changes the key length of each token in the database from variable length to only 8 bytes. A majority of chained tokens are well over 8 bytes, and in light of the some 100,000 to 500,000 records in a single user’s dataset, this could save several megabytes of space.
A final approach to implementing chained tokens while controlling disk space is by using a merged group, as we discussed in the previous chapter. By creating a “parent” group containing systemwide data and merging it with individual users’ “diffs,” we’re able to shrink each user’s dataset down to only what is necessary to customize the global dataset.Given that disk space is dirt cheap, and that an email system capable of supporting 30,000 users is already significant in cost, the additional disk space required for a chained token implementation is still well within the acceptable range for a serious Internet service provider that wants to provide accurate filtering services. A majority of filter users aren’t huge ISPs but small- to medium-sized companies that are looking to filter spam for their thousand or so employees. It should be very comforting for these administrators to know that they will need only a few gigabytes of disk space to accommodate a chained token implementation for the whole company. Finally, much optimization is possible in the type of back-end storage used. Stateless, random-access file implementations such as Berkeley DB and GDBM are very cumbersome in comparison to a SQL-based back end such as MySQL or PostgreSQL. SQL-based solutions seem to have a much better system for recovering unused space, which is important when purging data. Overall, it’s possible to decrease the size of a typical chained dataset to between 5 and 10 MB or even as small as 1 MB using a merged group and a train-on-error (TOE) approach.
Supporting Data
Statistical filters are already very good at identifying predictable email, so it makes no sense to run the standard accuracy test to prove the worthiness of chained tokens. Instead, one test that was run against chained tokens used a feature-comparison test in which a large sample of email from the SpamAssassin corpus—email that had never been seen before by the test users—was sent through the filter. The results were very good, as shown in Table 11-1.
User | Corpus | Chained? | Correct | Incorrect |
---|---|---|---|---|
USER 1 | SA Spam Corpus | NO | 353 | 147 |
USER 1 | SA Spam Corpus | YES | 431 | 69 |
USER 1 | SA Ham Corpus | NO | 2,274 | 226 |
USER 1 | SA Ham Corpus | YES | 2,433 | 67 |
USER 2 | SA Spam Corpus | NO | 393 | 107 |
USER 2 | SA Spam Corpus | YES | 390 | 110 |
USER 2 | SA Ham Corpus | NO | 1,820 | 680 |
USER 2 | SA Ham Corpus | YES | 2,007 | 493 |
The database was, of course, cleared between all tests and then retrained with or without chained tokens. As the table illustrates, the chained token approach was overall twice as accurate at identifying never-before-seen email than primitive tokenization alone. In one case, chained tokens actually hurt accuracy with one user by three messages, but they more than made up for it by correctly classifying innocent messages that otherwise would have become false positives.
Summary
In summary, chained tokens have proven their worth, both in the test lab and in actual production implementations. One of the primary reasons to use chained tokens is the added buffering they provide, making it much more difficult for spammers to evade statistical filters with a bunch of junk text. As we’ll see in Chapter 13, chained tokens also make it much more difficult to poison algorithms such as Bayesian noise reduction; by attempting to eliminate single-token noise, spammers are actually creating chained-token noise, and vice versa. The key features of chained tokens provide more accurate classification by
Establishing concepts of messages, rather than individual components
Maintaining the importance of certain words together
Providing more usable data to work with
Classifying HTML based on its use rather than its presence
Classifying message headers more accurately
Identifying differences in language patterns between spams and legitimate mail
Chained tokens have been implemented in the DSPAM project and in SpamProbe, and they are presently being worked into Dr. John Graham- Cumming’s POPFile filter. They have dramatically improved accuracy over other filters that don’t have such an implementation, and you can be confident that any project using chained tokens is likely to be a successful one.