Building a Historical Dataset

The historical dataset (discussed in Chapter 3) includes the memories of previous messages and meaningful words, phrases, and combinations to watch for in future emails. In order to reach a decision about a particular message, the historical dataset must first be built.

The characteristics used in statistical filtering are generated based on the content of the message; therefore, no matter how much spam changes over time, it will always have some unique content we can use to identify it. As new words begin to appear in spam, they’re automatically added to the filter’s memory.

Building a historical dataset can be the responsibility of the systems administrator or the end user. The systems administrator is responsible for building and maintaining any dataset used globally and for providing a way for end users to train their own dataset (if they will have one). It is the developer’s responsibility to provide the tools necessary for building and maintaining the dataset, either by feeding the filter a collection of saved messages (a corpus) or from scratch.

Corpus Feeding

Corpus feeding involves building a historical dataset by analyzing a collection (corpus) of saved mail. We start off with a small collection of both spam and legitimate mail that has been presorted and classified (by a human). The filter will process each message and build a table containing the occurrence of every token (how many times it has appeared in both spam and legitimate mail). When we’re finished, we end up with a very large collection of tokens and their corresponding counters, representing their appearance in the corpus.

The following example shows a small slice of preliminary data from such an exercise.

Feature	Appearances/Spam	Appearances/Ham
fun	19	9
girlfriend	4	0
mariners	0	7
tell	8	30
the	96	48
vehicle	11	3
viagra	20	1

Feature

Appearances/Spam

Appearances/Ham

fun

girlfriend

mariners

tell

the

vehicle

viagra

Note

The occurrence of a token counts only once for each message analyzed.

As you can see, the word “the” has appeared in 96 messages that were considered spam and in 48 legitimate messages. In a perfectly balanced email corpus, this would suggest that the word is more likely to be present in spam, but most corpora are unbalanced. In order to determine the true disposition of each token, we’ll also need to know how much spam and nonspam is in the corpus.

Two additional counters are created in the dataset for this purpose. For simplicity’s sake, our sample corpus will comprise twice as much spam as nonspam (which is usually about right in the real world):

Total Spam

224

Total Legitimate Mail

112

Anywhere from several hundred to a few thousand messages are processed in this fashion, building one large table at the end of the process. Once complete, the table is stored as our dataset, on disk, so that it can be accessed every time an email arrives.

Note

Analyzing a corpus of mail collected over three to six months will allow the filter to see more of the user’s personal behavior than one collected over a few days. It will also seed the filter with more information useful for identifying spam, since the characteristics of spam change occasionally. For this reason, 10,000 messages collected over a period of a week are not as useful as 1,000 messages collected over a month’s time. If you are building a historical dataset for use globally, it’s also important to use mail from several diverse users on the system. The more diverse the data, the less likely it will be to generate erroneous classifications.

Starting from Scratch

It’s not entirely necessary to build the dataset from a historical corpus of mail. In fact, many spam filters prefer to start from scratch and build the dataset as email is received. The filters CRM114, POPFile, and DSPAM, for example, all function best when building a dataset from scratch. The philosophy is that building from an empty corpus prevents the filter from getting hung up on stale data, which can hurt long-term accuracy.

Starting from scratch means that the filter knows nothing until a message is trained. Once a message is trained, either the tokens are added to the dataset or their counters are incremented if they already exist. For example, if the word “calculus” is already present in the dataset, either its innocent counter or its spam counter will be incremented whenever a message is trained containing that word. If the word is not present in an email, it’s left alone in the dataset.

Depending on the training mode used (as we discussed in Chapter 3), the filter will add information to the dataset at different times. The training mode can also dictate which specific tokens are incremented. Since we also keep track of message totals, the total number of messages is also updated whenever a message is trained, so that we know the size of the dataset at any point in time.

Building from an empty dataset generally takes time and messages. Even with many messages over a short period of time, certain areas of a user’s email behavior may not be well represented, and infrequent emails such as monthly newsletters may end up in the bit bucket unintentionally. However, accuracy rises over time using this approach, eventually reaching a plateau once enough training has taken place.

Correcting Errors

As the historical dataset is built, training errors are likely to occur while the filter learns. When this happens, it’s important to reverse whatever information has been previously recorded about the message.

If the message has been previously learned, it must first be unlearned. This is done by decrementing the counters under which the message was originally classified. It can then be learned by incrementing the correct counters. This is referred to as retraining the message.

For example, if the spam filter originally thought a message containing the word “free” was legitimate, the data may initially look like the following.

Token

Nonspam Hits

Spam Hits

free

Total Spam:

Total Nonspam:

Once the message is submitted for retraining, the data will be reversed.

Token

Nonspam Hits

Spam Hits

free

Total Spam:

Total Nonspam:

Note

In some environments (such as train-on-error, or TOE), the original message was never learned, so the message needs only to be learned (and not unlearned).

When relearning a message, it’s important to relearn all of the original tokens used in the original (erroneous) classification. This includes the original full headers of the message, message body, and all content.

This can be done in several ways. Some spam filters allow the user to bounce the message back to the server, where it is processed. Bouncing is a feature supported by many email clients and involves sending the original message (with all headers intact) back into the filter. Another way to ensure the preservation of all tokens is to store a copy of the message on the server for a short period of time so that when it is reclassified, all of the data will be available in its original form. This is relatively easy to do with web-based and IMAP mail systems, since the messages are already on the server. Finally, other tools create a snapshot of the original training data instead of storing the original message to create a “signature.”

Regardless of how the original data is retained, the retraining process will usually decrement one counter and increment the other. If the message is a spam that was classified erroneously, the innocent counters of all trained tokens pertaining to the message are decremented and the spam counters are incremented. This is essentially how the filter learns. As the mistakes are corrected, the dataset reflects these new values, which then directly affect the disposition of each token.

Note

Depending on the features available in your chosen filter, there may be many different ways to implement error correction. The approach that is simplest for your users to understand while still providing pristine data for the filter may differ for each implementation.

The Tokenizer and Calculating Token Values

When a message is processed, the tokenizer breaks it down into very small components (tokens) and assigns each token a value. A token value is a numeric representation of the token’s disposition (good or evil), based on its appearance in previous emails. For example, if a word has appeared primarily in spam, it would be assigned a value reflecting a very strong probability of being spam (such as 99 percent).

Consider the following test corpus with exactly twice the amount of spam as legitimate mail.

Total Spam:

224

Total Legitimate Mail:

112

If we look at the number of occurrences of the word “the” in the corpus, we see that its appearance in spam is proportionately identical to its appearance in legitimate mail. That is, it appears just as often in spams as it does in legitimate messages. In our test corpus, the word “the” has appeared in 96 spams and 48 nonspams.

Feature

Appearances/Spam

Appearances/Ham

the

Common sense tells us that this word should be assigned a neutral value of 50 percent, since it appears in spam and nonspam equally. Now all we need is a mathematical formula that will give us the same reasonable results.

There are two popular mathematical formulas to choose from: Graham’s and Gary Robinson’s.

Note

Some filters let you choose which of these two techniques (discussed next) to use. Selecting the correct option may be an exercise in trial and error, and may depend on the aggressiveness of your anti-spam policy. The first technique (Graham’s) has been shown to be more aggressive in preventing spam, while Robinson’s is more aggressive in preventing false positives.

Graham’s Technique for Computing Word Values

Graham provides a very simple formula for determining the probability that a given token is an identifier of spam. His approach takes into account the fact that many users will have an unbalanced collection of spam and legitimate email and computes a probability that factors in the total number of messages in each corpus.

In the following formula, SH and IH represent the total number of appearances in spam and innocent email for the token being computed. TS and TI represent the total number of spam and innocent messages in the user’s corpus.

border=

For example,

border=

This calculation leaves us with P, a number between 0.0 and 1.0, where 0.5 is neutral (and the result of calculating “the” in our example). Tokens with a probability higher than a neutral 0.5 are considered guilty (or “spammy”), while tokens with a lower probability are considered hammy (or innocent). Tokens with a stronger disposition will be computed with a value closer to 0.0 or 1.0.

Robinson’s Technique for Combining Word Values

Robinson improved on Graham’s technique by factoring in a way to calibrate the value of a token probability based on the amount of historical data available for each token. Robinson believes that tokens with very few data points (historical occurrences) should have a weaker confidence than ones with many data points, and suggested a new way of calculating individual token probabilities.

Robinson adds onto Graham’s existing calculation by applying a confidence factor. The resulting formula weakens the value of a token if it has very few data points. His function uses three variables:

The total number of times a token has appeared in the dataset—in either spam or legitimate mail

The assumed value of a token when N=0. A good starting value is 0.5000

A constant that is tuned for performance, with a reasonable default value of 1

“In cases where there is very little information to work with,” Robinson states, “we have some data, but we don’t really have enough to be confident. In reality we should do something when N is any low number—0 being the most extreme case.”

We first calculate the probability of a word using Graham’s techniques. Call this p(w) for word w, then calculate as follows.

Robinson’s approach yields results similar to Graham’s, except that it is calibrated based only on the information available for a specific token. By taking into account cases in which very little data has been collected for a particular token, Robinson’s technique yields a slight improvement in accuracy for many types of hard-to-classify messages. It is particularly useful in improving the identification of legitimate mail, thereby reducing false positives.

Single-Corpus Tokens

Some tokens are such strong indicators that they will appear in only one particular corpus of mail (spam or nonspam). These are referred to as single-corpus tokens. One caveat when assigning values to tokens is that no token should ever have a probability of 100 percent or 0 percent, because these are absolute probabilities (and more important, break our math). For example, because email evolves, there is no guarantee that messages containing the word “offer” have a 100 percent chance of being spam.

To solve this problem, a static value is assigned to single-corpus tokens. If a token appears in only one corpus, we assign it a hard-coded value. For example, most filters will assign tokens that appear only in our spam corpus a probability of 0.9900, and they will assign a probability of 0.0100 to tokens appearing only in legitimate mail. If Robinson’s technique is being used, this value is assigned to the p(w) value used in his formula.

When we apply these calculations to our test corpus from earlier in this chapter, we end up with a list of probabilities for each token.

Feature	Spam	Ham	Probability
fun	19	9	0.5135
girlfriend	4	0	0.9900
mariners	0	7	0.0100
tell	8	30	0.1176
the	96	48	0.5000
vehicle	11	3	0.6470
viagra	20	1	0.9090

Feature

Spam

Ham

Probability

fun

0.5135

girlfriend

0.9900

mariners

0.0100

tell

0.1176

the

0.5000

vehicle

0.6470

viagra

0.9090

A Biased Filter

Graham notes in his research that by doubling the occurrence of tokens in legitimate messages we can help to prevent false positives. Since spam filters typically err on the side of caution, Graham chooses to classify a message as spam only if there is an overwhelming amount of supporting data. The following is a modified version of Graham’s approach that multiplies the number of innocent occurrences for a token by 2.

border=

Note

Many spam filters allow bias to be turned on or off. In many scenarios, applying bias has resulted in up to 50 percent fewer false positives, with only slight drops in the identification of spam.

Hapaxes

Another thing to consider when assigning values to tokens is how we handle new tokens that we’ve never seen before, or for which we haven’t collected enough data. These tokens are referred to as hapaxes.

Filters based on Graham’s original outline in “A Plan for Spam” implement a threshold of occurrence—a minimum number of appearances in historical email before the filter will assign a calculated probability to the token. If the token doesn’t meet the threshold, it will be assigned a relatively neutral value, such as 0.4000 or 0.5000, known as the hapaxial value, which is applied to the Graham-derived probability of the token.

A hapaxial value serves two purposes. First, it prevents unknown tokens from directing the outcome of the classification. Since spammers frequently flood their spam with random junk words, treating words we’ve never seen before as innocent would have a dreadful effect on accuracy. It would also be a mistake to treat unknown tokens as spam, because unknown words will always be present in email as a user’s email behavior changes. The only option, in fact, is to assign a neutral value to hapaxes. This prevents the token from being used in our calculation at all—at least, until enough data has been collected about it.

Note

This is why Bayesian content filtering cannot be compromised with unknown random words or jumbles of random letters, in spite of popular myth.

The second reason a threshold is necessary is to ensure honesty in our statistics and to avoid a situation in which the lack of data skews the overall results. For example, if a token has appeared only twice, both times in spam, we don’t want our filter to treat it with a spammy probability of 0.9900 just yet, because we still don’t have enough data. We need at least a handful of messages using that word to provide a good indicator of its real disposition.

As the philosophy on erring on the side of caution goes, one occurrence in a legitimate message is treated as two occurrences. Therefore, if we were to use a threshold of five occurrences, a token would need to appear in either three legitimate emails (for a total of six occurrences), five spam, or a combination between the two in order to be assigned a real token value.

Final Product

Once we apply these concepts, including a bias toward innocent email and an occurrence threshold of five, the result is a table of data that is slightly different from our original example. Here are the probabilities for our tokens after applying all of our design rules.

Feature	Spam	Ham	Probability
fun	19	9	0.3454
girlfriend	4	0	0.4000
mariners	0	7	0.0100
tell	8	30	0.0625
the	96	48	0.3333
vehicle	11	3	0.4782
viagra	20	1	0.8333

Feature

Spam

Ham

Probability

fun

0.3454

girlfriend

0.4000

mariners

0.0100

tell

0.0625

the

0.3333

vehicle

0.4782

viagra

0.8333

Although it may not be obvious, our final table identifies not only which characteristics are present in email, but also which are the most useful to us. If we look at the values in the table, we can see which tokens have the strongest disposition of spam (closest to 1.0) and nonspam (closest to 0.0).

Building a Historical Dataset - Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] - نسخه متنی

Jonathan A. Zdziarski

آدرس پست الکترونیک گیرنده :

آدرس پست الکترونیک فرستنده :

نام و نام خانوارگی فرستنده :

پیغام برای گیرنده ( حداکثر 250 حرف ) :

کد امنیتی را وارد نمایید

فونت

اندازه قلم

حالت نمایش