Improvements to Statistical Analysis - Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] - نسخه متنی

Jonathan A. Zdziarski

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
توضیحات
افزودن یادداشت جدید







Improvements to Statistical Analysis


So far, we’ve learned only the basics of statistical language analysis. Many other improvements have been layered on top of these fundamentals to further improve our ability to correctly classify messages.

Improving the Decision Matrix


We’ve learned so far that the decision matrix used in most popular Bayesian implementations can be limited to a finite number of elements, with only the most interesting tokens. However, since some messages contain more than 15 interesting tokens, it would be beneficial to improve the way we deal with tokens of similar interest.

For example, a single-corpus token that has appeared in only 10 spams is given the same value as one that has appeared in 100 spams. This doesn’t really make sense. If a token has appeared in 100 spams, it should probably be considered even spammier than a token that’s appeared in only 10.

One way to solve this problem might be to adopt a more dynamic approach to assigning token values that uses a base value. For example, we might take our original base of 0.9900 for spam-corpus tokens, and add from 0 to 99 hundred-thousandths to the token’s probability depending on its representation as a percentage of total spam. The idea is to give preference to more popular tokens without dramatically affecting the probabilities themselves. For example, if the token has appeared in 5 percent of all spam, assigning a probability of 0.9905 will move the token up slightly in the decision matrix, above one that has appeared in only 2 percent of all spam (which would have a probability of 0.9902).

Another way to enhance the quality of the final result is to give preference to tokens based on how frequently they appear in the message being classified. For example, if two tokens both share a probability of 0.9900, and one of them has appeared twice in the message, by giving preference to that token in the decision matrix ordering we allow the more heavily used tokens in the message to be considered first. We can do this by tweaking the actual probability or by assigning a priority when the tokens are sorted by interestingness without affecting the actual probability of the token.

Improvements to Tokenization


Filtering is like NASCAR. All racecars go fast—it’s squeezing the last 5 mph out that makes it an artful science. The quality of the data is almost always more important than any of the algorithms used to combine it. Unless the filter is using an entirely primitive set of algorithms, it will most likely do a good job of performing some acceptable level of filtering. What sets good filters apart from great ones is the quality of data presented to these algorithms.

Finding ways to extract better data from a message will improve the quality of the final result. Graham’s original tokenization plans were very vanilla—all tokens were case insensitive, certain types of punctuation (such as exclamation points) were considered delimiters, and header and URL- specific associations didn’t even exist. The data was so common that Graham’s second paper on spam, “Better Bayesian Filtering,” introduced many changes to make the data stand out more, which quickly improved accuracy.

Further approaches to enhance tokenization have been implemented in many filters; these are discussed in Chapter 11.

Statistical Sedation


We’ve discussed establishing a dataset by either training an existing corpus of email or starting from scratch. The initial results for users with a high volume of spam are usually less than impressive. Because of the high volume of spam being trained and a lack of legitimate mail, some users may experience a certain number of classification errors during the first few weeks of training. Statistical sedation is an algorithm originally designed for DSPAM that is used to dampen statistical data in the absence of adequate training.

If the user doesn’t have much (or any) training data, or has a very high rate of spam, any tokens in their dataset will be relatively concentrated—that is, the word “free” might have a significant number of hits in spam, but the user may not have received enough innocent mail yet to adequately represent this token. Dampening is designed to curb this phenomenon by watering down the concentrated data from their dataset.

Statistical sedation doesn’t affect the probabilities of the individual tokens in the dataset, but rather raises the minimum number of occurrences required in the dataset for the token to be given a calculated probability in place of its neutral hapaxial value. Two thresholds are defined within the algorithm itself. The first sets a very aggressive form of sedation, indicating that the user’s dataset hasn’t received very much training data at all. An initial threshold of 500 to 1,000 (loThresh) innocent messages is a good default. The second threshold is much more passive and identifies that the user has had some training but may still experience false positives on occasion. This value normally hovers around 1,500 to 2,500 (hiThresh) innocent messages. Both thresholds take the total number of spams and legitimate mail in a user’s dataset into account, and the algorithm itself does not instantiate unless the user has received more spam than legitimate mail. Four variables are used in the sedation algorithm:

















minHits


The minimum number of token occurrences required


TI


The total number of innocent messages learned


TS


The total number of spams learned


S


A number between 0 and 10 specifying the level of sedation


if TI < loThresh and TI < TS then 
minHits = minHits + (S/2) + (S((TS – TI)/200))
else if TI < hiThresh and TI < TS then
minHits = minHits + (S/2) + S( 5(TS / (TS + TI)) )





Note

The value 5 used in this formula originates from first multiplying the calculation by 100 to yield a percentage value, and then dividing by 20 to sedate the statistic into fifths.


In the example above, the minHits value is computed based on the number of legitimate mails and spams in a corpus and the level of sedation specified by the implementer. The level of sedation is like a knob allowing the implementer to decide just how much caution to use when applying this algorithm. A good value to start out with is 5, and acceptable values usually range from 0 to 10.





Note

The level of statistical sedation can be tuned in DSPAM v3.x using the tb=N feature. For example, --feature=tb=0 will disable statistical sedation, and –feature=tb=10 will set the maximum level of sedation. The default value of 5 is generally acceptable; however, some systems may call for a more aggressive approach to filtering.


Iterative Training


Finally, another attempt to improve on basic statistical filtering is to incorporate iterative training (or retraining). Iterative training is an approach used by many spam filters, including SpamProbe and DSPAM, in an attempt to learn faster and ensure that the original mistake isn’t made a second time. This is often referred to as test-conditional training, as the message that was erroneously classified is learned and then relearned until the original erroneous condition is no longer met.

For example, if the filter misjudges a message as spam, the user will present the false positive to the filter for relearning. Iterative training will cause the message to be relearned until it is no longer classified as a false positive. Many filter authors impose a maximum loop of five or ten iterations to retrain a single message, in order to prevent too many changes being made to a user’s dataset, which could ultimately generate even more errors in the future.

Since most authors still program their filters to err on the side of caution, this is a reasonable algorithm to implement. It does result in more than one small change being made to the data, but the data being changed remains the same throughout the entire iterative training process (that is, only the tokens present in the message are being changed). While this approach increases the likelihood of the data’s polarity (overall disposition) being changed, it also limits the amount of data in the dataset that could potentially be altered by training. This is much safer than training ten different messages in which ten different sets of data are changed.





Note

If this feature is available in the filter you are using, enable it. Iterative training has proven to speed up the training cycle in many cases and will prevent the same mistakes from being made repeatedly.


Learning New Tricks


Spammers have been trying to evade statistical filters for a few years, to no avail. As spammers change their messages to evade the filters, the filters always seem to have an eerie way of detecting their new tricks—usually without the end user even noticing. For example, using “v1agra” in place of “Viagra” in an email does nothing more than train the filter—and then “v1agra” becomes an even better indicator of spam.

There is so much information embedded within an email that it’s extremely difficult for a spammer to craft a message with a completely unknown vocabulary, and even if they manage to find a way to get a message past the user’s filter, it won’t get through a second time. Most spammers don’t have the resources to evolve every spam distribution for each individual, and so filters efficiently detect the known patterns while catching any new permutations that happen to be present in the message. A spammer would have to be the first to use a completely unique combination of words in the entire message for it to be virtually undetectable as spam, including a set of normal-looking, average-Joe headers, and at best they’d get away with it only once or twice before the filters caught on. Also, the message would most likely be so obfuscated that it would be difficult to read.

/ 151