Chapter 10: Testing Theory
Statistical filtering is unlike any other spam-filtering approach we’ve seen, and because of this, it tests differently than any other beast. The testing approaches used to measure heuristic spam filters are frequently and erroneously applied to statistical filters, which produces poor results. The intricacies of machine learning require a more scientific approach than simply throwing mail at the filter, and even the most detailed approaches to testing such a tool only barely succeed in accomplishing a real- world simulation. In this chapter, we’ll discuss some of the best practices put into place by filter authors for testing the efficiency of language classifiers. We’ll also take a look at several common tests used to measure different areas of filtering and address some common problems testers make.
The Challenge of Testing
Modern-day language classifiers face a very unique situation—they learn based on the environment around them. The problem in testing them, therefore, involves the need to create an extremely controlled environment. When heuristic filtering was popular, there were many different ways to test it. Since the filter didn’t base its decisions on the previous results, an accurate set of results would accommodate just about any type of testing approach used. The state of a statistical language classifier is similar to that of a sequential circuit, in that the output is a combination of both the inputs and the previous state of the filter. The previous state of the filter is based on a previous set of inputs, which are based on a previous set of results, and so on. Think of it in terms of going to the supermarket every week: what you buy from visit to visit is based on what you have in your refrigerator. A single change in the environment (such as an impending snowstorm) can easily affect the contents of your refrigerator by many Twinkies. Similarly, a change in the filter’s environment can affect a filter by many messages. With this in mind, the challenge of testing is to create an environment that simulates real-world behavior as closely as possible—after all, what we are trying to measure is how accurate the filter will be in the real world. This means that testing a statistical filter is no longer a matter of testing but of simulation. Simulating a real-world behavior takes many factors into consideration that obsolete heuristic testing doesn’t.
Of course, when you’re not testing to measure accuracy, this type of simulation isn’t always necessary. Chaos in message ordering and content may be appropriate when testing to compare features for a particular filter or for any other kind of blind test for which accuracy isn’t as important as deviation.
Message Continuity
Message continuity refers to the existence of uninterrupted threads and their message content, specifically in the set of test messages used. Many test sets are based on older testing approaches and consist of nothing more than a random selection of emails from several users. This is ideal for static filter tests, such as those of heuristic filters or even blacklists or tests for which the accuracy of a filter isn’t being measured (such as the feature comparison test we’ll discuss later). However, such test messages don’t take into consideration the importance of supplying complete threads or continuity of headers, and they are therefore not very useful sources of data for conducting tests involving accuracy. Unfortunately, many individuals make the mistake of using these types of message corpora to test statistical filters, which results in extremely unreliable conclusions.Because the results of a statistical filter depend in part on the state of the filter, every single message that is learned plays a role in the results. If the message is from an unknown individual, the filter will begin to learn the individual’s distinct signature. The sender’s message headers, grammatical pretense, and other characteristics of the learned message all play a role in identifying the sender as either a legitimate user or a spammer. When future messages are received from the same sender, the original information learned will play a role in determining the outcome of the classification. Senders who have sent several legitimate messages to a recipient and are well known to the filter are trusted more than unknown users. As a result they have slightly more flexibility in the content of their message. Another common problem in maintaining message continuity is the order in which the messages are arranged in the corpus. In some cases, the original thread ordering will be preserved. The tests that do not preserve the ordering will generally provide poorer results than ordered tests.
A test corpus that doesn’t take message continuity into consideration will likely have several types of hard ham messages for testing but won’t provide a historical thread of data from the original senders. Statistical filters therefore will see more of these messages as spam, never able to take into consideration whether the topics discussed in the message are familiar to the user. This is also true of spam, in that the same spammers will generally be invading a user’s inbox. Presenting a filter with a random set of spams with may not provide an accurate representation of real-life results, because the same spammers generally spam a user over and over again, giving themselves away by, among other things, their message headers (and, ironically, a lot of the junk text they use to try to fool spam filters).In designing a simulation for a language classifier, it’s generally acceptable to use several corpora of actual users’ mail and spam. Establishing a test group of about 10 to 20 users who are willing to build a corpus of mail will provide the most accurate simulation because the message contents and ordering will both be preserved. An ideal training corpus should capture between three and six months’ worth of messages. One additional month should be captured to produce a set of testing messages. More advanced tests may even extend this to nine or ten months, measuring the accuracy of each month throughout the testing process. This will ensure both message continuity and the contextual differences between the types of spam. Although the continuity of the messages is important, many users make the mistake of presenting the same messages they used to train as candidates for classification. This usually results in wonderful but terribly dishonest results. The test corpus should be contiguous from the training corpus, rather than a repetition.Training corpora aren’t necessarily trained directly into the filter. Depending on the recommended training mode, messages may be trained only upon misclassification. This is especially true with filters using train-on- error (TOE) mode, in which training every single message would result in poor levels of accuracy.
Archive Window
The archive window is a frequently overlooked area of test simulation. As we’ve learned through Terry Sullivan’s research, spam evolves over the course of several months. The seas change every four to six months on average, and in building a test corpus, many users overlook this. Some will archive several thousand messages captured over a few weeks’ time. The resulting simulation provides only short-term accuracy, as the filter has not yet learned the different permutations that gradually take place in spam over a longer period of time.
Using the corpus exclusively as an archive and using fresh spam for the test will further impair results. The quantity of messages isn’t nearly as important as the time period over which the messages were captured. The difference is between having six months of experience with spam or having one month of experience six times. Similarly, the window for building an archive shouldn’t be too much older than four to six months if the testing will be performed on a more recent archive of mail, as spam (and likely the user’s own legitimate mail) will have evolved beyond the characteristics learned by the filter. The spam from six months ago is dissimilar to the spam of today. The spam from a year ago is virtually useless when classifying the spam of today. Only by learning recent messages and their permutations can a filter accurately learn the patterns it will need to classify new text. Ideally, a real-world corpus of mail should be captured for training and another one for testing. The corpora should be established during similar sequential time periods.The exception to this is when measuring several concurrent time periods. If a year of messages has been archived, the first six months can be used as training data and the remaining six months can be treated as six different tests. If a corrective training approach is implemented during the testing, the filter can learn from its mistakes during each training period and provide enough learning continuity to be able to adequately classify each message. As long as there is no gap between the training corpus and the test corpus, the messages will permute gradually enough to be measured adequately.
Purge Simulation
Another area that is frequently disregarded in a statistical learning simulation is the purging of stale data. When the training corpus is learned, each message is trained within the same short period of time (usually a period of several minutes or hours). The usual purging a filter might employ doesn’t take place because all of the data trained is considered new. The purging of stale data can very frequently affect the polarity of many tokens in the user’s database. In most cases, the conventional purge tools can’t be used because they fail to see the data as stale. If the data that would normally be considered stale is not purged, the less volatile data will fail to reflect the same results as a real-world scenario. This is because the legacy data left over will still affect the tokens’ polarity. In extreme cases, this could cause guilty tokens to take on an innocent probability and uninteresting tokens to become erroneously interesting.To establish a true purge simulation, it may be necessary to make actual changes to the software to use the time stamp represented in the message headers to set the actual time period for data. It will also be necessary to simulate the purge tool running at its standard intervals—usually nightly or weekly. This can be done by parsing the time period from the message headers and treating it as a virtual time period. The purge should run and remove any stale data it finds in the dataset based on the virtual time stamps extracted from the messages.
Interleave
The interleave at which messages from the corpus are trained, corrected, and classified can play a dramatic role in the results of the test. Many people erroneously perform tests by feeding in two separate corpora—one of legitimate mail and one of spam. Some tests use a 1-to-1 interleave, while others try their best to simulate a real-world scenario. The original ordering of the messages in the corpus will generally yield the most realistic results. The interleave should, if possible, include both legitimate mail and spams in the order they were received, and the original interleave should be recorded.
Corrective Training Delay
The delay in retraining classification errors is probably one of the most difficult characteristics to simulate. When a misclassification occurs, the user doesn’t report it immediately—several other messages are likely to come in before the user checks their email and corrects the error. If a user receives 20 spams overnight and the first of these is erroneously classified, the other 19 messages may have been affected by the error. In most cases this plays against the filter and can risk generating additional false positives or spam misses, but sometimes this can also err in favor of the filter. A corrective training delay plays a role in the decisions of all the messages between the time that the error was made and when it was corrected. Without simulating this, the snowball effect frequently experienced in real-world scenarios doesn’t occur, leaving the results somewhat skewed.Establishing an average message count between errors occurring and being corrected can help to simulate such a delay. Log files from the filter reporting on the number of messages processed during this period of time can help to create a reasonable message count to use as a corrective delay. Filters that perform their function at MTA (mail transfer agent or mail server) time (that is, when the message is received by the MTA) are likely to experience a much longer corrective training delay than ones that perform their function at MUA (mail user agent) time (that is, when the user down- loads their email).