Testing Caveats
Even the most well-planned testing can turn to mush if the testing process isn’t conducted properly. In this section we’ll discuss some caveats regarding the testing of statistical filters. In general, if you are experiencing very poor levels of accuracy compared to the advertised values, there is most likely a problem in the implementation of the test. These problems can be very simple oversights that, when fixed, will show a noticeable improvement.
Corrective Training
The corrective training process is unique to each filter. Two common mistakes are made in corrective training that can lead to poor accuracy. The filter will in most cases have a special argument specifically for corrective training. This argument will be different from standard training arguments because corrective training not only involves learning the message as the correct classification but unlearning the message from the old classification. Tools like DSPAM use a source argument—for example, -class=spam -source=corpus for initial training or -class=spam -source=error for retraining. Be sure to use the specific arguments that the filter requires for corrective training, and not the same arguments that were used for corpus training.
The second error made in corrective training is in not using the original message or the output message, depending on the filter, for retraining. Some filters store a serial number or other identifier in the message itself for managing error correction. If the original message is used to retrain, this serial number will not be present, and the message may not actually get retrained. Check the filter’s documentation to identify the correct approach to retraining. If the filter implements any type of serial number or identifier, it may be necessary to use the outputted message for retraining.
Purge Simulations
Many tests fail to create a true purge simulation, which can ultimately affect the accuracy of the test. During testing, it is crucial to use the time stamps from the message headers to bind the data to a particular age. The purge cycle should then perform its functions based on the virtual age assigned to the data. Many testers make the mistake of performing a single purge at the end of the training cycle. This doesn’t adequately simulate a true purge cycle.An example of this problem can be illustrated by assigning a time period of one month to every 250 messages in a corpus. The first 250 messages in a training corpus may leave a token with 10 spam hits and 0 innocent hits. If this token is not referenced in the next 250 messages, it will have become stale, in most cases. In a real-world environment, a token that has been stale for more than a month would be purged from the system. If, in the third month, the same token was referenced with 0 spam hits and 10 innocent hits, real-world purging would leave the token with these results, as the former results would have been purged. In an automated testing environment, if the purge isn’t simulated correctly, the result may be a neutral token with 10 spam hits and 10 innocent hits. This not only potentially affects the outcome of many classifications, but also increases the likelihood of classification errors in the third month.Using an approach to purging that effectively purges stale data based on the date that the messages were actually received will prevent this condition from occurring and will help ensure the reliability of the test.
Test Messages
In tests measuring the accuracy of the filter, many testers make the mistake of using the same set of messages for both training and testing. Others make the mistake of using a completely different set of messages. In both cases, the accuracy of the filter is misrepresented because it doesn’t reflect a real-world scenario. Using the same set of messages for testing and training will cause the filter to appear more accurate than it really is, because it has already learned the messages that are being presented. Using an entirely different set of messages will cause the filter to appear less accurate because the messages being presented have no continuity with the existing messages that have been learned.The goal of testing a filter’s accuracy is to measure its ability to adapt to permutations of the present training set. Real-world users experience diverse email behavior, but with a continuity of content and message senders. When this continuity is broken, the purpose for which the filter was designed is interrupted. To effectively measure the accuracy, the test corpus should be continuous with the training corpus and should represent the next period of time for which messages were received.
Presuppositions
Probably the biggest reason that tests fail is that the tester never tries out the filters in a real-world environment to compare results. Automated, scientific testing is important, but it is all too often performed by individuals who have never run a statistical filter on their own for several months. As a result, the tester brings many presuppositions into what is supposed to be an unbiased testing process, which can lead to unreliable results. Testing a statistical filter is much the same as test-driving a vehicle. Having a focus group of 10 or 20 individuals drive the vehicle and report back provides useful information, but no sensible person would buy a car without first getting behind the wheel and taking it for a spin. The lack of feeling for how statistical filters function in a real-world environment can often lead to this bias simply due to lack of experience. Before testing any statistical filters, it’s important for the tester to install and run one on their own email for several months. This will have the benefit of building the tester’s experience with the filter, identifying caveats, and giving credibility to the test results when published.