Testing the Effectiveness of Multiple Filters
Testing to compare the effectiveness of multiple filters is a complicated task. Not only does the data need to simulate the closest possible real-world behavior, but the amount of data and how the data is trained, classified, and corrected will change with every filter. The goal of the filter comparison test is to simulate each filter’s real-world levels of accuracy based on the preferences for each filter as prescribed in the documentation or recommended by the filter author. It’s generally a good idea to ask the user community or even the author of a particular filter for an opinion about training thresholds, as the recommended values in the documentation are sometimes tailored to be more conservative or liberal to make users happy.
Test Criteria
The test criteria are similar to those of the accuracy range test, in that the data should be consistent and well preserved. The same data must be used to measure each filter; otherwise, the results will be meaningless. It is not uncommon to perform several different tests for each filter, using a different corpus of mail for each test. There are two ways to determine how much data should be trained into the filter, depending on the results the tester is looking for. To achieve the fairest results, it may be appropriate to use a static amount of training data among all filters. Additional training usually only helps statistical filters, and therefore using the threshold of the filter that requires the most training and applying this threshold to all filters will prevent the test from being affected by the recommended values in the documentation. Stricter tests, which seek to measure each filter’s performance based not only on the data but also on these recommended values should use the training level recommended in the documentation or provided by the filter author, possibly resulting in a different training level for each filter. In either case, it’s important to take into consideration not only the number of messages being trained, but also the time period that the messages cover. A minimum of three months of training data should be considered for this test; six months is better.Another thing to consider between filters is the training mode. Most filters support a default training mode, and so two individual tests may need to be run, one using the default training mode for each filter and one using the same training mode among filters. The first type of test is generally a more accurate representation of how filters perform together, but it may be discovered that simply altering the training mode for a particular filter could improve its performance. This should be reflected in the test results.As with the accuracy range test, it is not a good idea to use 10 or 20 test subjects with the same background, but rather to find one or two different individuals in each of a diverse set of backgrounds.To measure the range of accuracy for each filter, it will be necessary to perform multiple tests on different sets of high-quality corpora. Each corpus of mail should follow all of the same guidelines outlined in the accuracy range test. Message continuity
The original threads for all messages must be intact. The ordering of all messages must be preserved in each corpus, to ensure accurate results.Archive window
Each corpus should cover a period of three to six months, regardless of the quantity of messages. If a separate test corpus will be used, it should be an extension of the original training corpus, representing the seventh month.
Purge simulation
Stale data should be purged to ensure accurate results.Interleave
The original interleave of the messages (legitimate mail versus spam) should be used. If the interleave is not available, a best estimation may be used, or multiple tests should be run for each test simulation, and the three best and worst results should be averaged.Corrective training delay
The best estimation for each user should be used. This may be different from filter to filter if some filters perform their filtering at different times or are integrated into different parts of the mail system.
Performing the Test
The filter comparison test consists of the following events during each run of each filter.Training period
The period during which an initial corpus of messages is trained into the filter.
Classification period
The period during which messages are presented for classification, rather than training.Corrective period
The period during which misclassified messages are presented for retraining.Purge period
The periods during which purging of stale data is simulated.Listing 10-3 outlines the entire process.Listing 10-3: Process flow of a comparison test
foreach filter
foreach corpus
reset all counters
while messageCount < minCount or timePeriod < minPeriod
do
present next message for training
if timeElapsed > nextPurgeInterval
then
perform purge simulation
while more messages in corpus
do
present next message for classification
if classification is wrong
then
determine nextInsertionPoint for correction
increment incorrect classification counter
else
increment correct classification counter
if timeElapsed > nextInsertionPoint
then
submit erroneous message for retraining
calculate test accuracy AN = 100 (100(TE/TM))
optionally drop best and worst accuracy
calculate average accuracy for filter
compare accuracy levels for each filter
Each filter test begins with a training cycle for each test corpus. During this training cycle, the predetermined set of test messages is seeded into the filter, using whatever mechanism the filter provides. Depending on whether this threshold is static or dynamic, the number of messages and the time period to cover them may be different from filter to filter. As in the accuracy range test, the time stamps used to record the data should be stored based on the time stamp in the message headers. If the filters do not provide a mechanism to apply this modification, it may be necessary to make minor coding changes to each filter. This simulation of time stamping is necessary in order to support a purge simulation for each filter.The purge simulation recommended for each filter should be used, and the purge intervals for each filter should be respected. Because filters manage data very differently, using a common purge value is not recommended, as it may skew the data. Unlike the training window, the purge process is very different between filters, based on many factors such as training mode.The decision as to whether to perform a purge simulation is evaluated every time a message is processed, based on the time stamp of the current and/or next message. A purge simulation should be run at the intervals recommended in the filter’s documentation or by its author. This is generally on a nightly or weekly basis, and therefore it will be necessary to calculate this delta from the time stamps in the message headers. When the purge simulation runs, all other processing should pause until it completes. When the simulation does complete, the training loop may continue.Once the training corpus has been correctly trained into the filter being tested, the next step is to present the test corpus. The test corpus should be a set of different messages from the training corpus, and should represent the next period of messages occurring in time right after the training corpus. As each message is presented for testing, the result of the classification should be compared with the actual classification of the message. If the result was incorrect, the message should be resubmitted for error correction at the correction delay checkpoint determined by the tester. This may be based on the number of messages processed, or it could be based on time. It’s up to the tester to determine the most reasonable delay period for retraining.For every corpus that is processed, the accuracy of the test will be calculated. When every corpus for a filter has completed, enough data is present to determine the peak, floor, and average levels of accuracy for that filter.