Measuring the Accuracy of a Specific Filter
This test is designed to measure a range of accuracy for a specific filter, using a set of users’ email as the test medium. It generally involves testing messages for 10 to 20 different users and dropping the top and bottom results from the mix. This test isn’t designed to measure peak accuracy but rather average results. It is therefore in the interest of the tester to use test subjects with very diverse email behavior.
Test Criteria
It’s not a good idea to use ten programmer friends for this test, but rather to find one or two different individuals each in a range of categories. Here are some sample categories to consider:
Computer programmers | Online merchants |
Musicians | Corporate executives |
Members of the clergy | Low-volume users |
Medical practitioners | Blue-collar workers |
Soccer moms | Government employees |
Teenagers | Attorneys |
To measure the range of accuracy for a specific filter, it will be necessary to perform several tests on different sets of high-quality corpora. Each corpus of mail should follow all of the guidelines outlined previously in this chapter:Message continuity
The original threads for all messages must be intact. The ordering of all messages must be preserved in each corpus to ensure accurate results.Archive window
Each corpus should be based over a period of three to six months, regardless of the quantity of messages. If a separate test corpus will be used, the test corpus should be an extension of the original training corpus, representing the seventh month. Several tests may be run on each user, using a different training period for each test.
Purge simulation
Stale data should be purged to ensure accurate results.Interleave
The original interleave of the messages (legitimate mail versus spam) should be used. If the interleave is not available, a best estimation may be used, or multiple tests should be run for each test simulation, with the three best and worst results averaged.Corrective training delay
The best estimation for each user should be used.Finally, the number of training messages and the time period required for the optimal performance of the filter being tested should be determined. The software’s documentation will sometimes recommend values, but it may be best to consult the filter author as well. Occasionally the documented values are too conservative or are unreasonable and require tweaking. If it is uncertain where a good training count lies, it may be necessary to perform a few different tests to identify the optimal range.
Performing the Test
The accuracy range test consists of the following events:Training period
The period during which an initial corpus of messages is trained into the filter.Classification period
The period during which messages are presented for classification, rather than training.Corrective period
The period during which misclassified messages are presented for retraining.Purge period
The periods during which purging of stale data is simulated.Listing 10-1 outlines the entire process.Listing 10-1: Process flow of an accuracy range test
while messageCount < minCount or timePeriod < minPeriod
do
present next message for training
if timeElapsed > nextPurgeInterval
then
perform purge simulation
while more messages in corpus
do
present next message for classification
if classification is wrong
then
determine nextInsertionPoint for correction
increment incorrect classification counter
else
increment correct classification counter
if timeElapsed > nextInsertionPoint
then
submit erroneous message for retraining
The process begins with a training cycle. During this training cycle, the recommended set of test messages is seeded into the filter, using whatever mechanism the filter provides for corpus training. The number of messages and the time period that the messages should cover is determined based on the recommended values in the documentation or from the author. The time stamps used to record the data should be stored based on the time stamp in the message’s headers. If the filter does not provide a mechanism to apply this modification, it may be necessary to make minor coding changes. This simulation of a time stamp is done in order to support the purge simulation.The decision as to whether to perform a purge simulation is made every time a message is processed, based on the time stamp of the current and/or next message. A purge simulation should be run at the intervals recommended in the filter’s documentation or by its author. This is generally on a nightly or weekly basis, and therefore it will be necessary to calculate this delta from the time stamps in the message headers. When the purge simulation runs, all other processing should pause until it completes. Once the simulation is complete, the training loop may continue.When the training corpus has been correctly trained into the filter, the next step is to present the test corpus. The test corpus should be a set of different messages from the training corpus, and should represent the next period of messages occurring in time right after the training corpus. The test corpus is used for two purposes. It provides the data that is actually tested, and it is used to schedule whatever corrections are necessary. As each message is presented for testing, the result of the classification should be compared with the actual classification of the message. If the result was incorrect, the message should be resubmitted for error correction at whatever correction delay checkpoint is determined by the tester. This may be based either on the number of messages processed or on the time in the time stamps. It’s up to the tester to determine the most reasonable delay period for retraining.When the message is retrained, it’s important to use whatever mechanism the filter supports to correct errors. Many testers make the mistake of feeding the message back in with the inverse classification, but this doesn’t necessarily correct the errors. In most cases, feeding the message through the corpus a second time will increment the relevant tokens’ correct classification counts but will not decrement the incorrect counts. Be sure to use whatever error correction arguments are necessary.Once the process has completed, the tester will be left with a number of correct and incorrect classifications. The accuracy of the filter can be calculated using the following formula:
100 (100 (totalErrors / totalTestMessages))
If three errors were made in 1,000 test messages, the formula would evaluate the following:
100 (100 (3 / 1000)) = 99.7 = 99.7% Accuracy
Each test will undoubtedly yield a different number of errors. An average accuracy can easily be calculated by averaging the results of the tests together. Omitting one or two results on each end of the range may provide a more reliable average, while the top and bottom results may provide a useful best- and worst-case scenario.