Comparing Features in a Single Filter - Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] - نسخه متنی

Jonathan A. Zdziarski

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
توضیحات
افزودن یادداشت جدید







Comparing Features in a Single Filter


Comparing features in a single filter can require an entirely different approach to testing, depending on the features being measured. The feature comparison test is designed to measure different levels of effectiveness among features that are autonomous to the filter’s core learning algorithms. That is to say, this test is useful for testing features that do not directly affect real-world machine learning, but that play a role in accuracy. This test is frequently performed to measure the different error levels between tokenizers or tokenizer philosophy. It can be used for some learning features, but the results it provides may not be as accurate as those of an accuracy range test. To perform a real-world comparison between two features, it is recommended that the tester perform an accuracy range test.

The idea behind the feature comparison test is to create an environment that does not reflect real-world behavior, but rather discontinuous or diverse behavior. This allows the features to be measured on many different types of messages. An ideal corpus to use for this type of test is the SpamAssassin public corpus, available at http://www.spamassassin.org/publiccorpus. While the environment should be diverse, it should not be overly chaotic. Some contextual similarities are necessary to training, but complete message continuity should be avoided.

Test Criteria


The test criteria for the feature comparison test require a more heuristic-like approach to testing. What’s being measured isn’t the actual levels of accuracy, but rather the distance between the two different levels of accuracy with different features enabled (or disabled) in the filter. This test will, in most cases, yield very poor results for measuring accuracy, because it relies on sparse, discontinuous training data. The goal is to initially train the filter with a set of training text from a corpus, and then shuffle the remaining set of text messages several times, using the data to establish the peak, floor, and average delta in accuracy or error rate.

The training set used is very different from that used for the other tests, in that a somewhat chaotic environment is created to measure the effectiveness of each feature. The entire training and test corpus combined may share the same context of messages if the tester’s goal is to compare the features in a diverse environment, or it may use an entirely different set of messages if the goal is to measure the filter’s ability to detect unpredictable and unlearned types of messages.

Message continuity

The messages should not be continuous but should be diverse in nature. There is no need for entire threads to be preserved, although a few messages in a thread may be present in the corpus. The ordering of all messages should be mixed; several messages randomly shuffled in the test corpus will be useful for testing, although some messages may be kept in order.

Archive window

Since no long-term learning is being measured, the archive window can be any length. It is acceptable to use a static number of training and test messages.

Purge simulation

In this chaotic environment, in which there is no long-term learning, a purge simulation is generally not necessary.

Interleave

A sparse interleave of the messages should be used. The original inter- leave should not be preserved, but rather messages should be somewhat randomized. The test corpus should be completely randomized.

Corrective training delay

Since real-world behavior is not being simulated, corrective training may be performed immediately on error.

Performing the Test


The feature comparison test consists of the following events during each run of each filter.

Training period

The period during which an initial corpus of messages is trained into the filter. This may be a percentage of a single corpus, or it may be an entirely different corpus if the tester is measuring behavior in an unpredictable environment.

Classification period

The period during which messages are presented for classification rather than training. In this case, the messages will also be randomized.

Corrective training period

The period during which erroneously classified messages are presented for retraining.

Listing 10-4 outlines the entire process.

Listing 10-4: Process flow in a feature comparison test






foreach feature  
while messageCount < minCount
do
present next message for training
foreach test shuffle
do
randomize message order in test corpus
while more messages in corpus

do
present next message for classification
if classification is wrong
then
perform corrective training
increment incorrect classification counter
else
increment correct classification counter
combine results for all test shuffles
compare results between features











The test process involves an initial training cycle similar to that of all other tests. Once the initial training cycle has been completed, the remaining messages to be used for testing are shuffled into N decks. The number of decks is determined by the tester, depending on how thorough the testing should be. This number is usually between 5 and 10.

Each shuffled deck is trained and classified. The floor and peak results are generally truncated, and the results from each deck are averaged. When the test process has been performed with each of the features enabled at their given time, the results of the tests are then compared to determine which features experienced the best overall performance (lowest error rate) and which experienced the worse overall performance (highest error rate).

/ 151