Chapter 4: Statistical Filtering Fundamentals
Overview
last chapter and illustrates the clockwork inside statistical filters. If you’re a developer, this chapter will provide you with answers to most of the questions you may have from Chapter 3 and give you the procedural information necessary to develop a typical Bayesian filter. If you’re a systems administrator, this chapter will explain, specifically, how the statistical filter you are using functions and what some of its configuration options may do, so that you can better understand what’s going on behind the scenes.Statistical filtering involves measuring probability. One of the great benefits of statistical language classification is that you’re actually measuring something, as opposed to guessing scores. Heuristic-based filters, on the other hand, assign a “score” to each feature, which is really just an arbitrary number without mathematical significance. As Paul Graham puts it, “The user doesn’t know what it means, but worse still, neither does the developer of the filter.” It’s very difficult for a systems administrator to interpret the meaning of a score, since it doesn’t relate to any real measurement. It’s even more difficult for a developer, who’s supposed to be coding new rules, to do their job without being able to understand the effects of their changes (which cannot be measured).Statistical classifiers, on the other hand, measure something very specific—mathematical probability. The idea that “there is an X percent chance that this message is spam” is a lot easier to comprehend than “this message scored a 3.52.” It also gives filter authors and systems administrators a look into the thought process of the filter, so they can better ensure its correct operation. By mathematically weighing even the simplest characteristics of an email—such as the probability of the word “Viagra” meaning spam, we end up with more reliable information—and more reliable spam filters.Statistical language classification has become very popular for many different types of solutions, in addition to spam filtering. While it is used in fighting spam, it has also become a mainstream approach to solving problems once considered to be solvable only by a human—and in fact many employees have been hired in the past to spend hours doing what statistical language classifiers do today in fractions of a second—making decisions about documents. Thanks to bright mathematicians like Thomas Bayes, computers are now making our more trivial decisions for us.