Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources]

Jonathan A. Zdziarski

نسخه متنی -صفحه : 151/ 63

Word Pairs

Using word pairs, or nGrams, has recently become very popular among authors of statistical filters and adds a lot of benefits to standard single-token filtering. Pairing words together creates more specialized tokens. For example, the word “play” could be considered a very neutral word, as it could be used to describe a lot of different things. But pairing it with the word adjacent to it will give us a token that will inevitably stick out more when it occurs—for example, “play lotto.” This approach helps improve the processing of HTML components by identifying the different types of generators used to create the HTML messages. Each generator, whether it’s a legitimate mail client or a spam tool, has its own unique signature, which joining tokens together can help to highlight. Tokenizers that implement these types of approaches are referred to as concept-based tokenizers, because they identify concepts in addition to content. We’ll discuss the different implementations of nGrams in Chapter 11.