Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources]

Jonathan A. Zdziarski

نسخه متنی -صفحه : 151/ 95
نمايش فراداده

Karnaugh Mapping

Karnaugh maps are primarily an electrical engineering and/or low-level computer science concept originally used to minimize boolean logic. A Karnaugh map provides a graphical approach to grouping together expressions with common factors and eliminating unwanted variables. It can be used in reverse to create a verbose set of tokens similar to those created by SBPH, except that ordering and even adjacency are ignored. The logic minimization features of a Karnaugh map come into play in that it can actually measure the absence of a specific token in a phrase as a meaningful piece of data. Until now, all of the tokenization approaches introduced have dealt only with data that is present, or they make exceptions for wildcard data that isn’t present—but what about the concept of evaluating the presence of the word “free” next to another word that is not “software”? This too could be considered an important piece of the answer to the question of what content is.

Karnaugh maps can be used to illustrate logic for a range of different numbers of inputs. The ideal Karnaugh map in the setting of language classification would have a window size of 4 inputs. A 4-input Karnaugh mapping tokenizer would take input four tokens at a time:

CALL NOW, IT'S FREE!

and generate a mapping of all the different possible combinations of the tokens, including tokens both absent and present from each expression, producing a total of 2N combinations. Literals are assigned to the individual components of a Karnaugh map, and logic is then produced for the literals.

CALL

NOW

IT'S

FREE!

A

B

C

D

A'B'C'D'

A'B'C'D

A'B'CD'

A'B'CD

A'BC'D'

A'BC'D

A'BCD'

A'BCD

AB'C'D'

AB'C'D

AB'CD'

AB'CD

ABC'D'

ABC'D

ABCD'

ABCD

Tokens in the expression that are not present are expressed in their complemented form; for example, A' is pronounced “not A.”

As shown in Figure 11-1, there are 16 different combinations of logic in a 4-variable Karnaugh map. For example, A'B and C'D can be stored as two separate tokens. At the same time, A'BC', BC'D, and A'BC'D can also be stored as tokens. Finally, even the individual single tokens can be stored: A, B, C, and D. The literal representations of the tokens aren’t stored, just the individual tokens that make up the expressions. In order to create a storage solution in which token phrases can be recalled in any order, two things need to be stored: a token key, using the alphabetic ordering of the tokens included in the expression, and the ordering of the expression. For example,

CALL+FREE!+NOW

(Alphabetically ordered key)

ADB

(Literal ordering)

Figure 11-1: A 4-variable Karnaugh map

The literal ordering is important because even though we’re not preserving order, we want to give a higher weight to tokens that are in the original order, as well as to more complex combinations. This brings us to the subject of weighting. Among the mass quantities of data generated by Karnaugh mapping, how do we know which tokens are more important than others? To address this, we can apply a 2N weighting scheme, similar to the way Markovian-based filters do it, as we’ll discuss in the next chapter. First, we give preference to more complex phrases—that is, tokens that have more phrases in them. Second, we give preference to phrases that are in their original order. Finally, we weigh the tokens by complexity and out of order.

Before we can assign weights to individual tokens, each Karnaugh map must first be evaluated. For example, if we have the phrase

CALL NOW, IT'S COOL!

the Karnaugh map against our original set of tokens will evaluate to ABCD', as D is not present. Once the Karnaugh map is resolved, we can load all relevant maps and perform our weighting:

Literal/Order

Weight

any one literal

2

any two literals

4

AB+BC+CD

8

any three literals

16

ABC+BCD+ACD+ABD

32

any four literals

64

ABCD

128

All that’s left now is to factor in the absence of specific tokens. For every combination in the corpus, there will be a record with each absence in the dataset. This makes for a significant amount of data (again, this approach isn’t for the faint of heart, or the faint of wallet), and so we will recall all different combinations of the present Karnaugh mapping with absent tokens and measure the specific absence of the word FREE!, the word NOW, and so forth.

Theoretically, once all of the Karnaugh maps have been evaluated and every possible combination of logic has been exhausted, enough data should be present to make amazingly accurate decisions about the data. Since this approach is theoretical, it has been tested on only very small corpora, and for generic language classification at that. It has proven, however, to provide an extremely accurate analysis of the samples being presented. Technology like this is more useful in applications where spam of other kinds is being explored, such as massive search engines trying to filter out page-ranking spam from pages on the Internet. Generally, the amount of resources needed for this type of language classification make it useful only when there is a good reason to spend a significant amount of money on hardware. This approach could potentially even be used to discover trends in DNA sequencing and other types of knowledge discovery. As Karnaugh mapping continues to be tinkered with, spam fighters will inevitably find ways to implement the approach using fewer resources. Training modes and purging could make Karnaugh mapping an extremely accurate approach to tokenization for single-user implementations.