Sparse Binary Polynomial Hashing
We discussed sparse binary polynomial hashing, or SBPH, briefly in Chapter 6. Now we’ll take a closer look at the tokenizing approach implemented as part of SBPH to see what has made it a very successful approach for full-featured language classifiers. Its most basic definition, as outlined at the MIT Spam Conference in 2003, is as follows:
Sparse Binary Polynomial Hashing (SBPH) is a way to create a lot of distinctive features from an incoming text. The goal is to create a LOT of features, many of which will be invariant over a large body of spam (or nonspam).
SBPH is similar to chained tokens in that it identifies different concepts within the text, but it also goes beyond the capabilities of chained tokens to identify sparse phrases and lexical data. The general concept of SBPH, according to its author, Bill Yerazunis, is to “break the incoming text into short phrases from one to five words each.” As further explained in the SBPH white paper,
A phrase can have words in the middle of the phrase skipped (e.g. “BUY <skip_word> ONLINE NOW!!!” is always a bad sign.), and more than one phrase can use the same word. You can’t change the order of the words, but you _can_ bridge across newlines, punctuation, etc. Make all the phrases you can make.
For example, the phrase
An uncommon offer for an
would be tokenized by SBPH into the following set of tokens:
An
An uncommon
An <skip> offer
An uncommon offer
An <skip> <skip> for
An uncommon <skip> for
An uncommon offer for
An <skip> <skip> <skip> an
An uncommon <skip> <skip> an
An <skip> offer <skip> an
An uncommon offer <skip> an
An <skip> <skip> for an
An uncommon <skip> for an
An <skip> offer for an
An uncommon offer for an
These phrases, once tokenized, are inserted into the dataset in a fashion similar to that used in standard Bayesian analysis. The total number of times each phrase has appeared in each corpus is calculated, and the most interesting phrases are used in the final calculation. CRM114 uses 64-bit hashes to convert plain-text tokens into a numeric hash. The complete process can be summed up into three steps:
Slide a window N words long over the incoming text.
For each window position, generate a set of order-preserving subphrases containing combinations of the windowed words.
Calculate 32-bit hashes of these order-preserved subphrases.
Each word in a text affects 2n - 1 features, where n is the window size (5 in our case). Needless to say, SBPH generates a whole lot more tokens than primitive tokenization! The present-day implementation of SBPH does have areas where some improvements could be made:
The current implementation uses 8-byte characters instead of wide characters. To better improve support for multiple languages, the wide-character datatype should be used instead.
SBPH currently preserves ordering, but it may be interesting to see what happens when ordering is not preserved.
CRM114 originally totaled up the number of spammy phrases and the number of innocent phrases and then compared the two to see which one had the most tokens in its category. In November 2002, CRM114 was reworked around Graham’s Bayesian approach, and now a similar decision matrix is formed using what is called the Bayesian chain rule, or BCR. The BCR part of what is now known as SBPH/BCR is not an actual tokenizer approach but more of a decision matrix and processing approach. Yerazunis has also made some performance optimizations to make SBPH/BCR work within an acceptable range of speed. Files are mapped directly into virtual memory for less overhead. As described by Yerazunis,
Each of the css files is mapped into virtual memory. The default size of a css file is one megabyte plus one byte, and each byte of a css file is used as a single 8-bit unsigned integer. Using the length of the css file as a modulus, each superhash value maps into a particular byte of the css file. Each css file also has a “score,” initialized to zero.
Another approach used to increase performance and decrease disk space is the implementation of TOE-mode training. Tokens are trained only when an error has occurred, and so instead of having to learn an entire corpus of phrases (which would use up an inordinate amount of disk space), the filter learns only what is necessary for proper classification.
Supporting Data
Yerazunis performed some of his own testing with SBPH and presented his results to the MIT Spam Conference in 2004. His results compared standard SBPH with other tokenizing approaches, measured in errors per 5,000. We’ll revisit these totals in Chapter 12 when we discuss Markovian discrimination.
Evaluator | Errors per 5,000 |
---|---|
Pure Bayesian matching | 92 |
Peak window value | 80 |
Token sequence sensitive | 78 |
Token grab bag | 71 |
Sparse binary polynomal hash | 69 |
One of the things Yerazunis noticed in these results is that the token grab bag approach, in which token ordering isn’t preserved, scored a lower error rate than standard token sequencing. Combining a token grab bag approach with SBPH may further improve accuracy and is in fact part of the approach used in Karnaugh mapping, discussed next.The SBPH approach showed a significant increase in accuracy, by 11 errors per 5,000, when compared with Graham’s approach, referenced as the “peak window value” in the table.
Summary
SBPH is extremely effective at identifying both concepts and advanced lexical patterns. Its overhead, while greater than chained tokens, is still very manageable and ideal for small- to medium-sized classification projects. SBPH is overall a highly effective tokenization approach that will likely find its way into many language classifiers in the future.