Sparse Binary Polynomial Hashing
Bill Yerazunis originally introduced the concept known as SBPH, or sparse binary polynomial hashing. SBPH is an approach to tokenization using word pairs and phrases. If it wasn’t so effective at what it does, it would probably be a terrible idea—but Yerazunis has repeatedly astonished the spam-filtering community with the leaps in accuracy made by SBPH tokenization. Graham refers to SBPH with the same mixed feelings regarding its ingenuity and need for medication:
Another project I heard about . . . was Bill Yerazunis’ CRM114. This is the counterexample to the design principle I just mentioned. It’s a straight text classifier, but such a stunningly effective one that it manages to filter spam almost perfectly without even knowing that’s what it’s doing.
SBPH tokenizes entire phrases, up to five tokens across, and allows for word skipping in between. It led the way in terms of accuracy for a long period of time, but it also created an enormous amount of data, which is one of the reasons it presently functions only in a train-on-error environment. SBPH provides the benefit of using the simplest, most colloquial tokens but giving special notice to more complex tokens as well—which are usually much stronger indicators of spam when they appear.A few filters, such as CRM114, perform this type of word skipping, which will tokenize something like “manh+<!rescind>+ood” and also help the filter “see” the original token by performing the word skipping: “manh+ood.” Since tokenization is an imperfect process, approaches like this generally provide more machine-readable tokens to deal with, without necessarily requiring much work. The more permutations of machine-readable tokens are created, however, the larger and more spread out the dataset will become, possibly affecting accuracy. The amount of data generated by SBPH generally turns a lot of filter authors off to it in favor of simple functions such as HTML comment filtering. We’ll cover SBPH more in depth in Chapter 11.