Final Thoughts
We’ve run the gamut of approaches to tokenizing in this chapter. We’ll learn more about tokenizing phrases in Chapter 11, and in Chapter 13 we’ll cover another type of prefilter that actually despeckles the noise inherent in tokens. Tokenizing strives to define content by defining the construct and, more importantly, what the root components of content are. This is a noble quest, but, as with other areas of machine learning, is a function that may eventually be better left up to the computer. As new types of neural decision-making algorithms surface, the analysis of unformatted text may become one of the next forms of AI. Until this happens, tokenizing remains one of the few heuristic components of a statistical spam filter. It should therefore be respected and kept somewhat simple, so as not to require any maintenance in the years to come.