Final Thoughts
In this chapter, we’ve taken a look at three very different approaches to advanced tokenization. All three approaches attempt to identify lexical patterns and specific characteristics we know as concepts. These approaches all add a different amount of data to the dataset and require differing amounts of resources. Depending on the complexity of the filter and other algorithms used to perform language classification, one type of tokenization may be more appropriate than another.All filters should consider implementing at least one form of concept identification. Primitive tokenizers have been shown to be very effective and resource friendly, but as new types of messages are being crafted specifically for the purpose of evading filters, it is becoming more and more necessary to identify not only the components of a message but also the concepts within a message. The advanced features that this type of tokenization provides (such as HTML classification, grammatical analysis, and so on) bring an entirely new level of accuracy to modern-day filters, without necessarily requiring a significant increase in resources.Concept learning is the foundation of true machine learning; the philosophical question, “What is content?” requires that concept identification be an important part of the answer.