Chapter 11: Concept Identification—Advanced Tokenization
Overview
Chapter 6, we discussed basic tokenization theory. Primitive tokenization allows us to identify messages based on the individual characteristics of each word, number, or other small component of a message. Content is more than just individual words like “free”; it is also a collection of concepts, like “free call” or “free software.” Statistical filters that perform the advanced tokenization discussed in this chapter are capable of identifying not only raw content but also individual concepts, the true goal of concept learning. Instead of identifying a spam as a message that contains the word “free,” implementing some of the concepts in this chapter allows us to identify spam content based on the concept of a “new exciting offer” or “playing lotto.” This approach to tokenization flows through to the rest of the filter and allows Bayesian content filters to act more like Bayesian concept filters— filters that eliminate spam based not just on content, but on the concept of the message. Concepts don’t necessarily have to be ones that are humanly understandable; they can be lexical concepts, such as grammatical pretense, word construction, and HTML generation.Let’s go back to the notion of concept learning for a moment. Suppose that we have some children sitting at a table with someone who is showing them pictures of sports cars and sedans. They’re not only learning the concepts of what a sports car and a sedan are; they’re learning the individual concepts that make up the larger concepts. A Bayesian content filter using primitive tokenization learns only specific characteristics, such as “four round objects” and “rectangle extending out of top subsection,” whereas true concept learning would identify concepts such as “fat tires” and “engine coming out of hood.” Identifying concepts is truly where content filtering approaches a new level of AI, and that is what this chapter is all about.