Basic Delimiters
Besides deciding how best to break apart a message, there are many other issues to consider when tokenizing. For example, we need to determine what constitutes a delimiter (token separator) and what constitutes a constituent character (part of the token). Do we break apart some pieces of a message differently than others? What data do we ignore (if any)? The fundamental goal of tokenization is to separate and identify specific features of a text sample. This starts with separating the message into smaller components, which are usually plain old words. So our first delimiter would be a space, since spaces commonly separate words in most languages. This makes it very easy to tokenize a phrase like the following:
For A Confidential Phone Interview, Please Complete Form & Submit.
which can be broken up into the following words:
For A Confidential Phone Interview, Please Complete Form & Submit.
As we’ve learned, each word typically is assigned one of two primary dispositions—spam or nonspam. The example above will cover a lot of text, but we’re left with a few punctuation issues. For example, is the word “submit” on its own likely to have a different disposition from the word “submit.” with a period after it? How about “interview” and “interview,” containing a comma? In these cases, it makes sense to add some types of punctuation to the set of delimiters, as punctuation suggests a break in most languages. The following are some widely accepted punctuation delimiters:
period (.)
comma (,)
semicolon (;)
quotation marks (“)
colon (:)
Some other punctuation, such as the question mark, is a bit more controversial. Some authors believe that “warts” and “warts?” should be treated the same, in most cases as spammy tokens. Including too much punctuation in the makeup of tokens could result in five or ten different permutations of a single word in the database. This can very rapidly diminish their usefulness. On the other hand, not having enough tokens can cause the tokens to become so common among both classes of email that they become uninteresting. The trick is to end up with tokens that would stick out in one particular corpus. If there were 100 spams about warts in the user’s corpus, but only one posing a question in which “warts?” was used, the filter is likely to overlook this feature in the one message.
Note | I’ve found that treating a question mark as a delimiter results in slightly better accuracy (on the order of a few messages) in my corpus testing, as opposed to treating it as a constituent character. This could likely change in the future, however. |