The tokenization methods discussed thus far have covered only standard character sets. The issue of foreign languages will eventually require a solution. Most spam filters simply use wide characters as placeholders, such as the letter “z” or an asterisk. This functionality allows the filter to catch just about any messages written using a wide character set. Some users, however, may expect to receive email from others speaking such a language, and for them this approach won’t function well at all, filtering only based on header data. The rest of the body will look (to the filter) like this:
ZZZZZ, ZZ ZZZZ ZZZ ZZZZZZZ ZZZ ZZZ Z ZZZZZZ Z ZZZZZZ ZZZZ Z ZZZ ZZZZ ZZZZZZZ ZZ ZZZZZZ ZZ ZZZ ZZZZZZZ ZZ, ZZZZZZZZ
Some filters implement i18n internationalization, which lets their filter support some additional languages. To make matters more complicated, however, some languages don’t use whitespace, making it very difficult to identify words at all. This commonly calls for more advanced solutions such as variable-length nGrams, which we’ll discuss in Chapter 11.