HTML Tokenization
One area that has plagued many filter authors is the decision as to what HTML to include and what other parts of the message to ignore—for example, should we ignore JavaScript? What about font tags? Most filters pay attention to all HTML tags except those on an exclusionary list—namely, a specific set of tokens that are common to all types of email. This approach works quite well, but there’s still room for improvement. Ignoring data is always something to be concerned about, and you shouldn’t do it unless you have good reason. The justification for ignoring some HTML data is that many people normally converse only with senders who do not use HTML.
This could cause any type of message with embedded HTML to be rejected as spam, which could be bad for the recipient if their boss suddenly started using an HTML-enabled mail client. The tags most filters ignore include
td
!doctype
blockquote
table
tr
div
p
body
short tags, with fewer than N characters of content
tags whose content contains no spaces
It is probably better to use an exclusionary list rather than an inclusionary one. You’re more likely to miss a few tags or possibly to fail to name certain tags you never thought could be used in spam (for example, the object tag has recently become popular). If this happens, at worst the tag will sit and collect dust in the dataset with some neutral value or will fill up a decision matrix slot in error. If you fail to add a tag to an inclusive list, though, you’re bound to ignore an important data point and may not even realize it.Some of the HTML tags commonly used by spammers (which a filter should definitely be looking at) include the following:
APPLET | BGSOUND | FRAME | IFRAME |
ILAYER | IMG | INPUT | LAYER |
LINK | SCRIPT | A | AREA |
BASE | DIV | LINK | SPAN |
OBJECT | FONT | BODY | META |
Some filters like to mark the tokens generated from HTML tags with an “HTML” identifier, while others go so far as to mark the particular tag the text belonged to (for example, “BODY:BGCOLOR=#FFFFFF”). Regardless of which tags the filter decides to keep and which get discarded, it’s very important to handle HTML comments correctly. Spammers are using many tricks to obfuscate their text so that it’s human readable, but not very machine readable. For example, the following may look like a complete mess in its machine-readable format:
Received: from 64.202.131.2 (h0007e9075130.ne.client2.attbi.com
[24.218.222.43])
Message-ID: <cp6-mh-rn-w$4pa2o965rl84@jn4y0hq1bcy>
From: "patsy stamm" <arthropathology71255@earthlink.net>
Reply-To: "patsy stamm" <arthropathology71255@earthlink.net>
Subject: Giving this to you
Date: Fri, 08 Aug 03 07:29:02 GMT
X-Mailer: MIME-tools 5.503 (Entity 5.501)
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="AD0E55.76_15.C"
X-Priority: 3
X-MSMail-Priority: Normal
--AD0E55.76_15.C
Content-Type: text/html;
Content-Transfer-Encoding: quoted-printable
Yes you he<!lansing>ard about th<!crossbill>ese weird <!cottony>little
pil<!domesday>ls
that are suppo<!=anabel>sed to make you bigger and of cou<!chord>rse you think
they're b<!soften>ogus snake potion. Well, let's look at the facts:
<strong>G<!eigenspace>RX2
has be<!waldron>en sold over 1.9 Mill<!audacity>ion times within the last 18
months</strong>...
With awe<!tapestry>some results for hun<!wield>dreds of thous<!locale>ands of
men all over the planet! They all enjoy a seriously enhanced version of their
manh<!rescind>ood and <b>why shou<!seoul>ldn't you</b>?
But when the user clicks the message to read it, the HTML comments won’t be visible, and the user will see this:
Yes you heard about these weird little pills
that are supposed to make you bigger and of course you think
they're bogus snake potion. Well, let's look at the facts: GRX2
has been sold over 1.9 Million times within the last 18 months...
With awesome results for hundreds of thousands of men all over the planet!
They all enjoy a seriously enhanced version of their manhood and why shouldn't
you?
A simple way to ensure that the message is tokenized correctly is to remove the HTML comments and reassemble the message.