Message Body Encodings
Message body encodings convert the body of a message into a more transport- friendly format. They are normally used to encode messages with alternate character sets or binary attachments, which contain nonprintable characters. Spammers commonly use these types of encodings to hide their messages from now-obsolete spam filters. Let’s take a look at an example of an encoded message. The filter sees:
Reply-To: <health204580m43@mail.com>
From: <health204580m43@mail.com>
Subject: Penile enlargement method - guaranteed !
Date: Thu, 22 Aug 0102 12:07:35 +0800
MIME-Version: 1.0
X-Priority: 3 (Normal)
X-Msmail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2600.0000 Importance: Normal
Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: base64
PGh0bWw+PGJvZHk+PGRpdiBpZD0ibWVzc2FnZUJvZHkiPjxkaXY+PGZvbnQg
ZmFjZT0iQXJpYWwiIHNpemU9IjIiPlRoaXMgbWVzc2FnZSBpcyBzZW50IHRv
IG91ciBzdWJzY3JpYmVycyBvbmx5LiBGdXJ0aGVyIGVtYWlscyB0byB5b3Ug
YnkgdGhlIHNlbmRlciB0aGlzIG9uZSB3aWxsIGJlIHN1c3BlbmRlZCBhdCBu
byBjb3N0IHRvIHlvdS4gU2NyZWVuaW5nIG9mIGFkZHJlc3NlcyBoYXMgYmVl
biBkb25lIHRvIHRoZSBiZXN0IG9mIG91ciBhYmlsaXR5LCB1bmZvcnR1bmF0
ZWx5IGl0IGlzIGltcG9zc2libGUgdG8gYmUgMTAwJSBhY2N1cmF0ZSwgc28g
aWYgeW91IGRpZCBub3QgYXNrIGZvciB0aGlzLCBvciB3aXNoIHRvIGJlIGV4
Y2x1ZGVkIG9mIHRoaXMgbGlzdCwgcGxlYXNlIGNsaWNrIDxhIGhyZWY9Im1h
aWx0bzpoZWFsdGgxMDVAbWFpbC5ydT9zdWJqZWN0PXJlbW92ZSIgdGFyZ2V0
PSJuZXdfd2luIj5oZXJlPC9hPjwvZm9udD48L2Rpdj4gIDxwPjxiPjxmb250
and so on.Not very much to look at, is it? If we fed this message directly to our tokenizer, we’d end up with a few headers’ worth of information and gibberish for the rest.
This type of encoding is known as Base64 encoding. Its legitimate purposes include encoding binary attachments, such as pictures and files, being sent via email. When the user’s mail client receives the message, it is decoded, and the recipient sees the following.
This message is sent to our subscribers only. Further emails to you by the
sender this one will be suspended at no cost to you. Screening of addresses
has been done to the best of our ability, unfortunately it is impossible to be
100% accurate, so if you did not ask for this, or wish to be excluded of this
list, please click here
THIS IS FOR ADULT MEN ONLY ! IF YOU ARE NOT AN ADULT, DELETE NOW !
We are a serious company, offering a program that will enhance your sex life,
and enlarge your penis in a totally natural way.
and so on.Had we processed this message without first decoding it, we would have missed every bit of the message’s content, which turned out to be spam. That would have left only the message headers to base our filter’s decision on, which, of course, would result in a high error rate.The message header “Content-Transfer-Encoding” identifies the encoding used for a particular message (or part of a message). This field can be present in an email’s top-level headers or in a particular part of a multipart document.Six different encodings can be specified with this field, and all are case insensitive. The first three, 7bit, 8bit, and binary, aren’t nearly as complicated as the remaining encodings, as they don’t adversely affect the format of the data.
7bit encoding means that the data is all represented as short lines of ASCII data. Only characters from the US-ASCII character set will be present in the message.
8bit encoding means that the lines are short, but there may be non-ASCII characters, such as Unicode or nonprintable characters.
Binary encoding means that not only may non-ASCII characters be present, but also that the lines are not necessarily short enough for SMTP transport.
The big difference between 8bit and binary encoding is that binary doesn’t require the lines to be trimmed to a specific length. If the message is going to contain any type of data outside of the standard US-ASCII character set, one of these encodings should be used. That is not to say, however, that spammers will conform to RFC standards.The remaining three types of encoding—quoted-printable, Base64, and custom encoding—can dramatically alter the data for transport. When a message is received in one of these encodings, the filter usually needs to perform some level of decoding before the data will be useful enough for the tokenizer.
Quoted-Printable Encoding
Quoted-printable encoding was designed for encoding message components that consist primarily of human-readable ASCII characters. The encoding algorithm encodes certain nonprintable characters, such as carriage returns and special characters, leaving most of the readable portion of the message intact (but possibly broken up). The RFC specification for this encoding lists five basic rules that are used to encode the message. The full specification can be found in RFC 2045.
Note | The RFC contains a few other guidelines and plenty of other useful information about this type of encoding. You can read RFC 2045 in its entirety online at http://www.ietf.org/rfc/rfc2045.txt. |
The typical quoted-printable encoded message looks very similar to a plain ASCII message, but the differences are apparent after a closer examination of some of the characters.
From: "MR.DOUGLAS
AND PRINCESS M." <douglassmith2004@yahoo.co.uk>
Reply-To: princessmar001@yahoo.com
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Subject: [SA] URGENT HELP..............
Date: Mon, 5 Apr 1999 20:38:02 +0100
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
DEAR SIR=2C
URGENT AND CONFIDENTIAL=3A
Re=3ATransfer of $50=2C000=2E000=2E00 USD=5BFIFTY MILLION UNITED STATES
DOLLARS=5D=2E
WE WANT TO TRANSFER TO OVERSEAS=5B$50=2C000=2E000=2E00=5BFIFTY = MILLION UNITED
STATES DOLLARS=5DFROM A SECURITY COMPANY =
IN SPAIN=2CI WANT TO ASK YOU TO QUIETLY LOOK FOR A = RELIABLE AND HONEST PERSON WHO
WILL BE CAPABLE AND FIT =
. . . . .
and so on.
Base64 Encoding
Base64 encoding was originally intended as a means of encoding binary data such as music and graphics. In fact, it’s used for any type of data that isn’t in human-readable form. Since the encoding translates any of the possible 256 bytes into printable ASCII bytes, the resulting encoded data is usually around 33 percent larger than the original file.
The Base64 algorithm is quite simple but detailed. You can read the complete description in RFC 2045. We’ve already seen an example of a Base64-encoded message earlier in this chapter. Note how little useful information can be drawn directly from the message without decoding it. This is one of the many dirty spammer tricks we’ll cover in Chapter 7. Base64 encoding is generally used to encode attachments, and so an email with a Base64-encoded body is generally very suspicious—most likely spam in fact (but don’t count on that). Many heuristic filters will even go so far as to automatically drop messages with a Base64-encoded body. Eventually, mail servers will get smart enough to reject any mail with a Base64-encoded body, forcing the few legitimate users of this practice to conform. Until that happens, it’s necessary to spend the processor cycles decoding the message body just to be certain it’s spam.
Custom Encodings
Custom encodings are those determined by the implementer. They are generally reserved for future expansion or for situations in which two proprietary applications require a different type of encoding to communicate. The use of custom encodings is usually frowned upon, as no other mail client would know how to decode the message unless it had been explicitly written to handle the custom encoding type. Nevertheless, one can use a token pre-pended with “x-” to specify a custom type—for example, x-myencoding.If you’re developing applications that use Internet mail to communicate, consider one of the existing transfer encodings first, and create a custom encoding only as a last resort for extremely sensitive uses.