HTML Encodings - Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] - نسخه متنی

Jonathan A. Zdziarski

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
توضیحات
افزودن یادداشت جدید







HTML Encodings


Until now, we’ve been discussing content encodings. HTML isn’t considered a content encoding, but there are components within HTML that should be decoded by the filter. Some types of HTML encodings make it more difficult to read text, and some make it easier to identify spam. (We’ll discuss the latter in Chapter 7, as we learn more about spammers’ tricks.) This section will cover the basic components in HTML that should be decoded by a filter.

There are a lot of ways to hide text in HTML. One of the most basic is to use HTML character encoding. Character encoding allows the author to use ASCII values, which then display as actual characters when viewed by an HTML-capable mail client. For example, the filter might see:

CALL NOW, IT'S &#7 
0;REE!

but the recipient will see:

CALL NOW, IT'S FREE!

This encoding isn’t particularly useful to the legitimate email sender, but it works wonders for spammers. And because it’s supported in the webbrowser world, email clients recognize it (at least until mail clients become smart enough to detect this type of abuse). Fortunately, the encoded characters can easily be decoded, after which the message can then be tokenized.

URL encoding is another type of encoding frequently used in HTML. It allows hexadecimal characters to be used in URLs to maintain continuity— that is, to prevent URLs from having spaces and other weird characters. For example, the filter sees:

http%3A//www.somedomain.com/%69%6e%64%65%78%2e%68%74%6d%6c

but when the recipient clicks it, they go to:

http://www.somedomain.com/indexl

Again, not a very complicated encoding, but nevertheless the filter could miss some very guilty data if it doesn’t properly decode it. Encoded chunks of text like this usually contain the guiltiest tokens in an email, which is why they’re hidden in the first place.

One of the tricky things about HTML encoding is that it can easily be layered underneath a Base64-encoded message part, so not only do you have to first decode the Base64 component, but you must then perform the necessary HTML decoding.

As you’ll see in Chapter 7, a number of additional optimizations can be made to thwart obfuscation attempts by spammers. The good news is that encodings and obfuscation techniques have a finite number of variations. Mail clients support only the standard types of message encoding, and therefore a spammer can’t simply make up a new encoding.

HTML gets a bit trickier, however, because there are many creative ways to hide text. Fortunately, as of this writing, spammers haven’t been able to find too many new ways to hide text on top of the approaches that have already been counter-programmed by filter authors.

/ 151