Header Optimizations - Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] - نسخه متنی

Jonathan A. Zdziarski

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
توضیحات
افزودن یادداشت جدید







Header Optimizations


Most filter authors agree that a token in the subject header is very different from a token in the message body, and that a token that appears in two different headers is unique enough to warrant keeping track of. Header tokens are usually processed differently from body tokens in order to maintain the origin of each token. Let’s look at an example of an email with a lot of useful header information.

From: bazz@xum2.xumx.com 
To: bazz@xum2.xumx.com
Reply-To: mort239o@xum2.xumx.com
Subject: ADV: FREE Mortgage Rate Quote - Save THOUSANDS! kplxl
X-Keywords:
Save thousands by refinancing now. Apply from the privacy of your home and
receive a FREE no-obligation loan quote.
http://211.78.96.11/acct/morquote/
Rates are Down. YOU Win!
Self-Employed or Poor Credit is OK!
Get CASH out or money for Home Improvements, Debt Consolidation and more.
Interest rates are at the lowest point in years-right now! This is the perfect
time for you to get a FREE quote and find out how much you can save!

In the spam shown here, several different tokens stand out. First, if my email address happened to be bazz@xum2.xumx.com, I wouldn’t expect to be seeing it in the From: header, but it would be very normal in the To: header. Seeing my own email address in the From: header would be a clear indicator of spam, since most people don’t usually send email to themselves unless they’ve had too much to drink.

Second, the word “Save” appears in both the subject line and the message body. I would expect to see it in the message body more frequently in legitimate mail—for example, “Save your files in the blue folder” or “Save me from this dreaded cubicle.” Seeing the word “Save” in the subject header is much more suspicious, though, and it makes sense for me to have a different entry in the dataset for each of them.

The word “FREE” also shows up in both the subject line and message body, but in this case, they’re both very guilty indicators of spam. The filter still benefits here because the tokens “FREE” and “Subject*FREE” now have the ability to take up two slots in my decision matrix, further condemning the spam. Header tokens are extremely useful for identifying both spam and legitimate mail.

Other types of header tokens are frequently found to be useful, and the set of delimiters used in the headers is usually slightly different from those used in the message body. For example, if I want to catch all of the IP addresses in the Received: headers, I would treat a period as a constituent character (part of the token) instead of a separator. If I wanted to tokenize the message-id, I’d also include the @ sign as a delimiter, as it is used to separate some pieces of the message-id.

Another advantage of including the header as part of the token is that it helps to create a virtual “whitelist” of users you trust. If I exchange a lot of correspondence with bobsmith@somedomain.com, tokens like “From*bobsmith” and “From*yourcompany.com” will start to appear in the dataset, usually with very innocent values. This works equally well in identifying the hostnames of trusted mail servers in the Received: header too.

/ 151