Attacks on the Decision Matrix - Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] - نسخه متنی

Jonathan A. Zdziarski

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
توضیحات
افزودن یادداشت جدید







Attacks on the Decision Matrix


The decision matrix is another area that is frequently attacked by spammers. The goal of attacking the decision matrix is to flood it with a significant amount of either meaningless or legitimate-sounding data, so that it doesn’t see the more guilty data in the message, or to starve the decision matrix of enough information to make a useful decision.

Image Spams


Image spams are one of the ways in which spammers attempt to starve the decision matrix of information. This type of spam is also used to try and circumvent heuristic-based filters by providing so little data that it’s difficult to determine whether the message is spam or not. Unfortunately, a heuristic filter can’t simply filter out any messages containing images, as many legitimate senders include these. Many legitimate senders also have the annoying habit of sending a message whose entire content is a single image—and others like it!


The Problem


Image spams frequently provide very little data to work with, but they still contain more than enough for a statistical filter to accurately identify the message. These types of spam usually consist of a small amount of HTML to include a remote image in the message.

X-Sender: offer888.net 
X-Mailid: 3444431.13
Errors-To: report@offer888.net
Complain-To: abuse@offer888.net
From: Trimlife <trimlife@offer888.net>
Subject: Lose Inches and Look Great This Summer!
X-Keywords:
Content-Type: text/html;
<html>
<head>
<title>Untitled Document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>

<body bgcolor="#FFFFFF" text="#000000" leftmargin="0" topmargin="0"
marginwidth="0" marginheight="0">
<a > <img
src="images/Creative--003.gif"
width="450" height="300" border="0"></a>
</body>
</html>

One of the side effects of this approach is that it gives the spammers an opportunity to create what Graham-Cumming refers to as a feedback loop. Since the image has to be loaded from one of the spammer’s servers, it’s easy to embed a unique identifier into the URL of the image so that the spammer can keep track of who views (or previews) the message. When the user’s email client receives the spam and either highlights it for preview or opens it at the user’s request, the image will be loaded from the server and will most likely confirm to the spammer that the recipient’s address was valid. It also confirms to the spammer that whatever lexical data was used in the spam was successful in evading whatever filter may have been protecting the user’s mailbox, helping them devise a directed attack (discussed later).


The Solution


Many recent versions of email clients have been outfitted with an option to disable the loading of images embedded in HTML. This is a wise move, as it both prevents the spammer’s web bug from working and prevents the user from having to look at what could potentially be objectionable material— especially if you happen to be sitting down with one of your kids.

Apart from client-side filtering, spam filters can easily identify these types of messages based on content. It takes a certain amount of HTML to construct a message like this, and many of the frequently guilty tokens include the content headers and different combinations of HTML tags.




















Token


SH


IH


Probability


img+src


00050


00000


0.9999


Url*http


00087


00003


0.8530


img+border


00050


00000


0.9999


Url*image


00051


00001


0.9137


On top of this, web bugs and other types of markers work to our advantage. For example, the presence of my email address in a URL yields a probability of 0.9999. The same goes for unique identifiers, such as long numbers, that can be used to track you over a long period of time. Even the simplest HTML conventions used by the spammers’ generators are a clear indicator of spam. The phrase “Untitled+Document” also yields a probability of 0.9999 in my dataset.

Essentially, the solution to catching these types of spam is to implement machine learning in your spam filter, which is what this entire book is designed to teach you to do. Additional accuracy can be achieved by using a concept-based tokenizer, which we’ll discuss in Chapter 11. Finally, tokenizing header and URL-specific tokens will provide additional data points to analyze.


The Importance


Image spams starve the language classifier for information, and although there is generally still enough information to classify the message, this technique, combined with additional approaches, could potentially hurt accuracy in some filters that use primitive tokenizers. Implementing concept tokenizers or taking other steps to counter these types of approaches is important, to ensure that the filter has enough data to work with. If it doesn’t, it could degenerate into an image tag classifier, which could create false positives in some legitimate newsletters and such.

Random Strings of Text


Nobody’s quite certain who initially came up with the concept of injecting random strings of text into emails to confuse filters, but the myth that this approach helps to circumvent statistical filters about as credible as the lone gunman theory. A typical random-string spam will contain the spam payload and then a series of random pieces of junk text somewhere in the message— usually in an unreadable portion of the message. It was originally thought that these unknown words would help messages score a more innocent probability, since they would include several words the filter knew nothing about. Possibly some of the earliest naive Bayesian filters could have been vulnerable to this approach, but any window of opportunity for it is long since past.


The Problem


Spammers will embed a long series of random junk text into spams in an attempt to evade filters.

From: "alvin" <swbeetp08@hotmail.com> 
Sender: swbeetp08@hotmail.com
Subject: Never have a hangover again!
bernie isaacPill that cures hangovers!
Content-Type: text/plain;
Come and play at the world’s PREMIPill that cures hangovers!ERE ONLINCure your
hangover with just a pill!E CASCure your hangover with just a pill!INO!
We are happy to offer you, in an elegant atmosphere, a 50% BONUHangover pills
are finally here!S for YOUR FIRST DEPOCure your hangover with just a pill!SIT
as a New Player.
Sign up now! Don’t wait!
http://www.virtualcasinoes.net/_e4faa55afa1972493c43ac8a3f66f869/
wlefk lkwejf 2l3fj 2l3klew fhewlf jhewlk jewflk jfelkfew j
lfewjnbwlkfejlewkjfwelfkjh lkjhfelkj hewflkjhlk kljw hfelkjew lf hlkj
alflo2lkjl qweoijwfe0923 oiv09juwflk32 fjnhoijn fewkjn43 wfeoi2j329f8hj 29f8 h29f
hwfiu hfew98wh fohjnkjnld nbzpwefwef poewf


The Solution


The only necessary solution, which is implemented by just about all modern- day filters, is to assign unknown words a fairly neutral value so that they won’t influence the decision of the filter. The standard among most filters is generally 0.4000 or 0.5000. Since these unknown words have never appeared before in the user’s dataset, they will be assigned this neutral value, with the result that the more guilty-sounding text will rising to the top of the decision matrix.

Quite ironically, these types of random text are sometimes very clear indicators of spam. Spammers are usually too lazy to rotate the random text often enough to keep it new to the filter, and as a result, the filter learns them as clear markers of spam. Some believe that the random text doesn’t always change because it is the tail end of a Bayesian poisoning approach. If this is indeed the case, the spammers have done us a favor by leaving these identifying markers for our filters to see.


The Importance


Being one of the dumber ideas spammers have come up with, this approach isn’t particularly important, except for the need to understand that all the media hype about it is false.

Word Salad


Injecting random words into a message is a little different from injecting random text. The goal of the spammer in launching word salad attacks is to make the filter think that the content is legitimate by hiding the more guilty- sounding tokens. In this case, the spammer will proceed to pick several different words out of a dictionary, from a list of last names, or from other lists, with the goal of hitting on words that may have an innocent disposition in the recipients’ datasets.


The Problem


Spammers embed hundreds of dictionary words in their spam in an attempt to flood the decision matrix with innocent tokens. The tokens used will range from common household words to more specific words that may be more likely to generate hits on some users’ datasets (for example, “Quebec” or “Marvin”).

From: "Frank Mansfield" <rynzoten@oddpost.com> 
Reply-To: "Frank Mansfield" <rynzoten@oddpost.com>
To: jonathan@networkdweebs.com
Subject: Please her like never before
Date: Sat, 10 Aug 2002 02:57:07 –0400

14613317844455311
Did you know That the normal cost for Super vi@*gr@ is $20, per dose?
We are running a hot special!! T0DAY Its only an amazing $3.00
Sh2pped world wide!
http://conserve.dryydd.com/py/a
repetitious stoke hartford floodlight can’t nab capitulate millenia resusc=
itate bela eden drone countywide pi cerebral toccata siemens blackberry co=
ntusion bedtime deflater cambridge phenomena teresa syntax bum astraddle c=
onvoke decor argive guesswork menelaus litigant andromeda intent trigonome=
try dixon polygonal=20 abetted discrete franz scripture amplitude boeing t=
ype eeoc belt crafty warhead bosch transcendent earnest africa protrude ed=
wardine crest crossway carey saturnalia warwick plug aerospace marksmen cr=
aftsmen show matinal hexameter advisable kodiak horatio infight jaime=20

These tokens can be anywhere in the message. They are sometimes hidden in one of the many ways spammers use to hide text, such as inside a separate part of the message, in form variables, or even inside HTML tags. The text can also be right out in the open, usually appended to the end of the message, as in this example. Since the goal is to flood a decision matrix, the spammer knows they have to have at least 15 to 20 solid hits, and so several different words are used. Some messages have used up to a thousand or more random words.


The Solution


This approach rarely works in modern-day statistical filters, because the filter is smarter than the spammer. Graham-Cumming provided an excellent presentation about word salad spams at the MIT Spam Conference in 2004. To prove that these types of spams had no effect on spam filters, he took several microspams (also known as picospams), which contain very little data, and added hundreds of random words from his dictionary, websites, and even Wikipedia (an online encyclopedia; http://www.wikipedia.org). He then sent the messages through his filter, which in the very worst case allowed only 0.04 percent of the messages through. Achieving an error rate this high required that the recipient be sent over 10,000 spams and that each message increase in size by 300 percent from the last one! Since spammers aren’t doing this, the error rate of spams like these is very small, well below 0.01 percent.

The reason these types of spams rarely work is that they use too many obscure tokens, at which point they are learned as spam. As spammers continue to use many obscure words in such emails, the tokens that are known become more likely to have been found to be guilty than innocent. For example, if I never use the word “disassociation,” but a spammer does, my filter’s much more likely to can a message containing the word. While the spammer may hit two or three innocent tokens, they’ve also hit several guilty ones. The more this approach is used, the less effective it becomes, since there are only so many different types of random words to use.

In order for this approach to work, the spammer must use primarily words that the recipient is using in their correspondence and must use fewer unknown or spammy words. Any words that the user isn’t using—no matter how innocent sounding—are likely to become guilty tokens in the user’s database.


The Importance


Word salad is not something to be concerned about, since the approach doesn’t work very often, but it is important to consider the effects it may have on a user’s dataset. It generates a great deal of data, which will need to be purged from the system at some point to avoid having large amounts of junk data taking up disk space. This approach may be ineffective at evading spam filters, but it does create a very slow denial of service attack if the data isn’t kept in check.

Directed Attacks


Directed attacks can, in theory, work, but require so much work on the spammer’s part that the spam-filtering community hasn’t seen a true attempt of these in the wild. The idea behind a directed attack is for the spammer to create a profile for each recipient of the message. This profile contains a list of the most innocent words in the user’s database, so that they can be used in a spam distribution specifically for that user or group. This requires inner knowledge of the user’s spam filter data—knowledge that is difficult to come by and constantly changes.


The Problem


The spammer will use one of several different approaches to attempt to build a profile on the targeted recipient. In most cases this involves sending several thousand messages to the target. The spammer will include a web bug— perhaps through the use of embedded images—to create a feedback loop. This feedback loop will tell the spammer which messages made it through the spam filter. The spammer can then take the messages that have gotten through and feed them into a similar Bayesian classifier to generate tokens that are most likely to be considered legitimate by the target—just like the old card trick, where you pick the card you’re thinking of out of five decks. Over time, the spammer will be able to develop a short list of very innocent tokens.

Now comes the spam. The spammer sends their spam to the user with the list of very innocent tokens deduced from the previous profiled analysis of the target. The innocent tokens used in the spams will then make their way into the decision matrix and trick the filter into thinking that the message is legitimate.


Why This Doesn’t Work


This approach can work, and work quite effectively. Fortunately, it requires a significant amount of resources. Spammers don’t have the time or resources to perform massive analysis of millions of users—yet. It is much more lucrative just to send out millions of blind emails a day. This approach could possibly work its way into the standard operating procedure of spammers in the future—what is too expensive today can become feasible as spammers learn to adapt. Presently, many are relying on spammers not to use this approach in the future, but an infrastructure to support this type of profiling already exists as computing power and available bandwidth increase.

This approach has two inherent flaws, which filters with the appropriate logic can take advantage of. First, more complex tokenizers generate more data to fill up the decision matrix with guilty data, forcing the spammers to come up with more innocent data to counter it. Legitimate email will usually generate innocent individual tokens as well as innocent token pairs, but since the spammer is sending a hodgepodge of innocent tokens, the token pairs will be, for the most part, unknown to the filter. While a primitive tokenizer would identify only some of the spammer’s tokens, such as “Free” and “Viagra,” a more complex tokenizer would also be able to come up with tokens like “Free+Viagra,” which will most definitely show up in the decision matrix. Not only is it more difficult to circumvent filters that are using these types of tokenizers, but once the first distribution referencing these innocent tokens gets sent, the innocent tokens will no longer have their extremely innocent disposition and will be outweighed by the guilty tokens.

The second flaw inherent in this type of attack is that the patterns of words used aren’t consistent with the patterns used in legitimate mail. This creates a type of contextual anomaly which can be detected using advanced algorithms such as Bayesian noise reduction (discussed in Chapter 13).

There’s been a lot of speculation about these types of attacks becoming mainstream, which is why it’s important to incorporate some of these additional functions into a filter. The most basic concept tokenizer can avoid much of the heartache that will be created when these types of attacks begin to be launched.

This approach was not invented by a spammer, mind you—it was invented by a very gifted spam filter author who also happens to have a doctorate and is vice president of a technology company, the author of many books, and the holder of two U.S. patents. These types of individuals don’t hang around spam slums working for the spammers. Most spammers’ staffs are generally lacking in skills and education. It will therefore take a significant amount of time for spammers to figure out ways to even understand these types of attacks and incorporate them into their spam tools. Once they do, however, it’s possible that primitive tokenizers will become obsolete or will at least cause a drop in the accuracy of the filter that doesn’t employ conceptual filtering.


The Importance


Directed attacks are an important type of attack to consider when developing spam filters. They can be very effective, and if the spammers devise a system to perform real-time user profiling, this approach could be detrimental to filters that don’t use more advanced analysis approaches. The more complex tokenizers discussed in Chapter 11, Bayesian noise reduction, and smarter mail clients are all ways to fight a directed attack successfully. This approach will certainly smash naive filters that don’t at least take the possibility of an attack of this type into consideration.

/ 151