Bayesian Noise Reduction (BNR) - Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] - نسخه متنی

Jonathan A. Zdziarski

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
توضیحات
افزودن یادداشت جدید







Bayesian Noise Reduction (BNR)


Present-day filters lack true lexical intelligence. When we think of spam filtering, most of us think about the word “Viagra” mixed in with a bunch of junk text, or perhaps just a single image followed by a story.

Today’s generation of statistical filters simply scans through an email and looks for the most interesting buzzwords to weigh its disposition by. Spam filters are faced, however, with a unique problem: a malicious villain is behind the scenes trying to misfeed filters with bogus data. While spammers aren’t the brightest collection of minds, they seem to have a general idea of how typical content-based filters work. At the very least, they know enough to hide the “bad” words and try to add some “good” words.

With some time and a little trial and error, spammers have recently come to the point of injecting anything from non-sensible text to a target group’s web page in an email, knowing that, at the end of the day, our filters are really just dumb token recognizers. Most of the time, however, this doesn’t work, because a user’s data is far too specialized to crack on words from obscure books or website lingo we’ve never used. Once in a while, however, spammers get lucky. They include just the right text in the right quantity to pass their message into our inbox. A bigger concern is that spammers may find ways to mine even more specific data from users through the use of Internet worms, web bugs, or similar means. If a spammer can find the perfect set of words to inject, they can get their spam past our filter.

Bayesian noise reduction (BNR for short) is one of the technologies I’ve designed in an attempt to give the filter a “lexical brain,” or rather an intelligence that allows the filter to look at language in a way similar to human understanding. In order to form concepts, we must have something spam filters don’t presently have: a context. Think of Bayesian noise reduction as a way to identify “out-of-context” words. If we’re having a face-to-face conversation and I say something out of context, it could assume a completely different meaning and easily lead to a misunderstanding. This problem flows into language classification, in that a token can resolve to one disposition when in context (say, the word “righteous” in the phrase “righteous code”) but have a completely opposite disposition when presented out of context (such as in a list of non-sensible text injected by a spammer). Filters today have no real way to deal with out-of-context words or phrases, because they don’t know what a context is.

BNR identifies out-of-context data by making its own contexts. It creates a series of machine-generated contexts around a sample of text (namely, the message body) and then identifies data that contradicts itself within the context it has created. The process is illustrated in the sections that follow, using three basic steps that any statistical filter should be able to implement.

Instantiation Phase


Let’s take a look at some text your filter might happen across while reading your email:

Mom Would Be Proud Try Viagra Now!

When your filter reads “Mom Would Be Proud Try Viagra Now!” it’s going to assign a series of probabilities to each word because, well, that’s what statistical filters do.











Text


Mom Would Be Proud Try Viagra Now!


Assigned Values


0.60 0.34 0.71 0.20 0.91 0.99 0.99


The first step in the noise reduction process involves instantiating a series of artificial contexts, or patterns, around this text. To create an artificial context, we first pigeonhole each of the values assigned by the filter into a band, rounded to the nearest 0.05. This helps to limit the total number of patterns we’re likely to come up with, as you’ll see shortly.














Text


Mom Would Be Proud Try Viagra Now!


Assigned Values


0.60 0.34 0.71 0.20 0.91 0.99 0.99


Assigned Bands


0.60 0.35 0.70 0.20 0.90 1.00 1.00


Next, we simply chain the bands together three by three to create patterns:

0.60_0.35_0.70 0.35_0.70_0.20 0.70_0.20_0.90 0.20_0.90_1.00 0.90_1.00_1.00

Each pattern represents the bands for three adjacent tokens in our sample text. We instantiate patterns for the entire body of our message, which leaves us with a series of what we call artificial contexts.

If you have a multiword-capable filter (for example, SpamProbe, DSPAM, or CRM114), you will want to instantiate a separate set of patterns for each layer of depth. For example, use the prefixes “S” and “M” to distinguish between single-word token patterns and multiword token patterns, or use 1, 2, 3, 4, and 5 to distinguish the different depths of SBPH.

Training Phase


Once we have instantiated a series of artificial contexts for an email, we take some time learning them, in a fashion very similar to the way we learn the rest of the tokens in our database. Each token is given a spam counter and a nonspam counter, and we calculate a probability for each pattern. For example, I use Paul Graham’s approach to assigning token values:

P = (spamHits / totalSpam) / (spamHits / totalSpam + innocentHits / totalInnocent)

Unlike Paul Graham’s standard calculation, however, no bias is used when calculating these pattern values, as we’re not interested in whether the value is guilty or innocent—only whether it’s consistent with the context around it. After a handful of email is processed in this fashion, our contexts are going to take on a disposition just like any other token, as illustrated in Table 13-4.



















































Table 13-4: Learned Pattern Contexts


Guilty


Innocent


1.00_0.00_0.45


[0.91990]


0.95_0.25_0.65


[0.02000]


1.00_0.40_1.00


[0.81868]


0.00_0.25_0.90


[0.00900]


0.25_1.00_1.00


[0.99990]


0.65_0.20_0.00


[0.00900]


0.35_1.00_1.00


[0.99990]


1.00_0.60_0.15


[0.21000]


1.00_1.00_0.20


[0.99990]


0.00_0.80_0.55


[0.00900]


1.00_1.00_0.25


[0.99990]


0.15_0.05_1.00


[0.00900]


0.55_1.00_1.00


[0.99990]


0.60_0.85_0.25


[0.12900]


1.00_1.00_0.35


[0.99990]


0.00_0.60_0.90


[0.02000]


0.25_1.00_1.00


[0.99990]


0.70_0.05_1.00


[0.17000]


1.00_1.00_0.15


[0.99990]


0.85_0.95_0.10


[0.00900]


0.15_1.00_1.00


[0.99990]


0.75_0.90_0.50


[0.00600]


0.10_1.00_1.00


[0.99990]


0.65_0.65_0.75


[0.00600]


0.20_1.00_1.00


[0.99990]


0.40_0.95_0.10


[0.16699]



Some contexts will resolve to a very innocent or very guilty disposition, and others will be less interesting. We want to identify contexts that are both very extreme in their value and self-contradictory. A pattern context must meet two basic criteria to be interesting enough for us to use:



The pattern’s value must exceed an exclusionary radius of 0.25 from neutral, or rather the distance from a neutral 0.5 (ABS(0.5 - P)). For a typical Bayesian filter, this means that the pattern’s value must resolve to 0.00 to 0.25 or 0.75 to 1.00.



The pattern must contain at least one data point with a value at least 0.30 away from the pattern’s value, or ABS(PP - PW), where PP is the value of the pattern and PW is the value of the word, or token. For example, if the pattern’s value is 0.90, it must have at least one data point with a value below 0.60.



For example, in the examples in Table 13-4, we see that the pattern 1.00_0.00_0.45 has an extremely guilty value (a 91 percent likelihood of being spam). Not only is this very interesting, but the fact that the pattern includes an extremely innocent token (0.00) is a good sign that this is a token we want to look at.

Dubbing Phase


Now let’s take a logical look at what we’ve accomplished. Given the pattern 1.00_0.00_0.45 (the first item in Table 13-4), which our filter trained to a value of 91 percent, our filter has discovered that the presence of an extremely guilty token (1.00) next to an extremely innocent token (0.00), next to a token we’ve not seen before (0.45 is the neutral value used by many filters), is collectively guilty. That is, this pattern of token values (regardless of the actual words used) is specifically guilty. And if this pattern is guilty, we must come to the logical conclusion that the token that the filter previously learned as 0.00 is completely contradictory in its present context—so it must be out of its normal context!

The dubbing phase, quite simply, involves omitting these anomalies. Given the following:

bnr.s.0.35.0.05.0.80 [0.99990] 
bnr.s.0.05.0.80.1.00 [0.99990]
Your Terminal TRY VIAGRA!
0.34 0.04 0.81 0.99 (Token Values)
0.35 0.05 0.80 1.00 (Corresponding Band)

we dub out the inconsistencies—any token in the pattern whose value is farther than 0.30 from the pattern’s value. So instead of seeing “Your Terminal TRY VIAGRA!” we now see


Blah Blah TRY VIAGRA! -.-- -.-- 0.81 0.99

Or we could get even more creative and change the out-of-context tokens’ polarity to match that of the context, which provides a bit of moral satisfaction (and possibly a more accurate result):

Your Terminal TRY VIAGRA!
0.99 0.99 0.81 0.99

Listing 13-1 summarizes the entire process.

Listing 13-1: The Bayesian noise reduction process






let windowSize = 3 let windowRadius = 0.25 let tokenRadius = 0.33
program start begin loop[cursor] (each token in text sample)
instantiate context[context] (from windowSize tokens at cursor)
load probability windowValue for context
let interestingWindow = (ABS(0.5-windowValue)>windowRadius)
if (interestingWindow)
begin loop[token] (each token in context)
load probability tokenValue at token
let inconsistentToken = (ABS(windowValue-tokenValue)>tokenRadius)
if (inconsistentToken)
eliminate token occurrence from classification
end loop
end loop program end











Examples


Now let’s take a look at some samples of actual mail before and after the noise reduction process. It’s important to note that the BNR algorithm performs the processing shown without any knowledge of the sample’s disposition.


Bayesian Noise Reduction and Spam Classification


We see in the example in Listing 13-2 that much of the message is simply irrelevant text. While some of the text is useful for identifying spam against the test user, there is also an abundance of innocent and neutral text. These types of spam often provide only a minute amount of useful information to work with, flooding the message with conversational text in an attempt to fool spam filters.

Listing 13-2: A real-world microspam with embedded word list attack






<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<HTML><HEAD><TITLE>Message</TITLE>
<META content="MSHTML 6.00.2800.1276" name=GENERATOR></HEAD>
<BODY> <DIV> </DIV> <DIV></DIV>
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left><FONT
face=Tahoma
size=2>-----Original Message-----<BR><B>From:</B> cierra myers
[mailto:sangglenna@techemail.com] <BR><B>Sent:</B> Tuesday, February 03, 2004
10:09 AM<BR><B>To:</B> Penny Kelly<BR><B>Subject:</B> >>Attract your
mate<BR><BR></FONT></DIV><FONT face="Courier New" size=1>My Saturdays nights
are no longer spent by the fire "reading a good book". four seconds until
picture isdownloaded <BR><A wjyf.com
mnhoavekpkjnimoixmomhioecwshvvl="http://
emjwxbholwirhkwwaoufvtcfamrydfnqx"><IMG src="http://www.netstarsite.com/
argo.net" fnqusjjsqenrqgwn.com
rqmjrqpkoccxavkbfhfrilkctbameurvfuepwjd="http://joydnxvnslffgngrejhustv"
NOSEND="1"> </A><BR><BR>No? drawled the dragonette; it seems to me very
babyish
<BR>How old is your mother? asked the girl <BR>Oh! I really think, continued
the
boy, nodding sagely, that it wouldn’t be well to have these Records scattered
around <BR>Mother’s about two thousand years old; but she carelessly lost
track
of her age a few centuries ago and skipped several hundreds Their use would
givesome folks unfair advantage over others, you know











Using the Bayesian noise reduction algorithm against a real-world subject with a sufficiently trained database, the filter eliminates the following text (and underlying values) from the sample:

HTML PUBLIC HTML HTML TITLE TITLE OutlookMessageHeader us dir ltr left Tahoma  
cierra myers sangglenna techemail February To Attract Courier Saturdays nights
spent good book four seconds isdownloaded href wjyf fnqusjjsqenrqgwn NOSEND No
drawled dragonette babyish Oh! nodding sagely well these Records Mother’s
about two but she carelessly lost track her age few centuries skipped several
Their use givesome folks advantage know HTML
0.16 1.00 0.16 0.16 0.43 0.43 0.40 0.27 0.31 0.40 0.44 0.24 0.40 0.40 0.40
0.40 1.00 0.23 0.40 0.32 0.40 0.30 0.22 0.09 0.25 0.39 0.09 0.40 1.00 0.40
0.40 0.40 0.28 0.40 0.40 0.40 0.40 0.40 0.40 0.17 0.26 0.40 0.40 0.24 0.22
0.49 0.61 0.40 0.43 0.16 0.76 1.00 0.41 0.40 0.16 0.39 0.24 0.19 0.40 0.11
0.61 0.38 0.16

This leaves the following text and values for the classifier:

W3C DTD Transitional EN HEAD Message META content MSHTML name GENERATOR HEAD  
BODY DIV DIV DIV DIV DIV class lang en align FONT face size Original Message
BR From mailto com BR Sent Tuesday AM BR Penny Kelly BR Subject your mate BR
BR FONT DIV FONT face New size My are no longer by the fire reading until
picture BR com IMG src com BR BR the it seems to me very BR How old is your
mother asked the girl BR really think continued the boy that it wouldn’t be to
have scattered around BR thousand years old of ago and hundreds would unfair
over others you FONT BODY
1.00 1.00 1.00 1.00 1.00 0.02 1.00 1.00 1.00 1.00 1.00 1.00 0.79 0.16 0.16
0.16 0.16 0.16 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.02 0.02 0.76 0.17 0.08
1.00 0.76 0.00 0.02 0.01 0.76 1.00 0.03 0.76 0.03 0.91 1.00 0.76 0.76 1.00
0.16 1.00 1.00 1.00 1.00 0.77 0.90 1.00 1.00 0.87 0.87 0.47 0.06 1.00 1.00
0.76 1.00 1.00 1.00 1.00 0.76 0.76 0.87 0.85 0.02 0.89 1.00 0.22 0.76 0.12
0.10 0.89 0.91 1.00 1.00 0.87 1.00 0.76 0.07 0.10 1.00 0.87 1.00 0.89 0.85
0.05 0.88 0.89 0.73 1.00 0.57 0.76 1.00 1.00 0.10 0.94 1.00 0.92 1.00 0.11
1.00 1.00 0.11 0.86 1.00 0.79

At first glance, it may appear that much would-be junk text is remaining. However, a closer look at the tokens’ underlying values shows that they are very useful to the filter for evaluating the message. What’s left is a much cleaner, more consistent set of data for processing.


Bayesian Noise Reduction and Legitimate Message Classification


To illustrate how statistically unbiased the noise reduction algorithm is, Listing 13-3 shows the same function performed on a sample of legitimate conversation. This particular message includes noise from the mailing list’s embedded advertisements and the informal conversation itself.

Listing 13-3: Legitimate email with common noise and embedded noise from list advertisements






<html><body> 
<tt><BR>
-hey sassy canadian..I'll do it for ya..just email me.<BR>
I'm at mom's. We got caught in a snowstorm coming home from <BR>
Susanville..I'm exhausted! lol<BR>
In clovergirls@yahoogroups.com, &quot;Chris &amp; Heather Nish&quot; <BR>
&lt;hcnish@t...&gt; wrote:<BR>
&gt; Hey guys, <BR>
&gt; I need one of you to email someone for me...<BR>
&gt; My emails aren't getting to a potential customer and<BR>
&gt; now she's starting to get pissy with me...lol<BR>
&gt; any volunteers?<BR>
<BR><BR><BR></tt>
<!-- |**|begin egp html banner|**| -->
<br><tt><hr width="500">
<b>Yahoo! Groups Links</b><br>
<ul><li>To visit your group on the web, go to:<br><a >http://groups.yahoo.com/group/
clovergirls/</a><br>&nbsp;
<li>To unsubscribe from this group, send an email to:<br><a
>clovergirls
unsubscribe@yahoogroups.com</a><br>&nbsp;
<li>Your use of Yahoo! Groups is subject to the <a >Yahoo! Terms of Service</a>.
</ul> </tt> </br>
<!-- |**|end egp html banner|**| -->
</body></html>











As you can see in the following results, this time the noise reduction algorithm perceived inconsistencies to be primarily patterns with guilty features— which may have otherwise led to a misclassification of the message.

Unlike the first example illustrated, in which innocent tokens were eliminated, the patterns found in this legitimate message lend themselves to preventing false positives.

Eliminations:

I'm In com amp gt gt gt need one email for gt emails getting gt now get with  
gt any your on the from this send an email mailto com subject com nbsp Your
use is subject the
0.82 1.00 1.00 1.00 0.94 0.84 0.35 1.00 1.00 0.58 1.00 0.64 0.54 1.00 0.35
0.98 0.95 0.43 0.63 0.63 0.63 0.56 1.00 0.84 1.00 0.63 0.64 0.54 0.63 0.98
0.84 1.00 0.63 1.00 1.00 1.00 1.00 1.00 1.00 0.73 0.55 0.84 0.70 0.95 0.69
0.95 0.94 0.69 0.59 1.00 0.69 1.00

Remaining text:

hey sassy I'll ya me mom's caught coming Susanville exhausted! lol clovergirls  
yahoogroups quot Chris Heather Nish quot lt hcnish wrote Hey guys of you to
someone me My aren't to potential customer and she's starting to pissy me lol
volunteers Yahoo! Groups Links To visit group web go to href To unsubscribe
group to href clovergirls unsubscribe yahoogroups Unsubscribe clovergirls
unsubscribe yahoogroups of Yahoo! Groups to href Yahoo! Terms of Service
0.07 0.00 0.10 0.04 0.90 0.00 0.10 0.17 0.00 0.00 0.00 0.00 0.01 0.28 0.07
0.00 0.00 0.28 0.17 0.00 0.07 0.17 0.10 1.00 1.00 1.00 0.20 0.90 0.36 0.13
1.00 1.00 0.93 1.00 0.04 0.17 1.00 0.00 0.90 0.00 0.00 0.01 0.01 0.02 0.97
0.31 0.01 0.04 0.29 1.00 1.00 0.97 0.83 0.01 1.00 1.00 0.00 0.83 0.01 0.06
0.00 0.83 0.01 1.00 0.01 0.01 1.00 1.00 0.01 0.07 1.00 0.08

End Result


The results provided by this algorithm are quite impressive. In legitimate messages, the number of guilty identifiers that could lead to a false positive are substantially reduced. In spam messages, there is a significant reduction in innocent identifiers that could lead to a spam misclassification. And this all takes place without the noise reduction algorithm having any knowledge of what the true disposition of the message is. After performing tests on random system users, I’ve found that the BNR algorithm appears to improve confidence by an average of 20 percent and, in a few isolated cases (which could be considered false positives), reduced confidence by only about 5 percent. Table 13-5 shows two such users on my system and their results.




























Table 13-5: Bayesian Noise Reduction, Illustrating Improved Confidence in


Total Samples Analyzed


3,948


2,280


Total Improved Confidence*


2,523


1,522


Total Decreased Confidence*


26


16


Total N/C in Confidence*


1,399


742


Average Increase in Confidence*


21.51%


20.80%


Average Decrease in Confidence*


5.26%


4.00%


* Confidence calculated using Robinson’s Geometric Mean Test Inverted


Efficacy


I’ve found that BNR’s ultimate effectiveness (and long-term efficacy) depend on how its pattern contexts are trained by the user’s filters. At present, I am training the contexts on every message processed (and retraining on errors), but I’ve found that training only on hard-to-classify messages makes BNR a little more sensitive to different types of noise. I believe there’s also a threshold for purging that should be developed through trial and error. Perhaps dividing all of the counter totals by 2 at certain milestones could help keep the pattern contexts dynamic enough to adapt to new types of context. Training philosophy will most definitely affect BNR’s performance, and so it’s a good idea to find a happy medium through a little testing.

Since BNR behaves based on the context values it has learned for a specific user, your actual mileage may vary. I’m confident, however, that this approach will come in handy as spammers continue to grow their wordmining databases. At some point, spammers will be able to generate enough accurate junk to increase their success rate against typical content-based statistical filters. The great thing about this algorithm is that its function is abstracted from the actual words. So in order to circumvent this type of approach, a spammer not only needs to mine words likely to be innocent, but also needs to mine both guilty and neutral words to the nearest 0.05. The spammers also need to somehow mine the learned patterns and values from a user’s filter (which may not even be possible) and then put it all together to create a series of artificially “in-context” junk text. This is, at the very least, computationally infeasible today.

The BNR algorithm appears to be very useful at identifying out-of- context data within any type of message (good or spam), and it does its job remarkably well, considering that it has no suppositions or knowledge about the true disposition of the message. I tested this algorithm against my own corpus of mail and was very surprised to see my accuracy jump from 99.96 percent to an astonishing 99.985 percent (from 1 error in 2,500 to 1 error in 7,500).

This implementation of Bayesian noise reduction has been included in DSPAM versions 3.4 and later (older versions sported a more heuristic approach) and is also available in a GPL library for other filter authors at http://bnr.nuclearelephant.com.

/ 151