Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources]

Jonathan A. Zdziarski

نسخه متنی -صفحه : 151/ 67

Chapter 7: The Low-Down Dirty Tricks Of Spammers

word salad (inserting arbitrary words into a message), the majority of people using modern-day statistical filters aren’t seeing their filters crack under pressure. In fact, the more salad a spammer doles out, the better the job these filters are doing in identifying them as spam. Many of the algorithms discussed in this book have been implemented to perform advanced decoding of messages, enhancement of concept identification, and even data polishing to further enhance the effectiveness of filters. As spammers have invented new tricks to evade these types of filters, all statistical filters have needed is the occasional tweaking of code or turning of a knob by the filter author. Plenty of new tricks are being devised even today by spammers’ programmers, but so far none have been found that actually work effectively enough to continue making money.

Successful Filtering

Being the anal-retentive hackers we are, we start to think that if a spammer has managed to get a single spam past our filter, the filter is a failure. It’s a lot more work to get a message through to a hundred people (let alone a million) than it is to get it to one user, especially if each of those individuals has a different idea about what is and isn’t spam. It’s not going to be enough for a spammer to push their message past one user’s filter—in order to make money, they have to get it through to tens of thousands of users, and get at least a handful of not-so-bright greenhorns to click their link. Many savvy filter users receive only one or two spams per month, or even less. This suggests that even though filters aren’t 100 percent accurate, they are succeeding. Fortunately, there are dozens of different spam filters out there to choose from, and so spammers can’t spend much time trying to find the weaknesses in any one filter. Because spam filtering isn’t a monoculture, and because statistical filters are able to learn from their mistakes, spammers are delivering far fewer spams than they once did.

No More Headaches

During the time heuristic filters were popular, an arms race began between spammers and filter authors, in which spammers tried to evolve the features of spam faster than the filter authors could revise the rule sets. When statistical filtering came on the scene, things quickly changed, and spammers realized that they could no longer succeed in spamming simply by giving the developer a new headache. The playground quickly transformed, and statistical filters became the bully. Spammers are currently shifting their tactics to specifically target these new filters, and although there’s a lot of media hype, the spam-filtering community now has the winning hand. We know we’re winning, not only by the accuracy of our filters, but also by the fact that spammers are trying so many different tricks to target them.

The best way to beat a statistical filter is not to run one. It is to the spammers’ benefit to continue the spread of misinformation about statistical filters—namely the misconception that they are ineffective. Any negative media hype or urban legends that can be propagated could turn people away from these types of filters, which is the only way spammers have found to evade them. Educating engineers about this misinformation is now a responsibility left up to, well, someone. Nobody wants to keep the spammers in business, and so convincing people that these approaches are futile is one way to help raise the overall accuracy of spam filtering (by getting more people involved in running the appropriate tools).

The nearsightedness of spammers gives us an advantage in our programming efforts. As we look at all of the common tricks spammers have incorporated, we’ll see that they have been directed only at the present problem of the day—with absolutely no imagination invested in finding real ways to potentially beat a spam filter. Spammers still make basic, uneducated attempts to evade filters. Spamming is a slimeball business. You’d hardly find the best- of-breed hackers working for spammers; more often they’re mediocre programmers looking for work. This fact has prevented statistical filtering from coming up against any intelligent counterattack. With the popularization of interpreted languages, CGI, and simple layout languages like HTML, a sub- culture has developed proclaiming the philosophy that reading a book about Perl or PHP, or learning how to do up a website in Microsoft FrontPage can make someone a real programmer. Talent is truly on our side in the fight against spam. No self-respecting, talented software developer wants anything to do with it, apart from using the Delete button, if necessary.