The Future of Language Classification - Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] - نسخه متنی

Jonathan A. Zdziarski

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
توضیحات
افزودن یادداشت جدید







The Future of Language Classification



Although language classification has proven quite effective at fighting spam, it is not yet as popular as other approaches have been in the past. Migrating filters to use this next-generation technology is not a step that will happen overnight. Many are concerned about the long-term efficacy of language classifiers, and there is even some resistance to this new approach on the part of a few developers and manufacturers, who insist that their older tools are better designed (even if they experience poor levels of accuracy).

For example, SpamAssassin still incorporates more than 900 heuristic rules, although it has recently implemented a Bayesian filter “rule” to quiet the masses. Many people still believe that it hasn’t been implemented well enough to compete with next-generation filters, as the effectiveness of the software is still dropping. In the words of one SpamAssassin user, “SpamAssassin doesn’t learn, it just tolerates you to make you feel like you’re actually accomplishing something.” The Bayesian learning mechanism does improve accuracy, but holding on to the heuristic portions of the program keeps it from becoming a filter that will outperform statistical products.

Some commercial appliances still refuse to adopt language classification technology altogether. As a result, many systems on the market today promise a “Bayesian element” but aren’t really dedicated to statistical analysis; this Bayesian element is given a slice in the filtering chain (usually at the bottom), front-ended by heuristic rules, blackhole lookups, whitelists, and a hodgepodge of other solutions that have been worked into the product—and they end up starving the Bayesian filter of data or causing it to degenerate to the point that it reflects the same results as the higher-level components. Many commercial anti-spam companies simply misunderstand Bayesian filtering or make incorrect statements about it (such as its level of accuracy) to push their own products.

The Sovereignty of Statistical Filtering


The primary concern that commercial appliance manufacturers have with language classification is that to be most effective it should be implemented as a first-line defense against spam. Statistical filtering is its own sovereign state and functions best with its own militia. Since it is more accurate than any other spam-fighting technology to date, placing any of these less accurate tools in front of it only hurts the accuracy of the filter. Most manufacturers are a bit concerned with the idea of deploying a box that learns on its own. Their customers will no longer need annual contracts for nightly updates (of rule sets) or as many software upgrades, which certainly puts them in a precarious financial position. By dumbing down the filtering appliance with heuristic tools (relying on annual subscriptions), manufacturers are able to continue generating residual revenue while implementing a lesser technology. Most people think in terms of arsenal, and so this makes sense.

Eventually, a new business model will be sorted out, but until then, some corporations have marketed Bayesian content filtering as an incomplete solution, for financial reasons. They have sought to convince the public that statistical filtering is ineffective. However, Bayesian content filtering hasn’t turned out to be ineffective, but rather too effective. Due to the financial risk of deploying technology that improves itself, many open source language classifiers have been left outperforming commercial solutions.[2]

Many other types of appliances tout five 9s accuracy. Brightmail makes this claim, but some who have implemented its solution have reported accuracy as low as 83 percent, unless they use their glorified whitelists. Other solutions require the use of challenge/response or other annoying technologies in order to achieve these inflated levels of accuracy. In nonstatistical filters, the price of accuracy is human effort. Paradoxially, saving human effort is one of the primary reasons most companies choose to filter spam.

[2]With the exception of one purely statistical commercial solution, Death2Spam. Death2Spam achieves accuracy levels of up to 99.9 percent, higher than most other commercial solutions on the market that implement a hybrid solution.

/ 151