Final Thoughts - Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] - نسخه متنی

Jonathan A. Zdziarski

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
توضیحات
افزودن یادداشت جدید







Final Thoughts



In this chapter, we’ve taken a look at three very different approaches to advanced tokenization. All three approaches attempt to identify lexical patterns and specific characteristics we know as concepts. These approaches all add a different amount of data to the dataset and require differing amounts of resources. Depending on the complexity of the filter and other algorithms used to perform language classification, one type of tokenization may be more appropriate than another.

All filters should consider implementing at least one form of concept identification. Primitive tokenizers have been shown to be very effective and resource friendly, but as new types of messages are being crafted specifically for the purpose of evading filters, it is becoming more and more necessary to identify not only the components of a message but also the concepts within a message. The advanced features that this type of tokenization provides (such as HTML classification, grammatical analysis, and so on) bring an entirely new level of accuracy to modern-day filters, without necessarily requiring a significant increase in resources.

Concept learning is the foundation of true machine learning; the philosophical question, “What is content?” requires that concept identification be an important part of the answer.

/ 151