Word Pairs - Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] - نسخه متنی

Jonathan A. Zdziarski

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
توضیحات
افزودن یادداشت جدید







Word Pairs


Using word pairs, or nGrams, has recently become very popular among authors of statistical filters and adds a lot of benefits to standard single-token filtering. Pairing words together creates more specialized tokens. For example, the word “play” could be considered a very neutral word, as it could be used to describe a lot of different things. But pairing it with the word adjacent to it will give us a token that will inevitably stick out more when it occurs—for example, “play lotto.” This approach helps improve the processing of HTML components by identifying the different types of generators used to create the HTML messages. Each generator, whether it’s a legitimate mail client or a spam tool, has its own unique signature, which joining tokens together can help to highlight. Tokenizers that implement these types of approaches are referred to as concept-based tokenizers, because they identify concepts in addition to content. We’ll discuss the different implementations of nGrams in Chapter 11.

/ 151