Sparse Binary Polynomial Hashing

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

اصطلاحنامه مجموعه ها مرورالفبایی لغت نامه دهخدا

➟

جستجو در لغت نامه

بیشتر

کتابخانه شخصی پرسش از کتابدار ارسال منبع

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] - نسخه متنی

Jonathan A. Zdziarski

| ،

افزودن به کتابخانه شخصی

میخواهم بخوانم

درحال خواندن

خوانده شده

ارسال به دوستان

آدرس پست الکترونیک گیرنده :

آدرس پست الکترونیک فرستنده :

نام و نام خانوارگی فرستنده :

پیغام برای گیرنده ( حداکثر 250 حرف ) :

کد امنیتی را وارد نمایید

ارسال

جستجو در متن کتاب

بیشتر

تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب

➟

جستجو در لغت نامه

بیشتر

توضیحات

افزودن یادداشت جدید

Bill Yerazunis originally introduced the concept known as SBPH, or sparse binary polynomial hashing. SBPH is an approach to tokenization using word pairs and phrases. If it wasn’t so effective at what it does, it would probably be a terrible idea—but Yerazunis has repeatedly astonished the spam-filtering community with the leaps in accuracy made by SBPH tokenization. Graham refers to SBPH with the same mixed feelings regarding its ingenuity and need for medication:

Another project I heard about . . . was Bill Yerazunis’ CRM114. This is the counterexample to the design principle I just mentioned. It’s a straight text classifier, but such a stunningly effective one that it manages to filter spam almost perfectly without even knowing that’s what it’s doing.

SBPH tokenizes entire phrases, up to five tokens across, and allows for word skipping in between. It led the way in terms of accuracy for a long period of time, but it also created an enormous amount of data, which is one of the reasons it presently functions only in a train-on-error environment. SBPH provides the benefit of using the simplest, most colloquial tokens but giving special notice to more complex tokens as well—which are usually much stronger indicators of spam when they appear.

A few filters, such as CRM114, perform this type of word skipping, which will tokenize something like “manh+<!rescind>+ood” and also help the filter “see” the original token by performing the word skipping: “manh+ood.” Since tokenization is an imperfect process, approaches like this generally provide more machine-readable tokens to deal with, without necessarily requiring much work. The more permutations of machine-readable tokens are created, however, the larger and more spread out the dataset will become, possibly affecting accuracy. The amount of data generated by SBPH generally turns a lot of filter authors off to it in favor of simple functions such as HTML comment filtering. We’ll cover SBPH more in depth in Chapter 11.

Sparse Binary Polynomial Hashing - Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] - نسخه متنی

Jonathan A. Zdziarski

آدرس پست الکترونیک گیرنده :

آدرس پست الکترونیک فرستنده :

نام و نام خانوارگی فرستنده :

پیغام برای گیرنده ( حداکثر 250 حرف ) :

کد امنیتی را وارد نمایید

فونت

اندازه قلم

حالت نمایش

Sparse Binary Polynomial Hashing