Floating-Point Renormalization and Underflow - Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] - نسخه متنی

Jonathan A. Zdziarski

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
توضیحات
افزودن یادداشت جدید







Floating-Point Renormalization and Underflow


Bayesian classifiers designed for the Graham model typically keep only the 20 to 50 most exemplary local probabilities and evaluate those through the Bayesian chain rule. This makes the math easy because 0.1 to the 50th power is 10-50, which is still well within the floating-point range of IEEE’s specification for floating point. However, 1.0 minus 10-50 is exactly equal to 1.0, even in 80-bit floating point. (For those readers with a vague familiarity with computer arithmetic, this is called loss of precision due to floating-point normalization.) This is bad; it means that once the Bayesian chain rule hits a Pout of 1.0, it can never recover change from that value.

For classifiers like CRM114 that don’t throw away any of the features, this problem of loss of precision becomes much worse. Unless special steps are taken, the useful dynamic range of an 80-bit IEEE number can be exhausted even before the headers of an email message are fully processed.

To prevent this, CRM114 uses a unique two-range system. Both P(spam) and P(nonspam) are calculated separately, and the smaller is used to recalculate the larger. Thus, even though the larger probability may become numerically indistinguishable from 1.0, the smaller will retain full accuracy down to 10-300 which is quite useful for classifying large documents.

/ 151