Proprietary Implementations - Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources] - نسخه متنی

Jonathan A. Zdziarski

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
توضیحات
افزودن یادداشت جدید







Proprietary Implementations



Of course, if you don’t feel like using a third-party product, you can always build one yourself. That’s what Bill Yerazunis did to support the large amounts of data powering his SBPH tokenizer. Bill’s proprietary implementation proves that home-brew solutions can get the job done right.

Yerazunis’s implementation consists of a series of files ending in .css, although they shouldn’t be confused with cascading style sheets. Each .css file is exactly 1 MB in size plus 1 byte. Presently, CRM114 has been modified to use 64-bit hash values as keys. These numeric keys are then used to create a modulus based on the file size. The modulus determines the exact spot in the 1 MB file where the token belongs, and therefore no traversal or complex indexing is necessary. Each record is exactly 12 bytes in size, making the average database approximately 25 MB.

Yerazunis’s latest implementation of this storage solution has resulted in very fast execution time, much more acceptable than previous versions of the software. The average classification cycle uses approximately 0.10 seconds of real time, 0.03 second user, and 0.02 second system—much more impressive than many other filters available today. On top of speed, CRM114 is considered one of the most accurate language classifiers freely available today. Disk space is cheap, so supporting 10,000 users won’t cost too much. CRM114 is presently being used on some large-scale implementations in Spain, and many optimizations are available for the software. The storage implementation continues to improve, and with the software under an open source license, it’s bound to get a significant amount of attention.

/ 151