Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification [Electronic resources]

Jonathan A. Zdziarski

نسخه متنی -صفحه : 151/ 110

Classification Groups

Classification groups are another type of collaborative filtering algorithm that is generally found exclusively on server-side spam filters. The concept behind classification groups is to seek assistance from a group of other trusted users when a user’s filter is uncertain as to whether a given email is spam. Clearly, we don’t want to ask the users themselves for input, which is why this is performed on the server side. Classification groups are a small-scale version of neural mesh networking, which we’ll discuss later in this chapter. Classification groups don’t require the additional overhead of parallel processing or even the need to process every node in the group. Instead, a groupwide spam hunt is started and iteration stops whenever enough qualifying results come back from the query.

A series of nodes is defined, with one node representing one user on the system. When a node in the classification network receives a message, the node’s filter instance determines its own confidence level in the result. If the confidence level is determined to be “uncertain,” the user’s filter instance then queries several other nodes in the network sequentially. In a classification network, the first positive, majority decision, or a percentage of decisions play a role in final classification, depending on the size of the network. For example, in a network of ten nodes, a user’s filter instance may seek two confirmations that the message is spam in order to classify the message as such. All nodes are considered equally accurate, and therefore once a minimum threshold has been met, iteration can stop. Other implementations iterate through the entire classification group and then tabulate which classification fell to a majority of the nodes. This results in smaller queries and faster execution time than a large-scale neural networking algorithm and can yield results that are just as good on smaller systems or in smaller groups where the users have specifically opted into membership.

Classification groups can also be used to establish a global-type training user to provide out-of-the-box filtering to new users on the system. A global dataset can be generated, either as a composite of several users’ training data on the system or, more likely, trained directly by the systems administrator to provide mediocre, generalized filtering for new users during their initial training period. This global group can provide an acceptable level of filtering for new users until they are able to build their own set of training data. If a new user on the system has fewer than X legitimate messages and Y spams in their corpus, the filter may then assume that the user isn’t ready to per- form their own classification, at which point the global user’s dataset will be queried. During the time that the global dataset is being used, the user’s filter instance can be training based on the results provided by the global decisions. When the user’s dataset has come into maturity, the global dataset will be disengaged and consulted only when the user’s filter instance is uncertain about the particular classification of a message.

Classification groups work very well in well-maintained environments but can cause problems among users with different email behavior. Since one person’s spam is another person’s legitimate email, it’s possible that classification groups could generate some false positives at first about messages that require additional classification. Classification groups are also designed for users with mature datasets. New users should never be placed in a classification group until they have built up enough data to be able to filter spam accurately. One of the nice safeguards of classification groups is that errors will be retrained, making the user’s dataset more confident in the particular types of messages received, so that the classification group may not even need to be consulted the next time.