Chapter 14: Collaborative Algorithms
Statistical filtering has a reputation for taking place in isolation, since filtering is better when it’s more personalized. The problem with this isolation, however, is that there’s a lot of value in sharing information too. Collaborative algorithms can help improve the accuracy of a filter by providing information about important new email so that the recipient’s filter doesn’t have to learn it from a misclassification. Several different collaborative algorithms are used today by popular filters, and all attempt to serve the same purpose, which is to make two heads better than one. Collaborative algorithms help us to further define content. Content is not only my content; it’s also the other guy’s content or, more specifically, the content we might have in common.Collaborative algorithms can be used correctly and incorrectly. The correct way to use them is as supporting algorithms to present information about other users in a network and their results. We’ll discuss some of the more popular collaborative algorithms in this chapter.
Message Inoculation
Message inoculation is a concept originally invented by Bill Yerazunis of CRM114. It is best explained by Yerazunis himself:
Part of the problem is that spam isn’t stationary, it evolves. That pesky .1 percent error rate is in some part due to the base mutation rate of spam itself. Maybe the answer is “vaccination.” Vaccination is allowing _one_ person’s misery to be used to generate some protective agent that protects the rest of the population; only the first person to get the spam actually has to read it. My expectation is this: Say you have ten friends, and you all agree to share your training errors. Each of you will (statistically) expect to be the first to see a new mutation of spam about 9 percent of the time; the other ten friends in this group will have their Bayesian filter trained preemptively to prevent this. Net result: you get a tenfold decrease in error rate—down to 99.99 percent accuracy. With a hundred such (trusted) friends, you may be down to 99.999 percent accuracy.
A message inoculation is an antigen that is introduced into each participating member’s filter. The antigen inoculates the user not only from receiving the original spam, but also from receiving any other spam that is contextually similar. The antigen trains the right amount of lexical data in the user’s dataset to seed it against any permutations or similar messages that happen to share some contextual similarities. Inoculation takes place only when a training error has occurred. This is important for two reasons. As accurate as language classifiers are today, they are still capable of making errors, and we don’t want to accidentally inoculate users with misclassified mail (this will cause a snowball of errors to occur). Also, one of the goals in fighting spam on a large scale is to reduce the number of resources that spam is allowed to use. If we were to fire off an inoculation every time a spam was caught, we would be doubling the amount of bandwidth spam uses already. As erroneous classifications are made, inoculations get sent out as the user attempts to correct the error. We end up sending only information that is most likely to be missed by someone else’s filter. It’s also important to note that inoculation groups comprise trusted users; that is, every member in the group has chosen to trust inoculations from every other member. This is very important, since message inoculation is protected with an authentication mechanism to prevent malicious injection. A list of shared secrets and/or public keys is exchanged between users in a group, so that they can authenticate one another’s inoculations. Presently supported mechanisms include an MD5 checksum with shared secret and public key signatures.Message inoculation was drafted into a message format outlined in the Internet-Draft available on the Internet Engineering Task Force (IETF) website at http://www.ietf.org/internet-drafts/draft-yerazunis-spamfiltinoculation-04.txt. The Internet-Draft describes a specific MIME encoding for sending message inoculations through email. The benefit of using this encoding is that inoculation groups may consist of different users on different systems, even using different spam filters. The message inoculation is generally sent by the user’s spam-filtering application, which could be run on the server side or embedded in the mail client itself.
According to the Internet-Draft, the message inoculation encoding consists of six different components:
1. The inoculation subtype, which identifies that the message being received is an inoculation and should be treated accordingly.
The inoculation subtype is explained further as a subtype used in conjunction with a standard content-type header. Three primary media types are currently supported: message, text, and multipart. The inoculation subtype is used to define an inoculation either in the top-level headers or in one or more parts of a multipart message.
Content-Type: message/inoculation
A multipart inoculation is capable of delivering multiple inoculations in a single message. Each part of the message contains its own additional inoculation headers, such as authentication information.
2. An Inoculation-Sender field, which identifies the sender of the inoculation and provides an identity the recipient can query locally for authentication information (such as a shared secret, public key, et cetera).
The sender of the inoculation must be in the recipient’s trusted group in order for the inoculation to be accepted. Since an authentication mechanism is used, the inoculation sender’s identity is looked up in a table where it will exist in conjunction with either a shared secret or a key. The inoculationsender field is also a header field.
Inoculation-Sender: bill_yerazunis
The sender’s identity may be a username or an email address but should be specific enough so that it doesn’t risk confusion with any other users in the group. Since it’s possible for some users to have many different email addresses, using the naming convention firstname_lastname is usually less ambiguous.
3. An Inoculation-Type field, which specifies the type of inoculation payload being sent (spam or nonspam), to instruct the filter how to proceed with importing the inoculation payload.
There are essentially two different types of inoculations a user can send. The most common type is a spam inoculation, to protect all other members of the group. Depending on the group’s makeup, it may also be appropriate to send a message inoculation for a particular nonspam to the members of the group. If the group consists of employees at a company, for example, a nonspam inoculation could be used to retrain a company message the user received as a false positive.
Inoculation-Type: spam
4. An Inoculation-Authentication field which specifies the method of authentication provided (if any) to verify that the inoculation is from a trusted user.
Depending on the type of authentication being used, the inoculationauthentication field will contain the authentication method and any additional data necessary to the correct authentication of the message. For example, if the authentication mechanism being used is an MD5 checksum with a shared secret, this field will include the checksum sent by the other user.
Inoculation-Authentication: md5; checksum="c3a47b29744062288cbd5c305897eaa9"
The Internet-Draft does make provision for an authentication type of “none,” but it strongly recommends against implementing this authentication type, as it would be relatively easy for a spammer to send bogus inoculations to a large group of users if no concept of authentication is being used.
5. Extended authentication message components, such as a public key signature, may be present depending on the authentication mechanism used.
If the authentication mechanism being used involves a public key, the signature for the inoculation payload will be present in a different part of the message, identified by a “signed” subtype.
6. The inoculation payload, which is the actual information provided to seed the filter tool.
There are a few different types of inoculation payloads, depending on the actual inoculation type. For example, a message inoculation will contain an Internet message in RFC 822 format, complete with headers and a message body. If the inoculation is a standard text inoculation, unformatted text will be sent as an inoculation payload. Unformatted text is generally processed differently than an Internet message, as the tokenizer will generally process message headers differently from the message body. Once the inoculation has been received, it is up to the filter to determine whether or not the user requires the inoculation, and it will apply the inoculation using whatever training the filter sees fit. A complete example of the process of receiving and processing an inoculation using MD5 with a shared secret authentication mechanism follows.
The recipient’s inoculation-aware spam tool notes that this is an inoculation-type message.
The recipient’s spam tool parses the headers to find that the claimed sender is a trusted user and the claimed inoculation type is spam.
The recipient’s spam tool checks the local set of authorized inoculators and finds that the identified user is permitted to inoculate spam.
The recipient’s spam tool looks up the identified user in its configuration and finds that the corresponding authentication shared secret is a particular string of text.
The recipient’s spam tool tests to confirm that this is not a multipart inoculation and that the payload is the entire data text area.
The recipient’s spam tool forms the authentication text by concatenating the authentication shared secret, a newline, and the full data text area (omitting the obligatory newline-newline after the last header line) and continuing to end-of-file on the email text or the length of the content, specified in the content-length field, if present.
The recipient’s spam tool calculates the MD5 checksum of this authentication text.
The recipient’s spam tool compares the calculated checksum (from step 7) with the claimed checksum found in the message header. If the checksum does not match, no automatic inoculation is done, and the mail server may either notify the user of the failure of an attempted inoculation or may simply drop the message and exit with nonerror status. It is recommended that this behavior be user configurable.
Having validated the authenticity of the sender/checksum/payload, the spam tool forwards the payload (and only the payload) to the learning interface of the proper user-configured spam-filtering program, including the type of payload presented.
The filter then determines whether the inoculation is useful (for example, it won’t inoculate if you already have the “disease”) and applies the inoculation if appropriate.
An example of a message inoculation is provided in Listing 14-1.Listing 14-1: Example of a message inoculation
To: Everyone on my list <spamsucks@myhouse.com>
From: Jonathan A. Zdziarski <mymailbox@mydomain.com>
Subject: This is a test inoculation
Inoculation-Authentication: md5;
checksum="dcdac94fab6ded79f33b0134d665d02f"
Inoculation-Type: spam
Inoculation-Sender: jonathan_zdziarski
Content-Type: message/inoculation
Content-Length: 169
From: Bob Denver <bob@dead.com>
Subject: This is a spam
To: You <you@youremail.com>
This is a test innoculation. The checksum is correct, however.
-Bill Yerazunis
Supporting Data
As a trial test of message inoculation’s ability to adequately protect a group of users, ten live users were selected and grouped together in a single message inoculation group. The users’ mailboxes were mirrored, so that an uninoculated mailbox and an inoculated mailbox were created. The users then continued along with their daily lives for 30 days. At the end of the 30-day period, each user’s inoculated mailbox had an average of 20 fewer spams than the uninoculated mailbox. Each user experienced a total of about 3,000 spams during this period; the accuracy level dramatically improved from an average of around 99.3 percent to 99.96 percent. If this were implemented at a medium-sized Internet provider of 10,000 accounts, it would decrease the total amount of spam received by approximately 2.4 million messages annually.This test proved that message inoculation works, at least for the test group. Larger groups of mature users should be able to count on even higher levels of accuracy from a larger base of inoculation.
External Inoculation
Real accounts don’t necessarily have to be the only source of inoculation. External inoculations involve the use of honey pots (mailboxes set up specifically to receive only spam) and have become quite popular recently to capture spam in the wild. The theory behind external inoculation is this: why put anyone through the misery of being the first to receive a new spam when you can have the spammers themselves send it directly to you? On top of this, you can combine external and internal inoculation by taking spam you receive externally and inoculating your friends with it internally. This is all accomplished by establishing one or more honey pots.The email address of a honey pot is frequently circulated in invisible text and other types of places where harvest bots are likely to pick it up. The satisfying thing about using honey pots to perform message inoculation is that the spammers themselves are really inoculating you from their new distributions of spam without even knowing it!Message inoculation has two primary uses. First, it provides a way for users to collaborate with other trusted users and learn important lexical data for new types of spams before they arrive. Second, message inoculations provide a way for users to collaborate directly with the spammers and to inoculate themselves and other members of their group from new types of spam.