Requirements Assessment

In Chapter 8, we discussed the key considerations for a storage driver. These considerations were primarily focused on internal design of the filter:

Disk space utilization

Performance

Locking

Portability

Recovery

I/O contention

Random-access features

Ease of use

However, these properties extend into the topics many institutions consider when evaluating different solutions on their network. These concerns will need to be taken into account in the development of the entire filter, as they will play a role in whether or not a filter may be usable in certain size networks. We’ll discuss the basic questions that get asked when looking at a filter and provide a little bit of foresight into the different answers that may be suitable for a particular network based on its size, budget, and methodology. These issues can be broken up into the following categories:

Total disk space requirements

Total processing power

Parallelization

Operating system requirements

High availability requirements

I/O bandwidth requirements

Feature requirements

End-user support

Large-scale spam-filtering projects usually have considerable budgets, but these budgets are likely to remain within the boundaries of managing spam (that is, without filtering). The goal in developing a scalable spam filter is to strike a balance between meeting the requirements just listed while keeping the cost within the confines of a typical budget. Return on investment is the biggest requirement managers have (or should have) when considering whether to deploy a new tool. Everyone agrees that spam is annoying, and everyone would like to filter it—unfortunately, if the management is unable to justify an actual return on their investment, it may be cheaper to let spam run rampant on the network.

Total Disk Space Requirements

The amount of disk space used to service an individual user versus that needed for a network of 100,000 users is usually seen as a linear problem. In global environments, in which a single dataset is shared, this generally isn’t an issue, and since a lot of institutions prefer cost over accuracy, many are willing to settle with the mediocre 99 percent accuracy levels a global dataset would offer.

Providing greater accuracy for a large number of users requires more disk space, however, and while many companies are happy with mediocre levels of accuracy, there are some interested enough in the problem to bite at a better solution. Customer support and even customer satisfaction are sometimes driving reasons to deploy a “spam filter that learns” on the network. Users are going to receive spam from time to time (and a lot of it if the system is filtering only 99 percent), but whether they call up the tech support department or simply click a button labeled “Report Spam” may determine how much money the company is willing to invest in managing inaccuracy. Politics may also drive this decision, as many universities and ISPs face privacy and user-control issues.

Greater accuracy requires personalized learning, and usually additional data. While it may be trivial to dedicate 25 MB of disk space to each user on a 10-user system, 25 MB of disk space per user on a 100,000-user system would require more than 2.5 TB of storage! Clearly, 25 MB per user is far too much for such a large-scale implementation. There are a few ways of reducing this amount, the biggest of which requires converting the issue of storage into a nonlinear problem.

Establishing a Purge Policy

An aggressive purge policy is your first defense in keeping the size of the dataset down. A majority of the data created by a tokenizer is never seen again in other messages—especially if you’re using a complex tokenizer that recognizes word pairs and lexical data. Nightly purges can help to

Remove data that has been stale for a period of time

Remove hapaxes that have been stale for a period of time

Remove uninteresting data that has matured

Data that has been stale for a period of time is likely not to appear in future messages. Since spam evolves, it’s possible that much of the data from two or three months ago will not be seen again in future distributions. It’s also likely that as users change their email behavior, much of the data from their old behavior will become obsolete. All that meaty data collected from mailing list headers stagnates indefinitely when a user unsubscribes from a list.

Hapaxes are likely to be in significant numbers in any dataset, representing data that has been seen only once (or few enough times for a real value to actually be assigned to the data). Data that has appeared only a few times over a period of a month is unlikely to play any significant role in effectively filtering messages.

Finally, uninteresting data is data that is unlikely to show up in a decision matrix. As we discussed in Chapter 4, most decision matrices used today are sorted based on a token’s distance from a neutral 0.5. This results in many tokens whose probability hovers around this neutral value never showing up in a decision matrix. Purging data with probabilities between 0.3 and 0.7 is a generally acceptable method of reducing the size of the dataset.

The important thing to consider in purging all of these types of data is that we want to allow enough time for the data to actually establish some type of value. On very large systems with high-volume users, this may be only a few days. The general philosophy is that if the data isn’t interesting after a certain period of time, not enough messages are coming in for the data to be useful, and it will appear in only a very small number of decisions.

Using Merged Global Datasets

The use of an exclusive global dataset is generally frowned upon because it diminishes accuracy. A merged global dataset, however, not only can provide the out-of-the-box functionality most users appreciate, but can also help keep the size of individual datasets down.

A merged global dataset is a feature used in some spam filters to provide a common set of training data to all users on the system while still allowing them to correct learning mistakes and maintain their own personal set of important tokens. The approach involves the creation of a global dataset in much the same way that one would be created for an exclusive global system. This global dataset should be populated enough to cover a wide range of data for many users, but should also be generic enough to avoid a significant number of false positives. The global dataset then provides a training base for the user and is merged with the user’s personalized training data at run time. As the filter begins to learn the specifics of a user’s behavior, the user’s data reflects only the necessarily corrections (diffs) to the global data. When the data is loaded into memory, the individual user’s data will adjust the value of the global data and yield a customized dataset for that user. The amount of training required by the system’s users in a merged group is in direct relation to the quality of the global database, meaning if you can create a high-quality global database then each user will require less disk space to correct it.

In a situation like this, all users benefit. Users who experience almost no classification errors continue to depend more on the global dataset without building much of their own training data. Users who experience a significant number of errors build more data to offset the global data, but this is still very trivial compared to the amount of training that would be necessary had the user been required to build a dataset from scratch. Merged groups don’t usually leave the user with long-term levels of accuracy as high as those of users who have built their own dataset from scratch, but they are usually close enough to make it a feasible solution in light of the disk space savings.

The ultimate goal of a merged global approach is to place as much common data as possible in the global space, so that it needs to take up disk space only once. A 100 MB global dataset with 5 MB training sets for 100,000 users will require only about 488 GB of storage, which is much more feasible than 2.5 TB and will still yield very accurate results in the range of around 99.9 percent.

Implementing Global Ingress

Another approach that can be layered on top of merged groups to improve the accuracy of a merged group is to maintain a tentative global ingress. A global ingress keeps track of the number of unique users whose filters discover new and useful data. This data can be used to season a global database with new and interesting data found. If a token remains interesting after being merged with the data from several other users, it is clearly a very specific and useful piece of data. After inserting this data into the global dataset, we can then delete it from all of the users’ individual datasets, freeing up space. This converts our problem of storage from a linear one to one of contextual similarities among users.

Total Processing Power

Processing power is generally thought to be much more expensive than disk space. Processing power is usually measured in terms of CPU and memory resources.

Many different variables play a role in the amount of processing power required to support a large number of users, and there are always ways to further optimize the software to run with a lighter footprint. Some factors to consider include the following:

Choice of programming language

Sorting and other algorithms used

Design paradigm and process instantiation

Choice of Language

The language that the filter is written in can play a significant role in how fast the application starts up and executes, and could ultimately mean the difference between supporting 5,000 users per machine or 20,000 users per machine.

There are two general types of languages to choose from: compiled languages and interpreted languages. Compiled languages such as C and C++ compile the source code into assembly language, from which it is assembled directly into machine code. Compiled programs are run as executables by the operating system and generally run much faster than interpreted programs, with a much lighter footprint as well. Interpreted languages, instead of compiling the source into machine code, will compile it into a proprietary bytecode that is then interpreted at execution time. These languages typically include just-in-time compiled languages as well, such as Java and Perl.

Interpreted languages aren’t actually executed by the operating system but rather are interpreted, evaluated, and then executed by an interpreter. Because interpreters are charged with the responsibility of evaluating the program and executing it themselves, the target program will usually have a slower execution time and require additional resources. The interpreters themselves have a considerable amount of bootstrap in that they require the loading of their own language libraries and other objects at startup. If each process requires its own instance of an interpreter, each process will use up a certain amount of memory and processor resources over and above what the program itself requires. This could range from a few hundred kilobytes of memory and a few hundredths of a second of execution time to megabytes of memory and full seconds of execution time.

Both types of languages have their strengths. Interpreted languages provide additional safeguards to prevent one’s code from being compromised by exploiting buffer overflow vulnerabilities and other programming errors caused by humans, so development with an interpreted language can be faster and require less care. The opposing point of view is that programming in interpreted languages takes less discipline, creating bad habits along the way that will ultimately lead to poor code anyway. Compiled languages run faster because they are executable machine code and generally provide more granular control over the software and access to more low-level functions. And without the additional overhead of bootstrap or interpretation, the execution runs at processor speed.

It’s very possible to write a successful, scalable spam filter in either type of language. Some upper-class spam filters today, such as Death2Spam, are written in interpreted languages and can handle reasonable loads. There are plenty of poorly written spam filters in either type of language, so the bottom line is that much depends on the skills of the programmer. Given two identical programs, one compiled and the other interpreted, the compiled execution will have a clear advantage over the interpreted one when implemented in a setting of 100,000 users. How much of an advantage this provides depends on the implementation—on very large-scale implementations (such as those for several hundred thousand users), it may mean the difference between purchasing 15 machines versus 50 machines.

If the goal of your project is to make the filter scalable enough for many large implementations, consider a compiled language for the added performance, but emphasize writing good code.

Note

Look for a well-written tool in a fast language. The generally accepted execution time for a spam filter is approximately one-tenth of a second of real time per message on small- to medium-sized implementations. Many filters outperform this and can achieve speeds of even one-hundredth of a second per message.

Sorting and Other Algorithms

It is entirely possible to write an application in BASIC that can outperform an assembly language program, depending on the techniques used in writing the program. Sorting algorithms and other types of looping and recursion can make a noticeable difference between an execution time of one-tenth of a second versus one of several seconds.

Asymptotic analysis involves the study of the growth rate of algorithms. Two goals of this type of analysis are to determine the worst-case running time for an algorithm and to identify the threshold for optimal performance. Depending on the type of tokenizer being used, there will be a certain threshold in the number of data points for the best- and worst-case scenarios. These values can be used to help determine the best algorithms for the particular implementation.

All algorithms have their worst-case scenarios. Some will perform exceptionally well with 50,000 data points but will run much slower than another with only a few thousand. Asymptotic analysis helps identify the best sorting algorithm for a given situation. Since the number of tokens could potentially reach millions, it’s difficult to ascertain the worst-case scenario. There are plenty of algorithms to consider. The parsing and tokenizing of a message can itself take up a fraction of a second, or several seconds. Avoiding unnecessary loops and finding alternative methods of parsing can greatly improve performance. Using profiling tools such as gprof can help to identify problem areas of code. Don’t guess what code is slow; know what code is slow and fix it.

The book Introduction to Algorithms by Cormen, Leiserson, Rivest, and Stein (MIT Press) is an excellent introduction to the basics of asymptotic analysis and also provides a volume full of existing algorithms to use in your applications. There is also a series by Donald Knuth titled The Art of Computer Programming, which consists of several books on understanding basic and advanced algorithms, with many examples. Either of these choices should provide excellent insight into the use and development of algorithms that perform optimally for a particular project.

Design Paradigm and Process Instantiation

The number of processes used in the solution, and how these processes communicate with other processes (such as a mail server) can affect performance as well as the licensing of the software.

Most filters fall into two general paradigms. The first, a client/server model, involves running the filter as a daemon process and allowing clients to communicate (usually the mail server or a lightweight piece of middle- ware) via TCP or a domain socket. The other involves starting a new filter process for each message and passing data across a pipe.

Client/server designs running as a dedicated server process can generally improve efficiency by reducing the amount of bootstrap code that’s executed, caching configuration data and database connections, and by performing housekeeping during idle periods, to keep the data optimized for longer periods of time. These designs also allow for more seamless integration with mail servers using protocols such as local mail transfer protocol (LMTP), a protocol designed for local mail system communication.

Instantiating a new filter process for each message directly from the mail server is another approach that yields acceptable results but may not scale quite as well as a the client/server model. This approach involves the mail server starting a new filter process and then passing data to the process via a pipe. For large systems processing hundreds of messages per second, this can mean that over one hundred filter processes are running simultaneously.

Parallelization versus Serialization

Whether or not messages are processed serially or in parallel will make a noticeable difference in the performance of the application. Many filter implementations use some form of locking in order to prevent data from being corrupted by two different filter instances changing it at the same time. If the filter uses extensive locking, lock contention may not only result in processes having to queue, but could create additional delays in overhead while waiting on locks or executing a back-off protocol. Designing the filter to use as little locking as possible while still maintaining data integrity will help avoid these situations. Parallelizing processes so that multiple messages can be processed simultaneously will greatly improve performance—this can be done by partitioning the data so that each user has their own set of locks or by implementing storage mechanisms, such as a SQL-based server, that perform their own row-level locking (locking only data that is being changed).

One of the common problems with serialization is the amount of delay in process execution, which can lead to a significant number of processes waiting on the system. If the number of processes exceeds the system’s capabilities, the result may be unpredictable. If each message takes half a second of real time to process, a serialized system will not be able to process more than that rate of inbound email over a sustained period of time. Wherever possible, implement efficient locking to allow muliple filter instances to run simultaneously.

Operating System Requirements

Companies have different requirements for the operating systems they support. As we touched on in the last chapter, the choice of storage driver implementation and other variables can directly affect which operating systems the software will compile on. Other considerations to take into account are any optimizations in the code for a particular operating system, and the use of any additional dependencies that may be available only for specific operating systems. It’s generally considered good practice to write software that follows a particular set of compiler standards—for example, use of the POSIX interfaces.

Taking advantage of operating system—specific functions, such as low- level kernel calls and the use of proprietary libraries (such as the Solaris thread library), can provide added optimization but can also make the software more proprietary. If you plan on adding these features to your software, it’s a good idea to code them in such a way that they can be disabled or replaced with more generic optimizations when a different operating system is being used.

High Availability

Disaster recovery depends largely on the operating system, but several routines can be coded into the software to ensure that it is tolerant of interruptions and failures.

Creating a diagram of the process flow for your project can help identify all of the different components that could potentially fail. In the case of a server-side filter, there are many potential points of failure, including:

The mail server

Delivery agents

Back-end database components (if any)

The file system (full, read-only, or corrupt)

Remote sockets (if using LMTP or similar protocols)

A good server-side filter will have a fail-over mode that will attempt to deliver the message or otherwise recover if any internal components fail. If an external component fails, it’s important to return the correct error codes so that the mail server can then requeue the message for later processing and delivery.

Transactional processing is also important in a large-scale environment. If your filter processes part of the message and writes it to disk, but then a failure causes the process to fail as a whole, the changes written to disk must be rolled back so that they aren’t made again when the system attempts to redeliver the message. SQL-based storage solutions do a great job of providing transactional support.

Mapping out a worst-case scenario via a flowchart is one of the best ways to establish all of the possible points of failure for the project; a subroutine can be written to handle potential failures, roll back transactions, and perform any other types of cleanup that may be necessary. The following layers cover the process flow for most implementations:

Internal layer

The internal components of the filter software itself.

Dependency layer

The external components that the filter software directly relies upon for processing, such as a MySQL database.

Service layer

The different services communicating with the filter, including the mail server, delivery agents, virus scanner, and any other pieces that directly touch the software.

Device layer

The different devices communicating with the filtering server and other components related directly to filtering.

Network layer

The potentially failing network paths, which could result in timeouts or connection failures.

I/O Bandwidth Requirements

The amount of bandwidth needed to communicate with the storage mechanism or other parts of the system (such as any pass-through components) can affect the decision as to whether a distributed cluster of servers can be configured to manage spam or whether it will be necessary to design a storage area network to handle the load of the software. In implementing I/O, many factors, such as the I/O needed to load and store data, should be taken into consideration. Using SQL-based solutions again appears to be the best way to optimize this type of I/O. Designing the data model and writing queries so that only one or two queries are necessary to perform certain tasks can help optimize the amount of bandwidth required.

Additionally, the training mode used will play a key role in the amount of bandwidth required, as well as in I/O contention issues. The train- everything (TEFT) mode of training requires the greatest number of simultaneous reads and writes to a system, while the train-on-error (TOE) mode requires the fewest.

A majority of SQL-based servers provide support for caching commonly used information to be immediately available for future queries. Most large- scale implementations use the TOE or train-until-mature (TUM) training modes to compensate for I/O issues.

Note

Implementing RAID devices and fiber-channel controllers can help increase the amount of bandwidth available to an application, as well as help avoid I/O contention by adding more disk heads.

Finally, the operating system can play a large role in the amount of I/O contention on the system. For example, the maintainers of the Linux 2.6 kernel recently overhauled many of the disk access components, resulting in less moving of a drive’s head back and forth and increasing performance. Finding a good systems administrator to implement a large-scale solution can make it much easier to support hundreds of thousands of users without having to make further coding optimizations or leave out features that would otherwise be useful. Identifying the right operating system to deploy the solution onto is an equally valuable asset to the fine-tuned performance of a large-scale filter.

Features

In order for an application to be truly scalable, it must be able to support a diverse set of feature requirements. Support for multiple training modes, storage interfaces, and other types of features can make the difference between a company choosing one solution over another. When we first code our projects, we usually design them to solve our own problems. To counteract this urge, visit some customer sites or discuss your project with others in the field to analyze what areas your project should address.

Required features may end up having less to do with individual training modes and more to do with the different knobs and whistles available on a particular piece of software. Some companies desire the ability to customize their software to meet new needs as they arise, while others require notifications and reports to be generated automatically and certain types of intelligent handling of particular events to be configurable.

End-User Support

The amount of support that the company implementing the software will need to provide to its own customers or end users also plays a role in just how scalable the application is. There’s no point in investing to save $50,000 of lost bandwidth every month if the company’s also going to have to spend $50,000 to provide help to the filter’s users.

A filter should be designed so that grandma can use it and understand it fairly quickly. Chances are, a large corporation will develop its own documentation to teach its employees and customers how to use the software, but having some prewritten generic documentation can expedite this process. Smaller businesses may have to depend on the documentation you provide. The software should be well documented but should also be easy enough to explain in a two-minute conversation. Ultimately, the software should be simple enough for even the most computer-illiterate person to use.

In addition, the filter should work with as many mail clients as possible. Not all mail clients support features such as bouncing or redirecting a message. Other mail clients may not support the client-side plug-ins that may be required to use the software. The design of the software must be global enough for everyone to use but should have enough features to do its job well and require little support.