Building a Distributed Model
In many large-scale implementations, it makes sense to build a distributed cluster of machines capable of performing independently of one another in the event of fail-over. It’s also sometimes necessary to run the spam-filtering software on a machine other than the mail server itself. There are countless hodgepodges of distributed networks powering the Internet on eggshells, but only a few that can provide reliable, incremental scaling. Depending on how the target network is already configured, some implementations may work better than others. This section covers some of the most common distributed networking approaches for spam-filtering installations.
Round-Robin Distributed Networking
A round-robin distributed network comprises a series of nodes capable of servicing any user on the system (see Figure 9-1). The term “round-robin” is used to denote a DNS round-robin or another type of implementation (possibly using a layer-4 switch) that is capable of sporadically directing requests to different machines.

Figure 9-1: Round-robin configuration with single storage node
The round-robin approach is commonly used to balance other types of network traffic such as web traffic. In a spam-filtering environment, this approach requires that one or more static systems reside behind the round- robin servers to store user data. Otherwise, users would experience a different result depending on which machine the inbound message hit. Systems employing a static global dataset don’t need to worry about this extra layer.In configurations like this, the bottleneck is usually the per-user storage system. The more nodes and users there are on the network, the higher the load average of the storage facility. For this reason, some like to take a modified approach to distributed networking by installing the primary filter software on each node and clustering an independent set of storage devices together. The storage devices are generally kept independent of one another to reduce the amount of disk space and load for each device. One way to do this is to split up the storage devices based on user-id or username. In Figure 9-2, we see the same series of nodes in a round-robin filter cluster, but the storage is now distributed based on the first letter of the username. This approach works reasonably well in a setting where the growth rate of users is very static, but there is a drawback in systems with a faster growth rate. In order to add more nodes, the naming scheme for the distributed storage must be changed, which results in an overabundance of work. A much better alternative for systems that grow rapidly is to use the numeric user-id to distribute the storage cluster. As more users are added to a network, the user-ids only increase or replace older users who are no longer on the system. As new ranges of user-ids start to become utilized, it becomes much less work to configure a fourth or fifth storage machine and make a minor update to the filtering software installed on each node.

Figure 9-2: Round-robin configuration with distributed storage
Distributed BGP Networking
Nationwide networks have the burden of managing not only server resources but also internal bandwidth from leased lines or their own physical networks of dark fiber buried in cow pastures. It’s inefficient to retrieve data from a server in Virginia for a user who is primarily connected to Palo Alto, California.A distributed border gateway protocol (BGP) network provides for an inverselike filtering (and mail distribution) approach for networks with many points of presence (see Figure 9-3).

Figure 9-3: Distributed BGP networking
By placing a machine at every major point of presence and having that machine handle a specific subset of users who are connected to that location, systems administrators can have inbound mail routed to the local mail server at the POP and stored in the vicinity of the destination user. When the user connects to receive email, they point their mail client at an IP address that the network has ghosted across every major point of presence. The server that is closest to the user will, by default, answer the request and should hopefully have the user’s data located on the local storage facility at that POP.If the user travels out of town, the individual storage servers (or the mail servers themselves) must be configured to communicate with one another for retrieval of the data. This is done by configuring each machine with an internal IP address of its own, while using the same public-facing IP address at each point of presence so that the user can move around without having to change mail settings. This commonly requires two network interfaces in each machine. BGP takes care of most of the issues with regard to ghosting IP addresses on a large network, so the user will end up hitting the closest server to them.
The obvious advantage of this approach is not only to prevent extraneous traffic from traveling over the network, but also to distribute the filtering (and mail server) network based on location, providing a level of redundancy. If a main trunk from California to Virginia goes down, it should affect only users who are not within their local service area who want to check their mail. If storage is being mirrored at the two closest locations to a user’s service area, then even the primary location failing wouldn’t result in a complete loss of service. Much of the success of this approach depends on the appropriate sizing of the machines to provide a level of fail-over capacity.