8.2 Configuration Issues
To
route the connection to a server, the load balancer must select a
target server. To do this, it takes two pieces of information into
account. First, it needs to know which servers are available. At any
time, one or more of the backend servers can be offline (for
maintenance, as the result of a crash, etc.). To keep track of the
servers, the load balancer periodically checks each
one's health.Once the load balancer has a list of candidate servers, it must
decide which should get the next connection. This process can take a
number of factors into account, including past performance, load,
client address, and so on. Let's look at both issues
in more detail.
8.2.1 Health Checks
Load balancers need to perform a health
check for each real server to ensure that it's still
alive, well, and willing to accept new connections. When
load-balancing a web server, this is often a trivial matter. The load
balancer is configured to connect to TCP port 80 and request a
status file such as
/healthl. If the server gets a 2xx response
code back, it assumes the server is fine. If not, it may stop sending
new requests to the server until it becomes healthy again.A nice side benefit of asking for a specific file, rather than simply
looking for any response on port 80, is that a server can be removed
from the cluster without taking it offline: simply remove or rename
the file.Most load balancers provide a great deal of control over the
parameters used when testing cluster hosts. Options may include the
frequency of checks, the duration of check timeouts, and the number
of unhealthy responses required to remove a server from the cluster.
See your load balancer's documentation for details.
8.2.1.1 Determining health
So what constitutes a good health check for MySQL? Unfortunately,
there's no single answer to that question.It depends on how sophisticated your load balancer is. Some load
balancers can verify only that each server is responding on the
necessary TCP port. They'll generally connect to TCP
port 3306 (or whichever port you're using) and
assume the server is unhealthy if the connection is refused or if it
has to wait too long for a response.Some load balancers are more flexible. They might give you the option
of scripting a complicated health check or of running the health
check against a different port than normal. This provides a lot of
flexibility and control. For example, you can run a web server (such
as Apache) on the server and configure the load balancer to check a
status file, just as you would for standard HTTP load balancing. You
can exploit this indirect kind of check by making the status file a
script (PHP, Perl, etc.) or Java servlet that performs arbitrarily
complex logic to decide whether the server is really
healthy.[2] The arbitrarily complex logic can be as
simple as running a SELECT 1
query, or as complicated as parsing the output of SHOW SLAVE
STATUS to verify that the slave is reasonably up to date.[2] Provided, of course, that the arbitrarily
complex logic doesn't take arbitrarily long to
execute. The load balancer won't wait
forever.
If your load balancer offers this degree of flexibility, we highly
recommend taking advantage of it. By taking control over the
decision-making process, you'll have a better idea
of how your cluster will respond in various situations. And after
testing, if you're not happy with the results,
simply adjust the logic and try again.What types of things might you check for? This goes back to the
question we're trying to answer: what makes a
healthy MySQL server, from the load balancer's point
of view?A good health check also depends on your application needs and
what's most important. For example, on a nearly
real-time dynamic web site like Yahoo! News, you might put more
emphasis on replication. If a slave gets busy enough handling regular
queries that it becomes sluggish and ends up more than 30 seconds
behind on replication, your code can return an unhealthy status code.
The load balancer then removes the slave from the list of available
servers until the health check passes again. Presumably the reduced
demand on the server will allow it to quickly catch up and rejoin the
cluster. (See the "Monitoring"
section in Chapter 7 for ideas about detecting
slow slaves.)Of course, the success of this algorithm depends on how smart your
scripts are. What if the slow server doesn't get
caught up? And what if the additional demand that the remaining
servers must bear causes them to fall behind?
There's a very real chance that one by one,
they'll start deciding they too are unhealthy.
Before long, the problem cascades until you're left
with a cluster of unhealthy servers sitting behind a load balancer
that doesn't know where to send connections anymore.At Yahoo! Finance, we've seen individual servers
that try to be smart and end up creating even bigger problems because
they didn't have the whole picture. Anticipating the
problem mentioned in the previous paragraph, the code that performed
health checks introduced yet another level of checking. Each server
knew all the other members of the cluster. The health check included
code to make sure that there were enough servers left. If a server
determined that too many other servers were already down, it would
elect to keep handling requests. After all, a slow site is better
than no site at all.But our implementation still wasn't smart enough;
the servers still went down in a cascade. The reason turned out to be
a simple race condition. The code performed a series of checks, but
it did them in the wrong order. The code first checked to see that a
sufficient number of other servers were healthy. It then went on to
make sure MySQL wasn't too far behind on
replication. The problem was that several servers could be doing the
health check at exactly the same time. If that happened, it was
possible for all servers to believe that all other servers were
healthy and proceed to declare themselves unhealthy.There are numerous solutions to the problem. One is to add a simple
sanity check. Each server can, after declaring itself unhealthy,
check to make sure that the situation hasn't
radically changed. Another option is to appoint a single server in
each cluster as the authority for determining who is and
isn't healthy. While it introduces a single point of
failure (what if this server dies?), it means there are fewer chances
for race conditions and similar problems.To summarize, some load balancers provide you with a lot of
flexibility and power. Be careful how you use it. If you elect to
take control of the decision-making process (and add complexity to
it), be sure that the code is well tested. Ask a few peers to review
it for you. Consider what will happen in unusual situations.
8.2.1.2 Connection limits
In normal operations, the load balancer
should distribute connections relatively evenly among your severs. If
you have eight backend servers, any one of them will handle roughly
one eighth of the connections at a given time. But what happens when
several backend servers go down at the same time? Because the rest of
the cluster must bear the load, you need to ensure that the se
servers are configured to handle it.The most important setting to check is
max_connections.
In this circumstance, you'll find that if
max_connections is set too low, otherwise healthy
MySQL servers start refusing connections even if
they're powerful enough to handle the load. Many
installations don't set the
max_connections option, so MySQL uses its built-in
default of 100. Instead, set max_connections high
enough that this problem can't happen. For example,
if you find that each server typically handles 75 connections, a
reasonable value for max_connections might be 150
or more. That way, even if half the backend servers failed,
you're application won't fail to
connect.
8.2.2 Next-Connection Algorithms
Different load balancers implement
different algorithms to decide which server should receive the next
connection. Some call these scheduling algorithms. Each vendor has
different terminology, but this list should provide an idea of
what's available:Random
Each request is directed to a backend server selected at random from
the pool of available servers.
Round-robin
Requests are sent to servers in a repeating sequence: A, B, C, A, B,
C, etc.
Least connections
The next connection goes to the server with the fewest active
connections.
Fastest response
The server that has been handling requests the fastest receives the
next connection. This tends to work well when the backend servers are
a mix of fast and slow machines.
Hashed
The source IP address of the connection is hashed, thereby mapping it
to one of the backend servers. Each time a connection request comes
from the same client IP address, it is sent to the same backend
server. The bindings change only when the number of machines in the
cluster does.
Weighted
Several of the other algorithms can be weighted. For example, you may
have four single-CPU machines and four dual-CPU machines. The
dual-CPU machines are roughly twice as powerful as the single-CPU
machines, so you tell the load balancer to send them twice as many
requestson average.
Which algorithm is right for MySQL? Again, it depends. There are
several factors to consider and some pitfalls to avoid. One of the
pitfalls is best illustrated with an example.
8.2.2.1 The consequences of poor algorithm choice
In September 2002, Yahoo! launched a one-week memorial site for those
affected by the September 11, 2001 terrorist attacks. This site was
described in Chapter 6. The
remember.yahoo.com site was heavily promoted on
the Yahoo! home page and elsewhere. The entire site was built by a
small group of Yahoo! employees in the two weeks before the
site's launch on September 9.Needless to say, the site got a lot of traffic. So much, in fact,
that Jeremy spent a couple of sleepless nights working to optimize
the SQL queries and bring new MySQL servers online to handle the
load. During that time the MySQL servers were running red hot. They
weren't handling many queries per second (because
they are poorly optimized) but they were either disk-bound,
CPU-bound, or both. A server was slowest when it first came online
because MySQL's key buffer hadn't
yet been populated, and the operating system's disk
cache didn't have any of the relevant disk blocks
cached. They needed several minutes to warm up before taking their
full query load.The situation was made worse by the fact that the load balancer
hadn't been configured with this in mind, and nobody
realized it until very late in the process. When a server was
reconfigured and brought back online, it was immediately pounded with
30-50 new queries. The machine became completely saturated and needed
several minutes to recover. During the recovery time, it was nearly
unresponsive, with the CPU at 100%, a load average over 25, and the
disk nearly maxed out.After quite a bit of theorizing and poking around, someone thought to
question the load-balancer configuration. It turned out that it was
set on a least-connections scheduling algorithm. That clearly
explained why a new machine was bombarded with new connections and
rendered useless for several minutes. Once the load balancer was
switched to a random scheduling algorithm, it became much easier to
bring down a slave, adjust the configuration, and put it back online
without becoming completely overwhelmed.The moral of the story is that the connection algorithm you select
may come back to bite you when you least expect it (and can least
afford it). Consider how your algorithm will work in day-to-day
operations as well as when you're under an unusually
high load or have a higher than normal number of backend servers
offline for some reason.We can't recommend the right configuration for your
needs. You need to think about what will work best for your hardware,
network, and applications. Furthermore, your algorithm choices are
limited by the load balancing hardware or software
you're using. When in doubt, test.