Performance Tuning for Linux Servers [Electronic resources] نسخه متنی

The Performance Analysis Exercise

The remainder of this chapter is devoted to the actual tuning exercise of the Trade3 application on our four-tier system configuration. Because the goal is to make sure that Trade3 can utilize the most precious resource, the CPU, in the three boxes (especially the application server box), the first thing we did was capture the initial performance data by performing a User Ramp Up test run.Table 22-1 shows our initial performance data. A corresponding graph for throughput and response time is shown in Figure 22-6. CPU utilization has two numbers (separated by a slash). The first number is the portion for the user and the second is the portion for the system. We used the sar tool to get this information. A heuristic that we commonly use is that the system usage should not be more than 20% of the total CPU utilization. We want to minimize system calls so that user code can do more work and thus increase the throughput. Too much time spent on the system can indicate inefficiencies or problems in the kernel. The Little's Law column is a convenient way to verify the validity of our response time and throughput on each run. The law says that the product of the throughput and response time is equal to the number of concurrent users. The numbers are rounded off to the nearest whole number. In some instances, when the number of concurrent users is large, Little's Law might not give the exact number, but it should be very close.

Figure 22-6. Initial User Ramp Up Graph

These numbers show that there is a serious performance problem. The numbers show that the throughput started to stay almost the same at 10 concurrent users (which also gave the highest throughput) at a response time of about 0.3 seconds. The CPU utilization of the application server started to maintain at that time. Thus, the highest CPU utilization was about 28% to 29%, which is a horrible number by anyone's standard. At 80 users, although the throughput is on par with the previous ones, the response time already went above the requirement of no more than 2 seconds response time. Thus, the acceptable number of concurrent users that this application server can handle is between 40 and 80more likely between 65 and 70.

Figure 22-6 is a visual representation of the performance data. The figure shows where the saturation point is and at what number of users the performance remains acceptable. The immediate goal now is to find out why the system's saturation point is at a very low CPU utilization of the application server.

The Web Server

Although not shown in our table, we noticed that the CPU utilization on the web server tier is very lowabout 3%. There are times when the system usage is even higher than the user usage, but this is really not much of an issue because the numbers are very low to begin with. So, we are hypothesizing that the web server itself is the bottleneck and that it goes on a wait state too long or too many times (which explains the low CPU utilization). Thus, this could cause a lower rate of requests being passed on to the application server.

This assumption is easy to verify. The plug-in passes the original HTTP request "as-is" to the application server. The application server has an HTTP transport layer that can accept full-fledged HTTP requests through port 9080. This means that we could skip the web server altogether and make the workload driver send the request to tethys:9080.

The test was performed starting with 10 users to see if we could see some improvements. The second set of performance data is shown in Table 22-2. Unfortunately, there is no significant difference from the original results. The throughput and response time are still within the range. The same is true with the application server CPU utilization. Thus, we can eliminate the web server as the source of the problem.

Table 22-2. Performance Data for Trade3 User Ramp Up Test Without the Web Server
Number of Concurrent Users	Throughput (Req/sec)	Avg Response Time (seconds)	Application Server CPU%	Database Server CPU%	Little's Law
10	68.17	0.145	52.34/7.11	12.62/2.22	10
20	69.75	0.285	52.87/7.29	14.51/2.45	20
40	69.23	0.576	54.75/7.92	15.85/2.29	40
80	68.52	1.167	54.15/6.45	18.64/2.5	80

The Database Back End: A Bottleneck?

One way to view this problem is to investigate what causes the application server to "wait." From our knowledge of the application, there is a significant portion of the code that calls the database back end. The application uses container-managed entity beans extensively. It is possible that the application server is being blocked somehow due to some contentions in the database, physical disk, or buffers, which then cause the application server to spend too much time waiting on the database.

There are two approaches to determine if this is the case. The first is to get to the database server and collect some traces, either through the database debugging facilities or with Linux tools. We recommend the database debugging tools because they immediately provide information about locks and waits on the database. Linux tools require the correlation of system-level information to the database, which might not be that straightforward. The second, and easier, approach is to look into the application server to check if many or some of its threads are waiting on the database to respond back. If so, we know that the database is the bottleneck.

Thread Dump

Examining the thread dump from the Java Virtual Machine is a useful way to detect where possible bottlenecks might be. The dump provides a snapshot of information on the current state of each thread in the JVM and the history of the threads' execution through stack traces. Possible states are as follows:

R for running, which means that the thread is actively executing (it does not necessarily mean that it is actually running, but at least it is in the runqueue of the kernel scheduler).

CW for conditional wait, which means that the Java thread has explicitly invoked the wait() method on an object and that it is waiting to be notified. Thus, if the notification does not arrive soon, the thread will wait a long time and might cause a bottleneck if other threads are depending on this thread to run before they can proceed. Otherwise, it is not that harmful.

MW for monitor wait, which means that the thread is queued in a monitor of a Java object. Remember that all Java objects have their own monitor. A thread would get in this state if it enters a synchronized method or block and another thread has already gotten the "right" to execute the synchronized method or block. Unless that executing thread leaves the synchronized method or block, no other thread waiting on it can proceed. Thus, this is the highest possible cause of bottlenecks.

We will perform one test run with 10 concurrent users. During that time, we will take note of the process ID of the Java process (which corresponds to the application server). During the run, we will issue the following command:


# kill -3 process id

You can check the process ID of the application server by looking for Java processes using the ps command. Some application servers log their process ID when they are started. Although the command says "kill," it does not terminate the process. Instead, it produces a thread dump and saves it on a file of the form javacorexxxxxxx.txt. Note that thread dumps are not part of the JVM or Java Specifications, so different JVM implementations might use different filenaming schemes. We typically issue this command in succession (three to five times) to get more than one thread dump and to monitor the progress of the threads by checking to see if they ever change their state at all.

Never take a thread dump if you are running a test run and getting performance data. Thread dumps slow down the JVM and thus affect its performance.

We took a thread dump of the application server while running a test with 10 concurrent users. Thread dumps can be huge files, depending on the number of threads in the JVM. Thread dump output also provides information about the system and the JVM.

Although this chapter does not cover how to interpret thread dump output, for the purposes of this case study, we are only interested in finding out if any threads are in the MW statespecifically, on the database routines.

Figure 22-7 shows a snippet of thread information from the thread dump that we took. Note that we show only the topmost part of the stack trace because the stack traces are really deep. The thread in Figure 22-7 is just one of the Servlet.Engine. Transports threads (threads are pooled in this application server). This thread is basically a worker thread that receives the request from the plug-in and executes it. As you can see, the state of the thread is R. The top of the stack trace tells us that this thread is in the JDBC code, which communicates with the local database client. The database client communicates with the remote database server.

Figure 22-7. A snippet of a thread stack trace taken from a thread dump.

[View full size image]

We examined all of the threads, and none of them are in the MW state. In fact, all threads that are in the JDBC code are in the R state. We can conclude that the database server is not a bottleneck because none of the threads is blocked by it.

Database I/O Activity

The next step is to check the I/O activity at the database server. We ran the same test and used iostat on the database server:


iostat d 2

This command monitors the disk activity every 2 seconds. Following is a segment of the trace:


Device:  tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev3-0 39.00         0.00       400.00         0       800
Device:  tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev3-0 32.50         0.00       480.00         0       960
Device:  tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev3-0 49.00         0.00       504.00         0       1008
Device:  tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev3-0 33.00         0.00       356.00         0       712
Device:  tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev3-0 33.50         0.00       380.00         0       760
Device:  tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev3-0 34.50         0.00       364.00         0       728
Device:  tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev3-0 32.00         0.00       464.00         0       928
Device:  tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev3-0 39.50         0.00       400.00         0       800
Device:  tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev3-0 25.50         0.00       280.00         0       560
Device:  tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev3-0 20.00         0.00       216.00         0       432

We removed the other devices from this trace because there were no activities on them. The trace shows not a lot of I/O activity (write operations) is going on, which leads to one possibility. Because the database is actually doing some work, and none of the application server threads is blocked on the database, it is apparent that communication or transfer of data is occurring. Even when the workload is increased, the I/O activity remains the same, and yet none of the threads is blocked. This might mean that the communication between the application server and the database server is the bottleneck. The rate of data transfer over the wire is the same regardless of the load. The threads are taking a long time to receive and send data. Note that sending and receiving data is a task and that is why the threads are in the R state.

Network Activity Between the Application Server and Database Server

We suspect that the network speed between the application server and the database server is slow. The quickest way we can tell this is to use the ping command. When we ping the workload driver, which is the host atlas, the following is returned:


PING atlas.raleigh.ibm.com (9.27.175.20) from
 9.27.175.25 : 56(84) bytes of data.
64 bytes from atlas.raleigh.ibm.com (9.27.175.20):icmp_seq=1
ttl=128 time=0.220 ms
64 bytes from atlas.raleigh.ibm.com (9.27.175.20):icmp_seq=2
ttl=128 time=0.127 ms
64 bytes from atlas.raleigh.ibm.com (9.27.175.20):icmp_seq=3
ttl=128 time=0.127 ms
64 bytes from atlas.raleigh.ibm.com (9.27.175.20):icmp_seq=4
ttl=128 time=0.125 ms
64 bytes from atlas.raleigh.ibm.com (9.27.175.20):icmp_seq=5
ttl=128 time=0.121 ms
64 bytes from atlas.raleigh.ibm.com (9.27.175.20):icmp_seq=6
ttl=128 time=0.124 ms
64 bytes from atlas.raleigh.ibm.com (9.27.175.20):icmp_seq=7
ttl=128 time=0.127 ms
64 bytes from atlas.raleigh.ibm.com (9.27.175.20):icmp_seq=8
ttl=128 time=0.119 ms
64 bytes from atlas.raleigh.ibm.com (9.27.175.20):icmp_seq=9
ttl=128 time=0.128 ms
64 bytes from atlas.raleigh.ibm.com (9.27.175.20):icmp_seq=10
ttl=128 time=0.126 ms

When we ping the database server, which is the host telesto, we get the following output:


PING telesto.raleigh.ibm.com (9.27.175.24)
 from 9.27.175.25 : 56(84) bytes of data.
64 bytes from telesto.raleigh.ibm.com (9.27.175.24):icmp_seq=1
ttl=64 time=1.96 ms
64 bytes from telesto.raleigh.ibm.com (9.27.175.24):icmp_seq=2
ttl=64 time=0.384 ms
64 bytes from telesto.raleigh.ibm.com (9.27.175.24):icmp_seq=3
ttl=64 time=0.448 ms
64 bytes from telesto.raleigh.ibm.com (9.27.175.24):icmp_seq=4
ttl=64 time=5.61 ms
64 bytes from telesto.raleigh.ibm.com (9.27.175.24):icmp_seq=5
ttl=64 time=0.382 ms
64 bytes from telesto.raleigh.ibm.com (9.27.175.24):icmp_seq=6
ttl=64 time=0.389 ms
64 bytes from telesto.raleigh.ibm.com (9.27.175.24):icmp_seq=7
ttl=64 time=0.677 ms
64 bytes from telesto.raleigh.ibm.com (9.27.175.24):icmp_seq=8
ttl=64 time=0.356 ms
64 bytes from telesto.raleigh.ibm.com (9.27.175.24):icmp_seq=9
ttl=64 time=1.99 ms
64 bytes from telesto.raleigh.ibm.com (9.27.175.24):icmp_seq=10
ttl=64 time=1.19 ms

Comparing the turnaround time for both outputs, it is clear that something is not right with the connection to the database server. To investigate the cause of this problem, the recommended approach is to try first from the lowest levelhardware, connections, NIC, and network driversbefore even going into tuning the TCP/IP stack. An important question to ask is "What speed and duplex are being used by the database server's network card?"

To determine the speed and duplex of the database server's network card, we log on to the database server and issue the command mii-tool. Unfortunately, the mii-tool on the database server fails with the following message:


SIOCGMIIPHY on 'eth0' failed: Operation not supported

A tool that is useful when the mii-tool does not work is ethtool. We issue ethtool on the database server (we know that eth0 is the interface that we are using on this server) as follows:


# ethtool eth0

Output from ethtool follows:


Settings for eth0:
Supported ports: [ TP ]
Supported link modes:     10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes:    10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 10Mb/s
Duplex: Half
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: umbg
Wake-on: d
Link detected: yes

The output indicates that the current setting for eth0 is 10Mbps using half-duplex. This is a very big discovery that could help improve performance. To change the setting, issue the following command:


# ethtool s eth0 speed 100 duplex full

Note that you could also set the speed to "auto" instead of 100. We ran the same tests again after fixing the network card settings. Table 22-3 summarizes the new results. The improvements gained were phenomenal. We raised the maximum CPU utilization on the application server from ~59% to ~94%. Also, we added 56% more to the maximum throughput. More importantly, we now can support up to 200 users and still get a response time of less than 2 seconds.

Table 22-3. Resulting Performance Data After Fixing the Network Problem
Number of Concurrent Users	Throughput (Req/Sec)	Avg Response Time (Seconds)	Application Server CPU%	Database Server CPU%	Little's Law
1	25.48	0.038	16.19/1.7	1.87/0.85	1
2	46.04	0.042	33.72/3.52	4.19/1.79	2
5	77.38	0.063	61.06/6.72	9.18/3.15	5
10	89.43	0.110	81.45/8.27	11.2/3.46	10
20	106.12	0.187	89.25/9.94	14.09/3.94	20
40	104.76	0.380	86.47/10.77	14.27/4.46	40
80	102.15	0.782	85.48/10.02	14.55/4.38	80
160	101.86	1.564	83.74/10.23	14.28/4.20	159
200	101.17	1.971	84.77/10.01	14.03/4.07	199
220	102.02	2.15	84.25/10.10	14.47/4.03	219

We challenged ourselves to do better because 6% of CPU is not being used.

The Java Virtual Machine

The next step is to tune the application server and the JVM. Because IBM WebSphere Application Server does not use the hotspot technology in its JVM, there is only one thing that we need to investigatethe heap size settings. The default values used are a minimum of 50MB and a maximum of 256MB. To determine if the defaults are the right values or if we have enough heap, we need to examine the garbage collection (GC) activities in the JVM. We know that garbage collection is an overhead that must be minimized to get more throughput from the application. So we ran one test with 20 concurrent users and enabled the verbosegc flag in the JVM parameters. The garbage collection outputs are emitted on the standard error, so we need to redirect them to a file.

Analyzing the raw output of verbosegc from the file is very tedious and time consuming. We developed a tool called GCAnalyzer to capture this information and summarize the results. The tool graphs each garbage collection event and shows how much memory was collected and how much is still live. We used the tool with gc.output as the name of the file where the raw GC outputs were saved using the following command:


GCAnalyzer mb chartdata gc.txt gc.output

This command specifies that all units be expressed in megabytes and that, in addition to the summary, a text file, gc.txt, be produced that can be imported from Microsoft Excel to create a graph. The results of the summary are shown here:


Total  number of GC: 222
Total runtime=464541 ms.
Total number of GC: 222
Total time in GC: 21997 ms.
% of time spent on GC: 4%
Avg GC time: 99 ms.
Longest GC time: 146 ms.
Shortest GC time: 14 ms.
Avg. time between GC: 1993 ms.
Avg. GC/Work ratio: 0.047361907
Avg. bytes needed per GC: 1380 bytes
Total garbage collected: 7,400.87 MB
Avg garbage collected per GC: 33.34 MB
Total live memory: 12,559.26 MB
Avg live memory per GC: 56.57 MB
Avg. % heap of free bytes before GC: 0.9 %
Avg. % heap of free bytes after GC: 38.12 %
Avg. % heap as garbage per GC: 37.03 %
Avg. marking time: 83 ms.
Avg. sweeping time: 12 ms.
Number of compaction(s): 6 (2.7%)
Avg. compaction time: 1 ms.
Number of heap expansion(s): 4 (1.8%)
Avg. expansion time:  34 ms.
Avg  size of additional bytes per expansion: 11632 KB.
Avg. garbage created per second (global): 16.74  MB per second

Looking at these numbers, we cannot see a major problem with garbage collection. In fact, the amount of time spent on garbage collection is only 4% of the 5-minute test run. In our experience, 5% is the upper limit that can be imposed for a good garbage collection time. Performing garbage collection is very quick83ms for marking and 12ms for sweeping, and the compaction time is 1ms! We believe that there was a great improvement with the garbage collector in JVM 1.4.1, which is the version that comes with IBM WebSphere Application Server 5.1. Figure 22-8 further shows there is no problem with garbage collection.

Figure 22-8. The resulting graph taken from the GCAnalyzer output.

We can infer from the graph that the heap size settled at around 98MB, which is a lot less than the 256MB maximum that was set. In other words, there is no need to increase the heap size. Note that the first few garbage collections took place during the startup of the application server.

However, another JVM flag, Xnoclassgc, might help improve performance. Xnoclassgc causes GCAnalyzer to avoid collecting class objects that are not used when the garbage collection occurs. Given the good garbage collection profile, we do not foresee a major improvement. However, it is worth testing the flag because if it gives us more throughput, it is still worth it. We tested with that flag, and as expected, the results showed miniscule improvement, but we decided to retain the flag anyway.

The Application Server

A layer in the software performance stack that might need tuning is the application server itself. Different application servers might have different tuning knobs to offer, but they are very similar in many respects. This is partly due to the fact that they are all J2EE-based.

Thread Pools

The first thing to address within the application server is the thread pool. Pooling threads is typical in all application servers. This is a good approach because providing threads is the responsibility of the application serverenterprise applications are not supposed to create threads. Thread pooling is a performance-enhancing mechanism for avoiding the cost of creating and destroying threads whenever they are needed.

Linux kernels 2.4 and earlier are very sensitive to threads, especially where a large number of active threads are in the system. This is because these kernels are still using the original threading model and scheduler. Linux was initially designed for small desktops with one processor, so the original design of the scheduler was not very concerned with scaling issues. Linux uses a single runqueue for all threads. Thus, a number of processors that are ready to take a queued runnable process must synchronize their access to the queue. Worse, the scheduler evaluates each process in the queue to choose which of them must be scheduled next. The time complexity is a function of the number of threads in the queue. This explains why a larger thread pool may perform worse. Somehow, it is faster to make some other requests wait for an available thread than to assign a thread to it and contribute to the cost of scheduling. Linux kernel 2.6, however, introduced the O(1) scheduler and a new threading model that conforms to Native POSIX. In the O(1) scheduler, each processor has its own runqueue instead of the single-queue model.

For now, we need to look at our thread pool size because we are using kernel 2.4. The default thread pool size of the web container is 50. By setting it at 20, we get the results shown in Table 22-4.

Table 22-4. Results of Setting the Web Container Thread Pool Size to 20
Number of Concurrent Users	Throughput (Req/Sec)	Avg Response Time (Seconds)	Application Server CPU %	Database Server CPU %	Little's Law
10	96.04	0.103	79.33/8.68	13.91/4.18	10
20	106.81	0.186	87.92/11.15	14.05/3.95	20
40	109.01	0.336	89.23/9.72	14.03/4.42	40
80	107.19	0.745	88.56/10.59	13.88/4.33	80

Compared to the results shown in Table 22-3, we see considerable improvement. We also figured that the benefit will be manifested more with users 20 and up because that is where we can see the benefits of fewer threads and more incoming requests.

Pass-by-Reference

The next parameter to set is the pass-by-reference in the ORB component of the application server. The J2EE specification mandates that all EJB method calls have to pass parameters by value, not by reference. This is needed because for remote method calls, references cannot be maintained in two remote machines. Therefore, the actual values must be passed.

Here is when a knowledge of the application helps. We know that all requests to Trade3 come through the HTTP transport and all EJB calls are done locally in the same JVM. It happens that Trade3 does not change the values of the parameters being passed to EJB method calls. Pass-by-reference is a performance enhancement mechanism where, instead of deeply copying objects, only the references to the objects are passed.

Data Source Prepared Statements and Connection Pool

The prepared statement size and connection pool size of the data source are two knobs that you can tune if you are having database bottlenecks. However, based on our thread dump analysis, we are not experiencing a performance problem on the database side. Thus, it appears that the default value of 60 prepared statements and 50 connection pools works well for the Trade3 workload.

Finally, we reran our tests with pass-by-reference enabled and set the ORB Thread Pool size down to 10. The results are shown in Table 22-5. The performance graph is also shown in Figure 22-9. Compare these results with the initial performance data, and you will see that a big difference has been made. At this point, we have used ~99% of the application server's CPU and raised the number of users to 250.

Table 22-5. Final Results of a Tuned System for the Trade3 Workload
Number of Concurrent Users	Throughput (Req/Sec)	Avg Response Time (Seconds)	Application Server CPU %	Database Server CPU %	Little's Law
1	27.96	0.036	14.65/2.0	2.25/1.12	1
2	50.99	0.038	29.63/3.78	4.61/1.73	2
5	87.61	0.056	53.69/6.54	9.66/3.18	5
10	112.71	0.088	76.40/10.21	14.11/4.17	10
20	124.36	0.159	86.86/11.97	15.94/4.58	20
40	125.89	0.317	87.28/11.78	16.67/4.38	40
80	126.02	0.633	86.94/12.25	17.12/4.94	80
160	126.15	1.263	86.81/12.35	16.85/4.53	159
250	128.49	1.929	87.37/11.94	16.71/4.85	250

Figure 22-9. Performance graph of a tuned system running Trade3.

Hyperthreading

Our final attempt to improve performance was to take advantage of the hyperthreading feature of the Intel Xeon processor. Fortunately, our Linux kernel supports hyperthreading. Using the same system in its tuned state, we enabled hyperthreading through the hardware system configuration and recycled the application server box. The results of rerunning the tests are shown in Table 22-6.

Table 22-6. Performance Graph with Hyperthreading
Number of Concurrent Users	Throughput (Req/Sec)	Avg Response Time (Seconds)	Web Server CPU%	Application Server CPU %	Database Server CPU %	Little's Law
1	29.82	0.033	0.44/0.64	8.46/1.53	2.34/1.11	1
2	53.46	0.037	0.82/1.48	17.64/3.27	5.05/1.99	2
5	90.46	0.055	1.26/1.87	41.58/7.47	10.28/3.26	5
10	112.07	0.089	1.67/2.40	67.83/11.20	14.10/4.04	10
20	119.18	0.167	1.92/2.58	81.28/14.08	16.36/4.62	20
40	117.71	0.339	1.53/2.52	79.71/14.44	15.63/4.42	40
80	117.34	0.679	1.54/2.65	79.36/14.06	15.59/4.86	80
160	117.10	1.357	1.51/2.81	79.84/14.21	15.71/4.70	159
250	116.10	2.134	1.60/2.52	81.11/12.66	15.54/4.64	250

It appears that the Trade3 application and the settings we have used so far do not lend to the hyperthreading feature of the processors as clearly shown in the results. This just goes to show that the promise of hyperthreading should be studied carefully and must be applied only in the right situations.