Case StudyThis section presents the results of the benchmarks used for the case study and shows the cumulative improvements of all the performance enhancement features discussed so far. Some of the benchmarks captured the cumulative gain of all the features, whereas others captured selected gains for specific workloads. NetBenchFigure 21-4 summarizes the performance improvements made to the NetBench benchmark throughput through various Linux kernel and Samba enhancements and tuning. These tests were conducted on a Pentium 4 system with four 1.5GHz P4 processors, four gigabit Ethernet adapters, 2GB memory, and fourteen 15k rpm SCSI disks. SUSE version 8.0 was used for all tests. Each subsequent test included one new kernel, tuning, or Samba change. The NetBench Enterprise Disk Suite by Ziff-Davis was used for all tests. Figure 21-4. NetBench case study.![]() Represents a clean installation of SUSE Enterprise Linux Server 8.0 (SUSE SLES 8) with no performance configuration changes. data=writeback. A configuration change was made to the default Ext3 mount option from ordered to writeback for the /data file system (where the Samba shares reside). This improved file system performance greatly on metadata- intensive workloads such as this one.smblog=1. The Samba logging level was changed from 2 to 1 to reduce disk I/O to the Samba log files. A level of 1 is verbose enough to log critical errors. SendFile/Zerocopy. A patch that makes Samba use SendFile for client read requests. This, combined with Linux Zerocopy support (first available in 2.4.4), eliminates two very costly memory copies.O(1) scheduler. A small improvement that will facilitate other performance improvements in the future. The O(1) scheduler is the multiqueue scheduler that improves the performance of symmetrical multiprocessors. This is the default scheduler in the Linux 2.5 and 2.6 kernels.evenly affined IRQs. Each of the four network adapters' interrupts was handled by a unique processor. SUSE SLES 8, for P4 architecture, defaults to a round-robin assignment (destination = irq_num % num_cpus) for IRQ-to-CPU mappings. In this particular case, all of the network adapters' IRQs are routed to CPU0. This can be very good for performance because cache warmth on this code is improved, but one CPU may not be able to handle the entire network load as more NICs are added to the system. The ideal solution is to evenly affine these IRQs so that each processor handles interrupts from one NIC. Along with process affinity, this should keep the process that is assigned to a particular NIC also assigned to one CPU for maximum performance.process affinity. This technique ensures that for each network interrupt that is processed, the corresponding smbd process is scheduled on the same CPU to further improve cache warmth.
Increases the number of buffers held in the network stack code so that the network stack code does not have to call the memory system to get or free a buffer from and to the memory system. This tuning is not available in the 2.6 kernel.case sensitivity enforced. When case sensitivity is not enforced, Samba might have to search for different versions of a filename before it can start that file, because many combinations of filenames can exist for the same file. Enforcing case sensitivity eliminates those guesses.spinlocks. Samba uses fcntl() for its database, which can be costly. Using spinlocks avoids a fcntl() call. The use of the Big Kernel Lock found in posix_lock_file() reduces contention and wait times for Big Kernel Lock. To use this feature, configure Samba with --use-spin-locks, as shown in the following example: dcache read copy update. Directory entry lookup times are reduced with a new implementation of dlookup(), using the read copy update technique. Read copy update is a two-phase update method for mutual exclusion in Linux that allows avoiding of the overheads of spin-waiting locks. For more information, see the locking section of the Linux Scalability Effort project. Netperf3 (Gigabit Ethernet Tuning Case Study)Gigabit Ethernet NICs are becoming cheaper and are quickly replacing 100Mb Ethernet cards. The system manufacturers are including Gigabit Ethernet on motherboards, and system suppliers and integrators are choosing Gigabit Ethernet network cards and switches for connecting disk servers, PC computer farms, and departmental backbones. This section looks at gigabit network performance in the Linux operating system and how tuning the Gigabit Ethernet improves network performance.The Gigabit Ethernet network cards (Intel Gigabit Ethernet and Acenic Gigabit Ethernet) have a few additional features that facilitate handling high throughput. These features include support for jumbo frame (MTU) size, interrupt-delay, and TX/RX descriptors:Jumbo frame size. With Gigabit Ethernet NICs, the MTU size can be greater than 1500 bytes. The limitation of 1500 MTU for 100Mb Ethernet does not exist anymore. Increasing the size of the MTU usually improves the network throughput, but make sure that the network routers support the jumbo frame size. Otherwise, when the system is connected to a 100Mb Ethernet network, the Gigabit Ethernet NICs will drop to 100Mb capacity.Interrupt-delay/interrupt coalescence. The interrupt coalescence can be set for receive and transmit interrupts that let the NIC delay generating interrupts for the set time period. For example, when RxInt is set to 1.024us (which is the default value on Intel Gigabit Ethernet), the NIC places the received frames in memory and generates an interrupt only when 1.024us has elapsed. This can improve CPU efficiency as it reduces the context switches, but at the same time, it also has the effect of increasing receive packet latency. If properly tuned for the network traffic, interrupt coalescence can improve CPU efficiency and network throughput performance.Transmit and receive descriptors. This value is used by the Gigabit Ethernet driver to allocate buffers for sending and receiving data. Increasing this value allows the driver to buffer more incoming packets. Each descriptor includes a transmit and receive descriptor buffer along with a data buffer. This data buffer size depends on the MTU size; the maximum MTU size is 16110 for this driver. Other Gigabit Ethernet studies also indicate that for every Gigabit of network traffic a system processes, approximately 1GHz of CPU processing power is needed to perform the work. Our experiment also proved that this is true, but adding more GHz processors and more Gigabit Ethernet network cards did not scale the network throughput even when the number of GHz processors equaled the number of Gigabit Ethernet NICs. Other bottlenecks, such as system buses, affect the scalability of the Gigabit Ethernet NICs on SMP processors. The NIC test shows that in only three out of four NICs media speed was achieved.These tests were run between a Pentium 4 1.6GHz 4-way machine and four client machines (Pentium 3 1.0GHz) capable of running 1 gigabit NIC at media limit. All machines had the Linux 2.4.17 SMP vanilla kernel. The e1000 driver was version 4.1.7. These are Netperf3 PACKET_STREAM and PACKET_MAERTS runs, all with an MTU of 1500 bytes. The Pentium 4 machine had 16GB RAM with four CPUs. The four NICs were distributed equally between 100MHz and 133MHz PCI-X slots. The hyperthreading was disabled on the Pentium 4 system.The PACKET_STREAM test transfers raw data without any TCP/IP headers. None of the packets transmitted or received go through the TCP/IP layers. This is a test that only exercises the CPU, memory, PCI bus, and the NIC's driver to look for bottlenecks in these areas. The interrupt delay and transmit and receive descriptors for the Gigabit driver were tuned with different values to determine what works best for the environment. Another element of tuning was added using different socket buffer sizes.Table 21-5 shows that the maximum throughput achieved is 2808 Mbps out of four NICs, and the tuning that helped achieve this throughput are 4096 transmit and receive descriptors and interrupt delay set to 64 for both the receive and send sides with a 132k socket buffer size.
VolanoMarkThe VolanoMark benchmark creates 10 chat rooms of 20 clients. Each room echoes the messages from one client to the other 19 clients in the room. This benchmark, not yet an open source benchmark, consists of the VolanoChat server and a second program that simulates the clients in the chat room. It is used to measure the raw server performance and network scalability performance. VolanoMark can be run in two modes: loopback and network. The loopback mode tests the raw server performance, and the network mode tests the network scalability performance. VolanoMark uses two parameters to control the size and number of chat rooms.The VolanoMark benchmark creates client connections in groups of 20 and measures how long it takes the server to take turns broadcasting all the clients' messages to the group. At the end of the loopback test, it reports a score of the average number of messages transferred per second. In network mode, the metric is the number of connections between the clients and the server. The Linux kernel components stressed with this benchmark include TCP/IP, the scheduler, and signals.Figure 21-5 shows the results of the VolanoMark benchmark run in loopback mode. The improvements shown resulted from various factors such as tuning the network, kernel enhancements, and two prototype patches. The Performance team at the IBM Linux Technology Center created the prototype patches, but they have not been submitted to the upstream kernel. The first patch is the priority preemption patch, which enables a process to run longer without being preempted by a higher-priority process. Because this policy of turning off the priority preemption is not acceptable for all workloads, the patch is enabled through a new scheduler tuning parameter. The other patch, the TCP soft affinity patch, is related to TCP/IP, so a detailed discussion of this patch is not appropriate for this chapter. Figure 21-5. VolanoMark case study.![]() SPECWeb99The SPECWeb99 benchmark presents a demanding workload to a web server. This workload requests 70% static pages and 30% simple dynamic pages. Sizes of the web pages range from 102 bytes to 921,000 bytes. The dynamic content models GIF advertisement rotation; there is no SSL content. SPECWeb99 is relevant because web serving, especially with Apache, is one of the most common uses of Linux servers. Apache is rich in functionality but is not designed for high performance. Apache was chosen as the web server for this benchmark because it currently hosts more web sites than any other web server on the Internet. SPECWeb99 is the accepted standard benchmark for web serving. SPECWeb99 stresses the following kernel components: SchedulerTCP/IPVarious threading modelsSendFileZerocopyNetwork drivers Figures 21-6 and 21-7 show the results of SPECWeb99 using web static content and web dynamic content, respectively. Also included is a description of the hardware and software configurations used for each test. Figure 21-6. SPECWeb99 static content case study.![]() Figure 21-7. SPECWeb99 dynamic content case study.![]() |