Performance Measurements
Each application was run five times on the host system. Physical memory was cleared before each run so that performance figures would not be improved as the files and programs became cached in memory or the swap file. The averages are shown in the tables in the following sections. Times are in seconds.Comments are listed after the tables. Needless to say, generalizations about performance can be perilous because numerous factors, including test program characteristics, contribute to a program's time performance. These tests do, however, show some of the possibilities and show the potential impacts of various file and operating systems and different programming techniques. Also bear in mind that the tests measure the time from program start to end but do not measure the time that the system might take to flush buffers to the disk. Finally, there was no attempt to exploit specific system features or parameters, such as stripped disks, disk block sizes, multiple disk partitions, and so on.The Windows performance monitor, available under the control panel's Administrative Tools, displays CPU, kernel, user, and other activities graphically. This tool is invaluable in gaining insight into program behavior beyond the measurements given here.The results show that performance varies widely based on the CPU, file system, disk configuration, program design, and many other factors. The timing programs are all on the book's Web site so that you can perform these tests on your own system.
File Copying
Five file copy implementations were used to copy a 25.6MB file (400,000 64-byte records, generated with the RandFile program from Chapter 5). The first two columns of Program 1-1) uses the C library. This test measures the effect of an implementation layered on top of Windows, although the library has the opportunity to perform efficient buffering and other techniques.Program 1-2) is the straightforward Windows implementation with a small buffer (256 bytes).Program 1-3) uses the Windows CopyFile function to determine whether the implementation within a single system call is more efficient than what can be achieved with other techniques.
CPU | Pentium III | Pentium III | Pentium LT | Celeron LT | Xeon | 4 x Xeon | |
OS | W2000 | W2000 | W2000 | XP | W2000 | W2000 | |
File System | FAT | NTFS | NTFS | NTFS | NTFS | NTFS | |
cpC | Real | 8.62 | 14.69 | 12.75 | 7.23 | 6.03 | 2.67 |
User | 0.12 | 0.12 | 0.10 | 0.10 | 0.09 | 0.06 | |
System | 0.24 | 0.52 | 1.39 | 0.39 | 0.25 | 0.36 | |
cpW | Real | 8.49 | 13.35 | 25.48 | 7.10 | 8.94 | 2.95 |
User | 0.13 | 0.12 | 0.06 | 0.04 | 0.04 | 0.13 | |
System | 0.88 | 1.37 | 4.61 | 0.62 | 0.56 | 0.13 | |
cpwFA | Real | 8.35 | 12.59 | 7.35 | 8.25 | 9.10 | 2.36 |
User | 0.01 | 0.02 | 0.03 | 0.01 | 0.01 | 0.02 | |
System | 0.40 | 0.50 | 0.82 | 0.29 | 0.20 | 0.19 | |
cpCF | Real | 8.00 | 11.69 | 2.57 | 6.50 | 7.62 | 2.97 |
User | 0.02 | 0.01 | 0.02 | 0.02 | 0.01 | 0.02 | |
System | 0.19 | 0.25 | 0.53 | 0.19 | 0.12 | 0.17 | |
cpUC | Real | 7.84 | 13.14 | 21.01 | 9.98 | 7.77 | 3.53 |
User | 0.72 | 0.66 | 0.47 | 0.34 | 0.34 | 0.42 | |
System | 0.40 | 0.67 | 3.12 | 0.34 | 0.36 | 0.41 |
Comments
- The NTFS does not necessarily give better performance than the FAT file system. On the contrary, the FAT can be faster, as can be seen by comparing columns 1 and 2.
- The C and UNIX compatibility libraries give competitive performance that is superior to the simplest Windows implementation in many cases.
- CPU time, both kernel ("System") and user, are minimal. Consequently, processor speed has very little impact on the elapsed time performance.
- High-end SMP server systems do produce fast file processing compared to laptops and PCs, as would be expected. Informal tests on a faster W2003 system produced even better results (not shown here), with elapsed times about half those of the right-most column.
- There are no consistent or significant elapsed time performance advantages obtained by using large buffers, sequential scan flags, or a function such as CopyFile. However, the very small user times for cpwFA and cpCF are interesting and potentially helpful in some situations, even if they do not improve elapsed time. One system, the Pentium laptop, is an exception to this generalization. As mentioned earlier, CPU time is a small part of the elapsed time.
- Elapsed time results are highly variable, with as much as a 10:1 difference between identical tests run under identical circumstances.
ASCII to Unicode Conversion
Eight programs were measured, all converting the same 12.8MB file to a 25.6MB file. Program 2-4 and is comparable to cpW using a small buffer.Program 5-3.Chapter 14's multiple buffer scheme without asynchronous I/O.
CPU | Pentium III | Pentium III | Pentium LT | Celeron LT | Xeon | 4 x Xeon | |
OS | W2000 | W2000 | W2000 | XP | W2000 | W2000 | |
File System | FAT | NTFS | NTFS | NTFS | NTFS | NTFS | |
atou | Real | 3.24 | 7.16 | 33.53 | 6.27 | 5.77 | 2.77 |
User | 0.31 | 0.33 | 0.01 | 0.06 | 0.06 | 0.08 | |
System | 0.46 | 0.72 | 3.55 | 0.54 | 0.63 | 0.63 | |
atouSS | Real | 3.77 | 6.21 | 43.53 | 10.12 | 5.68 | 2.48 |
User | 0.20 | 0.23 | 0.11 | 0.07 | 0.04 | 0.14 | |
System | 0.52 | 0.81 | 3.17 | 0.04 | 0.35 | 0.81 | |
atouLB | Real | 4.38 | 6.41 | 28.51 | 5.95 | 4.75 | 2.47 |
User | 0.10 | 0.07 | 0.05 | 0.03 | 0.03 | 0.08 | |
System | 0.26 | 0.34 | 0.63 | 0.19 | 0.21 | 0.187 | |
atouLSFP | Real | N/A | N/A | 5.17 | 1.38 | 1.28 | 2.03 |
User | N/A | N/A | 0.07 | 0.05 | 0.09 | 0.06 | |
System | N/A | N/A | 0.61 | 0.16 | 0.10 | 0.11 | |
atouMM | Real | 4.35 | 2.75 | 3.46 | 3.90 | 3.74 | 0.77 |
User | 0.27 | 0.29 | 0.09 | 0.07 | 0.05 | 0.14 | |
System | 0.19 | 0.19 | 0.16 | 0.14 | 0.10 | 0.09 | |
atouMT | Real | 4.84 | 6.18 | 5.83 | 6.61 | 5.99 | 3.55 |
User | 0.14 | 0.15 | 0.26 | 0.04 | 0.06 | 0.02 | |
System | 0.45 | 0.46 | 0.66 | 0.33 | 0.15 | 0.31 | |
atouOV | Real | 9.54 | 8.85 | 32.42 | 6.84 | 5.63 | 3.17 |
User | 0.14 | 0.12 | 0.21 | 0.06 | 0.06 | 0.06 | |
System | 0.24 | 0.23 | 0.42 | 0.18 | 0.21 | 0.17 | |
atouEX | Real | 5.67 | 5.92 | 30.65 | 6.50 | 5.19 | 2.64 |
User | 1.10 | 1.50 | 0.29 | 0.35 | 0.41 | 0.64 | |
System | 1.19 | 1.74 | 0.77 | 0.69 | 0.59 | 1.91 |
Comments
- These results show that there is a small advantage to using large buffers and the sequential scan flags, possibly in conjunction.
- Presizing the output file (atouLSFP) is very effective, giving dramatic performance improvements on all the single-processor systems. The SMP benefits, however, were marginal. This technique could also be used with the previous file copying examples.
- CPU time is insignificant in these tests.
- Overlapped I/O, in addition to being limited to Windows NT and very difficult to program, gives poor performance. Notice that the time is predominantly real time and not user or system time. Using NT4, it appears that the system has difficulty scheduling the disk access, and experiments with different buffer sizes (larger and smaller) did not help until 65K buffers were used. NT5 has eliminated this problem.
- Extended I/O and multiple threads do not provide any significant benefit.
- Memory-mapped I/O can give very good performance, usually about 30 percent faster than the other versions. The SMP server results were even better.
Pattern Searching
Three pattern searching methods were tested to compare the efficiencies of multiple threads and processes as well as sequential processing (see Program 6-1, searches with parallel processes, each processing a separate file. The system and user times are not given, because timep measures only the parent process.Program 7-1, uses parallel threads.
CPU | Pentium LT | Celeron LT | Xeon | 4 x Xeon | |
OS | W2000 | XP | W2000 | W2000 | |
File System | NTFS | NTFS | NTFS | NTFS | |
grepMP | Real | 14.72 | 3.95 | 10.58 | 0.63 |
User | N/A | N/A | N/A | N/A | |
System | N/A | N/A | N/A | N/A | |
grepMT | Real | 7.08 | 3.61 | 8.09 | 0.73 |
User | 0.30 | 0.41 | 0.27 | 2.23 | |
System | 0.09 | 0.47 | 0.13 | 0.28 | |
grepSQ | Real | 6.71 | 3.86 | 6.71 | 0.97 |
User | N/A | N/A | N/A | N/A | |
System | N/A | N/A | N/A | N/A |
Comments
- In most cases, all three techniques provide similar results on single-processor systems. The Pentium laptop is an exception, where grepMP version was consistently slow.
- Multithreading offers a slight advantage over multiple processes, even on single-processor systems.
- User and system times are only meaningful in the multithreaded version.
- SMP systems show the performance gains that are possible using threads or multiple single-threaded processes. Notice that the total user time exceeds the real time because the user time represents all four processors.
- The fact that the sequential processing gave such similar results on single-processor systems indicates that the simplest solution is sometimes the best.
File Sorting
A target sort file of 100,000 64-byte records (6.4MB) was used to test four sort implementations from Chapter 5, as shown in the first four rows in Program 7-2, of a 25MB file with 400,000 64-byte records was tested with one, two, and four threads. Each individual run used a different file, created by the RandFile program that is in the Chapter 5 directory. There was considerable variation from one run to the next.
- sortBT is Program 5-1, which creates a binary search tree, requiring a memory allocation for each record. This program is CPU-intensive.Program 5-4, which maps the file before using qsort. sortFLSR (heap access was serialized) was also tested but showed no measurable difference.Program 5-5, which creates a permanent index file.Program 7-2, the multithreaded sort-merge. The results are shown as sortMT1, sortMT2, and sortMT4, according to the number of parallel threads. Results can differ significantly depending on the nature of the data in the file to be sorted, although the size and randomness of the data minimizes this variation, which, in general, is a characteristic of the underlying quicksort algorithm used to implement the qsort C library function.
CPU | Pentium LT | Celeron LT | Xeon | 4 x Xeon | |
OS | W2000 | XP | W2000 | W2000 | |
File System | NTFS | NTFS | NTFS | NTFS | |
sortBT | Real | N/A | 9.61 | N/A | N/A |
User | N/A | 1.84 | N/A | N/A | |
System | N/A | 7.44 | N/A | N/A | |
sortFL | Real | 11.15 | 0.78 | 1.74 | 5.38 |
User | 4.81 | 0.41 | 0.26 | 5.19 | |
System | 0.15 | 0.09 | 0.09 | 0.02 | |
sortHP | Real | 1.76 | 0.34 | 1.52 | 1.30 |
User | 1.62 | 0.22 | 0.15 | 1.28 | |
System | 0.11 | 0.05 | 0.03 | 0.04 | |
sortMM | Real | 0.99 | 1.44 | 1.92 | 1.39 |
User | 0.31 | 0.18 | 0.15 | 0.38 | |
System | 0.68 | 0.47 | 0.36 | 1.03 | |
sortMT1 | Real | 3.18 | 3.58 | 6.80 | 0.14 |
User | 0.01 | 0.95 | 0.01 | 0.05 | |
System | 0.46 | 0.16 | 0.16 | 0.11 | |
sortMT2 | Real | 2.10 | 1.22 | 6.70 | 0.13 |
User | 0.01 | 1.05 | 0.01 | 0.02 | |
System | 0.40 | 0.16 | 0.16 | 0.13 | |
sortMT4 | Real | 2.20 | 1.49 | 6.22 | 0.13 |
User | 0.01 | 1.18 | 0.01 | 0.12 | |
System | 0.16 | 0.15 | 0.16 | 0.09 |
Comments
- The binary tree implementation, sortBT, is CPU-intensive; it must allocate storage for each record one at a time.
- Memory mapping and reading the file into a preallocated buffer yield similar performance, but the memory mapping was not as good in these tests, and, in several cases, was considerably worse. However, sortFL and sortHP gave some excellent results.
- The total of user and system times sometimes exceeds the elapsed time, even when using a single thread.
- sortMT demonstrates the potential of SMP systems. Additional threads also helped, in some cases, on single-processor systems.
Multiple Threads Contending for a Single Resource
This test sequence compares different strategies for implementing the queue management functions of Program 10-4, using Program 10-5 (the three-stage pipeline) as a test application. The tests were run on a four-processor (1GHz) Intel Xeon Windows 2000 Server using 1, 2, 4, 8, 16, 32, and 64 threads, but in all seven cases each thread was asked to perform 1,000 units of work. Ideally, we would then expect real time to increase linearly with the number of threads, but contention for a single mutex (or CS) can cause nonlinear degradation as the number of threads increases. Note that these tests do not exercise the file system.Six different implementation strategies were used, and the results are shown in separate columns in Program 10-4 discuss the results and explain the merits of the different implementations, but notice that the signal model does scale with the number of threads, while the broadcast model does not scale, especially with 32 and 64 threads. Also notice how the broadcast model results in large amounts of system CPU time as multiple threads run, test the predicate, and immediately return to the wait state.
- Broadcast model, mutex, event, separate release and wait calls. The tunable time-out was set to 5 milliseconds, which optimized the 16-thread case.
- Broadcast model, CRITICAL_SECTION, event, separate release and wait calls. The tunable time-out was set to 25 milliseconds, which optimized the 16-thread case.
- Broadcast model, mutex, event, atomic SignalObjectAndWait call.
- Signal model, mutex, event, separate release and wait calls.
- Signal model, CRITICAL_SECTION, event, separate release and wait calls.
- Signal model, mutex, event, atomic SignalObjectAndWait call.
Number of Threads | Broadcast Model | Broadcast Model | Broadcast Model | Signal Model | Signal Model | Signal Model | |
---|---|---|---|---|---|---|---|
Mtx, Evt | Crit Sec, Evt | Mtx, Evt | Mtx, Evt | Crit Sec, Evt | Mtx, Evt | ||
5-ms T/O | 25-ms T/O | SigObjWait | Time-out N/A | Time-out N/A | SigObjWait | ||
1 | Real | 0.03 | 0.03 | 0.05 | 0.05 | 0.03 | 0.05 |
User | 0.03 | 0.06 | 0.03 | 0.05 | 0.08 | 0.05 | |
System | 0.06 | 0.02 | 0.09 | 0.08 | 0.02 | 0.06 | |
2 | Real | 0.14 | 0.27 | 0.09 | 0.08 | 0.06 | 0.08 |
User | 0.13 | 0.05 | 0.14 | 0.17 | 0.11 | 0.08 | |
System | 0.11 | 0.06 | 0.16 | 0.09 | 0.11 | 0.17 | |
4 | Real | 0.39 | 0.59 | 0.23 | 0.19 | 0.16 | 0.20 |
User | 0.18 | 0.17 | 0.22 | 0.26 | 0.17 | 0.19 | |
System | 0.30 | 0.22 | 0.41 | 0.31 | 0.22 | 0.31 | |
8 | Real | 0.83 | 0.92 | 0.73 | 0.36 | 0.34 | 0.36 |
User | 0.34 | 0.36 | 0.55 | 0.52 | 0.45 | 0.45 | |
System | 0.98 | 1.00 | 1.00 | 0.69 | 0.39 | 0.75 | |
16 | Real | 2.42 | 2.30 | 2.38 | 0.75 | 0.69 | 0.75 |
User | 1.17 | 1.31 | 1.22 | 0.81 | 0.81 | 0.88 | |
System | 3.69 | 3.05 | 3.39 | 1.45 | 1.08 | 1.33 | |
32 | Real | 7.56 | 7.50 | 7.98 | 1.50 | 1.50 | 1.50 |
User | 3.33 | 3.73 | 2.56 | 1.75 | 1.69 | 1.78 | |
System | 12.52 | 10.72 | 11.03 | 3.13 | 2.00 | 2.69 | |
64 | Real | 27.72 | 26.23 | 29.31 | 3.14 | 2.95 | 3.20 |
User | 7.89 | 10.75 | 7.22 | 3.73 | 3.69 | 3.47 | |
System | 46.70 | 40.33 | 36.67 | 6.28 | 3.89 | 5.47 |