Synchronization Performance Impact
Synchronization can and will impact the performance of your program, and you need to be especially careful when running on SMP systems. This may seem counterintuitive, as we might expect that SMP would generally improve performance and certainly never slow down a program. However, it turns out that internal implementation mechanisms and processor contention for memory access can produce unexpected effects, including dramatic performance degradation.
CRITICAL_SECTIONMutex trade-offs
The first step is to assess the performance impact of synchronization and compare CRITICAL_SECTIONs to mutexes. Program 9-1 shows statsMX.c, which uses a mutex to synchronize access to a thread-specific data structure. statsCS.c, not shown but included on the book's Web site, does exactly the same thing using a CRITICAL_SECTION, and statsIN.c uses interlocked functions. Finally, statsNS.c, also not shown, uses no synchronization at all; it turns out, in this example, that synchronization is not required because each worker accesses its own unique storage. See the cautionary note after the bulleted list following the program. The actual programs allow any number of worker threads, although, for simplicity, Program 9-1 as listed can support only 64.This set of examples not only illustrates the relative performance impact of three types of synchronization but also shows the following concepts.
- Synchronization can sometimes be avoided with careful program design.
- The interlocked functions can be used in simple situations, such as incrementing a shared variable.
- CSs are measurably faster than mutexes in most situations.
- A common technique is to specify the thread argument data structure so that it contains state data to be maintained by the thread along with a pointer to a mutex or other synchronization object.
Program 9-1. statsMX: Maintaining Thread Statistics
You can use the timep program from Chapter 6 (Program 6-2) to examine the behavior of the different implementations. Tests performed on otherwise idle systems with 256,000 work units and 1, 2, 4, 8, 16, 32, 64, and 128 worker threads show the following results.
/* Chapter 9. statsMX.c */
/* Simple boss/worker system, where each worker reports */
/* its work output back to the boss for display. */
/* MUTEX VERSION. */
#include "EvryThng.h"
#define DELAY_COUNT 20
/* Usage: statsMX nthread ntasks */
/* Start up nthread worker threads, each assigned to perform */
/* "ntasks" work units. Each thread reports its progress */
/* in its own unshared slot in a work-performed array. */
DWORD WINAPI worker (void *);
typedef struct _THARG {
int thread_number;
HANDLE *phMutex;
unsigned int tasks_to_complete;
unsigned int *tasks_complete;
} THARG;
int _tmain (DWORD argc, LPTSTR argv [])
{
INT tstatus, nthread, ithread;
HANDLE *worker_t, hMutex;
unsigned int * task_count, tasks_per_thread;
THARG * thread_arg;
/* Create the mutex. */
hMutex = CreateMutex (NULL, FALSE, NULL);
nthread = _ttoi (argv [1]);
tasks_per_thread = _ttoi (argv [2]);
worker_t = malloc (nthread * sizeof (HANDLE));
task_count = calloc (nthread, sizeof (unsigned int));
thread_arg = calloc (nthread, sizeof (THARG));
for (ithread = 0; ithread < nthread; ithread++) {
/* Fill in the thread arg. */
thread_arg [ithread].thread_number = ithread;
thread_arg [ithread].tasks_to_complete = tasks_per_thread;
thread_arg [ithread].tasks_complete = &task_count [ithread];
thread_arg [ithread].phMutex = &hMutex;
worker_t [ithread] = (HANDLE)_beginthreadex (NULL, 0, worker,
&thread_arg [ithread], 0, &ThId);
}
/* Wait for the threads to complete. */
WaitForMultipleObjects (nthread, worker_t, TRUE, INFINITE);
free (worker_t);
printf ("Worker threads have terminated\n");
for (ithread = 0; ithread < nthread; ithread++) {
_tprintf (_T ("Tasks completed by thread %5d: %6d\n"),
ithread, task_count [ithread]);
}
return 0;
free (task_count);
free (thread_arg);
}
DWORD WINAPI worker (void *arg)
{
THARG * thread_arg;
int ithread;
thread_arg = (THARG *) arg;
ithread = thread_arg->thread_number;
while (*thread_arg->tasks_complete <
thread_arg->tasks_to_complete) {
delay_cpu (DELAY_COUNT);
WaitForSingleObject (*(thread_arg->phMutex), INFINITE);
(*thread_arg->tasks_complete)++;
ReleaseMutex (*(thread_arg->phMutex));
}
return 0;
}
- For a small number of threads (4 or fewer), the NS (no synchronization), IN (interlocked functions), and CS (CRITICAL_SECTION) programs all require about the same amount of time. The CS version can be marginally (1020 percent) slower, showing a typical synchronization slowdown. The MX (mutex) version, however, requires two to three times as long to execute.
- CS performance does not always scale with the number of threads on a single-processor system when the number of threads is 5 or more. Results vary from one NT5 system to another but appear to be consistent on a given system. On some systems, elapsed times double at each step using 1, 2, 4, . . . threads, but in one case (Windows 2000, 1GHz, Pentium laptop) the times, in seconds, were 0.5, 1.0, 2.0, 4.0, 14.9, 16.0, 32.1, and 363.4; in another (Windows 2000, 50 MHz, Pentium desktop), they were 1.2, 2.3, 4.7, 9.3, 42.7, 101.3, 207.8, and 1212.5. The breaks usually occur after 4 or 8 threads, but performance is acceptable until 128 threads.
- MX performance is slower than CS on a single processor, with ratios varying from 2:1 to 10:1, depending on the system.
- Performance on an SMP system can be very poor, by a factor of 10:1 or even 100:1. Intuitively, we would expect better performance with multiple processors, but the internal implementation means that the processors are contending for locks and memory access, which explains why the MX and CS results are nearly equivalent. Tuning the CS spin count can help, as described in a later section.
- A semaphore can be used to limit the number of ready worker threads without changing the basic programming model. This technique is the subject of a later section in this chapter.
Cautionary note:
The task_count array deliberately uses 32-bit integers to allow for a large task count and to avoid the potential for "word tearing" or a "cache line conflict" on SMP systems. Two separate processors, running adjacent worker threads, could concurrently modify adjacent task counts, making the modification in their caches (32 bits on Intel x86 systems). Only one cache, however, would actually be written back to memory, producing invalid results. Prevention requires that each thread's working storage be properly separated and aligned according to cache size. In this example, the task count could be grouped with the thread argument, and there is no good reason not to use a 32-bit count. Exercise 96 explores this subject.