Example: Merge-SortDivide and Conquer to Exploit SMP
This example shows how to use threads to get significant performance gains, especially on an SMP system. The basic idea is to divide the problem into component tasks, give each task to a separate thread, and then combine the results to get the complete solution. The Windows executive will automatically assign the threads to separate processors, so the tasks will be performed in parallel, reducing elapsed time.This strategy, often called the divide and conquer strategy or the work crew model, is useful both for performance and as an algorithm design method. The implementation of grepMT, Program 7-1, could be considered one example; it creates a thread for each file or pattern matching task. Appendix C shows that there are performance gains on SMP systems because the executive can schedule the threads on different processors.Next, consider another example in which a single task, sorting a file, is divided into subtasks delegated to separate threads.Merge-sort, in which the array to be sorted is divided into smaller arrays, is a classic divide and conquer algorithm. Each small array is sorted individually, and the individual sorted arrays are merged in pairs to yield larger sorted arrays. The pairwise merging continues until completion. Generally, merge-sort starts with arrays of size 1, which need no sorting. This example starts with larger arrays so that there is one array for each processor. Figure 7-2 is a sketch of the algorithm.Appendix C shows the timing results. Exercise 79 suggests that sortMT use GetSystemInfo to find the number of processors and then create one thread per processor.Notice that the program runs efficiently on single-processor systems with sufficient memory and gains a significant performance improvement on SMP systems. Caution: The algorithm as shown will work only if the number of records in the sort file is divisible by the number of threads and if the number of threads is a power of 2. Exercise 78 removes these limitations.Note:
In understanding this program, it is important to concentrate on the thread management logic separately from the logic that determines which portion of the array a thread is to sort. Notice too that the C library qsort function is used, so there is no need to be concerned with developing an efficient sort function.
Program 7-2. sortMT: Merge-Sort with Multiple Threads
/* Chapter 7. SortMT.
File sorting with multiple threads (a work crew).
sortMT [options] nt file */
#include "EvryThng.h"
#define DATALEN 56 /* Key: 8 bytes; Data: 56 bytes. */
#define KEYLEN 8
typedef struct _RECORD {
CHAR Key [KEYLEN]; TCHAR Data [DATALEN];
} RECORD;
#define RECSIZE sizeof (RECORD)
typedef RECORD * LPRECORD;
typedef struct _THREADARG { /* Thread argument */
DWORD iTh; /* Thread number: 0, 1, 2, ... */
LPRECORD LowRec; /* Low record */
LPRECORD HighRec; /* High record */
} THREADARG, *PTHREADARG;
static int KeyCompare (LPCTSTR, LPCTSTR);
static DWORD WINAPI ThSort (PTHREADARG pThArg);
static DWORD nRec; /* Total number of records to be sorted. */
static HANDLE * ThreadHandle;
int _tmain (int argc, LPTSTR argv [])
{
HANDLE hFile;
LPRECORD pRecords = NULL;
DWORD FsLow, nRead, LowRecNo, nRecTh, NPr, ThId, iTh;
BOOL NoPrint;
int iFF, iNP;
PTHREADARG ThArg;
LPTSTR StringEnd;
iNP = Options (argc, argv, _T ("n"), &NoPrint, NULL);
iFF = iNP + 1;
NPr = _ttoi (argv [iNP]); /* Number of threads. */
hFile = CreateFile (argv [iFF], GENERIC_READ | GENERIC_WRITE,
0, NULL, OPEN_EXISTING, 0, NULL);
FsLow = GetFileSize (hFile, NULL);
nRec = FsLow / RECSIZE; /* Total number of records. */
nRecTh = nRec / NPr; /* Records per thread. */
/* Allocate thread args and handle array
and space for the file. Read the complete file. */
ThArg = malloc (NPr * sizeof (THREADARG)); /* Thread args. */
ThreadHandle = malloc (NPr * sizeof (HANDLE));
pRecords = malloc (FsLow + sizeof (TCHAR));
ReadFile (hFile, pRecords, FsLow, &nRead, NULL);
CloseHandle (hFile);
LowRecNo = 0; /* Create the sorting threads. */
for (iTh = 0; iTh < NPr; iTh++) {
ThArg [iTh].iTh = iTh;
ThArg [iTh].LowRec = pRecords + LowRecNo;
ThArg [iTh].HighRec = pRecords + (LowRecNo + nRecTh);
LowRecNo += nRecTh;
ThreadHandle [iTh] = (HANDLE) _beginthreadex (NULL, 0,
ThSort, &ThArg [iTh], CREATE_SUSPENDED, &ThId);
}
for (iTh = 0; iTh < NPr; iTh++) /* Run all sort threads. */
ResumeThread (ThreadHandle [iTh]);
WaitForSingleObject (ThreadHandle [0], INFINITE);
for (iTh = 0; iTh < NPr; iTh++) CloseHandle (ThreadHandle [iTh]);
StringEnd = (LPTSTR) pRecords + FsLow;
*StringEnd = '\0';
if (!NoPrint) printf ("\n%s", (LPCTSTR) pRecords);
free (pRecords);
free (ThArg);
free (ThreadHandle);
return 0;
} /* End of _tmain. */
static VOID MergeArrays (LPRECORD, LPRECORD);
DWORD WINAPI ThSort (PTHREADARG pThArg)
{
DWORD GrpSize = 2, RecsInGrp, MyNumber, TwoToI = 1;
LPRECORD First;
MyNumber = pThArg->iTh;
First = pThArg->LowRec;
RecsInGrp = pThArg->HighRec - First;
qsort (First, RecsInGrp, RECSIZE, KeyCompare);
while ((MyNumber % GrpSize) == 0 && RecsInGrp < nRec) {
/* Merge with the adjacent sorted array. */
WaitForSingleObject (
ThreadHandle [MyNumber + TwoToI], INFINITE);
MergeArrays (First, First + RecsInGrp);
RecsInGrp *= 2;
GrpSize *= 2;
TwoToI *= 2;
}
_endthreadex (0);
return 0; /* Suppress a warning message. */
}
static VOID MergeArrays (LPRECORD p1, LPRECORD p2)
{
DWORD iRec = 0, nRecs, i1 = 0, i2 = 0;
LPRECORD pDest, p1Hold, pDestHold;
nRecs = p2 - p1;
pDest = pDestHold = malloc (2 * nRecs * RECSIZE);
p1Hold = p1;
while (i1 < nRecs && i2 < nRecs) {
if (KeyCompare ((LPCTSTR) p1, (LPCTSTR) p2) <= 0) {
memcpy (pDest, p1, RECSIZE);
i1++; p1++; pDest++;
}
else {
memcpy (pDest, p2, RECSIZE);
i2++; p2++; pDest++;
}
}
if (i1 >= nRecs) memcpy (pDest, p2, RECSIZE * (nRecs - i2));
else memcpy (pDest, p1, RECSIZE * (nRecs - i1));
memcpy (p1Hold, pDestHold, 2 * nRecs * RECSIZE);
free (pDestHold);
return;
}
Performance
Appendix C includes the results of sorting large files of 64-byte records using one, two, and four threads. SMP systems give significantly better results. Divide and conquer is more than just a strategy for algorithm design; it can also be the key to exploiting threads and SMP. The single-processor results can vary. On a system with limited memory (that is, insufficient physical memory to hold the entire file, in addition to the OS and other active processes), the use of multiple threads increases the sort time because the threads contend for available physical memory. On the other hand, multiple threads can improve performance with a single processor when there is sufficient memory. The results are also heavily dependent on the initial data arrangement, as discussed in Appendix C.
