Beowulf Cluster Computing with Linux, Second Edition [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Beowulf Cluster Computing with Linux, Second Edition [Electronic resources] - نسخه متنی

William Gropp; Ewing Lusk; Thomas Sterling

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
توضیحات
افزودن یادداشت جدید







7.3 Parameter Studies


One straightforward application of task parallelism is the "parameter study", in which the same sequential program is run multiple times with different sets of input parameters. Since the program to be run is sequential, there is no communication among the workers, and the manager can be a simple script that communicates with the workers by means of the arguments to the sequential program and its standard output. We can start the workers with

ssh and collect the output by using the

popen system call, which returns a file descriptor we can

select on and read the remote process's

stdout from.

Although both the algorithm we use and its implementation are general, we present here a concrete example. We explore the parameter space of compiler options for the default Linux C compiler

gcc. The

man page for

gcc conveniently lists in one place all the options that can be passed to

gcc to cause it to produce faster machine code. Here is an excerpt from the

man page:


Optimization Options
-fcaller-saves -fcse-follow-jumps -fcse-skip-blocks
-fdelayed-branch -felide-constructors
-fexpensive-optimizations -ffast-math -ffloat-store
-fforce-addr -fforce-mem -finline-functions
-fkeep-inline-functions -fmemoize-lookups
-fno-default-inline -fno-defer-pop
-fno-function-cse -fno-inline -fno-peephole
-fomit-frame-pointer -frerun-cse-after-loop
-fschedule-insns -fschedule-insns2
-fstrength-reduce -fthread-jumps -funroll-all-loops
-funroll-loops -0 -02 -03

For the matrix-matrix multiply program we are going to test with, which has no function calls, only some of these look useful. Here is a subset of the above list containing optimization flags that might have an effect on the speed of the program:


-fexpensive-optimizations
-ffast-math
-ffloat-store
-fno-peephole
-fschedule-insns
-fschedule-insns2
-fstrength-reduce
-funroll-all-loops
-funroll-loops
-0
-02
-03

Since there are twelve switches that can be either present or absent, there are 212 possible combinations. These are not completely independent, since some switch settings imply others, especially the three

-0 flags, but we will ignore thus fact for the sake of simplifying our example, and just try all 4096 combinations, Indeed, which switch settings are redundant in the presence of others should be deducible from our results!

Our plan will be to take a simple test program and compile it with all possible switch combinations and run it, reporting back the times. Since we have 4096 jobs to run, the use of a cluster will make a big difference, even if the individual tasks are short.

For our test program, we will use a straightforward matrix-matrix multiply program, shown in Figure 7.10. It multiples two 300 300 matrices, timing the calculation, this may not be the highest performing way to do this, but it will do for our purposes. The program echoes its command line arguments, which it does not otherwise use; we will use them to help us record the arguments used to compile the program.








#include <stdio.h>
#include <sys/time.h>
#include <unistd.h>
#define SIZE 300
main(int argc, char *argv[])
{
double a[SIZE][SIZE], b[SIZE][SIZE], c[SIZE][SIZE];
int i, j, k;
struct timeval tv;
double starttime, endtime;
for (i = 0; i < SIZE; i++)
for (j = 0; j < SIZE; j++)
a[i][j] = (double) (i + j);
for (i = 0; i < SIZE; i++)
for (j = 0; j < SIZE; j++)
b[i][j] = (double) (i + j);
for (i = 0; i < SIZE; i++)
for (j = 0; j < SIZE; j++)
c[i][j] = 0.0;
gettimeofday( &tv, ( struct timezone * ) 0 );
starttime = tv.tv_sec + (tv.tv_usec / 1000000.0);
for (i = 0; i < SIZE; i++) {
for (j = 0; j < SIZE; j++) {
for (k = 0; k < SIZE; k++) {
c[i][j] = c[i][j] + a[i][k] * b [k][j];
}
}
}
gettimeofday( &tv, ( struct timezone * ) 0 );
endtime = tv.tv_sec + (tv.tv_usec / 1000000.0);
printf("%f seconds for", endtime - starttime);
for (i = 1; i < argc; i++)
printf(" %s", argv[i]);
printf("\n");
}






Figure 7.10: Matrix-matrix multiply program

Our worker programs will just be the sequence



gcc <flags> -o matmult matmult.c
matmult

and the manager will start them with

ssh , on hosts whose names are in a file. The other argument to our manager is a file of possible arguments. It contains exactly the twelve lines listed above. The manager just steps through the numbers from 0 up to the total number of runs (in our case 4096) treating each number as a binary number where a 1 bit represent the presence of the compiler switch corresponding to that position in the binary number. Thus we will run through all possible combinations.

The overall plan is to loop through the parameter space represented by the binary numbers represented by the binary numbers from 0 to 212. If there is a free host (no worker is working there) we assign it the next task; if not we

select on the sockets that are open to currently working workers. When one of them reports back, we add it back to the list of free hosts. At the end, after all the work has been assigned, we still have to wait for the last tasks to complete.

Let us step through the code in Figure 7.11 in detail. First we read in the list of hosts (initial value of the list

freeHosts ) and the list of possible arguments (

parmList ). We initialize the set of sockets to select on to empty since there are no workers yet, and create an empty Python dictionary (

fd2host ) where we will keep track of busy hosts and the connections to them. We set

numParmSets to the number of subtasks, which we can calculate from the number of possible compiler flags in the input file. Then we enter the main loop, which runs until we have assigned all the work and there are no outstanding workers working. If there is a subtask still to do and a free host to do it on, we construct the parameter list corresponding to the next task (in

ParmSet ), and pick the first host from the list of free hosts, temporarily removing it from the list. We then build a string containing the specification of the subtask. The

Popen3 command forks a process that runs the

ssh program locally, which runs the

gcc-matmult sequence remotely. The

ssh 's, and therefore the remote processes, run in parallel.








#!/usr/bin/python
from sys import argv
from popen2 import Popen3
from select import select, error
hostFile = open(argv[1])
parmsFile = open(argv[2])
freeHosts = [ line.strip() for line in hostFile.readlines() ]
parmList = [ line.strip() for line in parmsFile.readlines() ]
lenParmList = len(parmList)
socketsToSelect = []
fd2host = {}
numParmSets = 2 ** lenParmList
pcnt = 0
while pcnt < numParmSets or socketsToSelect:
if pcnt < numParmSets and freeHosts:
parmSet = []
for i in range(0,lenParmList):
bit = 1 << i
if bit & pcnt:
parmSet.append(parmList[lenParmList-i-1])
host = freeHosts[0]
freeHosts.remove(host)
cmd = ("ssh -l lusk -x -n %s 'gcc %s -o matmult matmult.c; " +
"matmult %s'") % (host,' '.join(parmSet),' '.join(parmSet))
runner = Popen3(cmd)
runfd = runner.fromchild
socketsToSelect.append(runfd)
fd2host[runfd] = host
pcnt += 1
else:
readyFDs = 0
(readyFDs,None,None) = select(socketsToSelect,[],[],30)
for fd in readyFDs:
line = fd.readline()
if line:
print '%s on %s' % (line.strip(),fd2host[fd])
else:
freeHosts.append(fd2host[fd])
socketsToSelect.remove(fd)
fd.close()






Figure 7.11: Manager for parameter study

We set

runfd to the

stdout of the

ssh , which collects the

stdout from the

matmult . Each line of stdout will contain the time followed by the compiler flags used. Then we add this

fd to the list of sockets available for selecting on and enter into the list

fd2host the host attached to that

fd .

If there are no free hosts or we have already assigned all the subtasks, then we select on the sockets connected to busy workers. When one those sockets becomes active, it means that the associated worker has set us a line of output. We read it and print it, or if the read fails (the worker exited, which sends an

EOF on the socket), we close that socket, take it out of the list of sockets to select on, and add the corresponding host to the list of free hosts, since it can mow be assigned another subtask.


The manager exits once all the subtasks have been done and all the workers have completed. If we run this with


parmstudy.py hostfile parmfile | sort -n

then we will get the best combinations at the top of the list. Try it!

/ 198