High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI [Electronic resources] نسخه متنی

This is a Digital Library

With over 100,000 free electronic resource in Persian, Arabic and English

High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI [Electronic resources] - نسخه متنی

Joseph D. Sloan

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید








15.2 Problem Decomposition


When decomposing a program, we
will talk in terms of
tasks. The meaning of this word may vary
slightly depending upon context. Typically, a task is a portion of a
program that can be executed as a unit. It may be used to mean that
part of a program that can become an independent process, or it may
be used to mean a piece of the work that that process will execute.
It should be clear from context which meaning is intended.

Let's begin by looking at some of the issues
involved in decomposing a problem into parallelizable parts. The
first issue we must face is task granularity.
Depending on the problem, a task may be broken into very small pieces
(fine granularity), into relatively large pieces (coarse
granularity), or into a mixture of pieces of varying sizes.

Granularity, in
one sense, establishes a limit on how many compute nodes or
processors you may be able to use effectively. For example, if you
are multiplying two 10 by 10 matrices, then you will need to do 100
multiplications. Since you won't be able to
subdivide a multiplication, you won't be able to
divide this problem into more than 100 pieces. Consequently, having
more than 100 processors won't allow you to do the
multiplications any faster. In practice, the number of processors you
can effectively use will be lower. It is essential to realize that
there are a number of trade-offs that must be balanced when dividing
a problem. In particular, coarse granularity tends to limit
communication overhead but may result in increased idle time and poor
processor utilization. We will discuss each of these concerns in
detail in this chapter.

We can also speak of the degree of
concurrency
, i.e., the number of tasks that
can execute at the same time. Realize that this will vary during
programming execution depending on the point you are at in the
program. Thus, it is often more meaningful to talk about the maximum
or the average degree of concurrency of a program. Generally, both
the maximum and average concurrency are larger with fine-grained than
coarse-grained problems.

A

data (or

task )

dependency graph (or

diagram ) is one way of visually representing a
program. This can be helpful when investigating and describing
potential concurrency. The idea is to break the algorithm into pieces
of code or tasks based on the data required by that task. A graph is
then drawn for the algorithm that shows the set of tasks as nodes
connected by arrows indicating the flow of data between connected
pairs of tasks.

Figure 15-1 is a data dependency graph for the
numerical integration program developed in Chapter 13. The amount of detail will vary in these
graphs depending on your purpose. In this case, I've
kept things very simple. If you desire, you can increase the detail
to the point of having a single node for each instruction and arrows
for each variable.[1]

[1] Some authors distinguish between
data and task dependency graphs and between dependencies and
interactions. Feel free to adjust your graphs as you see fit.


The idea is that graphs such as these help you think about and locate
potential concurrencies in your code. If you have two blocks of code
or tasks that don't depend on each other and have
everything they need to execute, these are potentially parallelizable
tasks. Data flow graphs can be used for both data and task
partitioning. Data flow graphs should also provide you with some idea
of the critical paths through code, i.e., those paths that will
likely take the longest amount of time to complete. You
won't be able to shorten the runtime of a program to
less time than it takes to complete the critical path. In other
words, if you want to shorten the runtime of a program, you must
shorten the critical path.


Figure 15-1. Data flow for numerical integration

There are some limitations to this approach. You'll
need to give loops some thought when drawing these graphs, since the
body of a loop is a potentially parallelizable piece of code. The
essential step in parallelizing the numerical integration problem in
Chapter 13 was packaging as individual tasks,
with pieces of the loop used to calculate the area using the
individual rectangles. You should also realize that the graph
provides no information about the relative execution time for each
task. Finally, and perhaps most important, the graph
doesn't clearly indicate how idle time might show
up. Depending on how we code the task

Consolidate
Results in Figure 15-1, most of the

Calculate Chunk blocks may be idle waiting for
an opportunity to report their results. (Moreover, depending on how
they are coded, the individual

Calculate Chunk
tasks may not all be of the same length.)


15.2.1 Decomposition Strategies


There are several different
decomposition strategies worth considering. Roughly speaking,
decomposition strategies fall into two different
categoriesdata
decomposition
, sometimes called data
partitioning
, and control
decomposition
or task
partitioning
. With data decomposition, the data
is broken into individual chunks and distributed among processes that
are essentially similar. With control decomposition, the problem is
divided in such a way that each process is doing a different
calculation. In practice, many algorithms show characteristics of
both strategies.


15.2.1.1 Data decomposition

Data decomposition is generally
much easier to program than control decomposition and is usually the
best approach when trying to adapt serial algorithms for parallel
use. Data decomposition also tends to scale very well, a crucial
consideration when dealing with problems that may grow.

The numerical integration program from the last chapter used data
decomposition. Each process had a different set of bounds, so the
area that each calculated was different, but the procedure was the
same.

One of the most common approaches to data decomposition is a
divide-and-conquer strategy. This works
particularly well with recursive algorithms. If a problem can be
treated as a set of independent subproblems, it is an ideal candidate
for data decomposition. Consider the problem of finding the largest
value in a large collection of data. The data could be divided into
different sets, the largest in each set could be found, and finally,
this collection of largest values could be examined. Finding the
largest value in each of the smaller sets could be handled by a
different processor. Finding the final answer is an ideal use of
MPI_Reduce. This is a pretty trivial example of
how divide and conquer works.

For a more involved example, consider the merge sort
algorithm.[2] The serial algorithm
takes a set of data, divides it into smaller sets of data, sorts
these smaller individual data sets, and then merges the sorted sets
back together. To sort a smaller set of data, merge sort uses the
same strategy recursively. Eventually, the smaller sets of data are
reduced to sets of single items that are obviously sorted. Merging
sorted data is straightforward since you only have to compare the
first item in each group and select accordingly until
you've worked your way through the smaller sets.

[2] The sorting algorithm described here is
just one possible approach, not necessarily the best. Sorting in a
parallel environment is particularly difficult and is an area of
active, ongoing research.


In a parallel environment, you'll want to divide the
data equally among the available processors, but you probably
won't want to continue dividing up the data beyond
that point because of the communications overhead. Once you have the
data distributed, you'll need to sort it locally on
each individual processor. You could use the serial version of merge
sort or some other serial sorting algorithm.

Merging the data back together will be more problematic. Not only
will you need code to merge two data sets, but
you'll need to develop a communications strategy to
do this efficiently. If you use a single process to collect and merge
data, you will have a large amount of idle time. A more appropriate
strategy is to have pairs of processes merge their data, i.e., one
sends its data and dies while the other receives the sent data and
merges that data with its own. Repeat this strategy with the
remaining processes until only a single process remains. It will have
all the data sorted.

For example, if you have eight processes, processes 0 and 1,
processes 2 and 3, processes 4 and 5, and processes 6 and 7 could all
merge their data at the same time. Next, processes 0 and 2 and
processes 4 and 6 could merge their data simultaneously. Finally,
processes 0 and 4 could merge their data. This strategy, shown in
Figure 15-2, has three sets of parallel merges or
stages. This is much more efficient than having process 0 merge its
data repeatedly with each of the other seven processes sequentially,
a seven-stage procedure.


Figure 15-2. Merging data

With this strategy, for instance, 1,024 processes could merge their
data in 10 stages. It would take 1,023 stages with a single receiving
process, roughly 100 times as long.


15.2.1.2 Control decomposition

With control decomposition, each
processor has different tasks. One common model for control
decomposition is
pipelining or
stream
parallelism. With pipelining, each task, except the first and last,
plays the role of both producer and consumer. A task receives or
consumes data, processes that data, and then sends the results on to
the next consumer. For example, consider a set of processes designed
to manipulate a video stream. The first process might crop a frame,
the second might adjust brightness within the frame, the third might
adjust color levels, etc. Each process does something different and
will require radically different code.

Note that the second process must wait for the first process to
finish before it can begin since the second process consumes and
processes the data produced by the first. Similarly, the third
process can't begin until the second sends its data,
and so on. Getting enough data into the system so that all processes
are active is referred to as priming the pipeline. Figure 15-3 shows how processes overlap.


Figure 15-3. Ideal process overlap

You must have a lot more data than processes for this approach to be
efficient. Otherwise, the idle time at both the beginning and at the
end will render this approach pointless. Granularity is a key
consideration here. If the granularity is coarse, priming the
pipeline is particularly costly.

A second issue with pipelining is balance. Each of the processes must
run for about the same amount of time. If one process takes much
longer than the other processes, they will be idle much of the time
and overall efficiency will be lost. (This is a likely problem with
video processing, for example, as described.) Figure 15-4 shows the effect of having one process take
longer. Note the idle time.


Figure 15-4. Process overlap with idle time

However, even though task 2 takes twice as long as the other tasks in
this four-task example, there is still a speedup using the pipeline.

A number of algorithms fall between these two extremes. That is, they
appear to have elements of both strategies. For example, a common
approach in artificial intelligence is to describe algorithms in
terms of a search space. A fundamental part of a chess-playing
program is to examine a number of different moves to see which
appears to be the best. Since it evaluates each move in the same
manner, it is reasonable to approach this as a data decomposition
problem. Each process will be given a different board configuration,
i.e., a different set of data. But once the data has been
distributed, the different processes go their different ways. One
process may terminate quickly having determined the board position is
abysmal (a process known as pruning), while another may be following
a hot move recursively through several levels.


/ 142