High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI [Electronic resources] نسخه متنی

17.5 Profilers

Thus
far we have been looking at timing code manually. While this provides
a lot of control, it is labor intensive. The alternative to manual
timing is to use a profiler. A profiler attempts to capture the
profile of a program in execution; that is, a set of timings for an
application that maps where time is spent within the program. While
with manual timing you'll want to focus in on part
of a program, a profiler typically provides information for the
entire application in one fell swoop. While a profiler may give more
results than you actually need, you are less likely to overlook a
hotspot in your code using a profiler, particularly when working with
very complicated programs. Most profilers are easy to use and may
give you some control over how much information is collected. And
most profilers not only collect information, but provide a mechanism
for analyzing the results. Graphical output is common, a big help
with large, complicated programs.

There are a number of profilers available, particularly if you
include commercial products. They differ in several ways, but it
usually comes down to a question of how fine a granularity you want
and how much detail you need. Choices include information on a
line-by-line basis, information based on the basic blocks, or
information based on function calls or modules. Profilers may provide
timing information or simply count the number of times a statement is
executed.

There are two basic categories for profilesactive and passive.
(Some profilers, such as gprof, have features
that span both categories.) A passive profiler gathers information
without modifying the code. For example, a passive profiler might
collect information by repeatedly sampling the program counter while
the program is running. By installing an interrupt service routine
that wakes up periodically and examines the program counter, the
profiler can construct a statistical profile for the program.

While passive profiles are less intrusive, they have a couple of
problems. First, they tend to provide a flat view of functions within
an application. For example, if you have a function that is called by
several different functions, a passive profiler will give you an idea
of how often the function is called, but no information about what
functions are making the call. Second, passive profilers are
inherently statistical. Key performance issues are how often you
sample and how many samples are taken. The quality of your results
will depend on getting these parameters right.

Active profiles automatically add code to collect profile
information. Code can be added either directly to the source code or
to the object code generated by the compiler. While active compilers
avoid some of the problems passive profilers face, they introduce
their on set of problems. Perhaps the most significant of these is
the confounding effect on timing measurement caused by the execution
of the added instructions.

In this section, we'll look at two profilers,
gprof and gcov. These
profilers modify the application so that it will count how often
functions are called and individual instructions are executed,
respectively. (Additionally, gprof uses passive
profiling to estimate execution times.) The two tools
we'll examine use compiler options to add the
profiling code to the object code generated by the compiler. While
both of these tools were designed for serial programming and were not
intended for use with parallel programs, they can, with a little
added effort, be used with parallel programs. We'll
examine their use with serial code first and then look at how they
can be used with parallel code.

17.5.1 gprof

gprof
generates profiles and call graphs (a report of which routines call
which). There is both a Berkeley Unix gprof and
a GNU gprof that provide essentially the same
information. gprof is available on most systems
today. (If your system doesn't have
gprof, you might look to see if
prof
is installed. While it's not as versatile as
gprof, you'll still be able to
generate the flat profiles described in this section.)

Using gprof is remarkably straightforward.
You'll need to compile and link the program you want
to profile using the -gp option. (Compiler
optimizations shouldn't cause much trouble with
gprof.) Next, run the program. This will create
a file gmon.out with the profile information. If the file
already exists from a previous run, it will be overwritten. The
gprof program is then run to convert the binary
output in gmon.out into a readable format.

Here is an example:

[sloanjd@amy PROFILE]$ gcc demo.c -pg -o demo
[sloanjd@amy PROFILE]$ ./demo
[sloanjd@amy PROFILE]$ gprof -b demo > demo.gprof

This example uses the
-b option to
suppress the annotations that gprof includes by
default. You may want to leave this off the first time you run
gprof, but if you run gprof
often, you'll want to use this option.
gprof has other options that you might want to
investigate if you find you are using it a lot, such as excluding
functions from the reports. In this example, I've
also redirected the output to a file to make it easier to examine.

Here is the code that I profiled.

main( ) 
{  int i;
for (i=0; i<200; i++) 
{  foo( );
bar( );
baz( ); }
}
foo( )
{  int j;
for (j=0; j<250; j++) bar( );
}
bar( )
{  int k; 
for (k=0; k<50; k++);
}
baz( )
{  int m;
for (m=0; m<100; m++) bang( );
}
bang( )
{  int n;
for (n=0; n<200; n++) bar( );
}

As you can see, it doesn't do anything worthwhile,
but it provides a reasonable demonstration of
gprof's capabilities.

The output from gprof consists of three parts: a
flat profile, a call graph, and a function index. Here is the first
part of the output, the flat profile:

Flat profile:
Each sample counts as 0.01 seconds.
%   cumulative   self              self     total           
time   seconds   seconds    calls  ms/call  ms/call  name    
96.97      6.09     6.09  4050200     0.00     0.00  bar
1.27      6.17     0.08    20000     0.00     0.30  bang
1.27      6.25     0.08      200     0.40    30.87  baz
0.48      6.28     0.03      200     0.15     0.53  foo

This is not difficult to interpret. The routines are profiled, one
per line, with their names given in the last column
name. The column % time gives
the amount of time spent in each routine. self seconds gives the time spent in an individual routine while
cumulative seconds provides the total for the
named routine along with the ones above it. calls
reports the number of times the function is called while
self ms/call gives the average time spent in the
routine per call. total ms/call gives the average
time spent in the function and its descendents. All times are given
in milliseconds.

It should be clear that optimizing or eliminating calls to
bar would provide the greatest improvement.
What's not clear is the interrelationship among the
calls. Since bar is called by several functions,
improvement to these will help as well. To see this you need more
information. This is provided by a call graph.

Here is the call graph for this code:

                        Call graph
granularity: each sample hit covers 4 byte(s) for 0.16% of 6.28 seconds
index % time    self  children    called     name
<spontaneous>
[1]    100.0    0.00    6.28                 main [1]
0.08    6.09     200/200         baz [2]
0.03    0.08     200/200         foo [5]
0.00    0.00     200/4050200     bar [4]
-----------------------------------------------
0.08    6.09     200/200         main [1]
[2]     98.3    0.08    6.09     200         baz [2]
0.08    6.01   20000/20000       bang [3]
-----------------------------------------------
0.08    6.01   20000/20000       baz [2]
[3]     97.0    0.08    6.01   20000         bang [3]
6.01    0.00 4000000/4050200     bar [4]
-----------------------------------------------
0.00    0.00     200/4050200     main [1]
0.08    0.00   50000/4050200     foo [5]
6.01    0.00 4000000/4050200     bang [3]
[4]     97.0    6.09    0.00 4050200         bar [4]
-----------------------------------------------
0.03    0.08     200/200         main [1]
[5]      1.7    0.03    0.08     200         foo [5]
0.08    0.00   50000/4050200     bar [4]
-----------------------------------------------

The first thing you'll notice is that the numbers in
the % time column don't add up to
100 percent. The reason is that each timing includes the total time
spent in the function and its children. The next two columns give the
actual time spent in the function (self) and in
the functions it calls (children). The
called column gives the number of times the
function was called. When two numbers are given in this column, the
top number is the number of calls by the parent to the child and the
bottom number is the total number of calls to the child. For example,
consider the next to last line in section [4].

                6.01    0.00 4000000/4050200     bang [3]

Since section [4] describes calls to bar, this
line tells us that bang makes 4,000,000 of the
4,050,200 total calls made to bar.

The table is sorted numerically with a unique number of each
function. This number is printed next to each function to make it
easier to look up individual functions. A function index is also
included at the end of the report.

Index by function name
[3] bang                    [2] baz
[4] bar                     [5] foo

This is relatively useless for a small program but can be helpful
with larger programs.

As noted above, the number of function calls is determined actively,
so this number should be totally accurate. The percentage of time in
each is determined passively with a sampling rate that may be too
slow for short programs to capture enough data to be statistically
reliable. You will have observed that the basic granularity of
gprof is a routine. Depending on the code, this
may not provide enough information. For example, it may not be
immediately obvious whether the poor performance you are seeing is
the result of the computational complexity of your code or of poor
memory utilization if you are looking at a large routine. To improve
resolution and gain more information, you could break a program into
more routines. Or you could use gcov.

17.5.2 gcov

gcov
is a test-coverage program that is often used as a profiler. When a
test suite is designed for a program, one of the objects is to
exercise all parts of the code. A test-coverage program records the
number of times each line of code is executed. The idea is that you
can use this coverage data to ensure that your test suite is
adequate. Of course, this is also the sort of data you might want
when profiling code.

gcov is part of the gcc
development packages, so if you have gcc on your
system, you should have gcov. It does not work
with other compilers, but similar programs may be available for them
(e.g., tcov with Solaris).

To use
gcov, you need to compile
code with two options, -fprofile-arcs and
-ftest-coverage, which tell the compiler to add,
respectively, the additional code needed to generate a flow graph and
extra profiling information.

[sloanjd@amy PROFILE]$ gcc -fprofile-arcs -ftest-coverage demo.c -o demo

You should avoid optimizations when using gcov,
since optimizations that rewrite code will make the results difficult
to interpret. When the code is compiled, two new files will be
created with the same name as your program but with the extensions

.bb and

.bbg . The first
contains a list of source files with line numbers corresponding to
basic blocks. The second is used to reconstruct a flow graph for the
program.

Once you have compiled the code, you'll need to run
it.

[sloanjd@amy PROFILE]$ ./demo

This will create yet another file with the run data. It will have the
same name as your program but with a

.da
extension. This is created in the directory where the original
program was compiled.

Finally, you can run gcov to generate your
report.

[sloanjd@amy PROFILE]$ gcov demo
100.00% of 13 source lines executed in file demo.c
Creating demo.c.gcov.

You'll notice that the command reports what
percentage of the source is actually executed, something you would
want to know if you are using it as a coverage tool rather than a
profiler. The actual report is created in a file with the extension
.gcov.

Here is an example for the demonstration program that we looked at
earlier.

                main( ) 
2    {  int i;
201       for (i=0; i<200; i++) 
200       {  foo( );
200          bar( );
200          baz( ); }
}
foo( )
400    {  int j;
200       for (j=0; j<250; j++) bar( );
}
bar( )
8100400    {  int k; 
4050200       for (k=0; k<50; k++);
}
baz( )
400    {  int m;
200       for (m=0; m<100; m++) bang( );
}
bang( )
40000    {  int n;
20000       for (n=0; n<200; n++) bar( );
}

As you can see, a count giving the number of times each line was
executed is appended to each line. Take care; if you execute the
program multiple times, these counts accumulate. If this
isn't what you want, delete or rename the data file
between runs.

17.5.3 Profiling Parallel Programs with gprof and gcov

Depending on the version of
MPI you are using, you may be able to use gprof
and gcov with your MPI programs. However,
you'll need to make some adjustments. If we naively
try to profile an MPI program with gprof or
gcov, we run into problems. Here is an example
using rect from Chapter 14:

[sloanjd@amy PROFILE]$ mpicc -pg rect.c -o rect
[sloanjd@amy PROFILE]$ mpirun -np 4 rect
Enter number of chunk to divide problem into:
7
Enter number of steps per chunk:
20
Enter low end of interval:
2.0
Enter high end of interval:
5.0
Area for process 1, is: 2.107855
Area for process 2, is: 2.999984
Area for process 3, is: 4.049546
Area for process 1, is: 5.256543
Area for process 2, is: 6.620975
Area for process 3, is: 8.142841
Area for process 1, is: 9.822141
The area from 2.000000 to 5.000000 is: 38.999885

Everything appears to work. All the expected files are created. Here
is the flat profile.

[sloanjd@amy PROFILE]$ gprof -bp rect
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
%   cumulative   self              self     total           
time   seconds   seconds    calls  Ts/call  Ts/call  name    
0.00      0.00     0.00       40     0.00     0.00  f
0.00      0.00     0.00        2     0.00     0.00  chunkArea
0.00      0.00     0.00        1     0.00     0.00  _GLOBAL_ _I_mainGCOV

We've obviously profiled something. The question is
what. Because we are using NFS, all the processes are writing out
their logfiles to the same directory. Since files are overwritten,
what we have is the data from the last process to finish. For the
data, we can see this is either process 2 or process 3.

If we are going to avoid this problem, we need to use the local
filesystem on each node. For example, we could copy the compiled code
over to /tmp on each machine and run it from
there.^[2]

^[2] This example uses scp to
copy the file. If you have installed the C3 tools, discussed in Chapter 9, then you can use cpush
instead.

[sloanjd@amy PROFILE]$ cp rect /tmp/
[sloanjd@amy PROFILE]$ scp rect oscarnode1:/tmp/
rect                100% |*****************************|   372 KB    00:00    
[sloanjd@amy PROFILE]$ scp rect oscarnode2:/tmp/
rect                100% |*****************************|   372 KB    00:00    
[sloanjd@amy PROFILE]$ scp rect oscarnode3:/tmp/
rect                100% |*****************************|   372 KB    00:00    
[sloanjd@amy PROFILE]$ cd /tmp
[sloanjd@amy tmp]$ mpirun -np 4 rect
Enter number of chunk to divide problem into:
7
Enter number of steps per chunk:
20
Enter low end of interval:
2.0
Enter high end of interval:
5.0
Area for process 1, is: 2.107855
Area for process 2, is: 2.999984
Area for process 3, is: 4.049546
Area for process 1, is: 5.256543
Area for process 2, is: 6.620975
Area for process 3, is: 8.142841
Area for process 1, is: 9.822141
The area from 2.000000 to 5.000000 is: 38.999885

Here is the flat profile for oscarnode1:

[sloanjd@oscarnode1 tmp]$ gprof -bp rect
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
%   cumulative   self              self     total           
time   seconds   seconds    calls  Ts/call  Ts/call  name    
0.00      0.00     0.00       60     0.00     0.00  f
0.00      0.00     0.00        3     0.00     0.00  chunkArea
0.00      0.00     0.00        1     0.00     0.00  _GLOBAL_ _I_mainGCOV

And here is the flat profile for oscarnode2:

[sloanjd@oscarnode2 tmp]$ gprof -bp rect
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
%   cumulative   self              self     total           
time   seconds   seconds    calls  Ts/call  Ts/call  name    
0.00      0.00     0.00       40     0.00     0.00  f
0.00      0.00     0.00        2     0.00     0.00  chunkArea
0.00      0.00     0.00        1     0.00     0.00  _GLOBAL_ _I_mainGCOV

If you compare these to the output, you'll see the
calls to chunkArea now make senseit is
called three times by process 1 on oscarnode1
and twice by process 2 on oscarnode2. (We could
also look at the head node, amy, but because of
the master/slave design of the program, there isn't
much to see.)

Unfortunately, using gcov with a parallel
program is even more complicated. You'll recall that
the compile options used with gcov create two
additional files and that a datafile is created when the program is
run. To run gcov, you'll need
all these files as well as the executable and the original source
code file. To further complicate matters, the datafile that is
created when the program is run is created in the directory where the
source was compiled.

To use gcov on a cluster, first copy the source
code file to /tmp or a similar directory. Next,
compile the program with the appropriate switches. Once
you've done this, you'll need to
copy the compiled code, the original source code, the

.bb file, and the

.bbg file
to each node on the cluster. Now you can run the program. A datafile
will be created on each cluster in /tmp. Once
you have done this, you can then log onto each node and run
gcov. For example,

[sloanjd@oscarnode1 tmp]$ gcov rect
66.67% of 57 source lines executed in file rect.c
Creating rect.c.gcov.

Don't forget that the datafile accumulates
information with gcov. If you want fresh data,
you'll need to delete it from each node. This is a
lot of copying, so you'll want to automate it as
much as possible. You'll also need to clean up after
you are done. In this example, I copied rect,
rect.c, rect.bb, and
rect.bbg to each node. Fortunately, for this
demonstration I only needed to copy them to a few nodes.

A couple of warnings are in order. All of this is based on the
assumption that each MPI process will be able to access and write to
the local filesystem. With MPI, there is no guarantee this is the
case. The approach outlined here seems to work with LAM/MPI and
MPICH, but if you are using some other version of MPI, all bets are
off.

High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی