High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI [Electronic resources] - نسخه متنی

Joseph D. Sloan

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید








9.3 LAM/MPI


The Local Area Multicomputer/Message Passing Interface
(LAM/MPI)
was
originally developed by the Ohio Supercomputing Center. It is now
maintained by the Open Systems Laboratory at Indiana University. As
previously noted, LAM/MPI (or LAM for short) is both an MPI library
and an execution environment. Although beyond the scope of this book,
LAM was designed to include an extensible component framework known
as System Service Interface
(SSI)
, one of its major strengths. It
works well in a wide variety of environments and supports several
methods of inter-process communications using TCP/IP. LAM will run on
most Unix machines (but not Windows). New releases are tested with
both Red Hat and Mandrake Linux.

Documentation can be downloaded from the LAM site, http://www.lam-mpi.org/. There are also
tutorials, a FAQ, and archived mailing lists. This chapter provides
an overview of the installation process and a description of how to
use LAM. For more up-to-date and detailed information, you should
consult the

LAM/MPI Installation Guide and the

LAM/MPI User's Guide .


9.3.1 Installing LAM/MPI


You
have two basic choices when installing LAM. You can download and
install a Red Hat package, or you can download the source and
recompile it. The package approach is very quick, easy to automate,
and uses somewhat less space. If you have a small cluster and are
manually installing the software, it will be a lot easier to use
packages. Installing from the source will allow you to customize the
installation, i.e., select which features are enabled and determine
where the software is installed. It is probably a bad idea to mix
installations since you could easily end up with different versions
of the software, something you'll definitely want to
avoid.

Installing from a package is done
just as you'd expect. Download the package from
http://www.lam-mpi.org/ and
install it just as you would any Red Hat package.

[root@fanny root]# rpm -vih lam-7.0.6-1.i586.rpm
Preparing... ########################################### [100%]
1:lam ########################################### [100%]

The files will be installed under the /usr
directory. The space used is minimal. You can use the
laminfo command to see the details of the
installation, including compiler bindings and which modules are
installed, etc.

If you need more control over the
installation, you'll want to do a manual install:
fetch the source, compile, install, and configure. The manual
installation is only slightly more involved. However, it does take
considerably longer, something to keep in mind if
you'll be repeating the installation on each machine
in your cluster. But if you are building an image, this is a one-time
task. The installation requires a POSIX- compliant operating system,
an appropriate compiler (e.g., GNU 2.95 compiler suite) and utilities
such as sed, grep, and
awk, and a modern make. You
should have no problem with most versions of Linux.

First, you'll need to decide where to put
everything, a crucial step if you are installing more than one
version of MPI. If care isn't taken, you may find
that part of an installation has been overwritten. In this example,
the source files are saved in
/usr/local/src/lam-7.0.6 and the installed code
in /usr/local/lam-7.0.6. First, download the
appropriate file from http://www.lam-mpi.org/ to
/usr/local/src. Next, uncompress and unpack the
file.

[root@fanny src]# bunzip2 lam-7.0.6.tar.bz2
[root@fanny src]# tar -xvf lam-7.0.6.tar
...
[root@fanny src]# cd lam-7.0.6

You'll see a lot of files stream by as the source is
unpacked. If you want to capture this output, you can
tee it to a log file. Just append | tee
tar.log
to the end of the line and the output will be
copied to the file tar.log. You can do something
similar with subsequent commands.

Next, create the directory where the executables will be installed
and configure the code specifying that directory with the
--prefix option. You may also include any other
options you desire. The example uses a configuration option to
specify SSH as well. (You could also set this through an
environmental variable
LAMRSH, rather than compiling it into the
codesomething you must do if you use a package installation.)

[root@fanny lam-7.0.6]# mkdir /usr/local/lam-7.0.6
[root@fanny lam-7.0.6]# ./configure --prefix=/usr/local/lam-7.0.6 \
> --with-rsh="ssh -x"

If you don't have a FORTRAN compiler,
you'll need to add --without-fc
to the configure command. A description of other
configuration options can be found in the documentation. However, the
defaults are quite reasonable and will be adequate for most users.
Also, if you aren't using the GNU compilers, you
need to set and export compiler variables. The documentation advises
that you use the same compiler to build LAM/MPI that
you'll use when using LAM/MPI.

Next, you'll need to make and install the code.

[root@fanny lam-7.0.6]# make
...
[root@fanny lam-7.0.6]# make install
...

You'll see a lot of output with these commands, but
all should go well. You may also want to make the examples and clean
up afterwards.

[root@fanny lam-7.0.6]# make examples
...
[root@fanny lam-7.0.6]# make clean
...

Again, expect a lot of output. You only need to make the examples on
the cluster head. Congratulations, you've just
installed LAM/MPI. You can verify the settings and options with the
laminfo command.


9.3.2 User Configuration


Before you can use LAM,
you'll need to do a few more things. First,
you'll need to create a host file or
schema,
which
is basically a file that contains a list of the machines in your
cluster that will participate in the computation. In its simplest
form, it is just a text file with one machine name per line. If you
have multiple CPUs on a host, you can repeat the host name or you can
append a CPU count to a line in the form
cpu=

n , where

n is the number of CPUs. However, you should
realize that the actual process scheduling on the node is left to the
operating system. If you need to change identities when logging into
a machine, it is possible to specify that username for a machine in
the schema file, e.g., user=smith. You can create
as many different schemas as you want and can put them anywhere on
the system. If you have multiple users, you'll
probably want to put the schema in a public directory, for example,
/etc/lamhosts.

You'll also want to set your
$PATH variable to include the LAM
executables, which can be trickier than it might seem. If you are
installing both LAM/MPI and MPICH, there are several programs (e.g.,
mpirun, mpicc, etc.) that
have the same name with both systems, and you need to be able to
distinguish between them. While you could rename these programs for
one of the packages, that is not a good idea. It will confuse your
users and be a nuisance when you upgrade software. Since it is
unlikely that an individual user will want to use both packages, the
typical approach is to set the path to include one but not the other.
Of course, as the system administrator, you'll want
to test both, so you'll need to be able to switch
back and forth. OSCAR's solution to this problem is
a package called switcher that allows a user to
easily change between two configurations.
switcher is described in Chapter 6.

A second issue is making sure the path is set properly for both
interactive and noninteractive or non-login shells. (The path you
want to add is /usr/local/lam-7.0.6/bin if you
are using the same directory layout used here.) The processes that
run on the compute nodes are run in noninteractive shells. This can
be particularly confusing for
bash users. With
bash, if the path is set in
.bash_profile and not in
.bashrc, you'll be able to log
onto each individual system and run the appropriate programs, but you
won't be able to run the programs remotely. Until
you realize what is going on, this can be a frustrating problem to
debug. So, if you use bash,
don't forget to set your path in
.bashrc. (And while you are setting paths,
don't forget to add the manpages when setting up
your paths, e.g., /usr/local/lam-7.0.6/man.)

It should be downhill from here. Make sure you have
ssh-agent running and that you can log onto
other machines without a password. Setting up and using SSH is
described in Chapter 4. You'll
also need to ensure that there is no output to
stderr whenever you
log in using SSH. (When LAM sees output to
stderr, it thinks something bad is happening and
aborts.) Since you'll get a warning message the
first time you log into a system with SSH as it adds the remote
machine to the known hosts, often the easiest thing to do (provided
you don't have too many machines in the cluster) is
to manually log into each machine once to get past this problem.
You'll only need to do this once.
recon, described in the subsection on testing,
can alert you to some of these problems.

Also, the directory /tmp must be writable.
Don't forget to turn off or reconfigure your
firewall as needed.


9.3.3 Using LAM/MPI


The
basic steps in creating and executing a program with LAM are as
follows:

Booting the runtime system with lamboot.

Writing and compiling a program with the appropriate compiler, e.g.,
mpicc.[3]

[3] Actually, you
don't need to boot the system to compile
code.

Execute the code with the mpirun command.

Clean up any crashed processes with lamclean if
things didn't go well.

Shut down the runtime system with the command
lamhalt.

Each of these steps will now be described.

In order to use LAM, you will need to launch the runtime environment.
This is referred to as booting LAM and is done with the
lamboot command. Basically,
lamboot starts the lamd
daemon, the message server, on each machine.


Since there are considerable security issues in running
lamboot
as root, it is configured so that it will not run
if you try to start it as root.

You specify the schema you want to use as an argument.

[sloanjd@fanny sloanjd]$ lamboot -v /etc/lamhosts
LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
n-1<9677> ssi:boot:base:linear: booting n0 (fanny.wofford.int)
n-1<9677> ssi:boot:base:linear: booting n1 (george.wofford.int)
...
n0<15402> ssi:boot:base:linear: finished

As noted above, you must be able to log onto the remote systems
without a password and without any error messages. (If this command
doesn't work the first time, you might give this a
couple of tries to clear out any one time error messages.) If you
don't want to see the list of nodes, leave out the
-v. You can always use the
lamnodes command to list the nodes later if you
wish.

[sloanjd@fanny sloanjd]$ lamnodes
n0 10.0.32.144:1:origin,this_node
n1 10.0.32.145:1:
...

You'll only need to boot the system once at the
beginning of the session. It will remain loaded until you halt it or
log out. (Also, you can omit the schema and just use the local
machine. Your code will run only on the local node, but this can be
useful for initial testing.)

Once you have entered your program
using your favorite editor, the next step is to compile and link the
program. You could do this directly by typing in all the compile
options you'll need. But it is much simpler to use
one of the wrapper programs supplied with LAM. The programs
mpicc, mpiCC, and
mpif77 will respectively invoke the C, C++, and
FORTRAN 77 compilers on your system, supplying the appropriate
command-line arguments for LAM. For example, you might enter
something like the following:

[sloanjd@fanny sloanjd]$ mpicc -o hello hello.c

(hello.c is one of the examples that comes with
LAM and can be found in
/usr/local/src/lam-7.0.6/examples/hello if you
use the same directory structure used here to set up LAM.) If you
want to see which arguments are being passed to the compiler, you can
use the -showme argument. For example,

[sloanjd@fanny sloanjd]$ mpicc -showme -o hello hello.c
gcc -I/usr/local/lam-7.0.6/include -pthread -o hello hello.c -L/usr/local/
lam-7.0.6/lib -llammpio -llamf77mpi -lmpi -llam -lutil

With -showme, the program isn't
compiled; you just see the arguments that would have been used had it
been compiled. Any other arguments that you include in the call to
mpicc are passed on to the underlying compiler
unchanged. In general, you should avoid using the
-g (debug) option when it isn't
needed because of the overhead it adds.

To compile the program, rerun the last command without
-showme if you haven't done so.
You now have an executable program. Run the program with the
mpirun command. Basically,
mpirun communicates with the remote LAM daemon
to fork a new process, set environment variables, redirect I/O, and
execute the user's command. Here is an example:

[sloanjd@fanny sloanjd]$ mpirun -np 4 hello
Hello, world! I am 0 of 4
Hello, world! I am 1 of 4
Hello, world! I am 2 of 4
Hello, world! I am 3 of 4

As shown in this example, the argument -np 4
specified that four processes be used when running the program. If
more machines are available, only four will be used. If fewer
machines are available, some machines will be used more than once.

Of course, you'll need the executable on each
machine. If you're using NFS to mount your home
directories, this has already been taken care of if you are working
in that directory. You should also remember that
mpirun can be run on a single machine, which can
be helpful when you want to test code away from a cluster.

If a program crashes, there may be extraneous processes running on
remote machines. You can clean these up with the
lamclean command. This is a command
you'll use only when you are having problems. Try
lamclean first and if it hangs, you can escalate
to wipe. Rerun lamboot
after using wipe. This isn't
necessary with lamclean. Both
lamclean and wipe take a
-v for verbose output.

Once you are done, you can shut down LAM with the
lamhalt command, which kills the
lamd daemon on each machine. If you wish, you
can use -v for verbose output. Two other useful
LAM commands are
mpitask and mpimsg, which
are used to monitor processes across the cluster and to monitor the
message buffer, respectively.


9.3.4 Testing the Installation


LAM
comes with a set of examples, tests, and tools that you can use to
verify that it is properly installed and runs correctly.
We'll start with the simplest tests first.

The
recon tool verifies
that LAM will boot properly. recon is not a
complete test, but it confirms that the user can execute commands on
the remote machine, and that the LAM executables can be found and
executed.

[sloanjd@fanny bin]$ recon
-----------------------------------------------------------------------------
Woo hoo!
recon has completed successfully. This means that you will most likely
be able to boot LAM successfully with the "lamboot" command (but this
is not a guarantee). See the lamboot(1) manual page for more
information on the lamboot command.
If you have problems booting LAM (with lamboot) even though recon
worked successfully, enable the "-d" option to lamboot to examine each
step of lamboot and see what fails. Most situations where recon
succeeds and lamboot fails have to do with the hboot(1) command (that
lamboot invokes on each host in the hostfile).
-----------------------------------------------------------------------------

Since lamboot is required to run the next tests,
you'll need to run these tests as a non-privileged
user. Once you have booted LAM, you can use the
tping command to check basic connectivity.
tping is similar to ping
but uses the LAM echo server. This confirms that both network
connectivity and that the LAM daemon is listening. For example, the
following command sends two one-byte packets to the first three
machines in your cluster.

[sloanjd@fanny sloanjd]$ tping n1-3 -c2
1 byte from 3 remote nodes: 0.003 secs
1 byte from 3 remote nodes: 0.002 secs
2 messages, 2 bytes (0.002K), 0.006 secs (0.710K/sec)
roundtrip min/avg/max: 0.002/0.003/0.003

If you want to probe every machine, use n without
a count.

The LAM test suite is the most comprehensive way to test your system.
It can be used to confirm that you have a complete and correct
installation. Download the test suite that corresponds to your
installation and then uncompress and unpack it.

[sloanjd@fanny sloanjd]$ bunzip2 lamtests-7.0.6.tar.bz2
[sloanjd@fanny sloanjd]$ tar -xvf lamtests-7.0.6.tar
...

This creates the directory lamtests-7.0.6 with
the tests and a set of directions in the file
README. Next, you should start LAM with
lamboot if you haven't already
done so. Then change to the

test directory and
run configure.

[sloanjd@fanny sloanjd]$ cd lamtests-7.0.6
[sloanjd@fanny lamtests-7.0.6]$ ./configure
...

Finally, run make.

[sloanjd@fanny lamtests-7.0.6]$ make -k check
...

You'll see lots of output scroll past.
Don't be concerned about an occasional error message
while it is running. What you want is a clean bill of health when it
is finally done. You can run specific tests in the test suite by
changing into the appropriate subdirectory and running
make.


/ 142