High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI [Electronic resources] - نسخه متنی

Joseph D. Sloan

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید








12.1 PVFS


PVFS is a
freely available, software-based
solution jointly developed by Argonne National Laboratory and Clemson
University. PVFS is designed to distribute data among the disks
throughout the cluster and will work with both serial and parallel
programs. In programming, it works with traditional Unix file I/O
semantics, with the MPI-2 ROMIO semantics, or with the native PVFS
semantics. It provides a consistent namespace and transparent access
using existing utilities along with a mechanism for programming
application-specific access. Although PVFS is developed using
X-86-based Linux platforms, it runs on some other platforms. It is
available for both OSCAR and Rocks. PVFS2, a second generation PVFS,
is in the works.

On the downside, PVFS does not provide redundancy, does not support
symbolic or hard links, and it does not provide a
fsck-like utility.

Figure 12-1 shows the overall architecture for a
cluster using PVFS. Machines
in a cluster using PVFS fall into three possibly overlapping
categories based on functionality. Each PVFS has one metadata server.
This is a filesystem management node that maintains or tracks
information about the filesystem such as file ownership, access
privileges, and locations, i.e., the filesystem's
metadata.


Figure 12-1. Internal cluster architecture

Because PVFS distributes files across the cluster nodes, the actual
files are located on the disks on I/O servers. I/O servers store the
data using the existing hardware and filesystem on that node. By
spreading or striping a file across multiple nodes, applications have
multiple paths to data. A compute node may access a portion of the
file on one machine while another node accesses a different portion
of the file located on a different I/O server. This eliminates the
bottleneck inherent in a single file server approach such as NFS.

The remaining nodes are the client nodes. These are the actual
compute nodes within the clusters, i.e., where the parallel jobs
execute. With PVFS, client nodes and I/O servers can overlap. For a
small cluster, it may make sense for all nodes to be both client and
I/O nodes. Similarly, the metadata server can also be an I/O server
or client node, or both. Once you start writing data to these
machines, it is difficult to change the configuration of your system.
So give some thought to what you need.


12.1.1 Installing PVFS on the Head Node


Installing and configuring PVFS is more
complicated that most of the other software described in this book
for a couple of reasons. First, you will need to decide
how to partition your cluster. That is, you must decide which machine
will be the metadata server, which machines will be clients, and
which machines will be I/O servers. For each type of machine, there
is different software to install and a different configuration. If a
machine is going to be both a client and an I/O server, it must be
configured for each role. Second, in order to limit the overhead of
accessing the filesystem through the kernel, a kernel module is used.
This may entail further tasks such as making sure the appropriate
kernel header files are available or patching the code to account for
differences among Linux kernels.

This chapter describes a simple configuration where
fanny is the metadata server, a client, and an
I/O server, and all the remaining nodes are both clients and I/O
servers. As such, it should provide a fairly complete idea about how
PVFS is set up. If you are configuring your cluster differently, you
won't need to do as much. For example, if some of
your nodes are only I/O nodes, you can skip the client configuration
steps on those machines.

In this example, the files are downloaded, compiled, and installed on
fanny since fanny plays all
three roles. Once the software is installed on
fanny, the appropriate pieces are pushed to the
remaining machines in the cluster.

The first step, then, is to download the appropriate software. To
download PVFS, first go to the PVFS
home page (http://www.parl.clemson.edu/pvfs/) and follow
the link to files. This site has links to several download sites.
(You'll want to download the documentation from this
site before moving on to the software download sites.) There are two
tar archives to download: the sources for PVFS and for the kernel
module.

You should also look around for any patches you might need. For
example, at the time this was written, because of customizations to
the kernel, the current version of PVFS would not compile correctly under
Red Hat 9.0. Fortunately, a patch from http://www.mcs.anl.gov/~robl/pvfs/redhat-ntpl-fix.patch.gz
was available.[1] Other patches may also be available.

[1] Despite the URL, this was an
uncompressed text file at the time this was written.


Once you have the files, copy the files to an appropriate directory
and unpack them.

[root@fanny src]# gunzip pvfs-1.6.2.tgz
[root@fanny src]# gunzip pvfs-kernel-1.6.2-linux-2.4.tgz
[root@fanny src]# tar -xvf pvfs-1.6.2.tar
...
[root@fanny src]# tar -xvf pvfs-kernel-1.6.2-linux-2.4.tar
...

It is simpler if you install these under the same directory. In this
example, the directory /usr/local/src is used.
In the documentation that comes with PVFS, a link was created to the
first directory.

[root@fanny src]# ln -s pvfs-1.6.0 pvfs

This will save a little typing but isn't essential.


Be sure to look at the README and INSTALL files that come with the
sources.

Next, apply any patches you may need. As noted, with this version the
kernel module sources need to be patched.

[root@fanny src]# mv redhat-ntpl-fix.patch pvfs-kernel-1.6.2-linux-2.4/
[root@fanny src]# cd pvfs-kernel-1.6.2-linux-2.4
[root@fanny pvfs-kernel-1.6.2-linux-2.4]# patch -p1 -b <
\> redhat-ntpl-fix.patch
patching file config.h.in
patching file configure
patching file configure.in
patching file kpvfsd.c
patching file kpvfsdev.c
patching file pvfsdev.c
patching file pvfsdev.c

Apply any other patches that might be needed.

The next steps are compiling PVFS and the PVFS kernel module. Here
are the steps for compiling PVFS:

[root@fanny pvfs-kernel-1.6.2-linux-2.4]# cd /usr/local/src/pvfs
[root@fanny pvfs]# ./configure
...
[root@fanny pvfs]# make
...
[root@fanny pvfs]# make install
...

There is nothing new here.

Next, repeat the process with the kernel module.

[root@fanny src]# cd /usr/local/src/pvfs-kernel-1.6.2-linux-2.4
[root@fanny pvfs-kernel-1.6.2-linux-2.4]# ./configure
...
[root@fanny pvfs-kernel-1.6.2-linux-2.4]# make
...
[root@fanny pvfs-kernel-1.6.2-linux-2.4]# make install
install -c -d /usr/local/sbin
install -c mount.pvfs /usr/local/sbin
install -c pvfsd /usr/local/sbin
NOTE: pvfs.o must be installed by hand!
NOTE: install mount.pvfs by hand to /sbin if you want 'mount -t pvfs' to work

This should go very quickly.

As you see from the output, the installation for the kernel requires
some additional manual steps. Specifically, you need to decide where
you want to put the kernel module. The following works for Red Hat
9.0.

[root@fanny pvfs-kernel-1.6.2-linux-2.4]# mkdir \
> /lib/modules/2.4.20-6/kernel/fs/pvfs
[root@fanny pvfs-kernel-1.6.2-linux-2.4]# cp pvfs.o \
> /lib/modules/2.4.20-6/kernel/fs/pvfs/pvfs.o

If you are doing something different, you may need to poke around a
bit to find the right location.


12.1.2 Configuring the Metadata Server


If you have been following along,
at this point you should have all the software installed on the head
node, i.e., the node that will function as the metadata server for
the filesystem. The next step is to finish configuring the metadata
server. Once this is done, the I/O server and client software can be
installed and configured.

Configuring the meta-server is straightforward. First, create a
directory to store filesystem data.

[root@fanny pvfs-kernel-1.6.2-linux-2.4]# mkdir /pvfs-meta

Keep in mind, this directory is used to store information about the
PVFS filesystem. The actual data is not stored in this directory.
Once PVFS is running, you can ignore this directory.

Next, create the two metadata configuration files and place them in
this directory. Fortunately, PVFS provides a script to simplify the
process.

[root@fanny pvfs-kernel-1.6.2-linux-2.4]# cd /pvfs-meta
[root@fanny pvfs-meta]# /usr/local/bin/mkmgrconf
This script will make the .iodtab and .pvfsdir files
in the metadata directory of a PVFS file system.
Enter the root directory (metadata directory):
/pvfs-meta/
Enter the user id of directory:
root
Enter the group id of directory:
root
Enter the mode of the root directory:
777
Enter the hostname that will run the manager:
fanny
Searching for host...success
Enter the port number on the host for manager:
(Port number 3000 is the default)
3000
Enter the I/O nodes: (can use form node1, node2, ... or
nodename{#-#,#,#})
fanny george hector ida james
Searching for hosts...success
I/O nodes: fanny george hector ida james
Enter the port number for the iods:
(Port number 7000 is the default)
7000
Done!

Running this script creates the two configuration files
.pvfsdir and .iodtab. The
file .pvfsdir contains permission information
for the metadata directory. Here is the file the
mkmgrconf script creates when run as shown.

84230
0
0
0040777
3000
fanny
/pvfs-meta/
/

The first entry is the inode number of the configuration file. The
remaining entries correspond to the questions answered earlier.

The file .iodtab is a list of
the I/O servers and their port numbers. For this example, it should
look like this:

fanny:7000
george:7000
hector:7000
ida:7000
james:7000

Systems can be listed by name or by IP number. If the default port
(7000) is used, it can be omitted from the file.


The .iodtab file is an ordered list of I/O
servers. Once PVFS is running, you should not change the
.iodtab file. Otherwise, you will almost
certainly render existing PVFS files inaccessible.


12.1.3 I/O Server Setup


To set up the I/O servers, you need to
create a data directory on the appropriate machines, create a
configuration file, and then push the configuration file, along with
the other I/O server software, to the appropriate machines. In this
example, all the nodes in the cluster including the head node are I/O
servers.

The first step is to create a directory with the appropriate
ownership and permissions on all the
I/O servers. We start with the head node.

[root@fanny /]# mkdir /pvfs-data
[root@fanny /]# chmod 700 /pvfs-data
[root@fanny /]# chown nobody.nobody /pvfs-data

Keep in mind that these directories are where the actual pieces of a
data file will be stored. However, you will not access this data in
these directories directly. That is done through the filesystem at
the appropriate mount point. These PVFS data directories, like the
meta-server's metadata
directory, can be ignored once PVFS is running.

Next, create the configuration file
/etc/iod.conf using your favorite text editor. (This
is optional, but recommended.) iod.conf
describes the iod environment. Every line, apart
from comments, consists of a key and a corresponding value. Here is a
simple example:

# iod.conf-iod configuration file
datadir /pvfs-data
user nobody
group nobody
logdir /tmp
rootdir /
debug 0

As you can see, this specifies a directory for the data, the user and
group under which the I/O daemon iod will run,
the log and root directories, and a debug level. You can also specify
other parameters such as the port and buffer information. In general,
the defaults are reasonable, but you may want to revisit this file
when fine-tuning your system.

While this takes care of the head node, the process must be repeated
for each of the remaining I/O servers. First, create the directory
and configuration file for each of the remaining I/O servers. Here is
an example using the C3 utilities. (C3 is described in Chapter 10.)

[root@fanny /]# cexec mkdir /pvfs-data
...
[root@fanny /]# cexec chmod 700 /pvfs-data
...
[root@fanny /]# cexec chown nobody.nobody /pvfs-data
...
[root@fanny /]# cpush /etc/iod.conf
...

Since the configuration file is the same, it's
probably quicker to copy it to each machine, as shown here, rather
than re-create it.

Finally, since the iod daemon was created only
on the head node, you'll need to copy it to each of
the remaining I/O servers.

[root@fanny root]# cpush /usr/local/sbin/iod
...

While this example uses C3's
cpush, you can use whatever you are comfortable
with.

If you aren't configuring every machine in your
cluster to be an I/O server, you'll need to adapt
these steps as appropriate for your cluster. This is easy to do with
C3's range feature.


12.1.4 Client Setup


Client setup is a little more involved.
For each client, you'll need to create a PVFS device
file, copy over the kernel module, create a mount point and a PVFS
mount table, and copy over the appropriate executable along with any
other utilities you might need on the client machine. In this
example, all nodes including the head are configured as clients. But
because we have already installed software on the head node, some of
the steps aren't necessary for that particular
machine.

First, a special character file needs to be created on each of the
clients using the mknod command.

[root@fanny /]# cexec mknod /dev/pvfsd c 60 0
...

/dev/pvfsd is used to communicate between the
pvfsd daemon and the kernel module
pvfs.o. It allows programs to access PVFS files,
once mounted, using traditional Unix filesystem semantics.

We will need to distribute both the kernel module and the daemon to
each node.

[root@fanny /]# cpush /usr/local/sbin/pvfsd
...
[root@fanny /]# cexec mkdir /lib/modules/2.4.20-6/kernel/fs/pvfs/
...
[root@fanny /]# cpush /lib/modules/2.4.20-6/kernel/fs/pvfs/pvfs.o
...

The kernel module registers the filesystem with the kernel while the
daemon performs network transfers.

Next, we need to create a mount point.

[root@fanny root]# mkdir /mnt/pvfs
[root@fanny /]# cexec mkdir /mnt/pvfs
...

This example uses /mnt/pvfs, but
/pvfs is another frequently used alternative.
The mount directory is where the files appear to be located. This is
the directory you'll use to access or reference
files.

The mount.pvfs executable is used to mount a
filesystem using PVFS and should be copied to each client node.

[root@fanny /]# cpush /usr/local/sbin/mount.pvfs /sbin/
...

mount.pvfs can be invoked by the
mount command on some systems, or it can be
called directly.

Finally, create
/etc/pvfstab, a mount table for the PVFS system. This
needs to contain only a single line of information as shown here:

fanny:/pvfs-meta  /mnt/pvfs  pvfs  port=3000  0  0

If you are familiar with /etc/fstab, this should
look very familiar. The first field is the path to the metadata
information. The next field is the mount point. The third field is
the filesystem type, which is followed by the port number. The last
two fields, traditionally used to determine when a filesystem is
dumped or checked, aren't currently used by PVFS.
These fields should be zeros. You'll probably need
to change the first two fields to match your cluster, but everything
else should work as shown here.

Once you have created the mount table, push it to the remaining nodes.

[root@fanny /]# cpush /etc/pvfstab
...
[root@fanny /]# cexec chmod 644 /etc/pvfstab
...

Make sure the file is readable as shown.

While it isn't
strictly necessary, there are some other files that you may want to
push to your client nodes. The installation of PVFS puts a number of
utilities in

/usr/local/bin .
You'll need to push these to the clients before
you'll be able to use them effectively. The most
useful include mgr-ping,
iod-ping, pvstat, and
u2p.

[root@fanny root]# cpush /usr/local/bin/mgr-ping
...
[root@fanny root]# cpush /usr/local/bin/iod-ping
...
[root@fanny root]# cpush /usr/local/bin/pvstat
...
[root@fanny pvfs]# cpush /usr/local/bin/u2p
...

As you gain experience with PVFS, you may want to push other
utilities across the cluster.

If you want to do program development using PVFS, you will need
access to the PVFS header files and libraries and the
pvfstab file. By default, header and library
files are installed in /usr/local/include and
/usr/local/lib, respectively. If you do program
development only on your head node, you are in good shape. But if you
do program development on any of your cluster nodes,
you'll need to push these files to those nodes. (You
might also want to push the manpages as well, which are installed in
/usr/local/man.)


12.1.5 Running PVFS


Finally,
now that you have everything installed, you can start PVFS. You need
to start the appropriate daemons on the appropriate machines and load
the kernel module. To load the kernel module, use the
insmod command.

[root@fanny root]# insmod /lib/modules/2.4.20-6/kernel/fs/pvfs/pvfs.o
[root@fanny root]# cexec insmod /lib/modules/2.4.20-6/kernel/fs/pvfs/pvfs.o
...

Next, run the
mgr daemon on the metadata server. This is the
management daemon.

[root@fanny root]# /usr/local/sbin/mgr

On each I/O server, start the iod daemon.

[root@fanny root]# /usr/local/sbin/iod
[root@fanny root]# cexec /usr/local/sbin/iod
...

Next, start the pvfsd daemon on each client node.

[root@fanny root]# /usr/local/sbin/pvfsd
[root@fanny root]# cexec /usr/local/sbin/pvfsd
...

Finally, mount the filesystem on each
client.

[root@fanny root]# /usr/local/sbin/mount.pvfs fanny:/pvfs-meta /mnt/pvfs
[root@fanny /]# cexec /sbin/mount.pvfs fanny:/pvfs-meta /mnt/pvfs
...

PVFS should be up and running.[2]

[2] Although not described
here, you'll probably want to make the necessary
changes to your startup file so that this is all done automatically.
PVFS provides scripts enablemgr and
enableiod for use with Red Hat machines.


To shut PVFS down, use the umount command to
unmount the filesystem, e.g., umount /mnt/pvfs,
stop the PVFS processes with kill or
killall, and unload the
pvfs.o module with the
rmmod command.


12.1.5.1 Troubleshooting

There
are several things you can do to quickly check whether everything is
running. Perhaps the simplest is to copy a file to the mounted
directory and verify that it is accessible on other nodes. If you
have problems, there are a couple of other things you might want to
try to narrow things down.

First, use ps to ensure the daemons are running on
the appropriate machines. For example,

[root@fanny root]# ps -aux | grep pvfsd
root 15679 0.0 0.1 1700 184 ? S
Jun21 0:00 /usr/local/sbin/pvfsd

Of course, mgr should be running only on the
metadata server and iod should be running on all
the I/O servers (but nowhere else).

Each process will create a log file, by default in the
/tmp directory. Look to see if these are present.

[root@fanny root]# ls -l /tmp
total 48
-rwxr-xr-x 1 root root 354 Jun 21 11:13 iolog.OxLkSR
-rwxr-xr-x 1 root root 0 Jun 21 11:12 mgrlog.z3tg11
-rwxr-xr-x 1 root root 119 Jun 21 11:21 pvfsdlog.msBrCV
...

The garbage at the end of the filenames is generated to produce a
unique filename.

The mounted PVFS will be included in the listing given with the mount
command.

[root@fanny root]# mount
...
fanny:/pvfs-meta on /mnt/pvfs type pvfs (rw)
...

This should work on each node.

In addition to the fairly obvious tests just listed, PVFS provides a
couple of utilities you can turn to. The utilities
iod-ping and
mgr-ping can be used to check whether the I/O
and metadata servers are running and responding on a particular
machine.

Here is an example of using iod-ping:

[root@fanny root]# /usr/local/bin/iod-ping
localhost:7000 is responding.
[root@fanny root]# cexec /usr/local/bin/iod-ping
************************* local *************************
--------- george.wofford.int---------
localhost:7000 is responding.
--------- hector.wofford.int---------
localhost:7000 is responding.
--------- ida.wofford.int---------
localhost:7000 is responding.
--------- james.wofford.int---------
localhost:7000 is responding.

The iod daemon seems to be OK on all the
clients. If you run mgr-ping, only the metadata
server should respond.


/ 142