High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI [Electronic resources] - نسخه متنی

Joseph D. Sloan

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید


3.1 Design Decisions


While you may have some idea of what
you want, it is still worthwhile to review the implications of your
choices. There are several closely related, overlapping key issues to
consider when acquiring PCs for the nodes in your cluster:

Will you have identical systems or a mixture of hardware?

Will you scrounge for existing computers, buy assembled computers, or
buy the parts and assemble your own computers?

Will you have full systems with monitors, keyboards, and mice,
minimal systems, or something in between?

Will you have dedicated computers, or will you share your computers
with other users?

Do you have a broad or shallow user base?

This is this most important thing
I'll say in this chapter

if at all
possible, use identical systems for your nodes . Life will
be much simpler. You'll need to develop and test
only one configuration and then you can clone the remaining machines.
When programming your cluster, you won't have to
consider different hardware capabilities as you attempt to balance
the workload among machines. Also, maintenance and repair will be
easier since you will have less to become familiar with and will need
to keep fewer parts on hand. You can certainly use heterogeneous
hardware, but it will be more work.

In constructing a cluster, you can scrounge for existing computers,
buy assembled computers, or buy the parts and assemble your own.
Scrounging is the cheapest way to go, but this approach is often the
most time consuming. Usually, using scrounged systems means
you'll end up with a wide variety of hardware, which
creates both hardware and software problems. With older scrounged
systems, you are also more likely to have even more hardware
problems. If this is your only option, try to standardize hardware as
much as possible. Look around for folks doing bulk upgrades when
acquiring computers. If you can find someone replacing a number of
computers at one time, there is a good chance the computers being
replaced will have been a similar bulk purchase and will be very
similar or identical. These could come from a computer laboratory at
a college or university or from an IT department doing a periodic
upgrade.

Buying new, preassembled computers may be the simplest approach if
money isn't the primary concern. This is often the
best approach for mission-critical applications or when time is a
critical factor. Buying new is also the safest way to go if you are
uncomfortable assembling computers. Most
system integrators will allow
considerable latitude over what to include with your systems,
particularly if you are buying in bulk. If you are using a system
integrator, try to have the integrator provide a list of MAC
addresses and label each machine.

Building your own system is cheaper,
provides higher performance and reliability, and allows for
customization. Assembling your own computers may seem daunting, but
it isn't that difficult. You'll
need time, personnel, space, and a few tools. It's a
good idea to build a single system and test it for hardware and
software compatibility before you commit to a large bulk order. Even
if you do buy preassembled computers, you will still need to do some
testing and maintenance. Unfortunately, even new computers are
occasionally DOA.[1] So the extra time may be less
than you'd think. And by building your own,
you'll probably be able to afford more computers.

[1] Dead on arrival: nonfunctional when
first installed.

If you are constructing a
dedicated cluster, you will not need
full systems. The more you can leave out of each computer, the more
computers you will be able to afford, and the less you will need to
maintain on individual computers. For example, with dedicated
clusters you can probably do without monitors, keyboards, and mice
for each individual compute node. Minimal machines have the smallest
footprint, allowing larger clusters when space is limited and have
smaller power and air conditioning requirements. With a minimal
configuration, wiring is usually significantly easier, particularly
if you use rack-mounted equipment. (However, heat dissipation can be
a serious problem with rack-mounted systems.) Minimal machines also
have the advantage of being less likely to be reallocated by middle
management.

The size of your user base will also affect
your cluster design. With a broad user base, you'll
need to prepare for a wider range of potential usesmore
applications software and more systems tools. This implies more
secondary storage and, perhaps, more memory. There is also the
increased likelihood that your users will need direct access to
individual nodes.

Shared machines, i.e., computers that have other uses in addition to
their role as a cluster node, may be a way of constructing a
part-time cluster that would not be possible otherwise. If your
cluster is shared, then you will need complete, fully functioning
machines. While this book won't focus on such
clusters, it is certainly possible to have a setup that is a computer
lab on work days and a cluster on the weekend, or office machines by
day and cluster nodes at night.


3.1.1 Node Hardware


Obviously,
your computers need adequate hardware for all intended uses. If your
cluster includes workstations that are also used for other purposes,
you'll need to consider those other uses as well.
This probably means acquiring a fairly standard workstation. For a
dedicated cluster, you determine your needs and there may be a lot
you won't needaudio cards and speakers, video
capture cards, etc. Beyond these obvious expendables, there are other
additional parts you might want to consider omitting such as disk
drives, keyboards, mice, and displays. However, you should be aware
of some of the potential problems you'll face with a
truly minimalist approach. This subsection is a quick review of the
design decisions you'll need to make.


3.1.1.1 CPUs and motherboards

While you can certainly purchase CPUs and motherboards from different
sources, you need to select each with the other in mind. These two
items are the heart of your system. For optimal performance,
you'll need total compatibility between these. If
you are buying your systems piece by piece, consider buying an Intel-
or ADM-compatible motherboard with an installed CPU. However, you
should be aware that some motherboards with permanently affixed CPUs
are poor performers, so choose with care.

You should also buy your equipment from a known, trusted source with
a reputable warranty. For example, in recent years a number of boards
have been released with low-grade electrolytic capacitors. While
these capacitors work fine initially, the board life is
disappointingly brief. People who bought these boards from
fly-by-night companies were out of luck.

In
determining the performance of a node, the most important factors are
processor clock rate, cache size, bus speed, memory capacity, disk
access speed, and network latency. The first four are determined by
your selection of CPU and motherboard. And if you are using
integrated EIDE interfaces and network adapters, all six are at least
influenced by your choice of CPU and motherboard.

Clock speed can be misleading. It is best used to compare processors
within the same family since comparing processors from different
families is an unreliable way to measure performance. For example, an
AMD Athlon 64 may outperform an Intel Pentium 4 when running at the
same clock rate. Processor speed is also very application dependent.
If your data set fits within the large cache in a Prescott-core
Pentium 4 but won't fit in the smaller cache in an
Athlon, you may see much better performance with the Pentium.

Selecting a processor is a balancing act. Your choice will be
constrained by cost, performance, and compatibility. Remember, the
rationale behind a commodity off-the-shelf (COTS) cluster is buying
machines that have the most favorable price to performance ratio, not
pricey individual machines. Typically you'll get the
best ratio by purchasing a CPU that is a generation behind the
current cutting edge. This means comparing the numbers. When
comparing CPUs, you should look at the increase in performance versus
the increase in the

total cost of a node. When
the cost starts rising significantly faster than the performance,
it's time to back off. When a 20 percent increase in
performance raises your cost by 40 percent, you've
gone too far.

Since Linux works with most major chip families, stay mainstream and
you shouldn't have any software compatibility
problems. Nonetheless, it is a good idea to test a system before
committing to a bulk purchase. Since a primary rationale for building
your own cluster is the economic advantage, you'll
probably want to stay away from the less common chips. While clusters
built with UltraSPARC systems may be wonderful performers, few people
would describe these as commodity systems. So unless you just happen
to have a number of these systems that you aren't
otherwise using, you'll probably want to avoid
them.[2]

[2] Radajewski and Eadline's

Beowulf HOWTO refers to
"

Computer
Shopper "-certified equipment. That is, if
equipment isn't advertised in

Computer
Shopper , it isn't commodity
equipment.

With standalone workstations, the overall benefit of multiple
processors (i.e., SMP systems) is debatable since a second processor
can remain idle much of the time. A much stronger argument can be
made for the use of multiple processor systems in clusters where
heavy utilization is assured. They add additional CPUs without
requiring additional motherboards, disk drives, power supplies,
cases, etc.

When comparing motherboards, look to see what is integrated into the
board. There are some significant differences. Serial, parallel, and
USB ports along with EIDE disk adapters are fairly standard. You may
also find motherboards with integrated FireWire ports, a network
interface, or even a video interface. While you may be able to save
money with built-in network or display interfaces (provided they
actually meet your needs), make sure they can be disabled should you
want to install your own adapter in the future. If you are really
certain that some fully integrated motherboard meets your needs,
eliminating the need for daughter cards may allow you to go with a
small case. On the other hand, expandability is a valuable hedge
against the future. In particular, having free memory slots or
adapter slots can be crucial at times.

Finally, make sure the BIOS Setup options are compatible with your
intended configuration. If you are building a minimal system without
a keyboard or display, make sure the BIOS will allow you to boot
without them attached. That's not true for some
BIOSs.


3.1.1.2 Memory and disks

Subject
to your budget, the more cache and RAM in your system, the better.
Typically, the faster the processor, the more RAM you will need. A
very crude rule of thumb is one byte of RAM for every floating-point
operation per second. So a processor capable of 100 MFLOPs would need
around 100 MB of RAM. But don't take this rule too
literally.

Ultimately, what you will need depends on your applications. Paging
creates a severe performance penalty and should be avoided whenever
possible. If you are paging frequently, then you should consider
adding more memory. It comes down to matching the memory size to the
cluster application. While you may be able to get some idea of what
you will need by profiling your application, if you are creating a
new cluster for as yet unwritten applications, you will have little
choice but to guess what you'll need as you build
the cluster and then evaluate its performance after the fact. Having
free memory slots can be essential under these circumstances.

Which disks to include, if any, is perhaps the most controversial
decision you will make in designing your cluster. Opinions vary
widely. The cases both for and against diskless systems have been
grossly overstated. This decision is one of balancing various
tradeoffs. Different contexts tip the balance in different
directions. Keep in mind, diskless systems were once much more
popular than they are now. They disappeared for a reason. Despite a
lot of hype a few years ago about thin clients, the reemergence of
these diskless systems was a spectacular flop. Clusters are, however,
a notable exception. Diskless clusters are a widely used, viable
approach that may be the best solution in some circumstances.

There are a number of obvious advantages to diskless systems. There
is a lower cost per machine, which means you may be able to buy a
bigger cluster with better performance. With rapidly declining disk
prices, this is becoming less of an issue. A small footprint
translates into lowered power and HVAC needs. And once the initial
configuration has stabilized, software maintenance is simpler.

But the real advantage of diskless systems, at least with large
clusters, is reduced maintenance. With diskless systems, you
eliminate all moving parts aside from fans. For example, the average
life (often known as mean time between failures, mean time before
failure, or mean time to failure) of one
manufacturer's disks is reported to be 300,000 hours
or 34 years of continuous operation. If you have a cluster of 100
machines, you'll replace about three of these drives
a year. This is a nuisance, but doable. If you have a cluster with
12,000 nodes, then you are looking at a failure, on average, every 25
hoursroughly once a day.

There is also a downside to consider. Diskless systems are much
harder for inexperienced administrators to configure, particularly
with heterogeneous hardware. The network is often the weak link in a
cluster. In diskless systems the network will see more traffic from
the network file system, compounding the problem. Paging across a
network can be devastating to performance, so it is critical that you
have adequate local memory. But while local disks can reduce network
traffic, they don't eliminate it. There will still
be a need for network-accessible file systems.

Simply put, disk-based systems are more versatile and more forgiving.
If you are building a dedicated cluster with new equipment and have
experience with diskless systems, you should definitely consider
diskless systems. If you are new to clusters, a disk-based cluster is
a safer approach. (Since this book's focus is
getting started with clusters, it does not describe setting up
diskless clusters.)

If you are buying hard disks, there are three issues: interface type
(EIDE vs. SCSI), disk latency (a function of rotational speed), and
disk capacity. From a price-performance perspective, EIDE is probably
a better choice than SCSI since virtually all motherboards include a
built-in EIDE interface. And unless you are willing to pay a premium,
you won't have much choice with respect to disk
latency. Almost all current drives rotate at 7,200 RPM. While a few
10,000 RPM drives are available, their performance, unlike their
price, is typically not all that much higher. With respect to disk
capacity, you'll need enough space for the operating
system, local paging, and the data sets you will be manipulating.
Unless you have extremely large data sets, when recycling older
computers a 10 GB disk should be adequate for most uses. Often
smaller disks can be used. For new systems, you'll
be hard pressed to find anything smaller that 20 GB, which should
satisfy most uses. Of course, other non-cluster needs may dictate
larger disks.

You'll probably want to include either a floppy
drive or CD-ROM drive in each system. Since CD-ROM drives can be
bought for under $15 and floppy drives for under $5, you
won't save much by leaving these out. For disk-based
systems, CD-ROMs or floppies can be used to initiate and customize
network installs. For example, when installing the software on
compute nodes, you'll typically use a boot floppy
for OSCAR systems and a CD-ROM on Rocks systems. For diskless
systems, CD-ROMs or floppies can be used to boot systems over the
network without special BOOT ROMs on your network adapters. The only
compelling reason to not include a CD-ROM or floppy is a lack of
space in a truly minimal system.

When buying any disks, don't forget the cables.


3.1.1.3 Monitors, keyboards, and mice

Many minimal systems elect not to include monitors, keyboards, or
mice but rely on the network to provide local connectivity as needed.
While this approach is viable only with a dedicated cluster, its
advantages include lower cost, less equipment to maintain, and a
smaller equipment footprint. There are also several problems you may
encounter with these headless systems. Depending
on the system BIOS, you may not be able to boot a system without a
display card or keyboard attached. When such systems boot, they probe
for an attached keyboard and monitor and halt if none are found.
Often, there will be a CMOS option that will allow you to override
the test, but this isn't always the case.

Another problem comes when you need to configure or test equipment. A
lack of monitor and keyboard can complicate such tasks, particularly
if you have network problems. One possible solution is the use of a
crash carta cart with keyboard, mouse, and display that can be
wheeled to individual machines and connected temporarily. Provided
the network is up and the system is booting properly, X Windows or
VNC provide a software solution.

Yet another alternative, particularly for small clusters, is the use
of a

keyboard-video-mouse
(

KVM )

switch .
With these switches, you can attach a single keyboard, mouse, and
monitor to a number of different machines. The switch allows you to
determine which computer is currently connected.
You'll be able to access only one of the machines at
a time, but you can easily cycle among the machines at the touch of a
button. It is not too difficult to jump between machines and perform
several tasks at once. However, it is fairly easy to get confused
about which system you are logged on to. If you use a KVM switch, it
is a good idea to configure the individual systems so that each
displays its name, either as part of the prompt for command-line
systems or as part of the background image for GUI-based systems.

There are a number of different switches available. Avocet even sells
a KVM switch that operates over IP and can be used with remote
clusters. Some KVM switches can be very pricey so be sure to shop
around. Don't forget to include the cost of cables
when pricing KVM switches. Frequently, these are not included with
the switch and are usually overpriced. You'll need a
set for every machine you want to leave connected, but not
necessarily every machine.

The interaction between the system and the switch may provide a
surprise or two. As previously noted, some systems
don't allow booting without a keyboard, i.e., there
is no CMOS override for booting without a keyboard. A KVM switch may
be able to fool these systems. Such systems may detect a keyboard
when connected to a KVM switch even when the switch is set to a
different system. On the other hand, if you are installing Linux on a
computer and it probes for a monitor, unless the switch is set to
that system, the monitor won't be found.


Keep in mind, both the crash cart and the KVM switch approaches
assume that individual machines have display adapters.

For this reason, you should seriously consider including a video card
even when you are going with a headless systems. Very inexpensive
cards or integrated adapters can be used since you
won't need anything fancy. Typically, embedded video
will only add a few dollars to the price of a motherboard.

One other possibility is to use serial consoles. Basically, the idea
is to replace the attached monitor and keyboard with a serial
connection to a remote system. With a fair amount of work, most Linux
systems can be reconfigured to work in this manner. If you are using
rack-mount machines, many of them support serial console redirection
out of the box. With this approach, the systems use a connection to a
serial port to eliminate the need for a KVM switch. Additional
hardware is available that will allow you to multiplex serial
connections from a number of machines. If this approach is of
interest, consult the Remote Serial Console HOWTO at http://www.tldp.org/HOWTO/Remote-Serial-Console-HOWTO/.


3.1.1.4 Adapters, power supplies, and cases

As just noted, you should include a video adapter. The network
adapter is also a key component. You must buy an adapter that is
compatible with the cluster network. If you are planning to boot a
diskless system over the network, you'll need an
adapter that supports it. This translates into an adapter with an
appropriate network BOOT ROM, i.e., one with

pre-execution
environment (PXE) support. Many adapters come with a
built-in (but empty) BOOT ROM socket so that the ROM can be added.
You can purchase BOOT ROMs for these cards or burn your own. However,
it may be cheaper to buy a new card with an installed BOOT ROM than
to add the BOOT ROMs. And unless you are already set up to burn ROMs,
you'll need to be using several machines before it
becomes cost effective to buy an EPROM burner.

To round things out, you'll need something to put
everything in and a way to supply power, i.e., a case and power
supply. With the case, you'll have to balance
keeping the footprint small and having room to expand your system. If
you buy too small a power supply, it won't meet your
needs or allow you to expand your system. If you buy too large a
power supply, you waste money and space. If you add up the power
requirements for your individual components and add in another 50
percent as a fudge factor, you should be safe.

One last word about node selectionwhile we have considered
components individually, you should also think about the system
collectively before you make a final decision. If collectively the
individual systems generate more heat that you can manage, you may
need to reconsider how you configure individual machines. For
example, Google is said to use less-powerful machines in its clusters
in order to balance computation needs with total operational costs, a
judgment that includes the impact of cooling
needs.


3.1.2 Cluster Head and Servers


Thus far, we have been looking at the
compute nodes within the cluster. Depending on your configuration,
you will need a head node and possibly additional servers. Ideally,
the head node and most servers should be complete systems since it
will add little to your overall cost and can simplify customizing and
maintaining these systems. Typically, there is no need for these
systems to use the same hardware that your compute nodes use. Go for
enhancements that will improve performance that you might not be able
to afford on every node. These machines are the place for large, fast
disks and lots of fast memory. A faster processor is also in order.

On smaller clusters, you can usually use one machine as both the head
and as the network file server. This will be a dual-homed machine
(two network interfaces) that serves as an access point for the
cluster. As such, it will be configured to limit and control access
as well as provide it. When the services required by the network file
systems put too great a strain on the head node, the network file
system can be moved to a separate server to improve performance.

If you are setting up systems as I/O servers for a parallel file
system, it is likely that you'll want larger and
faster drives on these systems. Since you may have a number of I/O
servers in a larger cluster, you may need to look more closely at
cost and performance trade-offs.


3.1.3 Cluster Network


By definition, a cluster is a
networked collection of computers. For commodity clusters, networking
is often the weak link. The two key factors to consider when
designing your network are bandwidth and latency. Your application or
application mix will determine just how important these two factors
are. If you need to move large blocks of data, bandwidth will be
critical. For real-time applications or applications that have lots
of interaction among nodes, minimizing latency is critical. If you
have a mix of applications, both can be critical.

It should come as no surprise that a number of approaches and
products have been developed. High-end
Ethernet is
probably the most common choice for clusters. But for some
low-latency applications, including many real-time applications, you
may need to consider specialized low-latency hardware. There are a
number of choices. The most common alternative to Ethernet is

Myrinet
from Myricom, Inc. Myrinet is a proprietary solution providing
high-speed bidirectional connectivity (currently about 2 Gbps in each
direction) and low latencies (currently under 4 microseconds).
Myrinet uses a source-routing strategy and allows arbitrary length
packets.

Other competitive technologies that are emerging or are available
include

cLAN
from Emulex,

QsNet
from Quadrics, and

Infiniband
from the Infiniband consortium. These are high-performance solutions
and this technology is rapidly changing.

The problem with these alternative technologies is their extremely
high cost. Adapters can cost more than the combined cost of all the
other hardware in a node. And once you add in the per node cost of
the switch, you can easily triple the cost of a node. Clearly, these
approaches are for the high-end systems.

Fortunately, most clusters will not need this extreme level of
performance. Continuing gains in speed and rapidly declining costs
make Ethernet the network of choice for most clusters. Now that
Gigabit
Ethernet is well established and 10 Gigabit Ethernet has entered the
marketplace, the highly expensive proprietary products are no longer
essential for most needs.

For Gigabit Ethernet, you will be better served with an
embedded adapter rather than an add-on
PCI board since Gigabit can swamp the PCI bus. Embedded adapters use
workarounds that take the traffic off the PCI bus. Conversely, with
100BaseT, you may prefer a separate adapter rather than an embedded
one since an embedded adapter may steal clock cycles from your
applications.

Unless you are just playing around, you'll probably
want, at minimum, switched Fast Ethernet. If your goal is just to
experiment with clusters, almost any level of networking can be used.
For example, clusters have been created using FireWire ports. For two
(or even three) machines, you can create a cluster using crossover
cables.

Very high-performance clusters may have two
parallel networks. One is used for
messages passing among the nodes, while the second is used for the
network file system. In the past, elaborate technology,
architectures, and topologies have been developed to optimize
communications. For example, channel bonding uses multiple interfaces
to multiplex channels for higher bandwidth. Hypercube topologies have
been used to minimize communication path length. These approaches are
beyond the scope of this book. Fortunately, declining networking
prices and faster networking equipment have lessened the need for
these approaches.


/ 142