Having built a Beowulf cluster using instructions
found on the Internet or in popular magazines, some zealous individuals
are disgusted to discover that their favorite word processor and
spreadsheet packages will not run on their powerful new creation.
This is the case because unless a particular application has been
specifically programmed to use multiple processors, it must be
rewritten, or parallelized, before it will be able to run on either a
parallel supercomputer or a cluster of PCs. Moreover, since Beowulf
clusters consist of computational and network elements that may be
combined to optimally solve a particular problem or run a particular
algorithm, the target application should play a significant role in the
design of any new cluster.
Parallel computing is accomplished by splitting up a large
computational problem into smaller tasks that may be performed
simultaneously by multiple processors. For example, the addition of two
very long vectors of numbers, A and B, can be performed by two
processors if one of them adds the first half of vector A to the first
half of vector B, while the second adds the second half of vector A to
the second half of vector B. While this theoretically halves the time
needed to solve the problem, the resulting vector, C, is now split
across two different processor memories.
Because of this
split, communication must occur to get the entire solution in one
place. The distribution of the initial data, A and B, and the
collection of the result, C, adds overhead to the computational problem
and reduces the actual speedup of the entire task to less than double.
Symmetric
multi-processor (SMP) and other shared memory computers can be used to
reduce the amount and cost (in terms of time) of this communication;
however, these systems typically have only a small number of processors
or are very expensive, custom-built supercomputers. On the other hand,
distributed memory platforms -- including Beowulf clusters -- are
relatively inexpensive and can be scaled to hundreds or thousands of
processors.
Most clusters built today are hybrids; they
consist of many nodes (i.e., individual computers), each having two or
more processors.
Decomposition and Granularity
Computational
problems may be parallelized in a variety of ways. Parallelization may
be accomplished by decomposing the data (as in the vector addition
example above), by decomposing functionality so that one processor
performs one type of operation while other processors simultaneously
perform different operations, or by decomposing both the data and the
functionality.
This decomposition may be established a priori
or, more often, is performed dynamically once the program is running.
Good parallel code most often automatically decomposes the problem at
hand and allows the processors to communicate with each other when
necessary -- this is called "message passing" -- while performing
individual tasks.
Not all computational problems are amenable
to parallel computing. If an algorithm cannot be restructured so that
sub-tasks can be performed simultaneously or if the model components
are highly interdependent, attempts to parallelize these codes may
result in increased time-to-solution. Such "fine grained" problems do
not scale well as more processors are applied to the computation.
Performance of the finest-grained problems is limited by the speed of
the fastest single CPU that is available.
Other computational
problems, such as image processing where which each pixel may be
manipulated independently, are "coarse grained." Image processing is a
good example of this type of problem. These problems are generally
easier to parallelize and tend to benefit the most from parallel
processing, particularly in distributed memory environments. The
coarsest grained problems are often referred to as "embarrassingly
parallel."
Fortunately, most complex scientific problems may
be decomposed by performing separate tasks independently and
simultaneously on multiple processors or by splitting up the space
and/or time coordinates of the system being modeled. These problems
tend to fall somewhere between coarse and fine granularity and usually
require a moderate amount of interprocessor communication to coordinate
activities and share data.
For example, values for cells on
a map may depend on neighboring cell values. If the map is decomposed
into two pieces, each being processed on a separate CPU, the processors
must exchange cell values along the adjacent edges of the map.
Problem
decomposition is very important for successful parallel processing. A
balance must be struck between computation and communication so that
what was a computational problem on a single processor does not become
a communications problem in a parallel environment. Writing good
parallel code is actually more of an art than a science; practitioners
must be able to think about algorithms in novel ways.
Message Passing
Many alternatives
for code parallelization have been developed, but explicit
message-passing strategies have met with the most success on a wide
variety of applications and platforms. The most popular parallel
environments and applications programming interfaces (APIs) are MPI
(Message Passing Interface) and PVM (Parallel Virtual Machine). Because
these two APIs are widely used, parallel code that performs message
passing using these libraries can be run on every platform from laptops
to the largest commercial supercomputers without changing the source
code.
Both PVM and MPI are available on a wide range of
computer platforms and have bindings for C, C++, and FORTRAN. The two
most popular MPI implementations are LAM (Local Area Multicomputer) and
MPICH (MPI Chameleon). While PVM has task scheduling and advanced
features that are not available in MPI, MPI is increasingly used for
code development because it's based on a community standard and has
adequate features for most parallel applications.
All three of
these implementations work well on Beowulf clusters and are easy to
install under Linux. One or more of these libraries is often included
in standard Linux distributions. RPMs and other types of installation
packages are available on their respective Web sites: http://www.epm.ornl.gov/pvm/ for PVM, http://www.lam-mpi.org/ for LAM, and http://www-unix.mcs.anl.gov/mpi/mpich/for MPICH.
Using MPI and PVM
The desired
message passing environment should be downloaded and installed on all
the computers that will be used for parallel processing. On a Beowulf
cluster with a shared filesystem, a single installation can be made
accessible to all nodes (for more information on Beowulf cluster file
system topologies, please consult the February 2002 issue).
In the case of MPICH, the installation will create the file /usr/local/mpich/share/machines.
This file specifies which nodes in the cluster are available for use
and should usually contain the names of the all the nodes in the
cluster (a machines file may also be specified at runtime for finer control). If LAM is used, the lamboot command should be executed to initiate a daemon on each node.
If PVM is used, the PVM shell can be used to construct the desired virtual machine and start up daemons on each node. The PVM_ROOT environment variable should be set for each host, and user-created binaries should be placed in a special directory, pvm3/ bin/ARCH/, below the user's home directory, where ARCH should be replaced with the appropriate architecture name.
MPI
programs can be compiled in many ways, but most MPI implementations
provide an easy-to-use script that will set desired compiler flags,
point the compiler at the right directory for MPI header files, and
include the necessary libraries for the linker. MPICH and LAM both
provide a script called mpicc that is used for compiling MPI codes.
Figure One shows how to compile and run a simple "Hello World!" program called mpi_hello (see Listing One, ). After compiling with mpicc, the mpirun command executes the program. The -np flag tells mpirun how many processes to start. mpirun
starts one process on the local node (no matter how many processors are
contained in the node), then initiates one process on each node listed
in the machines file. If the number of processes that are
specified exceeds the number of nodes available, additional processes
are created in round-robin fashion until all processes have been
started.
Figure One: Building and Running with MPI
[forrest@node01 mpi]$ mpicc -O -o mpi_hello mpi_hello.c [forrest@node01 mpi]$ mpirun -np 6 mpi_hello Hello World! I'm rank 4 of 6 on node05 Hello World! I'm rank 1 of 6 on node02 Hello World! I'm rank 5 of 6 on node06 Hello World! I'm rank 2 of 6 on node03 Hello World! I'm rank 3 of 6 on node04 Hello World! I'm rank 0 of 6 on node01
Listing One: mpi_hello.c
#include #include "mpi.h"
int main(int argc, char **argv) { int myrank, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME];
printf("Hello World! I'm rank %d of %d on %s\n", myrank, nprocs, processor_name);
MPI_Finalize(); return 0; }
Each
process is assigned a rank: a numeric identifier starting at zero that
indicates the order in which processes were initiated. The name of the
executable file and any command-line arguments are the final arguments
to mpirun.
In this example, six processes were started on different nodes (node01 through node06),
and each process determined its own rank (0 through 5) and printed out
the total number of processes involved in the job (6). While each
process started at the same time, the printed output appears in no
particular order. This is normal behavior when multiple processes all
print at the same time.
In order to successfully compile the code, the MPI header file (mpi.h) must be included at the top. Just inside main(), MPI_Init()
must be called and handed the command-line arguments so that the
environment is set up correctly for the program to be able to run in
parallel.
The next three MPI routines return information about
the parallel environment. Although this example program merely prints
out the information, it is usually used to do automatic problem
decomposition and to set up communication between processes.
The MPI_Comm_size() routine returns the number of processes (subsequently stored in nprocs) in the communicator group MPI_COMM_WORLD. MPI_COMM_WORLD
is a special communicator that denotes all of the processes available
at initialization. The rank of the calling process (ranging from 0 to nprocs - 1) is provided by MPI_Comm_rank(). The rank is stored in myrank. MPI_Get_processor_ name() provides the hostname of the node (not the individual processor) that is being used, which is stored in processor_name, and the length of this hostname, which is stored in namelen.
Next, the code prints "Hello World!" and the values of the variables obtained in the three previous MPI calls. Finally, MPI_Finalize() is called to terminate the parallel environment.
A similar program can be written using PVM. In this case, the PVM
shell is used to construct a virtual machine of the desired size by
adding hosts using the add command (see Figure Two). The virtual machine configuration may be verified using the conf command.
Figure Two: Building and Running with PVM
[forrest@node01 forrest]$ cd pvm3/bin/LINUX [forrest@node01 LINUX]$ gcc -O -I/usr/share/pvm3/include / -L/usr/share/pvm3/lib/LINUX -o pvm_hello pvm_hello.c -lpvm3 [forrest@node01 LINUX]$ pvm pvm> add node02 node03 node04 add node02 node03 node04 3 successful HOST DTID node02 80000 node03 c0000 node04 100000 pvm> spawn -4 -> pvm_hello spawn -4 -> pvm_hello [1] 4 successful t80001 tc0001 t100001 t40002 [1:t40002] Hello World! My task id is t40002 on node01 with 4 hosts. [1:t40002] EOF [1:t80001] Hello World! My task id is t80001 on node02 with 4 hosts. [1:t80001] EOF [1:tc0001] Hello World! My task id is tc0001 on node03 with 4 hosts. [1:tc0001] EOF [1:t100001] Hello World! My task id is t100001 on node04 with 4 hosts. [1:t100001] EOF [1] finished pvm> halt halt Terminated
Listing Two: pvm_hello.c
#include #include "pvm3.h" > int main(int argc, char **argv) { int i, mytid, dtid, info, nhost, narch; struct pvmhostinfo *hostp;
for (i = 0; i < nhost && hostp[i].hi_tid != dtid; i++); printf("Hello World! My task id is t%x on %s with %d hosts.\n", mytid, hostp[i].hi_name, nhost);
pvm_exit(); return 0; }
Figure Two shows how the pvm_ hello program in Listing Two is compiled and run. In this example, the PVM shell is used to execute the program by invoking it with the spawn command. The number of tasks is specified via the -4 flag, and the output is sent to the display (hence the ->
flag). In this example, one process is spawned on each of the four
nodes included in the virtual machine (again, regardless of how many
processors that node may have), and each process prints its task
identifier and host name. The PVM shell is exited using the halt command so that the virtual machine is disassembled (i.e., the PVM daemons on each of the hosts are terminated).
Listing Two shows the PVM calls necessary to start and end a simple program. The PVM header file (pvm3.h) is included at the top of the code. The pvm_mytid() call returns the task ID of the calling process, and the pvm_ tidtohost() call returns the task ID of the daemon running on the host used by the task mytid. The pvm_ config()
call returns information about the configuration of the virtual
machine, including the total number of hosts, the number of different
architectures, as well as some additional host-specific information.
After calling these routines, the code prints "Hello World!" and the
values obtained from these library calls. Finally, pvm_ exit() is called to terminate the parallel environment.
The Tip of the Iceberg
This brief
introduction to MPI and PVM has merely demonstrated the routines used
for parallel program initiation and termination. Message passing and
data reduction routines are needed for most useful parallel codes. Many
of these routines will be presented in the next Extreme Linux column.