Professional Documents
Culture Documents
Pteris Ledi
University of Latvia Nov 16, 2004
What is Beowulf
Historically
Beowulf is the earliest surviving epic poem written in English. It is a story about a hero of great strength and courage who defeated a monster called Grendel. See History to find out more about the Beowulf hero. Famed was this Beowulf: far flew the boast of him, son of Scyld, in the Scandian lands. So becomes it a youth to quit him well with his father's friends, by fee and gift, that to aid him, aged, in after days, come warriors willing, should war draw nigh, liegemen loyal: by lauded deeds shall an earl have honor in every clan.
Computer related
The Beowulf architecture was first introduced by scientists (Thomas Sterling and Don Becker) at NASA who were trying to develop a supercomputer class of machine using inexpensive off-the-shelf computer technology.
COW
COW cluster of Workstations
A set of computers with one server with user /home and /usr/local directories via NFS. All clients have local *nix installed with rsh remote shell Workstations used by people at day, by computing processes at nights.
COW vs Beowulf
Ordinary nodes of Beowulf have no videocards, keyboards, etc. Users see Beowulf as one computer
Global PID
Hardware classification
From Beowulf how-to:
A CLASS I Beowulf is a machine that can be assembled from parts found in at least 3 nationally/globally circulated advertising catalogs Computer shopper test A CLASS II Beowulf is simply any machine that does not pass the Computer Shopper certification test
Beowulfery
Explore the possibilities of building supercomputers out of the mass-market electronic equivalent of coat hangers and chewing gum. Open source
Beowulf design is best driven (and extended) by one's needs of the moment and vision of the future and not by a mindless attempt to slavishly follow the original technical definition anyway.
Software
Building clusters is straightforward, but managing its software can be complex. Oscar (Open Source Cluster Application Resources) Scyld scyld.com from the scientists of NASA - commercial
OSCAR
http://oscar.openclustergroup.org/ Cluster on a CD automates cluster install process Wizard driven Can be installed on any Linux system, supporting RPMs Components: Open source and BSD style license
Rocks
Award Winning Open Source High Performance Linux Cluster Solution The current release of NPACI Rocks is 3.3.0. Rocks is built on top of RedHat Linux releases Two types of nodes
Frontend
Compute
Disk drive for caching the base operating environment (OS and libararies)
Rocks Base CD, HPC Roll CD and Kernel Roll CD Bootable Base CD User-friendly wizard mode installation
Cluster information Local hardware Both ethernet interfaces
Run a program which captures compute node DHCP requests and puts their information into the Rocks MySQL database
Nodes divided in cabinets Possible to install compute nodes of different architecture than frontend
MPI
MPI is a software systems that allows you to write message-passing parallel programs that run on a cluster, in Fortran and C.
MPI (Message Passing Interface) is a defacto standard for portable message-passing parallel programs standardized by the MPI Forum and available on all massively-parallel supercomputers.
Parallel programming
Mind that memory is distributed each node has its own memory space Decomposition divide large problems into smaller Use mpi.h for C programs
Message passing
Message Passing Program consists of multiple instance of serial program that communicate by library calls. These calls may be roughly divided into four classes:
Calls used to initialise, manage and finally terminate communication Calls used to communicate between pairs of processors Calls that communicate operations among group of processors Calls used to create arbitrary data type
Helloworld.c
#include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello world! I am %d of %d\n", rank, size); MPI_Finalize(); return(0); }
Communication
/* * The root node sends out a message to the next node in the ring and * each node then passes the message along to the next node. The root * node times how long it takes for the message to get back to it. */ #include<stdio.h> /* for input/output */ #include<mpi.h> /* for mpi routines */ #define BUFSIZE 64 /* The size of the messege being passed */ main( int argc, char** argv) { double start,finish; int my_rank; /* the rank of this process */ int n_processes; /* the total number of processes */ char buf[BUFSIZE]; /* a buffer for the messege */ int tag=0; /* not important here */ MPI_Status status; /* not important here */ MPI_Init(&argc, &argv); /* Initializing mpi */ MPI_Comm_size(MPI_COMM_WORLD, &n_processes); /* Getting # of processes */ MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); /* Getting my rank */
Communication again
/* * If this process is the root process send a messege to the next node * and wait to recieve one from the last node. Time how long it takes * for the messege to get around the ring. If this process is not the * root node, wait to recieve a messege from the previous node and * then send it to the next node. */ start=MPI_Wtime(); printf("Hello world! I am %d of %d\n", my_rank, n_processes); if( my_rank == 0 ) { /* send to the next node */ MPI_Send(buf, BUFSIZE, MPI_CHAR, my_rank+1, tag, MPI_COMM_WORLD); /* receive from the last node */ MPI_Recv(buf, BUFSIZE, MPI_CHAR, n_processes-1, tag, MPI_COMM_WORLD, &status); }
Compiling
Compile code using mpicc MPI C compiler.
Rocks computing
Mpirun on Rocks clusters is used to launch jobs that are linked with the Ethernet device for MPICH.
"mpirun" is a shell script that attempts to hide the differences in starting jobs for various devices from the user. On workstation clusters, you must supply a file that lists the different machines that mpirun can use to run remote jobs MPICH is an implementation of MPI, the Standard for message-passing libraries.
software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers.
Create a file in your home directory named machines, and put two entries in it, such as:
compute-0-0 compute-0-1
Download the the two-processor HPL configuration file and save it as HPL.dat in your home directory. Now launch the job from the frontend:
Rocks cluster-fork
Same jobs of standard Unix commands on different nodes
By default, cluster-fork uses a simple series of ssh connections to launch the task serially on every compute node in the cluster. My processes on all nodes:
$ cluster-fork ps -U$USER
Hostnames of nodes:
$ cluster-fork ps -U$USER
$ cluster-fork --query="select name from nodes where name like 'compute-1-%'" [cmd] Or use --nodes=compute-0-%d:0-2
optimally place computing tasks and balance the load on a set of networked computers allow users to generate and queue more computing tasks than can be run at the moment ensure that tasks are executed with respect to priority and to providing all users with a fair share of access over time
Give parameters for the processes to be executed and how should they be executed
Run script: qsub testjob.sh Query status: qstat -f STDOUT, STDERR output is kept in files using name of the script that is run
The other 2 files pertain to Grid Engine status and they are named:
testjob.sh
#!/bin/bash #$ -S /bin/bash # # Set the Parallel Environment and number of procs. #$ -pe mpi 2 # Where we will make our temporary directory. BASE="/tmp" # make a temporary key export KEYDIR=`mktemp -d $BASE/keys.XXXXXX` # Make a temporary password. // Makepasswd is quieter, and presumably more efficient. //We must use the -s 0 flag to make sure the password contains no quotes. if [ -x `which mkpasswd` ]; then export PASSWD=`mkpasswd -l 32 -s 0` else export PASSWD=`dd if=/dev/urandom bs=512 count=100 | md5sum | gawk '{print $1}'` fi
testjob.sh
/usr/bin/ssh-keygen -t rsa -f $KEYDIR/tmpid -N "$PASSWD" cat $KEYDIR/tmpid.pub >> $HOME/.ssh/authorized_keys2 # make a script that will run under its own ssh-agent # cat > $KEYDIR/launch-script <<"EOF" #!/bin/bash expect -c 'spawn /usr/bin/ssh-add $env(KEYDIR)/tmpid' -c \ 'expect "Enter passphrase for $env(KEYDIR)/tmpid" \ { send "$env(PASSWD)\n" }' -c 'expect "Identity"' echo # Put your Job commands here. #-----------------------------------------------/opt/mpich/gnu/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines \ /opt/hpl/gnu/bin/xhpl #-----------------------------------------------EOF chmod u+x $KEYDIR/launch-script
testjob.sh
# start a new ssh-agent from scratch -- make it forget previous ssh-agent connections unset SSH_AGENT_PID unset SSH_AUTH_SOCK /usr/bin/ssh-agent $KEYDIR/launch-script # # cleanup # grep -v "`cat $KEYDIR/tmpid.pub`" $HOME/.ssh/authorized_keys2 \ > $KEYDIR/authorized_keys2 mv $KEYDIR/authorized_keys2 $HOME/.ssh/authorized_keys2 chmod 644 $HOME/.ssh/authorized_keys2 rm -rf $KEYDIR
Monitoring Rocks
Set of web pages to monitor activities and configuration
Apache webserver with access from internal network only From outside - viewing webpages involves sending a web browser screen over a secure, encrypted SSH channel
ssh frontendsite, start Mozilla there, see http://localhost mozilla --no-remote Modify IPtables
Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.
Default services
411 Secure information service
Distribute files to nodes password changes, login files In cron run every hour
Sources
http://www.tldp.org/HOWTO/Beowulf-HOWTO.html http://www.cs.iusb.edu/beowulf.html
http://www.beowulf.org http://www.rocksclusters.org/Rocks/