You are on page 1of 23

Building of a Virtual Cluster from Scratch

Yu Zhang Computer Architecture Group Chemnitz University of Technology January 25, 2011

Abstract
Computing clusters run usually on physical computers. With virtualization approach clusters can also be virtualized. This article describes the building of a virtual cluster based on VirtualBox.

Contents
1 2 VirtualBox Installation Creation of Virtual Machines 2.1 Create of New Machines . . . . . . . 2.2 Normal Installation on Master Node 2.3 Minimal Installation on Slave Nodes 2.4 Network Conguration . . . . . . . . 3 3 4 4 5 7

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Application of SLURM 8 3.1 Possible Problems During Installation . . . . . . . . . . . . . . . . . . . . 8 3.2 Installation and Conguration . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 Automatic Startup when Booting . . . . . . . . . . . . . . . . . . . . . . . 10 Cluster Network Conguration 4.1 Hostnames . . . . . . . . 4.2 IP Addresses . . . . . . . 4.3 Host List . . . . . . . . . 4.4 Password-less SSH . . . . 4.5 Network File System . . . 13 13 13 13 13 15

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Test with Applications 16 5.1 Simple MPI Program Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.2 Tachyon Ray Tracer Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Further Work 21 22 23

A Removing Virtual Machine B A Bug in NFS

1 VirtualBox Installation
There are two basic editions of VirtualBox: VirtualBox and VirtualBox OSE (Open Source Edition). Both have almost the same function except some dierent features targeting dierent customers. In Ubuntu the command: sudo apt-get install virtualbox-ose or wget http://download.virtualbox.org/virtualbox/4.0.0/ \ virtualbox-4.0_4.0.0-69151~Ubuntu~lucid_amd64.deb sudo dpkg -i virtualbox-4-0_4.0.0-69151~Ubuntu_lucid_amd64.deb installs the VirtualBox OSE or VirtualBox packages respectively, where virtualbox-4-0_4.0.0-69151~Ubuntu_lucid_amd64.deb can also be replaced by * * * * * * virtualbox-4.0_4.0.0-69151~Ubuntu~maverick_amd64.deb virtualbox-4.0_4.0.0-69151~Ubuntu~karmic_amd64.deb virtualbox-4.0_4.0.0-69151~Ubuntu~jaunty_amd64.deb virtualbox-4.0_4.0.0-69151~Ubuntu~hardy_amd64.deb virtualbox-4.0_4.0.0-69151~Debian~squeeze_amd64.deb virtualbox-4.0_4.0.0-69151~Debian~lenny_amd64.deb ...

according to the distribution and version of the host OS. The package architecture has to match the Linux kernel architecture, that is, install the appropriate AMD64 package for a 64-bit CPU. It does not matter whether it is a Intel or an AMD CPU.

2 Creation of Virtual Machines


Although many Linux Distributions can be used on a Cluster, a carefully selected one simplies building and management hence a considerable amount of eort in future. One principle worth keeping in mind is: uniformity makes simplicity, that is, to apply the uniform Linux Distribution on master node as well as on slave nodes. For its pleasant features, Debian is considered as a suitable choice for cluster building.

(a) before

(b) later

Figure 1: VirtualBox Manager

2.1 Create of New Machines


When VirtualBox is started, a window like Figure 1(a) should come up. The window is called VirtualBox Manager. On the left there is a pane which will later list all the installed guest machines. Since no guest has been created yet, it is empty. A row of buttons above allow a guest machine to be created and existing machine to be operated, it any. The pane on the right displays the properties of the machine currently selected. After several machines have been installed later, the VirtualBox Manager may look like shown in Figure 1(b). It is simple to create new machines within VirtualBox Manager. Click on the menu Machine and select New, or just press the button New, follow the dialog boxes coming out and make chioces for your machines basic conguration. The main steps are shown in Figure 2.

2.2 Normal Installation on Master Node


A normal Debian installation with GUI makes cluster management easier. 8 GB hard disk space is necessary for possible further packages. It can certainly be installed from a full .iso image, here as an alternative we tried with the Debian netinst.iso, with which all packages except the base system are installed from a Debian mirror site instead of from a local CD. Later with package manager any update can be installed as long as Internet access is available, saving the eort manually updating the corresponding entries in the le /etc/apt/source.list

(a) VM Name and OS Type

(b) Virtual Harddisk

(c) Memory

(d) Virtual Disk Location and Size

Figure 2: Machine Creating Steps

2.3 Minimal Installation on Slave Nodes


Compared with that on a master node, work performed on a slave node is relatively simple. Therefore only the base system of a Debian installation is enough. For this 1 GB hard disk is needed. To make the installation simple und uniform, the best practice is rst to make a standard slave node image by installing Debian and all the necessary packages on a single guest, then to multiply the harddisk of this image as many copies as desired. Next the hostnames and IP addresses need to be changed according to the order listed on the left pane for further convenience. Here a scene taken from virtual disk image cloning process is presented. After the clone is completed successfully, import a new machine from VirtualBox Manager with the cloned disk as its storage.

Figure 3: master node

Figure 4: virtual disk image clone

2.4 Network Conguration


Communication is probably the most important and complex aspect for a cluster designer to consider. VirtualBox provides several kinds of network, each with many network adapters as alternatives. The following graphics present the network congurations in master as well as slave nodes respectively. It is important to have the same ethernet card

(a) ethernet 0 on master node

(b) ethernet 1 on master node

(c) ethernet 1 on slave node

Figure 5: ethernet adapter congurations

on each node. If not, say, with the command /sbin/ifcong ethernet cards rather than eth1 appeared, put the right ethernet cards in the le /etc/udev/rules.d/70-persistentnet.rules, then adjust it like this, sudo sudo sudo sudo modprobe -r e1000 modprobe e1000 /etc/init.d/udev restart /etc/init.d/networking restart

3 Application of SLURM
Resource management are of non-trivial an eort with ever-growing nodes in a clusster. SLURM is designed for this purpose on linux clusters of all sizes. It performs exclusive or non-exclusive resource access as well as monitors the present state of all nodes in a cluster. Beginers tend to get into troubles with slurm installation, at least for me previously.

3.1 Possible Problems During Installation


In principle, SLURM can be installed by following the steps below, wget http://en.sourceforge.jp/projects/sfnet_slurm/downloads/slurm/ \ version_2.1/2.1.15/slurm-2.1.15.tar.bz2/ tar xvjf slurm-2.1.15.tar.bz2 cd slurm-2.1.15 ./configure make sudo make install Some common error messages during congure are listed here. 1. Lack of GCC Normally GCC comes with a Debian Installation, but if for some unknown reason the make process suddenly stopps like this, it suggests a installation of GCC. checking build system type... auxdir/config.guess: unable to guess system type ... UNAME_MACHINE = i686 UNAME_RELEASE = 2.6.26-2-686 UNAME_SYSTEM = Linux UNAME_VERSION = #1 SMP Thu Nov 25 01:53:57 UTC 2010 configure: error: cannot guess build type; you must specify one Install gcc by command: sudo apt-get install gcc or from Synaptic Package Manager ... configure: error: in /home/zhayu/slurm-2.1.15: configure: error: no acceptable C compiler found in $PATH Install it with the command, sudo apt-get install gcc

2. Warnings in Conguration configure: WARNING: Unable to locate NUMA memory affinity functions ... configure: WARNING: Unable to locate PAM libraries ... configure: WARNING: Can not build smap without curses or ncurses library ... checking for GTK+ - version >= 2.7.1... no *** Could not run GTK+ test program, checking why... *** The test program failed to compile or link. See the file config.log for the *** exact error that occured. This usually means GTK+ is incorrectly installed. checking for mysql_config... no configure: WARNING: *** mysql_config not found. Evidently no MySQL \ install on system. checking for pg_config... no configure: WARNING: *** pg_config not found. Evidently no PostgreSQL \ install on system. ... checking for munge installation... configure: WARNING: unable to locate munge installation ... configure: WARNING: unable to locate blcr installation Solution: sudo sudo sudo sudo sudo sudo sudo apt-get apt-get apt-get apt-get apt-get apt-get apt-get install install install install install install install libnuma1 libnuma-dev libpam0g libpam0g-dev libncurses5-dev libgtk2.0-dev libmysql++-dev libpq-dev libmunge2 libmunge-dev

configure: WARNING: Unable to locate PLPA processor affinity functions ... Solution: wget http://www.open-mpi.org/software/plpa/v1.1/downloads/plpa-1.1.1.tar.gz tar xzvf plpa-1.1.1.tar.gz cd plpa-1.1.1/ ./configure make sudo make install 9

configure: WARNING: Could not find working OpenSSL library Solution: cd src/plugins make sudo make install configure: configure: configure: configure: Solution: remains unknown to me. But does not matter too much for our test purpose. Cannot support QsNet without librmscall Cannot support QsNet without libelan3 or libelanctrl! Cannot support Federation without libntbl WARNING: unable to locate blcr installation

3.2 Installation and Conguration


Only when the above mentioned error and warning messages dont come any more, can works be pushed forward, where other pitfalls have laying there already. In slurm there is an authentication service for creating and validating credentials. The security certicate as well as key must be generated as shown, openssl gensa -out /usr/local/etc/slurm.key 1024 openssl rsa -in /usr/local/etc/slurm.key -pubout \ -out /usr/local/etc/slurm.cert Last but not least, to congure slurm. The conguration le slurm.conf in our case is shown as an example, then copy the three les to the directory /usr/local/etc/ on all nodes one by one. Take the ssh copy from node1 to node2 as an example, scp slurm.* zhayu@node2:/home/zhayu and on node2, sudo cp slurm.* /usr/local/etc/ Finally slurm control daemon and slurm daemon can be started with commands, sudo /usr/local/sbin/slurmctld start sudo /usr/local/sbin/slurmd start

3.3 Automatic Startup when Booting


Appending the following lines to /etc/rc.local to make these daemons getting started from booting, which may relieve the administrator of the cluster management and, even important especially for headless nodes started with neither GUI nor Terminal. /usr/local/sbin/slurmctld start /usr/local/sbin/slurmd start 10

Figure 6: conguration le for slurm

11

Figure 7: graphic user interface to view and modify slurm state when successfully started

12

4 Cluster Network Conguration


4.1 Hostnames
We have one master node for management (computing as well when in need) and eight slave nodes for computing, hence the hostnames range from node1 to node9. In Debianbased Linux, hostname can be set in the le /etc/hostname

4.2 IP Addresses
Every node has a unique IP address within the cluster. We set 192.168.56.100 for the master node, and 192.168.56.101 to 192.168.56.108 for 8 slave nodes respectively. IP address can be changed in the le /etc/network/interface. After that, the network need to be restarted to apply the new one.

4.3 Host List


Launching an ssh login with an IP address is not always convenient. A better choice is to save all the hostnames in a le for further reference while IP address-involved operations are performed. We append the following lines to the le /etc/hosts: 192.168.56.100 192.168.56.101 192.168.56.102 192.168.56.103 192.168.56.104 192.168.56.105 192.168.56.106 192.168.56.107 192.168.56.108 node1 node2 node3 node4 node5 node6 node7 node8 node9

4.4 Password-less SSH


Install the SSH packages on all nodes with the following command. sudo apt-get install openssh-server openssh-client SSH login from master node to all slave nodes should be no problem if the network has been properly congured. To run mpi programs across nodes within a cluster, passwordless SSH login is needed. Actually saving the public key for the local node to the remote destination node does the miracle. Here is an example generating password-less SSH access from node1 to node2, zhayu@node1:~$ ssh-keygen -t rsa zhayu@node1:~$ ssh zhayu@node2 mkdir -p .ssh zhayu@node2s password: zhayu@node1:~$ cat .ssh/id_rsa.pub | ssh zhayu@node2 \ 13

zhayu@node2>>.ssh/authorized_keys zhayu@node2s password: zhayu@node1:~$ ssh zhayu@node2 SSH login works as follows, if properly done, zhayu@node1:~$ ssh node2 Linux node2 2.6.26-2-686 #1 SMP Thu Sep 16 19:35:51 UTC 2010 i686 The programs included with the Debian GNU/Linux system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. Last login: Mon Jan 17 19:23:14 2011 zhayu@node2:~$ If on the other hand, password is still demanded after this, or a message comes as the following, Agent admitted failure to sign using the key. Permission denied (publickey). then simply run ssh-add on the client node. Repeat it till SSH logins from master node to all the slave nodes require no password any more. If the following error message comes, it means SSH server daemon has not yet been started on the SSH sever. ssh: connect to host node2 port 22: Connection refused Issuing a command like: sudo /etc/init.d/ssh status to see whether SSH daemon is running on the SSH server. If no, type sudo /etc/init.d/ssh start to start SSH daemon.

14

4.5 Network File System


The Network File System (NFS) allows nodes within the entire Computing Cluster to share part of their le system, important for running parallel code like mpi programs. One or more nodes hold the le system on their physical hard disk and act as the NFS server while other nodes mount the le system locally. To the user, the le exists on all the nodes at once. A node acting as a NFS server needs to install the nfs related packages like this, apt-get install nfs-common nfs-kernel-server Appending a lines taking the form <directory to share> <allowed machines>(options) to the le /etc/export on the server end, and a line of the form <NFS server>:<remote location> to /etc/fstab on the client end,

(a) setting for NFS server

(b) setting for NFS client

Figure 8: NFS settings

15

Figure 9: NFS get mounted from booting on slave nodes

5 Test with Applications


5.1 Simple MPI Program Test
As any test at the very beginning is simple, we want to see whether our virtual cluster works with a simple example, Hello World. The program looks like the following /*hello.c*/ #include <stdio.h> #include <mpi.h> int main(int argc, char ** argv) { int rank, size; char name[80]; int length; MPI_Init(&argc, &argv); // note that argc and argv are passed // by address MPI_Comm_rank(MPI_COMM_WORLD,&rank); 16

MPI_Comm_size(MPI_COMM_WORLD,&size); MPI_Get_processor_name(name,&length); printf("Hello World MPI: processor %d of %d on %s\n", rank,size,name); MPI_Finalize(); } We compile and execute the program as Figure 10 shows.

Figure 10: hello world test

5.2 Tachyon Ray Tracer Test


Tachyon is a library developed for parallel graphic rendering with support to both distributed and shared memory parallel model, which enabled it an ideal benchmark application for clusters. We applied it here to illustrate how to run an MPI parallel application on a cluster. As Tachyon will be started by mpi, mpicc should be used instead of gcc. So change the option from CC=gcc to CC=mpicc within linux-beowulf-mpi 17

in the le /tachyon/unix/Make-arch or else the following error message comes and make process is aborted. make[2]: *** [../compile/linux-beowulf-mpi/libtachyon/parallel.o] Error 1 make[1]: *** [all] Error 2 make: *** [linux-beowulf-mpi] Error 2 When make, a lot of architectures it supports are listed. As linux-beowulf-mpi is the actual architecture in our case, build it like this, wget http://jedi.ks.uiuc.edu/~johns/raytracer/ \ files/0.98.9/tachyon-0.98.9.tar.gz tar xvzf tachyon-0.98.9.tar.gz cd tachyon make make linux-beowulf-mpi

Figure 11: Beowulf cluster architecture

With Tachyon, we would like to illustrate not only the parallel rendering on a 9-node virtual cluster, but also a brief application of the above mentioned batch system SLURM. The more nodes a cluster embraces, the more obvious its power emerges. Here is an simple example of SLURM script, containing all the necessary task specications. Then submit the task and wait for its completion. 18

Figure 12: slurm job submitting and outcome

#!/bin/bash #SBATCH -n=9 mpirun tachyon/compile/linux-beowulf-mpi/tachyon \ tachyon/scenes/dna.dat -fullshade -res 4096 4096 -o dna2.tga Submit the slurm job with the command, sbatch ./task1.sh All results will be saved in a specied le, When slurm job is submitted to the batch system, the available computing nodes was allocated for this task. A performance speedup achieved by 9 nodes is presented as gure 14 below, Tests were made with dierent task allocation methods, namely 1PN9, 3PN9, 1P1N and 9P1N, that is, an arbitary number of processes on an arbitary number of nodes. We can see clearly from the graphic the speedup a 9-node virtual cluster achieved.

19

Figure 13: Tachyon runs on the allocated computing nodes

Figure 14: performance speedup with dierent task allocation methods

20

6 Further Work
This is only the rst half of our task. The aim is to control the virtual machines in a cluster with a batch system, where pxeboot of slave nodes from a master node is necessary. However, to bring slurm working for this, every pxebooted node should have its own File system rather than the one shared from the master node.

21

A Removing Virtual Machine


To remove a virtual machine which you no longer need, right-click on it in the Managers VM list select Remove from the context menu that comes up. A conrmation window will come up that allows you to select whether the machine should only be removed from the list of machines or whether the les associated with it should also be deleted. The Remove menu item is disabled while a machine is running.[6] In linux if a machine with a name that has ever been used before need to be created, probably you may run into the trouble like this. It is caused by trying to register a machine with an already existing UUID. Go to .VirtualBox/VirtualBox.xml to delete the lines containing the name of the machine to be created.

22

B A Bug in NFS
There comes always the error message when NFS are to be mounted by slave nodes when booting: if-up.d/mountnfs[eth0]: lock /var/run/network/mountnfs exist, not mounting Just replace the later part of /etc/network/if-up.d/mountnfs with the following code:[7] ... # Using no != instead of yes = to make sure async nfs mounting is # the default even without a value in /etc/default/rcS if [ no != "$ASYNCMOUNTNFS" ]; then # Not for loopback! [ "$IFACE" != "lo" ] || exit 0 # Lock around this otherwise insanity may occur mkdir /var/run/network 2>/dev/null || true if [ -f /var/run/network/mountnfs ]; then msg="if-up.d/mountnfs[$IFACE]: lock /var/run/network/mountnfs exist, not mounting" log_failure_msg "$msg" # Log if /usr/ is mounted [ -x /usr/bin/logger ] && /usr/bin/logger -t "if-up.d/mountnfs[$IFACE] " "$msg" exit 0 fi touch /var/run/network/mountnfs on_exit() { # Clean up lock when script exits, even if it is interrupted rm -f /var/run/network/mountnfs 2>/dev/null || exit 0 } trap on_exit EXIT # Enable emergency handler do_start elif [ yes = "$FROMINITD" ] ; then do_start fi --------------------------------------------------------------------This will use a le instead of a directory to lock the action and les would be cleaned up on boot.

23