Professional Documents
Culture Documents
GPFS Installation
Overview
• Plan the installation
Before installing any software, it is important to plan the GPFS installation by choosing
the hardware, deciding which kind of disk connectivity to use (direct attached or
network attached disks), selecting the network capabilities (which depends a lot on
the disk connectivity), and, maybe the most important, verifying that your
application can take advantage of GPFS.
2
Overview (continued)
• Start GPFS
After the nodeset is created, you should start it before defining the disk. Use the
mmstartup command to start the GPFS daemons.
• Disk definition
All disks used by GPFS in a nodeset have to be described in a file, and then this file
has to be passed to the mmcrnsd command. This command gives a name to each
described disk and ensures that all the nodes included in the nodeset are able to
gain access to the disks with their new name.
3
Setup GPFS Environments
• Add a path to the GPFS binary directory to your $PATH environment
in all nodes. Type:
mkdir -p /cfmroot/etc/profile.d
PATH=$PATH:/usr/lpp/mmfs/bin
MANPATH=$MANPATH:/usr/lpp/mmfs/man
• Type:
chmod 755 /cfmroot/etc/profile.d/mmfs.sh
cfmupdatenode -a
cp /cfmroot/etc/profile.d/mmfs.sh /etc/profile.d
. /etc/profile.d/mmfs.sh
4
Install GPFS
• The GPFS install and update files are located on the management
node in the /lab/gpfs directory.
• Extract updates.
mkdir -p /tmp/gpfs/updates
cp –r /lab/gpfs/* /tmp/gpfs
cd /tmp/gpfs/updates
tar zxvf *update.tar.gz
• http://www-1.ibm.com/servers/eserver/clusters/software/gpfs_faq.html
• Lab Note: There are a few patches that should be applied. Read the
FAQ in the future. In this Lab we will not be apply the patches to save
time.
• Clean up tree
cd /usr/src/linux
make mrproper
6
Prepare kernel (continued)
• Check the content of the VERSION, PATCHLEVEL, SUBLEVEL, and
EXTRAVERSION variables in the /usr/src/linux/Makefile file
to match the release version of your kernel.
• Edit Makefile
VERSION = 2
PATCHLEVEL = 4
SUBLEVEL = 21
EXTRAVERSION = -27.ELsmp
• Type:
make oldconfig
make dep
7
Build the GPFS open source portability layer
• You have to build the GPFS open source portability layer manually on one
node (in our case, the management node), then copy them through all nodes.
• Below are the steps to build GPFS open source portability layer. Also, check
the /usr/lpp/mmfs/src/README file for more up to date information on
building the GPFS Open Source portability layer:
export SHARKCLONEROOT=/usr/lpp/mmfs/src
cd /usr/lpp/mmfs/src/config
cp site.mcr.proto site.mcr
Edit the /usr/lpp/mmfs/src/config/site.mcr file. There are some sections that need to be
checked (bold):
cd ..
make World
make InstallImages
8
Distribute the GPFS portability layer
• Copy the above binaries to the /cfmroot/usr/lpp/mmfs/bin
directory and distribute them to all nodes using the cfmupdatenode
command or your own scripts :
mkdir -p /cfmroot/usr/lpp/mmfs/bin
cd /usr/lpp/mmfs/bin
cp mmfslinux lxtrace tracedev dumpconv /cfmroot/usr/lpp/mmfs/bin
cfmupdatenode -a
9
Creating the GPFS nodes descriptor file
• ssh to node1. All GPFS commands should be run from nodes that will be
running GPFS. The management node will NOT be running GPFS.
ssh node1
• When creating your GPFS cluster, you need to provide a file containing a list of
node descriptors, one per line for each node to be included in the cluster,
including the storage nodes. Each descriptor must be specified in the form:
NodeName:NodeDesignations
• where:
manager|client
quorum|nonquorum
10
Creating the GPFS nodes descriptor file
• Create a file /tmp/gpfs.allnodes with a list of your nodes and
their roles. Ensure there is at least one node with quorum and
manager roles defined. For example:
node1:manager-quorum
node2:manager-quorum
Node3:quorum
node4:
• The above file signifies that we have four nodes in our GPFS cluster.
11
Defining the GPFS cluster
• Run the mmcrcluster command to define the GPFS cluster.
• Defined your node1 as the primary, node2 as the secondary (for GPFS data),
ssh as remote shell command and scp as remote file copy commands.
• For example:
mmcrcluster -p node1 -s node2 -n /tmp/gpfs.allnodes -r /usr/bin/ssh -R
/usr/bin/scp
• After creating the cluster definitions, you can see the definitions using the
mmlscluster command. Type:
mmlscluster
12
Starting GPFS
• After creating the GPFS cluster, you can start the GPFS services on
every node in the cluster by issuing the mmstartup command with
the -a parameter. The -a parameter will start GPFS on all nodes in the
cluster.
• Type:
mmstartup –a
13
Prepare Disks (Skip)
• For Fiber disks, create arrays, LUNs, and mappings
• Use CSM and GPFS Redbook as a guide
• Use GPFS documentation
• Use DS4xxx (FAStT) documentation
• Lab Note: We were unable to obtain DS4xxx controllers and disk.
• For each disk to be used for GPFS on each node use fdisk to remove
any partitions. NOTE: We will be using disk /dev/hdc in nodes1-
nodes4.
For Example:
ssh node1
fdisk /dev/hdc
14
Disk definitions
• A GPFS cluster with NSD network attached servers means that all
access to the disks and replication will be through one or two storage
attached servers (also known as storage node). If your cluster has an
internal network segment, this segment will be used for this purpose.
• If a disk is defined with one storage attached server only, and the
server fails, the disks would become unavailable to GPFS. If the disk
is defined with two NSD network attached servers, then GPFS
automatically transfers the I/O requests to the backup server.
• Lab Note: The four nodes in your cluster (e.g. node1 - node4) each
contain a single 40GB drive (/dev/hdc). You will use this as your
GPFS storage.
15
Creating Network Shared Disks (NSDs)
• You will need to create a descriptor file before creating your NSDs.
This file should contain information about each disk that will be a NSD,
and should have the following syntax:
DeviceName:PrimaryNSDServer:SecondaryNSDServer:DiskUsage:FailureGroup
DeviceName The real device name of the external storage partition (such as /dev/hdc).
PrimaryServer The host name of the server that the disk is attached to; Remember you must
always use the node names defined in the cluster definitions.
DiskUsage The kind of information should be stored in this disk. The valid values
are data, metadata, and dataAndMetadata (default).
FailureGroup An integer value (0 to 4000) that identifies the failure group to which this disk
belongs. All disks with a common point of failure must belong to the same
failure group. The value -1 indicates that the disk has no common point of failure
with any other disk in the file system. GPFS uses the failure group information
to assure that no two replicas of data or metadata are placed in the same group
and thereby become unavailable due to a single failure. When this field is not
specified, GPFS assigns a failure group (higher than 4000) automatically to
each disk.
16
Creating Network Shared Disks (NSDs)
• Create a new file /tmp/descfile
E.g.
/dev/hdc:node1::dataAndMetadata:-1
/dev/hdc:node2::dataAndMetadata:-1
/dev/hdc:node3::dataAndMetadata:-1
/dev/hdc:node4::dataAndMetadata:-1
• After successfully creating the NSD for GPFS cluster, mmcrnsd will
comment the original disk device and put the GPFS assigned global
name for that disk device at the following line. cat /tmp/descfile to see
the changes.
cat /tmp/descfile
• You can see the new device names by using the mmlsnsd command.
mmlsnsd
17
Creating the GPFS file system
• Once you have your NSDs ready, you can create the GPFS file
system. In order to create the file system, you will use the mmcrfs
command, where you must define the following attributes in this order:
– The mount point.
– The name of the device for the file system.
– The descriptor file (-F).
• Type:
mmcrfs /gpfs1 /dev/gpfs1 -F /tmp/descfile -A yes -B 256K -n 4 -v no
• Mount filesystems, exit node1 and type from the mgmt1 node:
dsh -a mount –a
• Validate with df. You should have a single 156GB filesystem spanning
4 disks in 4 nodes available to all nodes.
dsh -a df
18
Removing GPFS (Skip)
• Often it is desired to completely remove GPFS and start over. The
most common cause is SSH and DNS setup issues that cause
distributed GPFS commands to fail. Cleanup can be difficult.
19
Authentication
• HPC clusters require a global authentication solution
enabling all nodes view all users with the same properties.
useradd bob
20
Authentication (continued)
• Backup existing /etc/passwd and /etc/group files first. If for any
reason /etc/passwd gets corrupted you will be unable to login even
as root. A reboot to single user mode will be required to recover the
backup.
dsh -a cp /etc/passwd /etc/passwd.SAVE
dsh -a cp /etc/group /etc/group.SAVE
• Verify
dsh -a grep (username) /etc/passwd, For example:
dsh -a grep bob /etc/passwd (check the output)
• Generate SSH keys for each cluster user. (root is NOT a cluster
user). rsh clusters may need to create a .rhosts file per user.
21
File Systems
• Like authentication, HPC clusters also require a global file system solution
enabling all nodes to view the same files with the same properties.
GPFS is usually not required for user, application, and library directories. GPFS is
best suited for data directories.
• To setup NFS you must first export the /home and /usr/local file systems
from your management node. Append the follow lines to your
/etc/exports file:
/home *(rw,no_root_squash,sync)
/usr/local *(rw,no_root_squash,sync)
• Restart NFS.
service nfs restart
22
File Systems (continued)
• Verify
dsh -a ls -l /home | grep (user name you added, e.g. bob)
For example:
dsh -a ls -l /home | grep bob
• This verification checks that both the file systems and authentication
are working properly. Your dsh output should have listed the
/home/username directory for your cluster user AND the user should
have owned the directory, e.g.
23
MPICH-IP
• MPICH is a freely available, portable implementation of MPI, the Standard for
message-passing libraries the runs over IP.
su – bob
mkdir ~/bench/
cp /lab/hpc/mpiiotest.tgz ~/bench/
cd ~/bench/
tar zxvf mpiiotest.gz
• Build mpiiotest
export MPICH=/usr/local/mpich/1.2.7/ip/i686/up/gnu/ssh
export PATH=$MPICH/bin:$PATH
cd ~/bench/
make clean
make
25
mpiiotest (continued)
• Setup the users environment:
ssh node1
cd ~/bench
export MPICH=/usr/local/mpich/1.2.7/ip/i686/up/gnu/ssh
export PATH=$MPICH/bin:$PATH
26
mpiiotest (continued)
• First mpiiotest creates the file in parallel. Each red band represents the status
of the current process write progress. When the bar is red the file has been
written.
• Next mpiiotest reads the created file. Each blue band represents the status of
the current process read progress. When the bar is blue the file has been red.
27
mpiiotest (continued)
• The performance of any filesystem is affected by the blocksize used
by that filesystem vs the blocksize that the application is using. Since
the GPFS filesystem was setup with a 256K blocksize, the optimal
blocksize for this test should be 256K. Test this by trying a couple of
different blocksizes, recording the total read and write performance for
each run.
28
Modify GPFS Block Size
• If time permits modify the blocksize of the GPFS filesystem and rerun
the mpiiotest benchmark with the same three blocksizes used above.
• Follow the steps above to rerun the mpiiotest benchmark with the
three blocksizes (pages 27-30)
29