Paral IO

MPI IO
Timothy H. Kaiser, Ph.D. tkaiser@sdsc.edu
Purpose
Introduce Parallel IO Introduce MPI-IO Not give an exhaustive survey Explain why you would want to use it Explain why you wouldnt want to use it Give a nontrivial and useful example
References
http://www.nersc.gov/nusers/resources/
software/libs/io/mpiio.php John May
Parallel I/O for High Performance Computing. Using MPI-2 Advanced Features of the Message
Passing Interface. William Gropp, Ewing Lusk and Rajeev Thakur
What & Why of parallel IO

Same motivation of going parallel initially You have lots of data You want to do things fast Parallel IO will (hopefully) enable you to
move large amounts of data to/from disk quickly
What & Why of parallel IO

Parallel implies that some number (or all) of
processors faster your processors (simultaneously) participate in an IO operation
Good parallel IO shows speedup as you add I write about 300 Mbytes/second, others
Earthquake Model E3d Finite difference simulation with the grid

distributed across N processors
A Motivating Example
On BlueGene we run at sizes of 7509 x 7478 x

250 = 14,021,250,000 cells or 56 GBytes per volume, output 3 velocity volumes per dump
For a restart le we write 14 volumes
Simple (nonMPI) Parallel IO

Each processor dumps its portion of the grid
to a separate unique le
char* unique(char *name,int myid) { static char unique_str[40]; int i; for(i=0;i<40;i++) unique_str[i]=(char)0; sprintf(unique_str,"%s%5.5d",name,myid); return unique_str; }
Simple (nonMPI) Parallel

module stuff contains function unique(name,myid) character (len=*) name character (len=20) unique character (len=80) temp write(temp,"(a,i5.5)")trim(name),myid unique=temp return end function unique end module
Why not just do this?

Might write thousands of les Could be very slow Output is dependent on the number of
processors
We might want the data in a single le
MPI has over 55 calls related to le input and

output
MPI-IO to the rescue
Available in most modern MPI libraries Can produce exceptional results Support striping A collection of distributed les look like one We will look at outputs to a single le
Why not?
Some functionality might not be available 3d data types More likely to have/introduce bugs Memory leak File system overload Just hangs
Why not?
More complex than normal output Need support from the le system for good
performance
Have seen 200 bytes/second NOT Megabytes Have run out of le locks
Our Real World Example

We have a 3d volume of some data v
distributed across N processor same on each processor
The size and distribution are input and not the We are outputting some function of v ,V=f(v) Each processor writes its values to a common
le
Typical Small Case

Grid size 472 x 250 x 242 on 16 processors Color shows processor number
vista --raw 472 250 242 --minmax -1 15 --skip 36 \ -x 640 -y 480 --outformat png --swapbytes -r 1 0.53 0.51 \ --fov 30 -g 0.9 0.9 0.9 1.0 \ -a 0.002 --opacity 0.01 volume1.0010.3DSMPI
Special Considerations
We are calculating our output on the y Create a buffer Fill the buffer and write Different processors will have different
number of writes
We want to use a collective write operation Each process must call the collective write
the same number of times writes it needs to do
Special Considerations
Each process must determine how many
Some processors might call write with no

data
The total number of writes is the max
Procedure # 1
Allocate a temporary output buffer Open the le Set the view of the le to the beginning Process 0 writes the le header (36 bytes)
Procedure #2
Create a description of how the data is
distributed
Set the view of the le to this description Determine how many writes are needed and
AllReduce into do_call_max
Dened data type Hardest part of the whole process
Loop over the grid Fill buffer If buffer is full Write it Adjust offset do_call_max=do_call_max-1 Call write with no data until do_call_max=0
Procedure #3
The MPI-IO Routines

MPI_File_open(MPI_COMM_WORLD,fname,(MPI_MODE_RDWR|MPI_MODE_CREATE),MPI_INFO_NULL,&fh); MPI_File_set_view(fh,disp,MPI_INT,filetype,"native",MPI_INFO_NULL); MPI_File_write_at(fh, 0, header, hl, MPI_INT,&status); MPI_File_write_at_all(fh, offset, ptr, i2, MPI_INT,&status); MPI_File_close(&fh);
MPI_File_open
Synopsis: Opens a le
int MPI_File_open(MPI_Comm comm, char *filename, int amode, MPI_Info info, MPI_File *fh)
Input Parameters
comm communicator (handle) lename name of le to open (string) amode le access mode (integer) info info object (handle)
Output Parameters
fh le handle (handle)
MPI_File_set_view
Synopsis: Sets the le view
int MPI_File_set_view(MPI_File fh, MPI_Offset disp, MPI_Datatype etype, MPI_Datatype filetype, char *datarep, MPI_Info info)
Input Parameters
fh le handle (handle) disp displacement (nonnegative integer) etype elementary datatype (handle) letype letype (handle) datarep data representation (string) info info object (handle)
MPI_File_write_at
Synopsis: Write using explicit offset, not collective
int MPI_File_write_at(MPI_File fh, MPI_Offset offset, void *buf, int count, MPI_Datatype datatype, MPI_Status *status)
Input Parameters
fh le handle (handle) offset le offset (nonnegative integer) buf initial address of buffer (choice) count number of elements in buffer (nonnegative integer) datatype datatype of each buffer element (handle)
Output Parameters
status status object (Status)
MPI_File_write_at_all
Synopsis: Collective write using explicit offset
int MPI_File_write_at_all(MPI_File fh, MPI_Offset offset, void *buf, int count, MPI_Datatype datatype, MPI_Status *status)
Input Parameters
fh le handle (handle) offset le offset (nonnegative integer) buf initial address of buffer (choice) count number of elements in buffer (nonnegative integer) datatype datatype of each buffer element (handle)
Output Parameters
status status object (Status)
MPI_File_close
Synopsis: Closes a le
int MPI_File_close(MPI_File *fh)
Input Parameters
fh le handle (handle)
The Data type Routines

Our preferred routine creates a 3d description
MPI_Type_create_subarray(3,gsizes,lsizes,istarts,MPI_ORDER_C,old_type,new_type);
On some platforms we need to fake it

MPI_Type_contiguous(sx,old_type,&VECT); MPI_Type_struct(sz,blocklens,indices,old_types,&TWOD);
MPI_Type_commit(&TWOD);
MPI_Type_create_subarray
Synopsis: Creates a datatype describing a subarray of an N dimensional array
int MPI_Type_create_subarray(int ndims, int *array_of_sizes, int *array_of_subsizes, int *array_of_starts, int order, MPI_Datatype oldtype, MPI_Datatype *newtype)
Input Parameters
ndims number of array dimensions (positive integer) array_of_sizes number of elements of type oldtype in each dimension of the full array (array of positive integers) array_of_subsizes number of elements of type oldtype in each dimension of the subarray (array of positive integers) array_of_starts starting coordinates of the subarray in each dimension (array of nonnegative integers) order array storage order ag (state) oldtype old datatype (handle)
Output Parameters
newtype new datatype (handle)
MPI_Type_contiguous
Synopsis: Creates a contiguous datatype
int MPI_Type_contiguous( int count,MPI_Datatype old_type,MPI_Datatype *newtype)
Input Parameters
count replication count (nonnegative integer) oldtype old datatype (handle)
Output Parameter
MPI_Type_struct
Synopsis: Creates a struct datatype
int MPI_Type_struct( int count, int blocklens[], MPI_Aint indices[], MPI_Datatype old_types[], MPI_Datatype *newtype )
Input Parameters
count number of blocks (integer) -- also number of entries in arrays array_of_types , array_of_displacements and array_of_blocklengths blocklens number of elements in each block (array) indices byte displacement of each block (array) old_types type of elements in each block (array of handles to datatype objects)
Output Parameter
Our Program...

MPI_Init(&argc,&argv);
MPI_Comm_rank( MPI_COMM_WORLD, &myid);
MPI_Comm_size( MPI_COMM_WORLD, &numprocs);
MPI_Get_processor_name(name,&resultlen);
printf("process %d running on %s\n",myid,name); /* we read and broadcast the global grid size (nx,ny,nz) */
if(myid == 0) {

if(argc != 4){

printf("the grid size is not on the command line assuming 100 x 50 x 75\n");

gblsize[0]=100;

gblsize[1]=50;

gblsize[2]=75;

}

else {

gblsize[0]=atoi(argv[1]);



}
}
MPI_Bcast(gblsize,3,MPI_INT,0,MPI_COMM_WORLD); /********** a ***********/

/* the routine three takes the number of processors and returns a 3d decomposition or topology. this is simply a factoring of the number of processors into 3 integers stored in comp */
three(numprocs,comp);
/* the routine mpDecomposition takes the processor topology and the global grid dimensions and maps the grid to the topology. mpDecomposition returns the number of cells a processor holds and the starting coordinates for its portion of the grid */
if(myid == 0 ) {
printf("input mpDecomposition %5d%5d%5d%5d%5d%5d\n",gblsize[1],gblsize[2],gblsize[0],
comp[1], comp[2], comp[0]); }
mpDecomposition( gblsize[1],gblsize[2]gblsize[0],comp[1],comp[2],comp[0],myid,dist); printf(" out mpDecomposition %5d%5d%5d%5d%5d%5d%5d\n",myid,dist[0],dist[1],dist[2], dist[3],dist[4],dist[5]); /********** b ***********/

Example Distribution
Processor
0 1 2 3 4 5 6 7
Size X 50 50 50 50 50 50 50 50
Size Y 200 200 200 200 200 200 200 200
Size Z Start X Start Y Start Z 13 0 0 0 13 50 0 0 13 0 0 13 13 50 0 13 12 0 0 26 12 50 0 26 12 0 0 38 12 50 0 38
Global size 50 x 200 x 100 on 8 processors
Back to our program...
/* global grid size */ nx=gblsize[0];

ny=gblsize[1];
nz=gblsize[2]; /* amount that i hold */
sx=dist[0];
sy=dist[1];
sz=dist[2]; /* my grid starts here */
x0=dist[3];
y0=dist[4];
z0=dist[5]; /********** c ***********/

/* allocate memory for our volume */

vol=getArrayF3D((long)sy,(long)0,(long)0, (long)sz,(long)0,(long)0, (long)sx,(long)0,(long)0); /* fill the volume with numbers 1 to global grid size */ /* the program from which this example was derived, e3d, holds its data as a collection of vertical planes. plane number increases with y. that is why we loop on y with the outer most loop. */
k=1+(x0+nx*z0+(nx*nz)*y0);
for (ltmp=0;ltmp<sy;ltmp++) {

for (mtmp=0;mtmp<sz;mtmp++) {

for (ntmp=0;ntmp<sx;ntmp++) {

val=k+ntmp+ mtmp*nx
+ ltmp*nx*nz;

if(val > (long)INT_MAX)val=(long)INT_MAX;

vol[ltmp][mtmp][ntmp]=(int)val;

}

}
} /********** d ***********/

/* create a file name based on the grid size */

for(j=1;j<80;j++) {

fname[j]=(char)0;
} sprintf(fname,"%s_%3.3d_%4.4d_%4.4d_%4.4d","mpiio_dat", numprocs,gblsize[0],gblsize[1],gblsize[2]); /* we open the file fname for output, info is NULL */ ierr=MPI_File_open(MPI_COMM_WORLD, fname,(MPI_MODE_RDWR | MPI_MODE_CREATE), MPI_INFO_NULL, &fh); /* we write a 9 integer header */ hl=3; /* set the view to the beginning of the file */ ierr=MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, "native",MPI_INFO_NULL); /* process 0 writes the header */ if(myid == 0) { header[0]=nx; header[1]=ny; header[2]=nz; /* MPI_File_write_at is not a collective so only 0 calls it */ ierr=MPI_File_write_at(fh, 0, header, hl, MPI_INT,&status); } /********** 01 ***********/
/* we create a description of the layout of the data */ /* more on this later */

printf("mysubgrid0 %5d%5d%5d%5d%5d%5d%5d%5d%5d%5d\n",myid,nx,ny,nz,sx,sy,sz,x0,y0,z0);
mysubgrid0(nx, ny, nz,sx, sy, sz, x0, y0, z0, MPI_INT,&disp,&filetype); /* length of the header */
disp=disp+(4*hl); /* every processor "moves" past the header */ ierr=MPI_File_set_view(fh, disp, MPI_INT, filetype, "native",MPI_INFO_NULL); /********** 02 ***********/
/* we are going to create the data on the fly */ /* so we allocate a buffer for it */ t3=MPI_Wtime(); isize=sx*sy*sz; buf_size=NUM_VALS*sizeof(FLT); if( isize < NUM_VALS) { buf_size=isize*sizeof(FLT); } else { buf_size=NUM_VALS*sizeof(FLT); } ptr=(FLT*)malloc(buf_size); offset=0; /* find the max and min number of isize of each processors buffer */ ierr=MPI_Allreduce ( &isize, &max_size, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD); ierr=MPI_Allreduce ( &isize, &min_size, 1, MPI_INT, MPI_MIN, MPI_COMM_WORLD); /********** 03 ***********/
/* find out how many times each processor will dump its buffer */ i=0; i2=0; do_call=0; sample=1; grid_l=y0+sy; grid_m=z0+sz; grid_n=x0+sx; /* could just do division but that would be too easy */ for(l = y0; l < grid_l; l = l + sample) { for(m = z0; m < grid_m; m = m + sample) { for(n = x0; n < grid_n; n = n + sample) { i++; i2++; if(i == isize || i2 == NUM_VALS){ do_call++; i2=0; } } } } /* get the maximum number of many times a processor will dump its buffer */ ierr= MPI_Allreduce ( &do_call, &do_call_max, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD); /********** 04 ***********/
/* finally we start to write the data */ i=0; i2=0; /* we loop over our grid filling the output buffer */ for(l = y0; l < grid_l; l = l + sample) { for(m = z0; m < grid_m; m = m + sample) { for(n = x0; n < grid_n; n = n + sample) { ptr[i2] = getS3D(vol,l, m, n,y0,z0,x0); i++; i2++; /********** 05 ***********/
/* when we have all our data or the buffer is full we write */ if(i == isize || i2 == NUM_VALS){ t5=MPI_Wtime(); t7++; if((isize == max_size) && (max_size == min_size)) { /* as long as every processor has data to write we use the collective version */ /* the collective version of the write is MPI_File_write_at_all */ ierr=MPI_File_write_at_all(fh, offset, ptr, i2, MPI_INT,&status); do_call_max=do_call_max-1; } else { /* if only I have data to write then we use MPI_File_write_at */ /*ierr=MPI_File_write_at(fh, offset, ptr, i2, MPI_INT,&status);*/ /* Wait! Why was that line commented out? Why are we using MPI_File_write_at_all? */ /* Answer: Some versions of MPI work better using MPI_File_write_at_all */ /* What happens if some processors are done writing and don't call this? */ /* Answer: See below. */ ierr=MPI_File_write_at_all(fh, offset, ptr, i2, MPI_INT,&status); do_call_max=do_call_max-1; } offset=offset+i2; i2=0; t6=MPI_Wtime(); dt[5]=dt[5]+(t6-t5); } }}} /********** 06 ***********/
/* /* /* /* /* /*
Here is where we fix the problem of unmatched calls to MPI_File_write_at_all*/ If a processor is done with its writes and others still have */ data to write the the done processor just calls */ MPI_File_write_at_all but this 0 values to write */ All processors call MPI_File_write_at_all the same number of */ times so everyone is happy */ while(do_call_max > 0) { ierr=MPI_File_write_at_all(fh, (MPI_Offset)0, (void *)0, 0, MPI_INT,&status); do_call_max=do_call_max-1; } /* We finally close the file */ ierr=MPI_File_close(&fh); /********* ierr=MPI_Info_free(&fileinfo); *********/ MPI_Finalize(); exit(0); /********** 07 ***********/
Our output:
vista --rawtype int --minmax 0 1000000 --skip 12 -x 640 -y 480 --outformat png --fov 30 bonk --raw 100 50 200 -r .5 .25 1.0 -g 0.9 0.9 0.9 1.0 -a 0.002 --opacity 0.01
http://peloton.sdsc.edu/~tkaiser/mpiio/mpiio.c
Source
void {

/* x

/* z

/* y

}
mpDecomposition(int l, int m, int n, int nx, int ny, int nz, int node, int *dist) int nnode, mnode, rnode; int grid_n,grid_n0,grid_m,grid_m0,grid_l,grid_l0; decomposition */ rnode = node%nx; mnode = (n%nx); nnode = (n/nx); grid_n = (rnode < mnode) ? (nnode + 1) : (nnode); grid_n0 = rnode*nnode; grid_n0 += (rnode < mnode) ? (rnode) : (mnode); decomposition */ rnode = (node%(nx*nz))/nx; mnode = (m%nz); nnode = (m/nz); grid_m = (rnode < mnode) ? (nnode + 1) : (nnode); grid_m0 = rnode*nnode; grid_m0 += (rnode < mnode) ? (rnode) : (mnode); decomposition */ rnode = node/(nx*nz); mnode = (l%ny); nnode = (l/ny); grid_l = (rnode < mnode) ? (nnode + 1) : (nnode); grid_l0 = rnode*nnode; grid_l0 += (rnode < mnode) ? (rnode) : (mnode); dist[0]=grid_n; dist[3]=grid_n0; dist[1]=grid_l; dist[4]=grid_l0; dist[2]=grid_m; dist[5]=grid_m0;
/* the routine mpDecomposition takes the processor topology (nx, ny,nz) and the global grid dimensions (l,m,n) and maps the grid to the topology. mpDecomposition returns the number of cells a processor holds, dist[0:2], and the starting coordinates for its portion of the grid dist[3:5] */
void mysubgrid0(int nx, int ny, int nz, int sx, int sy, int sz, int x0, int y0, int z0, MPI_Datatype old_type, MPI_Offset *location,MPI_Datatype *new_type) {
MPI_Datatype VECT; #define BSIZE 5000
int blocklens[BSIZE]; /* we have two versions of mysubgrid0,
MPI_Aint indices[BSIZE]; the routine that creates the
MPI_Datatype old_types[BSIZE]; description of the data layout. /*
MPI_Datatype TWOD;
int i;
if(myid == 0)printf("using mysubgrid version 1\n");
if(sz > BSIZE)mpi_check(-1,"sz > BSIZE, increase BSIZE and recompile");
ierr=MPI_Type_contiguous(sx,old_type,&VECT); ierr=MPI_Type_commit(&VECT); /* This version of mysubgrid0 for (i=0;i<sz;i++) { builds up the description from
blocklens[i]=1; primatives. we start with x, then
old_types[i]=VECT; create VECT which is a vector of x
indices[i]=i*nx*4; values. we then take a collection } of VECTs and create a vertical
ierr=MPI_Type_struct(sz,blocklens,indices,old_types,&TWOD); slice, TWOD. note that the ierr=MPI_Type_commit(&TWOD); distance between each VECT in TWOD is given in indices[i]. we next for (i=0;i<sy;i++) { take a collection of vertical
blocklens[i]=1; slices and create our volume.
old_types[i]=TWOD; again we have the distances between
indices[i]=i*nx*nz*4; the slices given in indices[i] */ }
ierr=MPI_Type_struct(sy,blocklens,indices,old_types,new_type); ierr=MPI_Type_commit(new_type); *location=4*(x0+nx*z0+(nx*nz)*y0); }
/*
This one is actually perfered. it uses a single call to the mpi routine MPI_Type_create_subarray with the the grid description as input. what we get back is a data type that is a 3d strided volume. Unfortunately, MPI_Type_create_subarray does not work for 3d arrays for some versions of MPI, in particular LAM. */
void mysubgrid0(int nx, int ny, int nz, int sx, int sy, int sz, int x0, int y0, int z0,

MPI_Datatype old_type,

MPI_Offset *location,

MPI_Datatype *new_type) {
int gsizes[3],lsizes[3],istarts[3];
gsizes[2]=nx; gsizes[1]=nz; gsizes[0]=ny;
lsizes[2]=sx; lsizes[1]=sz; lsizes[0]=sy;
istarts[2]=x0; istarts[1]=z0; istarts[0]=y0;
if(myid == 0)printf("using mysubgrid version 2\n");
ierr=MPI_Type_create_subarray(3,gsizes,lsizes,istarts,MPI_ORDER_C,old_type,new_type);
ierr=MPI_Type_commit(new_type);
*location=0; }

Paral IO

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paral IO

Uploaded by

Copyright:

Available Formats

MPI IO

Timothy H. Kaiser, Ph.D. tkaiser@sdsc.edu

What & Why of parallel IO

move large amounts of data to/from disk quickly

What & Why of parallel IO

Earthquake Model E3d Finite difference simulation with the grid

On BlueGene we run at sizes of 7509 x 7478 x

For a restart le we write 14 volumes

Simple (nonMPI) Parallel IO

Simple (nonMPI) Parallel

Why not just do this?

We might want the data in a single le

MPI has over 55 calls related to le input and

MPI-IO to the rescue

Our Real World Example

Typical Small Case

Some processors might call write with no

The total number of writes is the max

Dened data type Hardest part of the whole process

The MPI-IO Routines

The Data type Routines

On some platforms we need to fake it

Size Y 200 200 200 200 200 200 200 200

Size Z Start X Start Y Start Z 13 0 0 0 13 50 0 0 13 0 0 13 13 50 0 13 12 0 0 26 12 50 0 26 12 0 0 38 12 50 0 38

Global size 50 x 200 x 100 on 8 processors

Back to our program...

/* global grid size */ nx=gblsize[0];

/* allocate memory for our volume */

/* create a file name based on the grid size */

/* we create a description of the layout of the data */ /* more on this later */

You might also like

/* we create a description of the layout of the data / / more on this later */