Professional Documents
Culture Documents
1
INTRODUCTION (cont) INTRODUCTION (cont)
• Traditional approaches • What is MPI-I/O?
– Each process writes to a separate file – MPI-I/O is a set of extensions to the original MPI standard
• Often requires an additional post-processing step – This is an interface specification: It does NOT give
• Without post-processing, restarts must use same number implementation specifics
of processor
– It provides routines for file manipulation and data access
– Result sent to a master processor, which collects results and
writes out to disk – Calls to MPI-I/O routines are portable across a large number
– Each processor calculates position in file and writes of architectures
individually
2
Additional argument for parallel I/O
standard
Simple parallel I/O
• Standard UNIX I/O is not portable
Memory
– Endianess becomes serious problem across different
machines
• Writing wrappers to perform byte conversion is tedious Processes
Files
• Multiple independent files are written
– Same as parallel I/O allowed under OpenMP
– May still be able to use sequential I/O libraries
– Significant increases in throughput are possible
• Drawbacks are potentially serious
– Now have multiple files to keep track of (concatenation may not be an
option)
– May be non-portable if application reading these files must have the same
number of processes
Simple parallel I/O (no MPI calls) Simple parallel I/O (using MPI calls)
• I/O operations are #include “mpi.h” • Rework previous example #include “mpi.h”
completely #include <stdio.h> to use MPI calls #include <stdio.h>
#define BUFSIZE 100
#define BUFSIZE 100
independent across • Note the file pointer has
the processes Int main(int argc, char *argv[]) been replaced by a Int main(int argc, char *argv[])
{ {
• Append a rank to variable of type MPI_File int i,myrank,buf[BUFSIZE];
int i,myrank,buf[BUFSIZE];
each file name to char filename[128]; • Under MPI I/O open, write char filename[128];
MPI_File myfile;
specify the different FILE *myfile; (read) and close
files statements are provided MPI_Init(&argc,&argv);
MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
• Need to be very MPI_Comm_rank(MPI_COMM_WORLD, &myrank); • Note MPI_COMM_SELF for (i=0; i< BUFSIZE; i++)
careful about for (i=0; i< BUFSIZE; i++) denotes a communicator buf[i] = myrank * BUFSIZE + i;
performance – buf[i] = myrank * BUFSIZE + I; sprintf(filename,”testfile.%d”,myrank);
sprintf(filename,”testfile.%d”,myrank); over local process MPI_File_open(MPI_COMM_SELF, filename,
individual files may be myfile= fopen(filename, “w”); MPI_MODE_WRONLY | MPI_MODE_CREATE,
too small for good fwrite(buf,sizeof(int), BUFSIZE, myfile); MPI_INFO_NULL, &myfile);
MPI_File_write(myfile, buf, BUFSIZE, MPI_INT,
throughput fclose(myfile);
MPI_STATUS_IGNORE);
MPI_Finalize();
return 0; MPI_File_close(&myfile);
MPI_Finalize();
}
return 0;
}
3
MPI_File_open arguments MPI_File_write
• MPI_File_open(comm,filename,accessmode,info,filehandle,ierr) • MPI_File_write(filehandler,buff,count,datatype,status,ierr)
– Comm – can choose immediately whether file access is collective – Very similar to message passing interface
or local • Imagine file handler as providing destination
• MPI_COMM_WORLD or MPI_COMM_SELF • “Address,count,datatype” interface
– Access mode is specified by MPI_MODE_CREATE and – Specification of non-contiguous writes would be done via a user-
MPI_MODE_WRONLY are or’d together defined datatype
• Same as Unix open – MPI_STATUS_IGNORE can be passed in the status field
– The file handle is passed back from this call to be used later in • Informs system not to fill field since user will ignore it
MPI_File_write – May slightly improve I/O performance when not needed
4
Parallel I/O to a single file using MPI calls File view
#include “mpi.h”
• Rework previous #include <stdio.h> • File view defines which portion of a file is “visible” to a
example to write to a #define BUFSIZE 100
single file given process
Int main(int argc, char *argv[])
• Write is now collective { – On first opening a file the entire file is visible
– MPI_COMM_SELF has int i,myrank,buf[BUFSIZE];
been replaced by MPI_File thefile; – Data is described at the byte level initially
MPI_COMM_WORLD
MPI_Init(&argc,&argv); • MPI_File_set_view provides information to
• All files agree on a MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
collective name for the for (i=0; i< BUFSIZE; i++) – enable reading of the data (datatypes)
file “testfile” buf[i] = myrank * BUFSIZE + i;
– Specify which parts of the file should be skipped
MPI_File_open(MPI_COMM_WORLD, “testfile”,
• Access to given parts of MPI_MODE_WRONLY | MPI_MODE_CREATE,
the file is specifically MPI_INFO_NULL, &thefile);
controlled MPI_File_set_view(thefile,
myrank*BUFSIZE*sizeof(int),
– MPI_File_set_view MPI_INT,MPI_INT, ”native”,
– Shift displacements MPI_INFO_NULL);
according to local rank MPI_File_write(thefile,buf, BUFSIZE, MPI_INT,
MPI_STATUS_IGNORE);
MPI_File_close(&thefile);
MPI_Finalize();
return 0;
}
MPI_File_set_view Definitions
• MPI_File_set_view(filehandler,displ,etype,filedatatype,datarep,in • Displacement – file position from the beginning of a file
fo,ierr)
– Displ controls the BYTE offset from the beginning of the file • etype – unit of data access
– The displacement is of a type “MPI_Offset” larger than normal • filetype – template for accessing the file
MPI_INT to allow for 64 bit addressing (byte offsets can easily
exceed 32 bit limit) • view – current set of data accessible by a process. Repetitions
– etype is the datatype of the buffer of filetype pattern define a view
– filedatatype is the corresponding datatype in the file, which must be • offset – position relative to the current view
either etype, or derived from it
– Datarep
• native: data is stored in the file as in memory
• internal: implementation specific format that may provide a level
or portability
• external32: 32bit big endian IEEE format (defined for all MPI
implementations)
– Only use if portability is required (conversion may be
necessary)
5
etype and filetype in action Examples
• Suppose we have a buffer with an etype of MPI_INT
• Filetype is defined to be 2 MPI_INTs followed by an offset of 4 MPI_INTS
(extent=6)
MPI_File_set_view(fh,5*sizeof(int),etype,filetype,”native”,MPI_INFO_NULL)
MPI_File_write(fh,buf,1000,MPI_INT,MPI_STATUS_IGNORE)
etype = MPI_INT
1 2 3 4 5 6
• Check how much data has been MPI_Status status; – Must perform a seek to arrive at the correct region of data
MPI_Offset filesize;
read by using status handler • Explicit offset functions don’t use an individual file pointer
MPI_Init(&argc,&argv);
• Pass to MPI_Get_count MPI_Comm_rank(MPI_COMM_WORLD, &myrank); – File offset is passed directly as an argument to the function
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
– Determines number of
datatypes that were read
MPI_File_open(MPI_COMM_WORLD, “testfile”, – Seek is not required
MPI_MODE_RDONLY,MPI_INFO_NULL,
• If number of read items is less &thefile); – MPI_File_read/write_at
MPI_File_get_size(thefile,&filezize);
than expected then (hopefully) filesize=filesize/sizeof(int); – Must use this version in multithreaded environment
bufsize=filesize/(numprocs+1);
EOF is reached buf=(int *) malloc(bufsize*sizeof(int));
MPI_File_set_view(thefile,
myrank*bufsize*sizeof(int),
MPI_INT,MPI_INT,”native”,
MPI_INFO_NULL);
MPI_File_read(thefile,buf, bufsize, MPI_INT,
&status);
MPI_Get_count(&status, MPI_INT, &count);
printf(“process %d read %d
ints\n”,myrank,count);
MPI_File_close(&thefile);
MPI_Finalize();
return 0;
DiSCoV }
26 April 2007 DiSCoV 26 April 2007
6
Revised code for explicit offsets Dealing with multidimensional arrays
PROGRAM main
• No need to apply seeks include ‘mpif.h • Storage formats differ for C versus fortran
• Specific movement of file integer FILESIZE,MAX_BUFSIZE,INTSIZE – row major versus column major
pointer is not applied, parameter (FILESIZE=1048576, MAX_BUFSIZE=1048576)
instead offset is passed as parameter (INTSIZE=4) • MPI_Type_create_darray, and also MPI_Type_create_subarray
argument integer buf(MAX_BUFSIZE),rank,ierr,fh,nprocs,nints are used to specify derived datatypes
integer status(MPI_STATUS_SIZE), count
• Remember offsets must be integer(kind=MPI_OFFSET_KIND) offset – These datatypes can then be used to resolve local regions within a
of kind linearized global array
MPI_OFFSET_KIND call MPI_Init(ierr)
call MPI_Comm_rank(MPI_COMM_WORLD,rank,ierr) – Specifically they deal with the noncontiguous nature of the domain
• Same issue here with call MPI_Comm_size(MPI_COMM_WORLD,nprocs,ierr)
decomposition
precalculating offset and call MPI_File_open(MPI_COMM_WORLD, ‘testfile’,&
resolving into a variable MPI_MODE_RDONLY,& – See Using MPI-2 book for more details
MPI_INFO_NULL,fh,ierr)
with the appropriate typing nints=FILESIZE/(nprocs*INTSIZE)
offset=rank*nints*INTSIZE
call MPI_File_read_at(fh,offset, buf, nints, &
MPI_INTEGER,status,ierr)
call MPI_Get_count(status,MPI_INTEGER,count,ierr)
print*, ‘process ’,rank,’read ‘,count’,’ints’
call MPI_File_close(fh,ierr)
call MPI_Finalize(ierr);
7
Default Hints (Info values) API
• { access_style} (comma separated list of strings) Positioning Synchronism Coordination Coordination
- read_once, write_once, read_mostly, write_mostly, sequential, Non-Collective Collective
reverse_sequential, and random
Explicit offsets Blocking MPI_FILE_READ_AT MPI_FILE_READ_AT_ALL
• { collective_buffering} (boolean)
• { cb_buffer_size} (integer)
Non-blocking MPI_FILE_IREAD_AT MPI_FILE_READ_AT_ALL_BEGIN
• { cb_nodes} (integer) MPI_FILE_READ_AT_ALL_END
• { filename} (string)
• { file_perm} (string) Individual file Blocking MPI_FILE_READ MPI_FILE_READ_ALL
• { io_node_list} (comma separated list of strings) pointers
• { nb_proc} (integer) Non-blocking MPI_FILE_IREAD MPI_FILE_READ_ALL_BEGIN
MPI_FILE_READ_ALL_END
• { num_io_nodes} (integer)
• { striping_factor} (integer)
Shared file Blocking MPI_FILE_READ_SHARED MPI_FILE_READ_ORDERED
• { striping_unit} (integer)
pointers
Non-blocking MPI_FILE_IREAD_SHARED MPI_FILE_READ_ORDERED_BEGIN
MPI_FILE_READ_ORDERED_END