You are on page 1of 9

COSC 6385 Computer Architecture - Homework

Edgar Gabriel Spring 2012

Edgar Gabriel

Hardware performance counters


set of special-purpose registers built into modern microprocessors to store the counts of hardwarerelated activities within computer systems low overhead compared to software based methods types and meanings of hardware counters vary from one kind of architecture to another due to the variation in hardware organizations.

COSC 6385 Computer Architecture Edgar Gabriel

Overflow handling
generate an overflow signal after every threshold events are counted
each counter has to be registered separately the value of each registered hardware counter is maintained separately (LONG_)LONG_MAX: 32 bit: 2,147,483,647 64 bit: 9,223,372,036,854,775,807

overflow_handler(): user-defined function to process overflow events.


function will be called by the PAPI library every time the threshold is reached
COSC 6385 Computer Architecture Edgar Gabriel

overflow_vector: a bit-array that can be processed to determined which event(s) caused the overflow
e.g. using PAPI_get_overflow_event_index()

Software vs. hardware overflow:


if processor does not support hardware overflow, software emulates it be periodically checking the counter values software overflow handling inaccurate and more expensive than hardware handling often implemented using a zero-crossing algorithm value of counter is set to threshold and increased accordingly
COSC 6385 Computer Architecture Edgar Gabriel

1st Assignment
Rules

Each student should deliver Source code (.c files)


Please: no .o files and no executables!

Documentation (.pdf, .doc, .tex or .txt file) Deliver electronically to gabriel@cs.uh.edu Expected by Friday, March 9, 11.59pm In case of questions: ask the TAs first, if he doesnt know the answer, he will ask me. Ask early, not the day before the submission is due

COSC 6385 Computer Architecture Edgar Gabriel

About the Project


Given the source code for matrix-multiply operation( File hwmatmul.c). The code contains a trivial implementation of the matrix multiply operation and a blocked implementation The blocked implementation is called with block sizes of 16, 32, 64 and 128 You can compile the C file, e.g. with cc O3 hw-matmul.c o hw-matmul Once you added the PAPI functions cc o3 hw-matmul.c o hw-matmul -I/opt/papi/4.2.0/include L/opt/papi/4.2.0/lib lpapi -lperfctr Run: allocate a node (see later in the lecture) type: ./hw-matmul <matrix-dimension>

COSC 6385 Computer Architecture Edgar Gabriel

Part 1: Instrument the code in order to use hardware performance counters to determine the behavior of the trivial and of the blocked implementation for different block sizes Goal is to be able to see how the counter values change with the block size You will have to provide measurements for matrixes of size 512 and 1024 Note, that for development purposes you can run the code of course with much smaller matrices, e.g. 64

COSC 6385 Computer Architecture Edgar Gabriel

The hardware performance counters should be based on the PAPI library, and you could monitor the following values: L1 and Level 2 Cache misses and/or Cache miss rate Translation look aside buffer misses stall cycles waiting for various events conditional branch instructions mispredicted Whether you can access these values will depend on the processor you are really using! You will have to add code to handle counter overflow or convince me otherwise that overflow does not occur. If you just ignore this item, you will loose points.
COSC 6385 Computer Architecture Edgar Gabriel

Part 2: Run the modified code on the shark cluster. Generate graphs for 3-5 PAPI hardware counters showing the values for each block size identified in Part 1 separately for both matrix sizes of 512 and 1024. Comment on your findings on how the parameter values change with the block sizes for each matrix size Make sure you run your tests multiple times, and document how often you run it, whether you show average, minimum, maximum etc. Please document (you can use PAPI to determine many of these things!) : Processor type, frequency Operating System (as precisely as possible) Cache hierarchies and sizes
COSC 6385 Computer Architecture Edgar Gabriel

Notes
The PAPI version installed on shark is 4.2.0 On the front-end node you can find tons tons of examples in C and Fortran on how to use PAPI in /opt/papi/4.2.0/share/examples/ctests. E.g. all_events.c -> how to check on a processor whether a counter is available low-level.c -> how to use the low-level API of PAPI memory.c -> how to extract information of the memory subsystem (e.g. cache sizes) overflow_index.c -> how to handle overflow correctly

COSC 6385 Computer Architecture Edgar Gabriel

1st Assignment
The Documentation should contain
(Brief) Problem description Solution strategy Results section Description of resources used Description of measurements performed Results (graphs + findings)

COSC 6385 Computer Architecture Edgar Gabriel

1st Assignment
The document should not contain
Replication of the entire source code thats why you have to deliver the sources Description of your laptop, ssh implementation used etc. Only items contributing towards the results matter! Screen shots of every single measurement you made Actually, no screen shots at all. The slurm output files

COSC 6385 Computer Architecture Edgar Gabriel

How to use a cluster


A cluster usually consists of a front-end node and compute nodes You can login to the front end node using ssh (from windows or linux machines) using the login name and the password assigned to you. The front end node is supposed to be there for editing, and compiling - not for running jobs! If 40 students would run their jobs on the same processor, everything would stall!!!!! To allocate a node for interactive development: studxy@shark:~>salloc N 1 bash studxy@shark:~>squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
489 calc studxy R 0:02 studxy@shark:~> ssh shark08
COSC 6385 Computer Architecture Edgar Gabriel

shark08

How to use a cluster (II)


Once your code is correct and you would like to do the measurements:
You have to submit a batch job The command you need is sbatch, e.g. sbatch N 1 ./measurements.sh Your job goes into a queue, and will be executed as soon as a node is available. You can check the status of your job with sqeueu

COSC 6385 Computer Architecture Edgar Gabriel

How to use a cluster (III)


The output of squeue gives you a job-id for your job Once your job finishes, you will have a file called slurm-<jobid>.out in your home directory, which contains all the output of your printf statements etc. Note the batch script used for the job submission (e.g. measurements.sh) has to be executable. This means, that after you downloaded it from the webpage and copied it to shark, you have to type chmod +x measurements.sh
Please do not edit the ImageAnalysis.sh file on MS Windows. Windows does not add the UNIX EOF markers, and this confuses slurm when reading the file.
COSC 6385 Computer Architecture Edgar Gabriel

One more comment on PAPI


PAPI requires a particular device to work properly, will otherwise generate an error If you are not sure whether the error message that you get is your fault or PAPIs fault, you can run the command papi_avail on the node. If the output says something about PASSING, PAPI device works properly. If papi_avail says FAILED contact the TA and me telling us precisely which node (e.g. shark08) the problem occurred on, and we can restart the device.

COSC 6385 Computer Architecture Edgar Gabriel

Notes
PAPI Documentation: http://icl.cs.utk.edu/projects/papi/wiki/Main_Page If you need hints on how to use a UNIX/Linux machine through ssh:
http://www.cs.uh.edu/~gabriel/courses/cosc4397_s06/ParCo_08_IntroductionUNIX.pdf

How to use a cluster such as shark


http://pstl.cs.uh.edu/resources.shtml

COSC 6385 Computer Architecture Edgar Gabriel

You might also like