You are on page 1of 77

GRIDDING FOR RADIO ASTRONOMY ON COMMODITY GRAPHICS HARDWARE USING OPENCL

Alexander Ottenho School of Electrical and Electronic Engineering University of Western Australia

Supervisor Dr Chistopher Harris Research Associate International Centre for Radio Astronomy Research

Co-Supervisor Associate Professor Karen Haines Western Australian Supercomputer Program

October 2010

ii

16 Arenga Crt Mount Claremont WA 6010 October 29, 2010

The Dean Faculty of Engineering Computing and Mathematics The University of Western Australia 35 Stirling Highway CRAWLEY WA 6009

Dear Sir,

I submit to you this dissertation entitiled Gridding for Radio Astronomy on Commodity Graphics Hardware using OpenCL in partial fulllment of the requirement of the award of Bachelor of Engineering.

Yours faithfully,

Alexander Ottenho

Abstract

With the emergence of large radiotelescope arrays, such as the MWA, ASKAP and SKA, the rate at which data is generated is nearing the limits of what can currently be processed or stored in real time. Since processor clock rates have plateaued computer hardware manufacturers are trying dierent strategies, such as developing massively parallel architectures, in order to create more powerful processors. A major challenge in high performance computing is the development of parallel programs which can take advantage of new processors. Due to their extremely high instruction throughput and low power consumption, fully programmable Graphics Processing Units (GPUs) are an ideal target for radio-astronomy applications. This research investigates gridding, a very time-consuming stage of the radio astronomy image synthesis process, and the challenges involved in devising and implementing a parallel gridding kernel optimised for programmable GPUs using OpenCL. A parallel gridding implementation was developed, which successfully outperformed a single threaded reference program for gridding in all but the smallest test cases.

iii

iv

Acknowledgements

I thank my supervisors Christopher Harris and Professor Karen Haines for providing guidance throughout the course of this project. They provided feedback and advice on my work and helped me rene my research and academic writing skills. Thanks to Xenon Technologies for providing the computer used throughout this project. Id like to acknowledge the techincal support sta at WASP, Jason Tan and Khanh Ly, for providing me with access to WASP facilities, setting up the computer used during this project. Thanks to Paul Bourke for providing me with a small CUDA project that got me started in GPU programming. Thanks also to Derek Gerstmann for organising the OpenCL Summer School where I was able to become familiar with the OpenCL API before starting this project. Id also like to thank Ankur Sharda and Stefan Westerlund with who I shared the Hobbit Room with for oering suggestions and feedback on various ideas. Finally, thank to my family for supporting me over the course of this project. In

vi

particular my mother for staying up all night proofreading the nal version of this document.

Contents

Abstract Acknowledgements List of Figures 1 Introduction 2 Background 2.1 2.2 2.3 2.4 2.5 Radio Astronomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aperture Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii v viii 1 5 5 7

Gridding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Parallel Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 19 23

3 Literature Review 4 Model 4.1 4.2 4.3

Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Pre-sorted Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 33

5 Testing

vii

viii

CONTENTS

6 Discussion 6.1 6.2 6.3

45

Work-Group Optimisation . . . . . . . . . . . . . . . . . . . . . . . . 45 Performance Proling . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 48 51

7 Conclusion 7.1 7.2

Project Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Future Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 57

A Original Proposal

List of Figures
2.1 2.2 2.3 Aperture synthesis data processing pipeline. . . . . . . . . . . . . . . 9

Overview of the gridding operation . . . . . . . . . . . . . . . . . . . 11 Comparison of CPU and GPU architectures. . . . . . . . . . . . . . . 14 (a) (b) CPU Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

GPU Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 4.1 4.2 4.3 5.1 5.2

OpenCL memory hierarchy. . . . . . . . . . . . . . . . . . . . . . . . 17 Gridding with a scatter kernel. . . . . . . . . . . . . . . . . . . . . . . 26 Gridding with a gather kernel. . . . . . . . . . . . . . . . . . . . . . . 29 Gridding with a pre-sorted gather kernel. . . . . . . . . . . . . . . . . 31 Thread topology optimisation . . . . . . . . . . . . . . . . . . . . . . 35 Performance prole of GPU gridding implementation compared with CPU gridding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3

Performance prole of GPU gridding implementation with sorting running on the CPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.4 5.5 5.6

CPU and GPU gridding performance for a varying number of visibilities 40 Thread optimisation for a range of convolution lter widths . . . . . . 41 CPU and GPU gridding performance for a varying convolution lter width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

ix

LIST OF FIGURES

Chapter 1

Introduction

Astronomers can gain a better understanding of the the creation and early evolution of the universe, test theories and attempt to answer many questions in physics by producing images from radio waves emitted by distant celestial entities. With the construction of vast radio-telescope arrays, such as the Murchison Wide-eld Array (MWA), Australian SKA Path-nder (ASKAP) and Square Kilometre Array (SKA), many engineering challenges have to be overcome. ASKAP alone will generate data at a rate of 40Gb/s, producing over 12PB in a single month [6] and SKA will produce several orders of magnitude more, so data processing and storage are major issues. As we reach the limit of how fast we can make single core CPUs run we need to look to parallel processors such as multi-core CPUs, GPUs and digital signal processors to process this vast amount of data. One of biggest problems limiting the popularity of parallel processors has been the lack of a standard language that runs on a wide variety of hardware. To address this, the Khronos Group produced the OpenCL standard.

CHAPTER 1: Introduction

OpenCL is an open standard for heterogeneous parallel programming [17]. One of the major advantages of code written in OpenCL is that it allows programmers to write software capable of running on any device with an OpenCL driver; eliminating the need to rewrite large amounts of code for each vendors hardware. This partially solves the issue of vendor lock-in, a major problem in general purpose GPU (GPGPU) programming up until now, where, due to the lack of standardisation software is often restricted to running on a series of architectures produced by a single company.

In this project I aim to develop an ecient method to develop a parallel algorithm in OpenCL for the gridding stage of radio-interferometric imaging, which has traditionally been the most time-consuming stage of the imaging process [30]. Due to the large amount of data that will be generated by the next generation of radio telescopes, the amount of data which can be processed in real-time may be a serious performance bottleneck. Since a cluster of GPUs with equal computational performance to a traditional supercomputer consumes a fraction of the energy, an ecient OpenCL implementation would be a signicantly less expensive option. I will primarily target GPU architectures in particular the NVIDIA Tesla C1060, although I will also attempt to benchmark and compare performance on several dierent devices.

Chapter 2, the background will explain the theory behind radio astronomy with a focus on the aperture synthesis process. It will provide an overview of GPU architectures, NVIDIAs Tesla series of graphics cards and OpenCL. In Chapter 3, the literature review, previous implementations of gridding on other heterogeneous parallel architectures will be discussed. The model in Chapter 4 will provide a

detailed explanation of gridding and detail several ways of adapting it to GPU hardware. Chapter 5 will outline and present the results of various tests performed in order to determine the paramaters which result in the best performance of the GPU based gridding algorithm. The results of these tests will be discussed in Chapter 6, as well as other discoveries made over the course of this project. Finally Chapter 7 will summarise the important results of this work and outline possible areas for future research.

CHAPTER 1: Introduction

Chapter 2

Background

This chapter will explain background information on various topics that are useful for understanding this project. It will discuss the theory behind radio astronomy, as well as what scientists in this eld can discover. An overview of the aperture synthesis process, used to generate 2-dimensional images from multiple radio telescopes, will be given. General Graphical Processing Unit (GPU) design will then be outlined, with a particular focus on the NVIDIA GT200 architecture used in the Tesla C1060. OpenCL will also be discussed, explaining all the features used in this project and why it was chosen for this project over other programming languages.

2.1

Radio Astronomy

Radio astronomy is the branch of astronomy which focuses with observing electromagnetic waves emitted by celestial entities lying in the radio band of the electromagnetic spectrum. While the visible light spectrum observed by optical telescopes

CHAPTER 2: Background

can pass through the atmosphere with only a small amount of atmospheric distortion, radio waves with wavelengths ranging from 3cm to around 30m are not distorted at all by the Earths atmosphere. Also, unlike visible waves which are mostly produced by hot thermal sources, such as stars, radio waves can originate from a wide variety of sources including gas clouds, pulsars and even background radiation left over from the big bang [31]. It is also possible to observe radio waves through clouds as well as during the day, when the amount of light emitted by the Sun vastly exceeds that which reaches Earth from distant sources, which allows radio telescopes to operate when optical astronomy is impossible.

Due to the long wavelengths of the signals being measured, radio telescopes are generally far larger than their optical counterparts. For a single dish style radio telescope, the angular resolution R of the image generated from a signal of wavelength is related to the diameter of the dish D.

R=

(2.1)

Since R is a measure of the nest object a telescope can detect, a dish designed to create detailed images of signals less than 1GHz would need to be several hundred metres in diameter. Constructing a dish of this size is both dicult and extremely expensive and, for wavelengths longer than around 3 metres, the diameter required for a good resolution can surpass what can realistically be constructed. A technique called radio interferometry makes it possible to combine multiple telescopes to make observations with a ner angular resolution than that of what each telescope could resolve individually. When using this technique neither telescope measures the

brightness of a frequency in the sky directly. Instead, each pair of telescopes in the array measures a component of the brightness and combines this data in a process known as aperture synthesis.

2.2

Aperture Synthesis

Aperture synthesis works by combining signals from multiple telescopes to produce an image with a resolution approximately equal to that of a single dish with a diameter of the maximum distance between antennae. The aperture synthesis process is made up of several stages, which transform the signals measured by each pair of telescopes in an array into a two dimensional image. This process consists of several stages shown in Figure 2.1.

The rst stage of this process involves taking the signals from each pair of antennas and cross-correlating them to form a baseline. The relationship between the number of an antennas in an array a, and the total number of baselines b including antennas is shown in Equation A.2.

b=

a(a 1) +a 2

(2.2)

These signals are combined to produce a set of complex visibilities, one for each baseline, frequency channel and period of time. The complex visibilities for each baseline are created by cross-correlating sampled voltages from a pair of telescopes.

CHAPTER 2: Background

The next stage is to calibrate the visibilities to remove noise introduced by atmospheric interference and small irregularites in the shape and positioning of the radio dishes. The calibrated visibilities can then be used to generate a two-dimensional image by converting them to the spatial domain. These visibilities are rst mapped to a regular two-dimensional grid in a process referred to as gridding. This is followed by applying the two-dimensional inverse Fast Fourier Transform (FFT) to the gridded visibilities, converting them to the spatial domain. The output of this operation is known as the dirty image, because it still contains some artifacts introduced during the aperture synthesis process.

In order to remove these synthesis artifacts the dirty image is nally processed with a deconvolution technique. Two common algorithms used to perform this operation are the CLEAN algorithm [2] and the Maximum Entropy Method (MEM) [25]. The CLEAN algorithm works by nding and removing point sources in the dirty image, and then adding them back to the image after removing associated side lobe noise. The MEM process involves dening a function to describe the entropy of an image and then searching for the maximum of this function. The result of the deconvolution process is known as a clean image. Several radio astronomy software packages exist which are able to perform the aperture synthesis process including Miriad [21], which is used in this project. Of the stages used in aperture synthesis, gridding is the focus of this research and will be discussed in more depth.

Imaging Gridding Correlation Calibration Fast Fourier Transform Deconvolution

Figure 2.1: Aperture synthesis data processing pipeline [30]. Shown is an overview of the major software stages involved in taking sampled radiowave data from a pair of radio telescopes and generating an image. The signals from a pair of telescopes are correlated with each other to provide a stream of visibilities. These visibilities are then calibrated to correct for irregularities in the telescopes dish, small errors in the the telescopes alignment and to account for some atmospheric interference. After being calibrated these visibilities are converted into a two-dimensional image through a three stage process consisting of interpolation to a regular grid, transformation to the spatial domain with a Fast Fourier Transform (FFT) and deconvolution using a technique such as the CLEAN algorithm [2] or Maximum Entropy Method [25].

10

CHAPTER 2: Background

2.3

Gridding

Gridding is the stage of the aperture synthesis process which converts a list of calibrated visibilities into a form that can be transformed to the spatial domain with an inverse Fast Fourier Transform. This operation involves sampling the measured visibilities to a two-dimensional grid aligned with the u and v axes which are dened for each baseline by the Earths rotation. An example of visibilities measured by a telescope array containing eight baselines is shown in Figure 2.2. In order to minimise aliasing eects in the image plain, ie. distortion introduced due to sampling, each visibility is mapped across a small region of the grid dened by a convolution window. This convolution function used in this project is the spheroidal function, which emphasises aliasing suppression near the centre of the image, typically near the object of interest [23].

Typically, instead of computing the spheroidal function every time its used, coecients are generated ahead of time and stored in an array. Because the same function is used for both gridding a visibility in both u and v directions, the coecients of the convolution function can be stored in a one-dimensional array. The ratio between the length of the convolution array and the width of the convolution function is known as the oversampling ratio. Because a high oversampling ratio results in better suppression of aliasing in the nal image, the convolution array is signicantly larger than the width of the function.

11

Visibilities v(t)

Grid

Gridding

u(t)
Figure 2.2: Overview of the gridding operation The gridding operation transforms a set of visibilities sampled from multiple baselines of a radio-telescope array and convolves them to a regular grid. This operation is necessary to prepare this visibility data for the two dimensional Inverse Fast Fourier Transform (FFT) operation, which is used to transform these visibilities from the frequency domain into a two dimensional image in the spatial domain. Each red line represents measurements made by a separate baseline taken over a period of time.

12

CHAPTER 2: Background

2.4

Parallel Processors

Computer manufacturers have shifted their focus in recent years from designing fast single core processors to creating processors which can execute multiple threads simultaneously and minimise memory access latency with on chip cache. Since these multi-core processors are still relatively new, a diverse range of architectures are available, including multi-core x86 processors such as the AMD Phenom and Intel Core i7, GPUs like NVIDIAs Tesla and AMDs Firestream series as well as other types of processors including IBMs Cell/B.E. One of the factors limiting the usage of parallel processors by developers is the vast amount of code that has been developed for single processor computers. Often, due to inter-dependencies between operations, rewriting these legacy programs to take advantage of multiple concurrent threads is not a trivial task.

While originally developed as co-processors optimised for graphics calculations, GPUs are being designed with increasingly exible instruction sets and are emerging as aordable massively parallel processors. NVIDIAs recent Tesla C1060 GPU is capable of 933 single precision GigaFLOPS [8] (oating point operations per second) compared to one of the fastest CPUs available at the time, Intels Core i7 975 with a reported theoretical peak of 55.36 GigaFLOPS [14]. Part of the reason that GPUs can claim such high performance gures is their architecture. As shown in Figure 2.3, GPUs devote more die s[ace to data processing. GPUs are thus highly optimised for performing simple vector operations on large amounts of data faster than a processor using that die space for other purposes. However, this performance comes at the expense of control circuitry meaning that GPUs cannot make use of advanced run time optimisations commonly found on modern desktop CPUs such

13

as branch prediction and out-of-order execution. GPUs also sacrice the amount of circuitry used for local cache, which has a major impact on the average amount of time a process needs to wait after requesting data from memory.

14

CHAPTER 2: Background

Control Cache

ALU Control ALU

ALU U ALU U
ALUs

Cache

DRAM
(a) CPU Layout

DRAM
(b) GPU Layout

Figure 2.3: Comparison of CPU and GPU architectures [7]. This gure shows the dierence in layout between a CPU and a GPU. CPUs are designed to be general purpose processors capable of performing a wide variety of tasks quickly. Because of this, a large amount of space on the chips die is dedicated to control logic and local cache, both of which can be used to optimise programs at run time. GPUs are highly tuned to perform graphics operations, which are mostly simple vector operations on large amounts of data. This performance is achieved by dedicating most of the chip to the Arithmetic and Logic Units (ALUs) which perform instructions at the expense of cache and control circuitry. Because of this, GPUs often lack many of the the advanced run time optimisations commonly found on modern desktop CPUs, such as branch prediction and out-of-order execution and accessing system memory has a higher average latency.

15

2.5

OpenCL

OpenCL is a programming language created by the Khronos Group with the design goal of enabling the creating of code that can run across a wide range of parallel processor architectures without needing to be modied. To deal with the many different types of processors that can be used for processing data, the OpenCL runtime separates them into two dierent classes; device and host. The host, which represents a general purpose computer, is in charge of transferring both device programs compiled at run-time (kernels) and data to a device. It also instructs devices to run kernels and sends requests for data to be transferred back from a device to the host. A host can make use of a command queue object in order to schedule data transfers and the execution of kernels on various devices asynchronously so that it remains free to perform other operations while the devices are busy.

The job of a device is to simply execute a kernel in parallel across a range of data storing the results locally and to then alert the host when the kernel has nished execution, so the results can be transferred back. Each device can be divided into a collection of compute units and in turn each of these compute units is composed of one or more Processing Elements (PEs). Memory on a device is organised into four distinct regions: global, constant, local and private. Global and constant memory are shared among all compute units on a device and are the only regions of memory accessible to the host. The only major dierence between these two regions is that constant memory can only be written to by the host while global memory can be written to by both host and device. Local memory is memory shared by all processing elements within a work-group, which can be allocated by the host but only manipulated by the device. Finally, private memory is memory available to only

16

CHAPTER 2: Background

a single processing element. Figure 2.4 shows how the hierachy of processors and various memory types are linked together.

In order to run a kernel, the host initialises an NDRange, which represents a one, two or three dimensional array with a specic length in each dimension. The size of this NDRange, also known as an index space, determines the number of kernal instances launched. Each instance of a kernel running on a device is known as a work-item and is provided with an independent global ID representing a position in index space. Work-items are organised into work-groups. Each of which have their own group ID as well as providing work-items within the group with independent local IDs. When a kernel is executed each work-group is executed on a compute unit and each work-item maps to a processing element. Limitations on various parameters such as the maximum number of work-items a work-group can allocate as well as the amount of memory available in each region, are dependent on the architecture of a device.

For GPU devices based on the CUDA architecture such as the NVIDIA Tesla C1060 used in this project, OpenCL compute units correspond to hardware objects called multiprocessors. While each multiprocessor can process 32 threads in parallel (known as a warp), it is capable of storing the execution context (program counters, registers, etc) of multiple warps simultaneously and switching between them very quickly [7]. This technique can be used to eciently run work-groups larger than 32 threads on a single multiprocessor. Since this context switch can occur between two consecutive instructions, the multiprocessor can instantly switch to a warp with threads ready to execute if the current context becomes idle, such as when reading or writing global memory. Each multiprocessor posesses a single set of registers and

17

Compute Device 1

Compute Device P

Compute unit 1
Private memory 1

Compute unit M
Private memory N Private memory 1

Compute unit 1
Private memory N Private memory 1

Compute unit M
Private memory N Private memory 1

...

PE 1

PE N

...

...

...

PE 1

PE N

PE 1

PE N

...

...

Private memory N

PE 1

PE N

Local memory 1

Local memory M

...

Local memory 1

Local memory M

Global/Constant Memory Data Cache

Global/Constant Memory Data Cache

Global Memory

Global Memory

Constant Memory
Compute Device Memory 1

Constant Memory
Compute Device Memory P

System Memory

CPU Cache

CPU

Host System

Figure 2.4: OpenCL memory hierachy [7]. A system running OpenCL consists of a host, which can be any computer capable of running the OpenCL API and one or more devices. Each device can be divided into a collection of compute units and, in turn, each of these compute units is composed of one or more Processing Elements. Memory on a device is arranged in a similar hierachy. Global and constant memory are shared among all compute units on a device, local memory is only available to a single compute unit and private memory is specic to a single processing element. OpenCL devices typically represent GPUs, multi-core CPUs, Digital Signal Processors (DSPs) and other parallel processors. Since a host represents a general purpose computer, it has its own CPU and memory which are used to issue commands and transfer data to the various devices as well as perform other operations outside the OpenCL environment.

18

CHAPTER 2: Background

a xed amount of local memory which are shared between all active warps. Because of this tradeo between work-group size and memory available to each work-item, trying to nd a balance between these parameters is essential to obtaining optimal performance.

Chapter 3

Literature Review

The gridding algorithm used in aperture synthesis is widely documented in scientic literature [4, 5, 10, 18, 23, 32]. A large part of the research eort has been focused on improving the quality of images generated by devising methods to programatically determine the ideal convolution window for a given set of data, as well as minimilising artifacts introduced from oversampling.

There have been various eorts to implement this algorithm on parallel hardware in [11, 19, 2830]. Before the OpenCL standard was publish, IBMs Cell Processor was a major target for research eorts, although recently GPUs have become cheaper, more powerful and easier to program, leading to more research on parallelisation with GPUs, particularly with NVIDIAs CUDA based cards.

Gridding is also used in Magnetic Resonance Imaging (MRI) applications and several papers have been written on the topic of improving the gridding algorithm

19

20

CHAPTER 3: Literature Review

as well as creating various implementations targetting heterogenous parallel processors [1, 12, 15, 16, 20, 22, 24, 26]. While the process used to convert MRI data into images is completely dierent to the aperture synthesis process used in radio astronomy, both processes involve transforming irregularly sampled data in the Fourier domain into a spacial image.

An early attempt to parallelise gridding on IBMs Cell Broadband Engine is described in an article entitled Radio-Astronomy Image Synthesis on the Cell/B.E. [29] published in 2008. This paper describes an application of gridding and its inverse function degridding, and compares the perfomance between an Intel Pentium D x86 CPU and two dierent platforms containing the cell processor: Sonys Playstation 3 and a IBMs QS20 Server Blade. On average, the results for the Cell platforms showed a twentyfold increase in performance compared to the Pentium D, although speed increase was negligible for small kernels less than 17x17. One of the main conclusions reached in this paper is that I/O delay and memory latency are the largest bottlenecks in scaling this algorithm to a cluster of processors.

The parallel gridding implementation detailed in this paper took advantage of the Cells high bandwidth between processors by using the Power Processing Element (PPE) to distribute the visibility data along with the relevant convolution and grid indices to the Synergistic Processing Elements (SPEs) on the y. The PPE stored separate queues as well as seperate copies of the grid for each of the SPEs. Therefore, if multiple adjacent visibilities were located close to each other, they would be allocated to a single SPE to reduce the number of memory accesses. To prevent too much work from piling up in a single queue, a maximum queue size was established in order that the PPE would not continuously ll a single queue while the other

21

SPEs idled. Each of the SPEs performed a simple loop of polling their queue until work was available, fetching the appropriate data from system memory with Direct Memory Access (DMA), performing the gridding operation and writing the results to its copy of the grid in system memory. Once all visibilities were processed, the PPE added each of the grids together to produce the output.

A follow up paper was written by the same research team in 2009, entitiled Building high-resolution sky images using the Cell/B.E. [30], detailing further optimisations to their Cell-based gridding implementation. The largest optimisation detailed in this paper was to check consecutive visibilities to see if they had identical u v coordinates and if so, add them together and then enqueue the combined visibility. The result of th further optimisations was a scalable version of the previous gridding algorithm designed to run on a cluster of Cell processors with each Cell core able to process all data generated by 500 baselines and 500 frequency channels at a rate of one sample per second.

More recently an eort was made to implement several stages of the aperture synthesis process using CUDA, which is outlined in Enabling a High Throughput Real Time Data Pipeline for a Large Radio Telescope Array with GPUs [9]. The purpose of this research was to design a data pipeline capable of processing data generated by the Murchison Wideeld Array in real time. While the data pipeline required over 500 seconds of processing time running on a single core of an Intel Core-i7 920, the same pipeline implemented in CUDA could be processed in under 7.5 seconds on a single NVIDIA Tesla C1060. Excluding data transfer times, the GPU implementation of gridding developed as part of this research demonstrated an average speedup of twenty-twofold when compared to the CPU version.

22

CHAPTER 3: Literature Review

This research demonstrates that gridding has been successfully implemented on several dierent parallel processor architectures with signicant performance improvements compared to existing serial implementations. Most of the research conducted to date has been focused on implementing gridding on a single processor architecture or on comparing the performance of multiple independent implementations written for dierent devices. Due to the portability of software written in OpenCL, a parallel version of gridding implemented as an OpenCL kernel could be combined with kernels implementing other stages of aperture synthesis and run on a system comprised of multiple dierent compute devices.

Chapter 4

Model

The gridding algorithm is used to interpolate a set of visibilities to a regular grid as illustrated in Figure 2.2. Each visibility sample is projected onto a region of the grid by convolving its brightness value by a two-dimensional set of coecients. In this chapter I will outline the model I developed which implements the gridding algorithm on the parallel architecture of NVIDIAs Tesla C1060 GPU. I describe three approaches. Firstly, the scatter approach, where each visibility is mapped to an OpenCL work-item and the kernel performs a similar convolution operation to the original serial implementation. Secondly, the gather approach, where the twodimensional location of each pixel on the grid corresponds to a thread on the GPU and the kernel reads in the entire list of visibilities, only writing to the grid address corresponding to its global ID. Finally, the pre-sorted gather approach, which is similar to the normal gather approach except the visibilities are sorted and placed into bins prior to gridding and each kernel only reads through a subset of the list of visibilities.

23

24

CHAPTER 4: Model

4.1

Scatter

Scatter communication occurs when the ID value given to a kernel processing a stream of data corresponds to a single input element and the kernel writes to multiple locations, scattering information to other parts of memory [13]. In the context of parallel gridding, a scatter kernel is implemented so that the global ID of each work-item corresponds to a single visibility and the kernel convolves this visibility over a region of the grid. Of the dierent parallelisation approaches discussed, scatter is the closest to a traditional serial implementation because the kernel eectively performs the same operations, although instead of looping through the list of visibilities, these these operations are performed simultaneously. An example of this type of kernel is shown in Figure 4.1

In the case of a scatter kernel operating over a set of v visibilities with a convolution function of width c, v threads are launched where each thread performs c2 multiplications by looping across the convolution function in two-dimensions. This results in a computational complexity of O(v c2 ). Although the complexity is the same as that of the serial implementation, a scatter kernel can scale across a large number of processors with a proportional speed increase.

Even though the scatter approach is very fast, it does nothing to prevent multiple threads attempting to write to the same memory location simultaneously, which can lead to a write conict resulting in the results of one thread being lost. A possible solution is to provide each processor with a unique copy of the grid, which it writes to, and adding an extra step at the end of the process to add all the grids together.

25

While this solution would be ideal on a multi-core CPU, it would be impractical on a GPU like device with hundreds of processing elements, since the amount of memory needed would likely exceed that which is available for any practical grid size.

26

CHAPTER 4: Model

Convolution Function

Grid
A B C A

Thread Index
B

Visibilities

Figure 4.1: Gridding with a scatter kernel. The scatter strategy involves assigning each visibility to a dierent thread, where each thread applies the convolution function. While this approach is very fast, it runs into problems when threads attempt to write to the same grid location as shown in the magenta region where kernels A and B overlap as well as the cyan region where kernels B and C overlap. When this occurs only one of the values being written is saved while all the other values are lost.

27

4.2

Gather

A gather kernel works by mapping each address in the output of a function to a thread and processing the set of input data separately at each location. The gather approach to gridding works by assigning a thread for each pixel on the output grid and having each thread process the list of visibilities separately. Since each thread only writes to a single pixel of the output grid, this approach avoids the problem of write conicts found in scatter kernels, as shown in Figure 4.2.

Given a set of v visibilities which are to be convolved to a w by h grid, a gather kernel needs to iterate through the list of visibilities once for each thread. Because a thread is launched for each position on the grid, this results in a complexity of O(vwh). Since the grid is always signicantly larger than the convolution function, the gather approach is far more algorithmically complex than the scatter approach and therefore takes longer to run. When it comes to writing, gather kernels have one major advantage in complexity, because each thread writes to a single location, the number of writes is only w h. Since all writes can be performed independently, given a GPU with p processing elements, the complexity of writing is only O( wh ). p

A major disadvantage of to this approach is that the total number of operations performed is signicantly larger, since each visibility is processed once for each thread, whereas the scatter approach only processes each visibility once. Because the convolution function used to map visibilities to the grid is signicantly smaller than the grid itself, most visibilities processed by each thread fall outside the convolution width and can safely be ignored. With this in mind, an optimised version of the

28

CHAPTER 4: Model

gather approach was developed and is discussed in the following section.

29

A B C A B C A B C A

Convolution Function

Grid

Thread Index 1

Visibilities
C

Thread Index 0

Figure 4.2: Gridding with a gather kernel. The gather approach to gridding works by assigning a thread for each pixel on the output grid and having each thread process the list of visibilities separately. Since each thread only writes to a single pixel of the output grid, this approach avoids the problem of write conicts found in scatter kernels A disadvantage to this approach is that the total number of operations performed is signicantly larger since the list of visibilities is read once for each thread, whereas the scatter approach only reads them in once.

30

CHAPTER 4: Model

4.3

Pre-sorted Gather

The pre-sorted gather approach attempts to signicantly improve the performance of the regular gather approach by performing an additional series of steps before the gridding operation. These steps attempt reduce the number of visibilities processed by each thread while still producing correct output. This sequence of steps, collectively called binning, works by splitting the list of visibilities into a collection of shorter lists, whereby each short list contains the visibilities located in a particular region of the grid, or bin. The binning process begins by determining a bin size, which must be equal to or larger than the convolution function. This is followed by idneifying which bin each visibility is located in, and useing thes values to create a list of keys.

Once the list of keys has been generated, the visibilities are sorted based on the value stored in each visibilitys corresponding key, which results in a list where visibilities in each bin are grouped together. The list of visibilities is then processed a second time in order to generate an array containing the index of the rst and last visibility in each bin. Following this step, a modied gather kernel is launched to perform the gridding process, with the array of bin indicies passed as an additional argument, The size and position of each work-group corresponds to the size and location of each bin. Instead of looping through each visibility in the list, each work-item only iterates through the visibilities located in its own bin and the eight bins directly adjacent to it. An illustration of the pre-sorted gather approach is shown in Figure 4.3. While the easiest approach to sorting the visibilities into bins would be to sort them

31

Convolution Function
A B C 00 01 02

Grid 03

10

A 11

12

13

20

21

B 22

23

Thread Index 1

Visibilities
30 C 31 32 33

Thread Index 0

Figure 4.3: Gridding with a presorted gather kernel Since the grid is signicantly larger than the convolution window, each thread only needs to consider visibilities located nearby. To take advantage of this, the grid is divided into sub-regions called bins and the list of visibilities is sorted into an order where visibilities in each bin are grouped together. Each work-item processes visibilities located in its own bin and adjacent bins. The red, green and blue boxes on the left correspond to the list of visibilities processed by an individual work-item. The tinted bins represent adjacent bins for their corresponding coloured work-items.

32

CHAPTER 4: Model

on the CPU with a traditional algorithm such as Quicksort, it is possible to do this on the GPU using a parallel sorting algorithm such as a bitonic sort. Bitonic sorting is based on a network of threads taking a divide and conquer approach to sorting, which implement two kernels: Bitonic Sort which orders the data into alternate increasing and decreasing subsequences, and Bitonic Merge which takes a pair of these ordered subsequences and combines them together. This was implemented with a modied version of the Bitonic Sorting network example found in NVIDIAs GPU Computing SDK with the datatype of the values convertred from uint to float4 in order to handle visibilities.

Chapter 5

Testing

A GPU gridding program was successfully implemented and tested, using the presorted gather model presented in Chapter 4. The testing compared the GPU gridding implementation with a single core CPU implementation in order to determine the suitability of parallel architectures to the gridding stage of aperture synthesis.

All testing was performed on a Xenon Nitro A6 Tesla workstation. This system contained a Foxconn Destroyer motherboard featuring a single AM2+ CPU Socket, four Dual Channel DDR2 Memory slots, four PCIe v2.0 slots, NVIDIA nForce 780a SLI chipset and 5.2 GT/s HyperTransportTM bus connecting the CPU with the northbridge. The CPU used was an AMD PhenomTM II X4 955 clocked at 3.2 GHz. 8GB of RAM was installed, consisting of four 2GB DIMMS running at DDR2-800. Two dierent Graphics Cards were available, a NVIDIA Tesla C1060 and a NVIDIA Quadro 5800, which both include a 240 core GPU clocked at 1.3GHz and 8GB and 4GB resectively of on-board RAM clocked at 800MHz and connected over a 512 bit GDDR3 with a bandwidth of 102 GB/s. Both graphics cards were connected to the

33

34

CHAPTER 5: Testing

motherboard through a PCIe x16 bus.

The operating system used in the tests was the AMD64 release of Ubuntu Linux 9.10 (Karmic Koala), running Linux Kernel 2.6.31-22. Version 4.4.1 of the GNU compiler collection was used to compile all C and FORTRAN code. The reference implementation of gridding was a version of the invert function from the 2010-04-22 release of the Miriad data reduction package, modied to measure and output its run time. The NVIDIA drivers installed were version 195.36.15, along with version 3.0 of the NVIDIA Toolkit which includes the OpenCL libraries for NVIDIA GPUs and version 3.0 of the NVIDIA GPU Computing SDK. All performance timing data was measured using the gettimeofday function found in the Unix library sys/time.h.

Performance tests were conducted using a sample dataset of 1337545 visibilities taken by the Australian Telescope Compact Array (ATCA) of Supernova SN1987A. Unless specied otherwise the grid size used is 1186 by 2101 and the convolution function width is 6, with the convolution function data comprising a spheroidal function stored in a 2048 element array. The data used to generate each performance plot was created by running the relevant program ve times in a row and averaging the execution time of the last three runs in order to minimise the impact of hard disk seek times and power saving features on the results.

The objective of the rst test, shown in Figure 5.1, was to determine the optimal local work-group size for the gridding kernel, operating on the sample dataset. This value also determines the size of the bins used in the binning stage of the gridding

35

1300
14x14

1200

12x12 13x10

16x13 16x146x15 1 16x16 15x13 15x14 15x15 15x16 14x16 14x15

1100 Runtime (ms)

1000

900

800

13x15 12x11 12x13 13x16 13x11 13x12 11x12 13x13 13x14 12x14 12x16 12x15 8x9 11x13 7x13 10x13 10x7 9x10 9x12 11x15 9x11 11x14 11x16 7x11 7x10 8x10 10x14 10x15 6x11 8x11 10x8 10x16 9x13 10x9 6x6 11x9 8x13 9x14 9x16 9x15 8x12 10x10 10x11 10x12 7x12 8x14 16x8 14x13 8x15 6x12 8x16 6x13 7x14 14x10 12x6 11x6 6x157x15 15x11 14x6 13x8 14x12 6x14 13x6 16x67x16 15x9 14x11 16x7 11x7 15x615x7 15x10 16x9 12x7 14x7 16x116x12 1 16x1015x12 6x16 11x11 13x7 11x8 11x10 15x8 13x9 14x8 7x9 12x8 12x10 14x9 12x9 8x6 7x69x6 8x7 7x7 6x7 9x7 8x8 10x6 6x8 7x8 6x9 6x10 9x8 9x9

700 0 50 100 150 Work Group Size 200 250 300

Figure 5.1: Thread topology optimisation This gure shows how the performance of the GPU gridding kernel varies with a number of local work-group sizes on the sample data. This test was performed in order to nd an optimal work-group size for later tests. The x-axis shows the number of work-items in each group, with each datapoint showing the width and height of the work-group it represents. The y-axis is the execution time of the gridding process measured in milliseconds. Because its not perfectly clear in the diagram, the fastest work-group sizes are 6x10, 6x9, 7x8, 6x8, 10x6 and 8x8.

36

CHAPTER 5: Testing

process. This test was conducted by iterating through each combination of work group width and height and recording the time taken by the entire gridding process. This included time taken transferring data between the device and host. A value of 6 was used for the minimum number of elements in both dimensions since the gridding kernel is only designed to work with both work-group dimensions equal to or greater than the convolution function width. Values greater than 16 in either dimension are not displayed on the plot, since increasing either work-group dimension past this value signicantly decreased performance. While a work-group size of 6x10 resulted in the fastest execution time, 8x8 was used in further tests for reasons that are explained in Chapter 6.

Figure 5.2 illustrates the execution time of each stage of the GPU gridding implementation compared to the total execution time taken by Miriads gridding implementation. A performance prole of the GPU gridding process with sorting handled on the CPU is shown in Figure 5.3. The purpose of these diagrams is to visualise the amount of time spent at each stage of the gridding process in order to determine if any stage in particular is acting as a performance bottleneck. Each item listed in the key located on either diagram represents a distinct stage of the GPU gridding process. Binning represents the time spent determining which bin each visibility is located in. Device Transfer represents the total amount of time spent transferring the binned visibilities and convolution function from host to device. Sorting is a measure of the total time taken by the sorting stage. Bin Processing represents the time taken to transfer the sorted visibilities from device to host, build an array containing indices for the rst and last visibility in each bin, and transfer this new array onto the device. Kernel Execution represents the time spent performing the actual gridding operation. Finally, Host Transfer represents the time taken trans-

37

1200 Host Transfer Kernel Execution Bin Processing Sorting Device Transfer Binning

1000

800 Time (ms)

600

400

200

0 Miriad OpenCL with GPU Sort

Figure 5.2: Performance Prole of GPU gridding implementations compared with CPU gridding. This diagram illustrates the time spent in each stage of the GPU gridding process compared with the total execution time of Miriads gridding implementation, using the sample dataset. Each item listed in the key represents a distinct stage of the GPU gridding process. Binning represents the time spent determining which bin each visibility is located in. Device Transfer represent the total amount of time spent transferring the binned visibilities and convolution function from host to device. Sorting is a measure of the total time taken by the sorting stage. Bin Processing represents the time taken to transfer the sorted visibilities from device to host, build an array containing indices for the rst and last visibility in each bin and transfer this new array onto the device. Kernel Execution represents the time spent performing the actual gridding operation. Finally, Host Transfer represents the time taken transferring the grid from the device back to the host.

38

CHAPTER 5: Testing

5000 4500 4000 3500 3000 Time (ms) 2500 2000 1500 1000 500 0 OpenCL with CPU Sort Host Transfer Kernel Execution Bin Processing Sorting Device Transfer Binning

Figure 5.3: Performance Prole of GPU gridding implementation with sorting running on the CPU. This diagram illustrates the time spent in each stage of the GPU gridding process with sorting of the visibilities handled by the CPU using the sample dataset. This plot indicates the large amount of processing time needed to sort the visibilities into bins on the CPU. Because of this major performance bottleneck, the sorting stage was adapted to run on the GPU which lead to a signicantly faster gridding implementation as shown in the second column of Figure 5.2.

39

ferring the grid from the device back to the host. Because Miriads gridding process is performed entirely on the host without any pre-processing of visibility data, its performance prole only consists of the kernel execution stage.

Figure 5.4 compares the execution time of GPU gridding with Miriad for various size visibility lists. This test was done to compare how the performance of each program scales when provided with larger datasets to process. The large datasets used in this test were generated by repeating the visibilities in the SN1987A dataset as many times as necessary for each test.

In order to compare the performance for convolution windows of various sizes, the optimal work-group for each convolution width needed to be measured. Because the graphics card used for testing only allows for work-groups of up to 512 elements in size, only convolution windows up to 22x22 elements in size could be tested since the convolution width acts as a lower bound for work-group sizes. The results of this test are shown in Figure 5.5 and the measured optimal work-group sizes are listed in Table 5.1.

Figure 5.6 shows how the GPU gridding program performs compared with Miriad over a number of dierent convolution lter widths. The work-group sizes used for the GPU kernel in this test are the optimal values displayed in Table 5.1. This test was conducted by changing the convolution width parameter provided to both gridding programs. Since the convolution function used in the sample data has a large oversampling ratio, the array of convolution coecients didn not require modication.

40

CHAPTER 5: Testing

9000 8000 7000 6000 Runtime (ms) 5000 4000 3000 2000 1000 0 0 2 4 6 8 10 Number of Visibilities (millions) Miriad OpenCL

Figure 5.4: CPU and GPU gridding performance for a varying number of visibilities This graph compares the performance of the optimised GPU gridding implementation with Miriad as the number of elements in the visibility list, N , increased. Results were plotted starting at N = 250000 and repeated for every multiple of 250000 up to a maximum of N = 10000000. Each datapoint was generated by averaging the runtime for each value of N over four runs.

41

1800

1600

1400 Runtime (ms)

1200

1000

Work-group size 1x1 2x2 3x3 4x4 5x5 6x6 7x7 8x8 9x9 10x10 11x11 12x12 13x13 14x14 15x15 16x16 17x17 18x18 19x19 20x20 21x21 22x22

800

600 0 5 10 CGF Width 15 20

Figure 5.5: Thread optimisation for a range of convolution lter widths The purpose of this test was to determine the optimal work-group sizes for the GPU gridding kernel for a range of convolution lter widths. Due to the way the gridding kernel handled bins, the minimum possible work-group size needs to be equal or greater than the convolution width in both dimensions in order to generate correct output, so only datpoints satisfying this criteria were plotted. Another limitation is that the GPU only allows work-groups with 512 elements or less, leading to upper bounds of 22 for the convolution lter width and 22x22 for the work-group size.

42

CHAPTER 5: Testing

12000 Miriad OpenCL 10000

8000 Runtime (ms)

6000

4000

2000

0 0 5 10 CGF Width 15 20

Figure 5.6: CPU and GPU gridding performance for a varying convolution lter width This test was designed to demonstrate the dierences in performance between Miriad and OpenCL gridding implementations over a range of convolution widths. These convolution widths, on the x-axis, are plotted against run time, on the y-axis. Cgf widths were tested over a range of 1 to 22. The maximum value was imposed due to the current GPU implementations requirement of a work-group size equal to or greater than the convolution width, with 22x22 being the maximum work-group size possible on the NVIDIA Tesla C1060.

43

44

CHAPTER 5: Testing

Convolution width 1 2 3 4 5 6 7 8 9 10 11

Optimal Work-group 8x8 8x8 7x7 8x8 7x7 8x8 7x7 8x8 11x11 11x11 11x11

Convolution width 12 13 14 15 16 17 18 19 20 21 22

Optimal Work-group 12x12 13x13 14x14 16x16 16x16 20x20 21x21 21x21 21x21 21x21 22x22

Table 5.1: Optimal work-group sizes for various convolution lter widths This table shows the best performing local work-group sizes for a range of convolution widths as determined by the results of the test shown in Figure 5.5

Chapter 6

Discussion

The results presented in the previous chapter are now discussed. I will begin by examining the selection of an optimal work-group size and explaining the eect of this parameter on performance. Subsequently, the performance prole of the OpenCL gridding implementation will be discussed. Finally, Performance of both the Miriad and OpenCL gridding implementations will be compared and the paramaters aecting each program will be analysed.

6.1

Work-Group Optimisation

The rst major goal of testing was to determine the optimal local work-group size which makes gridding on the GPU run in the shortest amount of time. This paramater has a major impact on GPU performance in a number of ways, as it determines how many work-items can be run simultaneously, the number of registers and

45

46

CHAPTER 6: Discussion

amount of shared memory available to each processor, as well as determining the size of the bins that visibilities are sorted into. For a given number of work-items in a work-group T and warp size Wsize (which is equal to 32 for GPUs based NVIDIAs CUDA architecture), the total number of warps required by a work-group, Wwg is given by Equation 6.1 [7]:

Wwg = ceil(

T , 1) Wsize

(6.1)

Given the warp allocation granularity GW (equal to 2 on the Tesla), the number of registers used by a kernel Rk and the thread allocation granularity GT (512 on the Tesla), the number of registers allocated to each work-group, Rwg , can be expressed by Equation 6.2:

Rblock = ceil(ceil(Wwg , GW ) Wsize Rk , GT )

(6.2)

From Figure 5.1 the best performing work-group size determined to be 60 workitems arranged as 6x10. The reason a work-group size of 8x8 was chosen was that it contains 64 work-items, which happens to be the maximum number that can t into 2 warps. This maximises the number of work-items capable of running simultaneously on the GPU without decreasing the number of registers available to each warp.

The optimal work-group sizes measured for a range of convolution widths, shown in Figure 5.5 and Table 5.1 show several interesting patterns. Work convolution widths up to 8 in length, work-group sizes of 7x7 and 8x8 appear to produce very similar results, outperforming all the other sizes. While the 8x8 and 7x7 work-groups

47

contain 64 and 49 work-items respectively, they both take up 2 warps. A possible explanation for the fast performance of the 7x7 work-group is despite running less work-items in parallel, the smaller bin size reduces the number of visibilities processed in each work-group. A similar pattern can be seen in the performance of 15x15 and 16x16 work-groups, which both require 8 warps, but contain 225 and 256 work-items respectively.

Another observation is that while small work-groups generally outperform large work-groups, work-groups below 6x6 in size show an opposite trend. Since these work-groups are all smaller than a single warp in size, since each multiprocessor processes a single work-group at a time, these small work-groups dont contain enough work-items to make use of the full set of processing elements. Witha 1x1 work-group, each multiprocessor only performs operations on one processing element, while the other 31 idle.

6.2

Performance Proling

The performance proles shown in Figures 5.2 and 5.3 show a breakdown of the time spent in each stage for separate gridding implementations. These plots can be used to determine the ratio between run-times of dierent stages to determine performance bottlenecks for a single plot as well as measure speed-up by comparing dierent plots together.

The gridding implementation developed in this project, labelled as OpenCL with GPU sort, demonstrates a speedup of 1.46x the original Miriad implementation. Ex-

48

CHAPTER 6: Discussion

cluding the time taken by the device and host transfer stages, as well as the transfers listed part of bin-processing, this speedup is 2.28x. While this value is lower than the performance obtained in other parallel gridding processes, detailed in Chapter 3, the two values cant be directly compared due to the dierent datasets used.

A comparison of both OpenCL implementations in these plots reveals the impact sorting has on the runtime of pre-sorted gather based gridding. Compared to the CPU based sort, sorting the visibilities on the GPU is 103x faster. Combined with the other stages of the gridding process, this resulted in a total speedup of 6.36x.

6.3

Performance Comparison

Figure 5.4 shows several sharp increases in runtime of the GPU gridding algorithm as the number of visibilities grows. Since the bitonic sorting kernel requires the list of visibilities to be padded with empty values so that its length is a power of two, these sudden runtime increases represent a combination of two factors. Firstly, the sorting kernel needs to process twice as many visibilities which doubles the time required for the operation. Secondly, the current version of GPU gridding pads the list of visibilities with zeros on the host, which results in the amount of data needed to be transferred doubling at each jump in runtime on the graph. These steep increases in runtime could be partially reduced by padding the visibility data with empty values on the GPU.

As shown in Figure 5.6, the relative performance of gridding on a GPU compared to

49

on a CPU greatly increases with large convolution widths. This is a major benet of the gather approach compared to the the scatter approach, since, while larger convolution windows increase the number of calculations performed per visibility in both algorithms, in a gather based kernel this extra work is spread across a large number of threads. In both the CPU implementation and scatter kernels, the number of operations performed on each visibility is proportional to the convolution width squared.

50

CHAPTER 6: Discussion

Chapter 7

Conclusion

This project has implemented the gridding stage of aperture synthesis on a GPU using OpenCL. Its performance has also been compared with the single threaded gridding process used in Miriad. This chapter summarises the process of developing the GPU gridding algorithm and concludes with future considerations for extending this work.

7.1

Project Summary

The initial target of my research was to write a CPU based gridding implementation in C. The purpose of this implementation was to gain an understanding of the gridding process and to develop wrapper code to handle input and output tasks not supported by the GPU. In order to avoid rewriting a large amount of code unrelated to the main task of gridding, this program was implemented by replacing the MapIt

51

52

CHAPTER 7: Conclusion

function call in Miriads Mapper subroutine with a function call to my own gridding function and returning the gridded output to Miriad. Once I veried that all gridding calculations were being performed within the function , I began research into dierent approaches to perform this operation on parallel hardware.

My rst attempt at an OpenCL implementation running on the GPU made use of a kernel based on the scatter approach described in Section 4.1. This implementation was similar to the original CPU implementation, since the operations performed by the kernel on each visibility were exactly the same as the original. The major dierence was that these operations were performed in parallel by work-items on the GPU instead of inside a loop on the CPU. Although this program was able to run extremely fast, I was unable to overcome the problems caused by simultaneous writes to the same memory address.

After further examination of the gridding process, I developed a new gridding kernel using the gather approach outlined in Section 4.2. This kernel was primarily designed to eliminate the issue of simultaneous writing, which was achieved by launching a separate thread for each pixel in the output grid. Initially this kernel was incredibly slow, taking over half an hour to grid the sample dataset described in Chapter 5. Since the gather kernel managed to produce correct results, improving its performance replaced correcting the scatter kernel as the main focus of development.

It soon became apparent that the gather kernels performance could be drastically improved by sorting the visibilities based on their location in the u v plane and modifying the gridding kernel to only process visibilities close to the grid location

53

designatedd by its global ID. This additional sorting step is explained in Section 4.3. The rst version of the pre-sorted gather approach performed the visibility sorting operation on the CPU before transferring the sorted visibility data to the graphics card and running the gridding kernel. Because I was planning to eventually implement sorting on the GPU, I wrote my own version of the bitonic sort algorithm which ran on a single CPU core. Performance tests revealed that although this new approach to gridding was several hundred times faster than the original unsorted gather approach, it was still ve times slower than Miriad.

Proling this new gridding implementation showed that the visibility sorting stage was responsible for 90% of the total runtime. In order to improve overall performance the sorting algorithm was replaced by a sorting kernel running on the GPU. This sorting kernel was taken from OpenCL Sorting Networks example NVIDIAs GPU Computing SDK, which perfectly matched my requirements with only slight modication. This modication nally managed to improve the performance of my gridding implementation enough to run faster than Miriad over a wide range of parameters.

7.2

Future Consideration

This research thoroughly investigated many aspects of a GPU based gridding implementation. However, there are still many related areas yet to be explored as well as a number of areas within the scope of this project which warrant further investigation. These include utilising dierent memory regions on the device and performing the

54

CHAPTER 7: Conclusion

remaining CPU based binning stages on the GPU. Combining gridding with other stages of the aperture synthesis pipeline in OpenCL and adapting this work to other hardware architectures are also discussed.

The pre-sorted gather approach to gridding consists of four stages: Determining each visibilitys bin, sorting the visibilities, constructing an array of indices of the rst and last visibility in each bin and gridding. Currently only the sorting and gridding stages are implemented on the GPU, while the other stages are processed on the host. Determining each visibilitys bin on the GPU could be performed faster than on the CPU. Building the array of indices on the CPU requires the sorted visibilities to be transferred to the host and the array of bin locations to be transferred to the device. Performing this calculation the GPU would not only be faster, but also eliminate both of these transfers.

Since each work-group is comprised of work-items located in the same bin, each work-item processes the same set of visibilities. Currently the gridding kernel requests each visibility from global memory individually, waiting after each request. By allocating a small amount of local memory as a visibility cache, the kernel could alternate between lling the cache by requesting a series of consecutive visibilities in parallel and looping though each visibility in the cache. This optimisation does not guarantee a performance increase on all devices since the OpenCL specication allows for devices without work-group specic local memory to map it to a region of global memory. On such a device, any attempts at caching data from global memory in local memory would actually slow down a kernel.

55

The aperture synthesis pipeline, outlined in Section 2.2, consists of several sequential stages that converts radio signals collected by radio telescopes into a two-dimensional image of the radio source. As discussed in Chapter 3, parallel versions of most stages of aperture synthesis have been developed in CUDA in order to process data generated by the Murchison Wideeld Array in real time [9]. Future work could focus on an OpenCL version of this pipeline which could make use of a combination of a wider variety of GPUs as well as other devices supporting OpenCL.

Since kernels written in OpenCL are capable of running on any OpenCL device with sucient memory, the gridding implementation developed in this project is able to run on a wide variety of hardware without any modication. A subsequent project could focus on optimising the gridding kernel developed for the NVIDIA Tesla C1060 for various other devices and compare performance across a wide range of parameters. Another potential area for further research could be in implementing a version of gridding capable of running on multiple OpenCL devices simultaneously.

56

CHAPTER : Conclusion

Appendix A

Original Proposal

Radio interferometric image reconstruction on commodity graphics hardware using OpenCL


Alexander Ottenho
01 April 2010

The Problem
Astronomers can gain a better understanding of the the creation and early evolution of the universe, test theories and attempt to solve many mysteries in physics by producing images from radio waves emitted by distant celestial entities. With the construction of vast radio-telescope arrays such as the Square Kilometre Array (SKA), Australian SKA Path-nder (ASKAP) and Murchison Wide-eld Array (MWA), many engineering challenges need to be overcome. ASKAP alone will gen-

57

58

CHAPTER A: Original Proposal

erate data at a rate of 40Gb/s, producing over 12PB in a single month [6] and SKA will produce many times this, so data processing and storage are major issues. As we reach the limit of how fast we can make single core CPUs run we need to look to parallel processors such as multi-core CPUs, GPUs and digital signal processors to process this vast amount of data. One of biggest problems limiting the popularity of parallel processors has been the lack of a standard language that runs on a wide variety of hardware, although a new language named OpenCL may change that.

First published by the Khronos group in late 2008, OpenCL is an open standard for heterogeneous parallel programming [17]. One of the major advantages of code written in OpenCL is that it allows programmers to write software capable of running on any device with an OpenCL driver, eliminating the need to rewrite large amounts of code for each vendors hardware. This partially solves the issue of vendor lock-in, a major problem in general purpose GPU (GPGPU) programming up until now where due to the lack of standardisation, software is often restricted to only running on a single architecture only produced by one company.

In this project I aim to develop an ecient way to adapt radio-interferometric imaging to parallel processors using OpenCL, in particular the gridding algorithm as this has traditionally been the most time-consuming part of the imaging process [30]. Due to the large amount of data that will be generated by the ASKAP, the amount of data which can be processed in real-time may be a serious performance bottleneck. Since a cluster of GPUs with equal computational performance to a traditional supercomputer consumes a fraction of the energy, an ecient OpenCL implementation would be a signicantly less expensive option. I will primarily target GPU architectures in particular the NVIDIA Tesla C1060, although I will also attempt

59

to benchmark and compare performance on several dierent devices.

Background

Radio interferometry background

The goal of radio astronomy is to gain a better understanding of the physical universe via the observation of radio waves emitted by celestial bodies. Part of this is achieved by forming images from the signals received by radio telescopes. For a single dish style radio telescope, the angular resolution R of the image generated from a signal of wavelength is related to the diameter of the dish D.

R=

(A.1)

Since R is a measure of the nest object a telescope can detect, a dish designed to create detailed images of low frequency signals can be several hundred metres in diameter. Constructing a dish of this size is however both dicult and extremely expensive, so most modern radio astronomy projects are utilise an array of telescopes.

Aperture synthesis is a method of combining signals from multiple telescopes to produce an image (as shown in gure ??) with a resolution approximately equal to that of a single dish with a diameter of the maximum distance between antennae. The rst stage of this process involves taking the signals from each pair of antennas

60

CHAPTER A: Original Proposal

and cross-correlating them to form a baseline. The relationship between the number of an antennas in an array a, and the total number of baselines b including those autocorrelated with themselves is shown by (A.2).

b=

a(a 1) +a 2

(A.2)

These signals are combined to produce a set of complex visibilities, one for each baseline, frequency channel and period of time. The next stage is to generate a dirty image from these complex visibilities by translating and interpolating them to a regular grid so that the Fast Fourier Transform (FFT) can be applied. Finally, the dirty image may be deconvolved to eliminate artifacts introduced during imaging.

A common name for the stage of aperture synthesis where complex visibilities are mapped to a regular grid is gridding. The relationship between the 2-dimensional sky brightness I, 3-dimensional visibility V and primary antenna beam pattern A is shown in (A.3) [27].

A(l, m)I(l, m) =

V (u, v)e2i(ul+vm) du dv

(A.3)

The primary beam pattern A is removed during the deconvolution stage to obtain the sky brightness. For radio-telescope arrays with suciently large baselines or a wide eld of view, images are distorted due to the curvature of the earth introducing a height element w to the location of each antenna. One technique used to counter this distortion is faceting, where the sky is divided into patches small enough that the baselines can be treated as coplanar and then combining them into one image.

61

Another common approach known as W-projection involves gridding the entire grid treating w as 0 and then convolving each point in the dirty image by a function G provided in (??) [3].

(u2 +v 2 ) i G(u, v, w) = ei[ w ] w

(A.4)

Parallel hardware background

Computer manufacturers have shifted their focus in recent years from designing fast, single core processors to creating processors which can execute multiple threads simultaneously and minimise memory access latency. Since these multi-core processors are still relatively new, a diverse range of architectures are available, including multicore x86 processors such as the AMD Phenom and Intel Core i7, IBMs Cell/B.E. and GPUS like NVIDIAs Tesla and AMDs Radeon 5800 series. One of the factors limiting the usage of parallel processors by developers is the vast amount of code that has been developed for single-processor computers. Often, due to interdependencies between operations rewriting these legacy programs to take advantage of multiple concurrent threads is not a trivial task.

While originally developed as co-processors optimised for graphics calculations, GPUs are being designed with increasingly exible instruction sets and are emerging as economical massively parallel processors. NVIDIAs recent Tesla C1060 GPU is capable of 933 single precision GigaFLOPS [8] (oating point operations per second) compared to the fastest CPU available at the time, Intels Core i7 Extreme 965, which has been benchmarked at 69 single precision GigaFLOPS [?]. Part of

62

CHAPTER A: Original Proposal

the reason that GPUs can claim such high performance gures is their architecture as shown in gure ??. By devoting more transistors to data processing, GPUs are highly optimised for performing simple vector operations on large amounts of data signicantly faster than a processor using those transistors for other purposes. This performance, however, comes at the expense of control circuitry meaning that GPUs cant make use of advanced run time optimisations commonly found on modern desktop CPUs such as branch prediction and out-of-order execution. GPUs also sacrice the amount of circuitry used for local cache, which has a major impact on the average amount of time a process needs to wait when between requesting data from memory.

OpenCL is a programming language created by the Khronos group with the design goal of enabling the creating of code that can run across a wide range of parallel processor architectures without needing to be modied. To deal with the many different types of processors that can be used for processing data the OpenCL runtime separates them into 2 classes: hosts and devices. The host, generally a single CPU core, is in charge of managing memory and transferring programs compiled for the device at run-time (kernels) and data to and from devices. A devices job is to simply execute a kernel in parallel across a range of data, storing the results locally, and alert the host when nished so the results can be transferred back. Command queues are used so the host can queue up several instructions waiting for device execution while still being free to perform whatever other operations are necessary while waiting for results. An important feature for code executed on the device is the availability of vector data types, allowing each ALU on a device with SIMD instructions to perform an operation on multiple variables simultaneously. Because of the diverse range of devices supported, memory management on the device is left up to the programmer so they can eciently make use of the limited local cache

63

available on GPU threads as well as take advantage of optional device features such as texture memory.

Plan
This project aims to evaluate whether GPUs programmed using OpenCL are a suitable platform for running the gridding stage of imaging radio astronomy data in real-time. So far various research papers and journal articles have been read in an eort to understand the variety of techniques currently being used to improve gridding performance in existing projects [1, 3, 15, 23, 32], as well as previous eorts to parallelise gridding on similar processors [19, 26, 2830]. The next step will be to construct a theoretical model through analysis of the algorithms used in the most relevant papers and research into the specications of the target language and platform [7, 17]. This model will be used to determine where any data dependencies exist in the algorithm and to plan out a GPU optimised solution.

Before implementing this model on a GPU target using OpenCL, a serial version written in ANSI C. The serial implementation will be developed rst as a reference to determine the correctness of the OpenCL version. This will then be followed by an OpenCL implementation, optimised for NVIDIAs Tesla C1060 processor on a x86 workstation running Ubuntu Linux. Various optimisations will be tested to improve the execution time, and the nal version will be benchmarked on several dierent platforms.

64

CHAPTER A: Original Proposal

Figure A.1: The software pipeline. [29] The rst stage of this process involves taking the signals from each pair of antennas and cross-correlating them to form a baseline. These signals are combined to produce a set of complex visibilities, one for each baseline, frequency channel and period of time. The next stage is to generate a dirty image by translating and interpolating them to a regular grid and applying the Fast Fourier Transform (FFT). Finally, the dirty image may be deconvolved to eliminate artifacts introduced during imaging.

Figure A.2: A comparison of CPU and GPU architectures. [7] By devoting more transistors to data processing, GPUs are highly optimised for performing simple vector operations on large amounts of data signicantly faster than a processor using those transistors for other purposes. This performance, however, comes at the expense of control circuitry meaning that GPUs cant make use of advanced run time optimisations commonly found on modern desktop CPUs such as branch prediction and out-of-order execution. GPUs also sacrice the amount of circuitry used for local cache, which has a major impact on the average amount of time a process needs to wait when between requesting data from memory.

References
[1] PJ Beatty, DG Nishimura, and JM Pauly. Rapid gridding reconstruction with a minimal oversampling ratio. IEEE transactions on medical imaging, 24(6):799 808, 2005. [2] BG Clark. An ecient implementation of the algorithmCLEAN. Astronomy and Astrophysics, 89:377, 1980. [3] T. J. Cornwell, K. Golap, and S. Bhatnagar. Wide eld imaging problems in radio astronomy. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP 05). Vol. 5: p. v-861-v-864, pages 861, March 2005. [4] TJ Cornwell. Radio-interferometric imaging of very large objects. Astronomy and Astrophysics, 202:316321, 1988. [5] TJ Cornwell, MA Holdaway, and JM Uson. Radio-interferometric imaging of very large objects: implications for array design. Astronomy and Astrophysics, 271:697, 1993. [6] TJ Cornwell and G. van Diepenb. Scaling Mount Exaop: from the pathnders to the Square Kilometre Array. [7] NVIDIA Corporation. OpenCL Programming Guide for the CUDA Architecture. Available from: http://www.nvidia.com/content/cudazone/ download/OpenCL/NVIDIA\_OpenCL\_ProgrammingGuide.pdf Last accessed on: . [8] NVIDIA Corporation. Tesla c1060 computing processor board specication. Available from: http://www.nvidia.com/docs/IO/56483/Tesla\ _C1060\_boardSpec\_v03.pdf Last accessed on: . [9] RG Edgar, MA Clark, K. Dale, DA Mitchell, SM Ord, RB Wayth, H. Pster, and LJ Greenhill. Enabling a high throughput real time data pipeline for a large radio telescope array with GPUs. Computer Physics Communications, 2010.

65

66

References

[10] S. Freya and L. Mosonic. A short introduction to radio interferometric image reconstruction. [11] K. Golap, A. Kemball, T. Cornwell, and W. Young. Parallelization of Wideeld Imaging in AIPS++. In Astronomical Data Analysis Software and Systems X, volume 238, page 408, 2001. [12] A. Gregerson. Implementing Fast MRI Gridding on GPUs via CUDA. [13] Mark Harris. Mapping computational concepts to gpus. In SIGGRAPH 05: ACM SIGGRAPH 2005 Courses, page 50, New York, NY, USA, 2005. ACM. [14] Intel. Intel microprocessor export compliance metrics. Available from: http:// www.intel.com/support/processors/sb/CS-023143.htm Last accessed on: . [15] JI Jackson, CH Meyer, DG Nishimura, and A. Macovski. Selection of a convolution function for Fourier inversion usinggridding [computerised tomography application]. IEEE Transactions on Medical Imaging, 10(3):473478, 1991. [16] W.Q. Malik, H.A. Khan, D.J. Edwards, and C.J. Stevens. A gridding algorithm for ecient density compensation of arbitrarily sampled Fourier-domain data. [17] A. Munshi. OpenCL: Parallel Computing on the GPU and CPU. SIGGRAPH, Tutorial, 2008. [18] ST Myers. Image Reconstruction in Radio Interferometry. [19] S. Ord, L. Greenhill, R. Wayth, D. Mitchell, K. Dale, H. Pster, and RG Edgar. GPUs for data processing in the MWA. Arxiv preprint arXiv:0902.0915, 2009. [20] D. Rosenfeld. An optimal and ecient new gridding algorithm using singular value decomposition. Magnetic Resonance in Medicine, 40(1):1423, 1998. [21] RJ Sault, PJ Teuben, and MCH Wright. A retrospective view of Miriad. Arxiv preprint astro-ph/0612759, 2006. [22] T. Schiwietz, T. Chang, P. Speier, and R. Westermann. MR image reconstruction using the GPU. In Proc. SPIE, volume 6142, pages 127990. Citeseer, 2006. [23] FR Schwab. Optimal gridding of visibility data in radio interferometry. In Indirect Imaging. Measurement and Processing for Indirect Imaging, page 333, 1984. [24] H. Sedarat and D.G. Nishimura. On the optimality of the gridding reconstruction algorithm. IEEE Transactions on Medical Imaging, 19(4):306317, 2000. [25] DJ Smith. Maximum Entropy Method. MARCONI REV., 44(222):137158, 1981.

67

[26] TS Sorensen, T. Schaeter, KO Noe, and M.S. Hansen. Accelerating the nonequispaced fast Fourier transform on commodity graphics hardware. IEEE Transactions on Medical Imaging, 27(4):538547, 2008.

[27] GB Taylor, CL Carilli, and RA Perley. Synthesis imaging in radio astronomy II. In Synthesis Imaging in Radio Astronomy II, volume 180, 1999.

[28] A.S. van Amesfoort, A.L. Varbanescu, H.J. Sips, and R.V. van Nieuwpoort. Evaluating multi-core platforms for HPC data-intensive kernels. In Proceedings of the 6th ACM conference on Computing frontiers, pages 207216. ACM, 2009.

[29] Ana Lucia Varbanescu, Alexander S. Amesfoort, Tim Cornwell, Andrew Mattingly, Bruce G. Elmegreen, Rob Nieuwpoort, Ger Diepen, and Henk Sips. Radioastronomy image synthesis on the cell/b.e. In Euro-Par 08: Proceedings of the 14th international Euro-Par conference on Parallel Processing, pages 749762, Berlin, Heidelberg, 2008. Springer-Verlag.

[30] Ana Lucia Varbanescu, Alexander S. van Amesfoort, Tim Cornwell, Ger van Diepen, Rob van Nieuwpoort, Bruce G. Elmegreen, and Henk Sips. Building high-resolution sky images using the cell/b.e. Sci. Program., 17(1-2):113134, 2009.

[31] T.L. Wilson, K. Rohlfs, and S. Huttemeister. Springer Verlag, 2009.

Tools of Radio Astronomy.

[32] M. Yashar and A. Kemball. TDP CALIBRATION & PROCESSING GROUP CPG MEMO COMPUTATIONAL COSTS OF RADIO IMAGING ALGORITHMS DEALING WITH THE NON-COPLANAR BASELINES EFFECT: I. 2009.

You might also like