You are on page 1of 35

FAST GPU BASED ADAPTIVE FILTERING OF 4D ECHOCARDIOGRAPHY

M.TECH FIRST SEMESTER SEMINAR REPORT SIGNAL PROCESSING

Submitted in partial fulllment of the requirements for the award of M. Tech Degree in Electronics and Communication Engineering(Signal Processing) of the University of Kerala

Submitted by:

JERRIN THOMAS PANACHAKEL

DEPARTMENT OF ELECTRONICS AND COMMUNICATION COLLEGE OF ENGINEERING TRIVANDRUM 2012

DEPARTMENT OF ELECTRONICS AND COMMUNICATION


COLLEGE OF ENGINEERING TRIVANDRUM

Certicate
This is to certify that this report entitled FAST GPU BASED ADAPTIVE FILTERING OF 4D ECHOCARDIOGRAPHY is a bonade record of the seminar presented by JERRIN THOMAS PANACHAKEL, under our guidance towards partial fulllment of the requirements for the award of Master of Technology Degree in Electronics and Communication Engineering (Signal Processing), of the University of Kerala during the year 2012.

Prof. Jeena R.S. Asst. Prof, Dept. of ECE College of Engineering, Trivandrum (Seminar Guide)

Prof. Prajith C.A Asso Prof, Dept. of ECE College of Engineering, Trivandrum (Seminar Coordinator)

Dr. Jiji. C. V Professor, Dept. of ECE College of Engineering, Trivandrum (P. G. Coordinator)

Dr. J. David Professor, Dept. of ECE College of Engineering, Trivandrum (Head of Department)

ACKNOWLEDGEMENTS

I am thankful to Dr.J.David, Head of the Department and Dr.Jiji C.V, P.G. Coordinator of the Department of Electronics and Communication for their help and support.

I extend my hearty gratitude to Prof.Prajith C.A, Prof.James T.G and Prof.Susan R.J, seminar coordinators, Department of Electronics and Communication for providing necessary facilities and their sincere cooperation.

I would like to express my sincere gratitude and heartful indebtedness to my seminar guide, Prof. Jeena R.S.,Assistant Professor, Department of Electronics and Communication Engineering for her valuable guidance and encouragement in pursuing this seminar.

I also acknowledge other members of faculty in the Department of Electronics and Communication Engineering and all my friends and family for their whole hearted cooperation and encouragement.

Above all I am thankful to the God Almighty for his love and blessings.

JERRIN THOMAS PANACHAKEL

Abstract
Time resolved three-dimensional echocardiography generates four-dimensi-onal data sets that bring new possibilities in clinical practice. Image quality of fourdimensional (4D) echocardiography is however regarded as poorer compared to conventional echocardiography where time-resolved 2D imaging is used. Advanced image processing ltering methods can be used to achieve image improvements but to the cost of heavy data processing. The recent development of graphics processing unit (GPUs) enables highly parallel general purpose computations, that considerably reduces the computational time of advanced image ltering methods. In this study multidimensional adaptive ltering of 4D echocardiography was performed using GPUs. Filtering was done using multiple kernels implemented in OpenCL (open computing language) working on multiple subsets of the data. The results show a substantial speed increase of up to 74 times, resulting in a total ltering time less than 30 s on a common desktop. This implies that advanced adaptive image processing can be accomplished in conjunction with a clinical examination.

Contents
1 INTRODUCTION 2 ANISOTROPIC ADAPTIVE FILTERING 3 ECHOCARDIOGRAPHY 3.1 Purpose . . . . . . . . . . . . . . . . 3.2 Transthoracic Echocardiogram . . . . 3.3 Transesophageal Echocardiogram . . 3.3.1 Advantages . . . . . . . . . . 3.3.2 Disadvantages . . . . . . . . . 3.4 M-mode Echocardiography . . . . . . 3.5 Two-Dimensional Echocardiography . 3.6 Three-dimensional echocardiography 1 3 7 8 8 9 10 10 10 11 12 13 13 13 14 15

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

4 HARDWARE AND SOFTWARE 4.1 Echocardiographic Image Acquisition . . . . . . . . . . . . . . . . . . 4.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 GPU IMPLEMENTATION

6 RESULTS 21 6.1 Timing Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 6.2 Filtering Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 7 CONCLUSION Bibliography 27 29

ii

List of Figures
2.1 3.1 3.2 5.1 6.1 6.2 6.3 6.4 6.5 6.6 Visualization of a quadrature lter . . . . . . . . . . . . . . . . . . . 5

Normal heart (TTE view) . . . . . . . . . . . . . . . . . . . . . . . . 9 3D echocardiogram of a heart viewed from the apex . . . . . . . . . . 12 Illustration of kernel invocations and data ow between CPU/GPU and GPU/GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Timing comparison for dierent kernel sizes Timing comparisons for 3D and 4D ltering Computational time for ltering of the aortic Comparison of ltering eciency . . . . . . Intensity plot along a central horizontal line and 4D ltered aortic valve view. . . . . . . Intensity plot along a central horizontal line and 4D ltered four chamber view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . valve data set . . . . . . . . . . . . . . . . . . from the original, 3D, . . . . . . . . . . . . . from the original, 3D, . . . . . . . . . . . . . . . . . 22 22 23 25

. 26 . 26

iii

Chapter 1 INTRODUCTION
CARDIAC ultrasound began with single-crystal transducer displays of the amplitude (A-mode) of reected ultrasound versus depth on an oscilloscope screen creating images that were dicult to interpret. When time dimension (M-mode) was introduced the images became somewhat easier for the clinician to interpret, and it became even easier with the introduction of 2D-images. Among the disadvantages of using 2D is that structures can not be seen in all three spatial dimensions or could be viewed from dierent orientations, which could be of interest, for instance when localizing a prolapse or vegetation of the mitral valve. The later could be achieved using 3D/4D echocardiography, which also gives the opportunity to study structures from the surgical view and makes it possible to travel through the heart, which is of special interest in patients with complex congenital heart diseases. The 3D echocardiographic data set can be used to generate multiple 2D image planes, which is useful for instance when calculating the left ventricular stroke volume . Real-time 3D transesophageal echocardiography may be used during complex interventional procedures like percutaneous edge-to-edge repair of the mitral valve. New areas to use 3D echocardiography are also for instance to assess myocardial perfusion during adenosine stress . However, the clinical use of 3D/4D echocardiography is much reliant on image quality, even more than standard 2D, and during the years the use of 3D/4D has been limited due to complex and sometimes timeconsuming requirements for postprocessing, often on a separate workstation after the examination . The trend today towards 3D and 4D imaging modalities of even higher resolutions pose an increasing computational challenge for data processing and ltering of these data sets. At the same time, the previous trend of the last few decades of pro1

cessors of increasing frequencies have been hampered by basic physical limitations of electronics and instead been replaced by a focus on parallelization of hardware. This shift in focus implies that basic imaging ltering algorithms no longer automatically scale up to larger data sets with later generation processors, but instead require an often complete redesign of algorithms and programming tools. As such, there is an increasing need for ecient algorithms that can exploit the new computational hardwares such as multicore processors and, lately, general purpose computations on graphics processors (GPUs). The possibility to substantially speed-up processing time using GPUs has been utilized in dierent applications of ultrasound imaging. Real-time tracking image registration , and segmentation are examples where GPUs has facilitated computational demanding algorithms. It has also been shown how GPU computations can be used for image denoising of ultrasound images . In a recent publication it is shown how the power of full dimensionality (4D) improves image denoising of 4D cardiac CT data set. There are, however,there are no work on image denoising of 4D echocardiography data where the full dimensionality of the data is taken into account in the image denoising process. In the work, an ecient data parallel version of an o the shelf adaptive ltering method for image denoising is discussed. This algorithm was applied to to 3D and 4D echocardiography data and conclude that analysis of 3D and 4D echocardiography can be done in more practical time frames with the method presented here. For the method to be of practical clinical use, we require that the ltering should be possible to perform on a normal desktop computer within a time frame allowing for a normal patient throughput. Due to the size of the data to be ltered and the computational cost of each sample, this impose a hard requirement on ecient implementation of the computations. For reference, the data sets under consideration here range roughly from 107 108 samples samples, and the computation requires in the magnitude of 104 105 convolution steps per sample to measure the local orientations. To satisfy this computational requirement, an algorithms was developed that perform this ltering to be run on the latest graphic card processors (GPUs) which promise to deliver a oating point performance that is sucient for the ltering of the 4D data. To deliver this performance the basic ltering steps have been rewritten and optimized to a form suitable for GPUs.

Chapter 2 ANISOTROPIC ADAPTIVE FILTERING


Adaptive lters are commonly used in image processing to enhance or restore data by removing noise without signicantly blurring the structures in the image. The adaptive tering literature is vast and cannot adequately be summarized in a short chapter. However, a large part of the literature concerns one-dimensional (1D) signals. Such methods are not directly applicable to image processing and there are no straightforward ways to extend 1D techniques to higher dimensions primarily because there is no unique ordering of data points in dimensions higher than one. Since higher- dimensional medical image data are not uncommon (2D images, 3D volumes, 4D time-volumes), the focus on this chapter on adaptive ltering techniques that can be generalized to multidimensional signals. On the basis of the characteristics of the human visual system, local anisotropy is an important property in images and introduced an anisotropic component in Abramatic and Silvermans model H, = H + (1 )( + (1 )cos2 ( ))(1 H) (2.1)

where the parameter controls the level of anisotropy, denes the angular direction of the lter coordinates, and y is the orientation of the local image structure. The specic choice of weighting functioncos2 ( ) was imposed by its ideal interpolation properties, the directed anisotropy function could be implemented as a steerable lter from three xed lters The local orientation and the degree of anisotropy with three oriented Hilbert transform pairs, so- called quadrature lters, with the same angular proles as the three basis functions describing the steerable weighting function. Figure 2 shows one of these Hilbert transform pairs. In areas 3

of the image lacking a dominant orientation, is set to 1, and Eq.2.1 reverts to the isotropic Abramatic and Silverman solution. The more dominant the local orientation, the smaller the value and the more anisotropic the lter. This method can intuitively be seen as a linear combination of low pass and high pass lters, where the combination of the lters is spatially variant and relies on the orientation of the local structures surrounding each data sample. These orientation estimates adjust the lter to preserve the edges of surfaces while the low pass components remove the noise. The theory of this concept can be found in the literature, only a concise introduction to the method is given here. At rst the local orientation is determined by using a set of quadrature lters qk . In a quadrature lter set the number of quadrature lters being used depend on the dimensionality of the data. In 3D six quadrature lters are used while twelve are used for 4D. Each quadrature lter has a specic orientation and consists of a kernel pair with even and odd convolution kernels that are sensitive for lines and edges, respectively. The output qk of is a complex number where the magnitude |qk | is an estimate of the certainty for identifying signal change corresponding to a line or an edge. Based on the response from the quadrature lters the local orientation tensor T is given by |qk |(nk nk T I) (2.2) T =
k

where nk is the direction of the quadrature lter qk , I the identity tensor and , 5 are the constants with values = 4 and = 1 in 3D and = 1, = 1 in 4D. 4 6 By calculating the eigenvalues and eigenvectors of T the local orientation is possible to interpret. If all eigenvalues are approximately equal then T describes an isotropic neighborhood with no dominating orientation while in other cases, when there exists a dierentiation in magnitude among the eigenvalues, neighborhoods of, for example, planes and lines are described. Based on the intrinsic information in T regarding the orientation of the local neighborhood, an adaptive lter synthesis is performed where the resulting adaptive lter sap is given by a weighted sum of xed lters sap = slp + ahp
k

ck s k

(2.3)

where slp is the result from a low pass lter,ahp is the high-pass amplication factor which gives a trade-o between ltering quality and a risk for introducing high-pass ltering artifacts exaggerating edges and lines,sk is the output from a high-pass lter 4

(a)

(b)

(c)

(d)

Figure 2.1: Visualization of a quadrature lter (Hilbert transform pair) used in the estimation of local anisotropy. (Top) The plots show the lter in the spatial domain: the real part (left) and the imaginary part (right). It can be appreciated that the real part can be viewed as a line lter and the imaginary part an edge lter. The color coding is green, positive real, red, negative real; blue, positive imaginary, and orange, negative imaginary. (Bottom) The left plot shows the magnitude of the lter with the phase of the lter color coded. The right plot shows the quadrature lter in the Fourier domain. Here the lter is real and zero on one half of the Fourier domain.

with the same direction as the quadrature lter and ck is the weighting coecient below ck = C.(nk nk T I) (2.4) where C is the control tensor and . symbolizes the scalar product between tensors.The control tensor is used to control the degree of anisotropy in the adaptive lter. When determining C a low-pass ltered version of the local orientation tensor,Tlp , is being used. Calculating the weighted outer product of the eigenvectors of Tlp gives the control tensor C= 1 2 1 + 2
N

k ek ek T
k=1

(2.5)

where k is the eigenvalue of Tlp with i i+1 for all i = 1....M , is a resolution parameter ranging from zero to one and ek is the eigenvector of Tlp corresponding to k .

Chapter 3 ECHOCARDIOGRAPHY
An echocardiogram,often referred to in the medical community as a cardiac ECHO or simply an ECHO, is a sonogram of the heart, also known as a cardiac ultrasound, it uses standard ultrasound techniques to image two-dimensional slices of the heart. The latest ultrasound systems now employ 3D real-time imaging. In addition to creating two-dimensional pictures of the cardiovascular system, an echocardiogram can also produce accurate assessment of the velocity of blood and cardiac tissue at any arbitrary point using pulsed or continuous wave Doppler ultrasound. This allows assessment of cardiac valve areas and function, any abnormal communications between the left and right side of the heart, any leaking of blood through the valves (valvular regurgitation), and calculation of the cardiac output as well as the ejection fraction. Other parameters measured include cardiac dimensions (luminal diameters and septal thicknesses) and E/A ratio. Echocardiography was an early medical application of ultrasound. Echocardiography was also the rst application of intravenous contrast-enhanced ultrasound. This technique injects gas-lled micro bubbles into the venous system to improve tissue and blood delineation. Contrast is also currently being evaluated for its effectiveness in evaluating myocardial perfusion. It can also be used with Doppler ultrasound to improve ow-related measurements. Echocardiography is either performed by cardiac sonographers, cardiac physiologists or doctors trained in cardiology. Purpose of echocardiography in general and various types of echocardiograms are discussed below.

3.1

Purpose

Echocardiography is used to diagnose cardiovascular diseases. In fact, it is one of the most widely used diagnostic tests for heart disease. It can provide a wealth of helpful information, including the size and shape of the heart, its pumping capacity and the location and extent of any damage to its tissues. It is especially useful for assessing diseases of the heart valves. It not only allows doctors to evaluate the heart valves, but it can detect abnormalities in the pattern of blood ow, such as the backward ow of blood through partly closed heart valves, known as regurgitation. By assessing the motion of the heart wall, echocardiography can help detect the presence and assess the severity of any wall ischemia that may be associated with coronary artery disease. Echocardiography also helps determine whether any chest pain or associated symptoms are related to heart disease. Echocardiography can also help detect any cardiomyopathy, such as hypertrophic cardiomyopathy, as well as others. The biggest advantage to echocardiography is that it is noninvasive (doesnt involve breaking the skin or entering body cavities) and has no known risks or side eects.

3.2

Transthoracic Echocardiogram

A standard echocardiogram is also known as a transthoracic echocardiogram (TTE), or cardiac ultrasound. In this case, the echocardiography transducer (or probe) is placed on the chest wall (or thorax) of the subject, and images are taken through the chest wall. This is a non-invasive, highly accurate and quick assessment of the overall health of the heart. A cardiologist can quickly assess a patients heart valves and degree of heart muscle contraction (an indicator of the ejection fraction). The images are displayed on a monitor, and are recorded either by videotape (analog) or by digital techniques. An echocardiogram can be used to evaluate all four chambers of the heart. It can determine strength of the heart, the condition of the heart valves, the lining of the heart (the pericardium), and the aorta. It can be used to detect a heart attack, enlargement or hypertrophy of the heart, inltration of the heart with an abnormal substance. Weakness of the heart, cardiac tumors, and a variety of other ndings can be diagnosed with an echocardiogram. With advanced measurements of the movement of the tissue with time (tissue doppler), it can measure diastolic function, uid status and dys-synchrony. The TTE is highly accurate for identifying vegetations (masses consisting of a 8

Figure 3.1: Normal heart (TTE view) mixture of bacteria and blood clots), but the accuracy can be reduced in up to 20% of adults because of obesity, chronic obstructive pulmonary disease, chest-wall deformities, or otherwise technically dicult patients. TTE in adults is also of limited use for the structures at the back of the heart, such as the left atrial appendage. Transesophageal echocardiography may be more accurate than TTE because it excludes the variables previously mentioned and allows closer visualization of common sites for vegetations and other abnormalities. Transesophageal echocardiography also aords better visualization of prosthetic heart valves. Bubble contrast TTE involves the injection of agitated saline into a vein, followed by an echocardiographic study. The bubbles are initially detected in the right atrium and right ventricle. If bubbles appear in the left heart, this may indicate a shunt, such as a patent foramen ovale, atrial septal defect, ventricular septal defect or arteriovenous malformations in the lungs.

3.3

Transesophageal Echocardiogram

A transesophageal echocardiogram, or TEE is an alternative way to perform an echocardiogram. A specialized probe containing an ultrasound transducer at its tip is passed into the patients esophagus. This allows image and Doppler evaluation which can be recorded.It has several advantages and some disadvantages compared to a transthoracic echocardiogram (TTE).

3.3.1

Advantages

The advantage of TEE over TTE is usually clearer images, especially of structures that are dicult to view transthoracicly (through the chest wall). The explanation for this is that the heart rests directly upon the esophagus leaving only millimeters that the ultrasound beam has to travel. This reduces the attenuation (weakening) of the ultrasound signal, generating a stronger return signal, ultimately enhancing image and Doppler quality. Comparatively, transthoracic ultrasound must rst traverse skin, fat, ribs and lungs before reecting o the heart and back to the probe before an image can be created. All these structures, along with the increased distance the beam must travel, weaken the ultrasound signal thus degrading the image and Doppler quality. In adults, several structures can be evaluated and imaged better with the TEE, including the aorta, pulmonary artery, valves of the heart, both atria, atrial septum, left atrial appendage, and coronary arteries. TEE has a very high sensitivity for locating a blood clot inside the left atrium.

3.3.2

Disadvantages

TEE requires a fasting patient, (the patient must follow the ASA NPO guidelines(i.e. usually not eat or drink anything for eight hours prior to the procedure) Requires a team of medical personnel Takes longer to perform May be uncomfortable for the patient May require sedation or general anesthesia Has some risks associated with the procedure (esophageal perforation 1 in 10,000, and adverse reactions to the medication).

3.4

M-mode Echocardiography

The M-mode echocardiogram yields a one-dimensional (ice-pick) view of the cardiac structures moving over time. The echoes from various tissue interfaces along the axis of the beam are moving during the cardiac cycle and are swept across time, 10

providing the dimension of time. The lines on the recordings correspond to the position of the imaged structures in relation to the transducer and other cardiac structures at any instance in time. More accurate placement of the M-mode cursor within the heart is performed by using the two-dimensional (2-D) real-time image as a guide. The M-mode echocardiogram uses a high sampling rate and can yield cleaner images of cardiac borders, allowing the echocardiographer to obtain more accurate measurements of cardiac dimensions and more critically evaluate cardiac motion. Careful placement of the M-mode beam at the appropriate locations within the heart and obtaining clean echoes of endocardial surfaces are critical to obtain accurate measurements and to make the calculations performed from these measurements, meaningful. Standard M-mode views are obtained from the right parasternal position. The M-mode cursor should be positioned within the heart using the right parasternal short axis view, to avoid inclusion of a papillary muscle within the left ventricular free wall thickness. The standard M-mode views utilized in veterinary medicine include the left ventricle (at the level of the chordae tendineae), the mitral valve and the aortic root (aorta/ left atrial appendage) view.

3.5

Two-Dimensional Echocardiography

Two-dimensional echocardiography allows a plane of tissue (both depth and width) to be imaged in real time. Thus, the anatomic relationships between various structures are easier to appreciate than with M-mode echocardiographic images. An innite number of imaging planes through the heart are possible, however, standard views are used to evaluate the intra and extracardiac structures. The standard views are obtained from either the right parasternal window in all species and from the left parasternal window in adult large animals or in other species when imaging the heart from the left side is desirable. Occasionally, images are obtained from subxiphoid (subcostal) or thoracic inlet (suprasternal) positions. These views are usually only feasible to obtain in small animals or young large animals. The standard views include the right parasternal long axis views of the 4 chambers (4 chamber view), left ventricular outow tract, and right ventricular outow tract and the short axis views perpendicular to this plane (left ventricle at the chordal level, mitral valve, and aorta/left atrial appendage). In large animals left parasternal long axis views of the mitral valve, aorta and pulmonary artery are also obtained when indicated.

11

Figure 3.2: 3D echocardiogram of a heart viewed from the apex

3.6

Three-dimensional echocardiography

3D echocardiography (also known as 4D echocardiography when the picture is moving) is now possible, using a matrix array ultrasound probe and an appropriate processing system. This enables detailed anatomical assessment of cardiac pathology, particularly valvular defects,and cardiomyopathies.The ability to slice the virtual heart in innite planes in an anatomically appropriate manner and to reconstruct three-dimensional images of anatomic structures make 3D echocardiography unique for the understanding of the congenitally malformed heart.Real Time 3-Dimensional echocardiography can be used to guide the location of bioptomes during right ventricular endomyocardial biopsies, placement of catheter delivered valvular devices, and in many other intraoperative assessments.

12

Chapter 4 HARDWARE AND SOFTWARE


4.1 Echocardiographic Image Acquisition

For the purpose of validating the ltering algorithms presented here, volumetric data sets collected from a healthy volunteer using a GE VingMed Vivid E9 cardiovascular ultrasound system was used. We have sampled the heart of the volunteer from multiple projections, most importantly a parasternal basal short-axis view at the level of the aortic valve and an apical four-chamber view. Volume data was acquired using four consecutive cardiac cycles (multibeat). These recordings were exported as envelope data sets from the proprietary system to DICOM format which was read using a custom program running on standard laptops and desktops.

4.2

Hardware

For the implementation of the above algorithm it was evaluated on a modern desktop PC computer and an o-the-shelf laptop computer. The former consists of a 3.33 GHz AMD Phenom II SixCore processor, 4 GB DDR3 RAM, with an AMD 6950 graphic card. The later is an Asus G73JH laptop containing an Intel Core i7 720QM (4 hyper threaded CPU cores at 1.6 GHz), 8 GB DDR3 RAM and a AMD 5870 Mobility Radeon graphic card. For reference, the graphic cards above contain 1408 and 800 individual stream processors operating at 800 and 700 MHz, respectively. As we see in the results these numbers reect almost linearly on the performance of the algorithm. Parsing of the volumetric data les of 40700 MB each was performed onboard the CPU following the standard DICOM specication with the VolDICOM extension as 13

specied by the manufacturer of the ultrasound machines.

4.3

OpenCL

Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing unit (CPUs), graphics processing unit (GPUs), and other processors. OpenCL includes a language (based on C99) for writing kernels (functions that execute on OpenCL devices), plus application programming interfaces (APIs) that are used to dene and then control the platforms. OpenCL provides parallel computing using task-based and data-based parallelism. OpenCL is an open standard maintained by the non-prot technology consortium Khronos Group. It has been adopted by Intel, Advanced Micro Devices, Nvidia, and ARM Holdings. For example, OpenCL can be used to give an application access to a graphics processing unit for non-graphical computing (see general-purpose computing on graphics processing units). Academic researchers have investigated automatically compiling OpenCL programs into application-specic processors running on FPGAs, and commercial FPGA vendors are developing tools to translate OpenCL to run on their FPGA devices. The simplest way of implementing the algorithm is to let each sample of the data corresponds to one work item, which is possible only if there is not any inter-data dependencies. As we will see below this can be improved to give better performance by allowing dependencies and working around the constraints posed by limitations in available memory for storing temporary data.

14

Chapter 5 GPU IMPLEMENTATION


As a method to perform adaptive ltering of typical 4D echocardiography data sets within reasonable time frames the adaptive ltering algorithm was implemented onboard a graphics processor (GPU). By structuring the problem and modifying the algorithms accordingly a signicant speedup as compared to traditional implementation of the adaptive ltering algorithm could be obtained. To analyze the trade-o between execution times and quality of ltering both 3D and 4D based ltering were implemented of the data sets. In the rst case, the ltering method was applied on a volume of ultrasound data for each time frame of the data set. For the latter case, the whole data set was considered as a four dimensional array of data and perform ltering also along the fourth dimension. In the standard way of describing the adaptive ltering method one typically constructs the orientation tensor for each data sample and apply a low pass lter componentwise to the tensors. We note that performing the low pass ltering before the tensor construction gives an equivalent result, but with fewer operations, since the constructed orientation tensors are a linear product of the quadrature lter responses T flp =
k

(qk ek ek T ) (qk
k

flp

(5.1) (5.2)

flp )ek ek T

Thus, for the 3D and 4D ltering case we require only 6 or 12 low pass ltering operations since qk only contains the magnitude of the complex quadrature lter responses 15

The convolution kernels for line and edge detection is precomputed for a given radius rc and give a Gaussian low pass kernel of radius rg , as measured in data samples. The lter qk could be computed by a combined kernel of eg.(2rg + 1 + 2rc + 1)3 size for the 3D ltering. However, we observe that we gain the same result by performing convolutions with the high pass lters of 2rc + 1)3 followed by three (respectively four) convolutions with a 1D Gaussian lter of size (2rg + 1) consecutively. By performing this sequence of convolutions with smaller kernels fewer convolution operations are needed. As can be seen, the full ltering algorithm could be implemented in a naively data parallel fashion where each work item would require convolutions with the larger kernels followed by the construction of control tensors and nal ltering. Such an approach would require no intermediate storage of values and thus require no communication between dierent work items. However, since the convolutions with the smaller quadrature lters and 1D low pass lters give a signicant reduction in the number of computations, the computation was split into a series of kernels. The intermediate value from the dierent kernel computations are here stored onboard the GPU. At a high level the adaptive ltering algorithm can be described by the steps below. Step 1: Compute the quadrature lter qk for each data point as a combination of the line and edge detection lter kernels. Step 2: Let sk = |qk |for each data point.

Step 3: Perform a convolution of qk by applying three 1D low pass Gaussian lters oriented around each of the rst three dimensions consecutively forming qk . Apply the same convolutions to form the low pass ltered data slp . Step 4: Form the orientation tensor T (1) using qk and the corresponding directions ek . Step 5: Compute the eigenvalues i of T tensor using the characteristic equation det(|T i |) = 0 and nd the corresponding eigenvectors ei by Gauss Jordan elimination. Step 5: Form the control tensor C(4), and the weighting coecients ck (3) for each high pass lter.

16

Step 6: Compute the nal output slp (2) from the weighting coecients, the low pass ltered data and the high pass ltered data. Only data dependencies between dierent data points occurs in Step 3 above, and can thus implement the algorithm by the following three data parallel computation kernels:
the quadrature convolutions kernel which performs steps 12 of the algorithm; the low pass ltering kernel that performs ltering with a 1D Gaussian kernel; the adaptive ltering kernel that performs steps 47 of the algorithm

where the low pass ltering kernel is invoked once per dimension. A straightforward data parallel implementation of these kernels that runs on one or more processor cores on a desktop machine that process frame by frame of data can easily be implemented. To store the intermediate results for qk we require twooating point values per lter and data sample in at least two copies during the low pass convolutions. For the considered data sets this consumes at least 1536 MiB of data for the 3D ltering data sets and 2rg + 1) times that for the 4D ltering case since multiple frames need to be stored. Since the GPUs of today cannot handle the computations with such large temporary data, the computational task was split up into the consecutive ltering of a number of subvolumes, each responsible to compute N 3 data points of the ltered data on each frame. For step 47 of the algorithm above, we require the corresponding N 3 values of T ,slp , and sk to compute sap . For the multiple executions of step 3, however, we require between N 3 and (N + 2rg )3 values of qk to correctly handle the lowpass ltering of the overlap between subvolumes. This gives a computational cost of ((N + 2rg )/N 3 ) times higher than if all the intermediate calculations could be saved. For an illustration of the split up between kernel executions and the intermediate data sets that are stored (see Figure 5.1). With the above value for , the ve kernel executions per subvolume will be performed 64 times per frame of the data set. Low pass ltering was not performed along the time dimension for 4D ltering. This is due to the requirement of storing a four dimensional qk , requiring (2rg + 1) times as much memory for intermediate calculations and either storing the computed qk between the dierent invocations for dierent frames (requiring too much onboard memory or swapping to the CPUs RAM memory) or by the corresponding (2rg + 1) multiplied cost by recomputing the values for each new frame. However, since the input data sets are of considerably smaller sizes the individual values of qk for the a N3 subset can be computed along the fourth dimensions for reasonable sizes of rc . 17

Figure 5.1: Illustration of kernel invocations and data ow between CPU/GPU and GPU/GPU. Data on CPU side stored with one byte per volumetric sample, temporary data on GPU side stored as oating point vectors/tensors per sample R is the radius of convolution kernels,N = 64 the size of each subvolume for which ltering is performed.

18

1) Computational Performance: The computationally most expensive steps in the algorithm above is the convolution with the quadrature lters and with the low pass lters. For the later, we have already seen how it can be changed from using 3D or 4D convolution kernels to a sequence of 1D kernels, giving a speedup of e.g. (2rg + 1)3 /2(2rg + 1)), which for rg = 40 gives respectively 332 times less computations needed. For the former, we cannot in an easy way reduce the problem to lower dimensions. For this purpose, optimization of the basic convolution algorithm itself for use on GPU hardware is of interest. Due to the relatively small size of the convolution kernels and the large number of convolutions to perform, spatial domain convolutions was used on the GPU and have optimized the algorithm to t the number of simultaneous convolution operations, specic kernel sizes, and dimensionality of the problem. kernels. If we let s be the size of each frame in the data set,f the number of frames, the dimensionality of the problem (3D or 4D) and rc the convolution radius we see that the number of basic convolution operations (one multiplication and one addition) required is (2rc + 1)d sd f for each of the 6 or 12 complex convolutions. For 36 frames of size 256 and rc = 3 we thus require 5 101 2 operations for the 3D ltering and 6.9times101 3 operations for the 4D ltering. Compare these numbers with the typical oating point performance of modern multi-core CPUs, theoretically capable of 10100 109 oating point operations per second, with practically measured LINPACK benchmarks of 13 109 operations per second. Although the exact details for the oating point performance of CPUs depend on the form of computations and other circumstances (see Pratx and Xing for an overview of the relative performance dierences between CPU and GPUs) we see that a CPU requires at least 690 s to perform the 4D ltering with the most optimistic of the alternatives above. By using GPUs for these computations, with theoretical speeds up to 3 102 operations per second, we have a lower bound of the required time of instead 19.5 s. Note that this is a lower bound that we will not reach in practice since in order to gain this speed we require that no other bottlenecks, e.g., memory bandwidth, slow down the computational speed. These numbers cannot be improved without changing the computational task signicantly. 2) Handling Border Data: Due to the nature of the sampled data, there is an articial cuto of the data outside the pyramid that can be sampled by the ultrasound probe. This change between inside and outside data sets requires a special treatment of the quadrature lter convolution kernels to avoid creating an articial edge. 2) Numerical Precision: The only step of the algorithms that have proven sensi19

tive to numerical performance are the computations of eigenvalues and eigenvectors in the nal kernel. However, since the amount of time spent in this kernel is relatively low (< 1% for the 4D case) 64 bit (double precision) oating point numbers was elected for these computations and only 32 bit (single precision) oating point numbers for the rest of the computations. This has shown to cause no detectable eect to the ltered data while giving a speedup of 5.8 times for the GPU based convolutions. Note that this speed advantage is higher than the theoretical advantage of 5 times speedup given by the dierence in 32 and 64 bit oating point computational power stated by the manufacturer.

20

Chapter 6 RESULTS
6.1 Timing Comparisons

Timing results both for applying the ltering based on using a 3D orientation estimate as well as a 4D orientation estimate are provided and how the timing scales on a GPU and/or CPU cores is investigated. The measurement for CPU cores have been performed using an identical OpenCL implementation, exploiting vectorization and with a manual ne tuning of the work group sizes to optimize speed. As such the relative numbers between GPU and CPU performance should reect the true dierence in computational speed between these types of devices. Figure 6.1 shows the actual time spend in the convolution operations when performing 3D and 4D ltering on one data set for dierent values of k0 for kernel sizes ranging from 517 and 515, respectively. From this we see that the optimal values for k0 vary signicantly depending on the kernel size (favoring larger values for k0 ) and the dimension of the problem. Since higher values for k0 and the higher dimension require more intermediate results the number of work items that can be executed in parallel decreases.From this we can conclude that the optimal values for dier depending on the lter parameters but note that it is invariant to the used data set. Furthermore, Figure 6.2 the total time and time spent in convolutions when no optimization was performed (k0 = 1) and when the most optimal value of k0 was selected. It is concluded that the total time spend in the algorithm is a constant time plus a cubic respectively a quartic expression dependent only on the lter diameter. Furthermore, we see that the result of the optimization provides a consistent speedup for the dierent kernel sizes. Table ?? presents the total fractions of computational time spent in the dierent computational kernels for a lter size

21

Figure 6.1: Timing comparison for dierent kernel sizes and values of the optimization parameter k0 when performing 3D (left) respectively 4D (right) ltering of the aortic valve data set (116 X 200 X 117 X 36 samples) for dierent lter diameters rg

Figure 6.2: Timing comparisons for 3D (left) respectively 4D (right) ltering of the aortic valve data set (116 X 200 X 117 X 36 samples) showing the time for convolutions (center, solid line) and total time (upper, dashed line) with no optimization k0 = 1. The time for convolutions with best value of k0 shown in lowermost dashed line. On the horizontal axis the total number of elements in the convolution kernels is presented , proportional to the cube respectively power of four of the kernel radius (ie.(2re + 1)d ).

22

Quadrature Convolution Lowpass ltering Adaptive ltering Other

3D/CPU 61.6% 12.3% 18.0% 7.9%

3D/GPU 23.0% 20.6% 34.5% 21.9%

4D/CPU 96.2% 1.0% 2.4% 0.4%

4D/GPU 69.9% 12.5% 7.7% 9.9%

Figure 6.3: Computational time in seconds for ltering of the aortic valve data set consisting of 116 X 200 X 117 X 36 samples and kernel size 7. of 7. As we can see, the CPU based computations are dominated by the cost of performing convolutions. Although the convolutions are one of the more expensive operations also for the GPU, we see that for the smaller 3D convolution problem it is of the same magnitude as the other kernels while for the 4D case it scales up and takes a larger share of the computational cost. For ltering with larger kernel radii we see that the total computational cost increase by the cube (respectively, power of four) with the increase in kernel radius as supported by Figure 6.2. Figure 6.3 compares the time for ltering onboard dierent GPU and CPUs. Due to the deterministic nature of the algorithm timing measurements give a standard deviation less than the clock resolution of 1 ms, implying signicant dierences between the computational times for each of the presented categories in Figure 6.3. A quadrature lter diameter of 7 was chosen here as the smallest lter size that gives good visual results. Timing results thus demonstrate a large gain by the application of GPUs compared to CPUs for ltering of the data. As we can see both the 3D and 4D ltering can be performed onboard GPUs, for the given data set size, within a time span suitable for analysis immediately after and in conjunction with the 23

physical examination.

6.2

Filtering Eciency

To illustrate the eect of 3D and 4D adaptive ltering on the considered echocardiography data set a 2D slice at one time frame for the data sets (Figure 6.4) was presented before ltering and after 3D/4D ltering, respectively. In this gure it is presented on the top row the original (a), 3D ltered (b) and the 4D ltered data (c) taken from one frame of the aortic valve view data set. On the second row it is presented the original (d), 3D ltered (e) and 4D ltered (f) data taken from one frame of the four chamber view data set. Visual assessment of the adaptive ltering indicated, according to a clinician with over 15 years of experience of echocardiography and cardiology (KE), an improvement of image quality in both the 3D and 4D ltered data set. This improvement is illustrated in Figures 6.5 and 6.6 where intensity lines demonstrates the eect of the ltering. The 3D and 4D ltering decrease the noise in isotropic signal areas while preserving the high frequency content of edges and lines. When comparing the 4D and 3D ltered images further improvements, according to the same clinician, are noticed in the 4D ltered images where for instance the atrioventricular valves are more distinctly visualized which makes the interpretation of the image even easier.

24

Figure 6.4: On the top row, cross sections of the aortic valve in basal short axis view using the original (a), 3D ltered (b), and 4D ltered (c) signal. On the bottom row cross sections of the four chamber view data set using the original (d), 3D ltered (e) and 4D ltered (f) signal.

25

Figure 6.5: Intensity plot along a central horizontal line from the original, 3D, and 4D ltered aortic valve view.

Figure 6.6: Intensity plot along a central horizontal line from the original, 3D, and 4D ltered four chamber view.

26

Chapter 7 CONCLUSION
A general method for fast local orientation estimation and ltering of 4D echocardiographic data sets using GPU hardware was presented. This specic combination of 3D and 4D ltering show promising results that require further studies to determine suitability in echocardiographic examinations. Such a clinical evaluation would preferably be performed as a double blind study involving several data sets and clinicians. This specic application of ltering of ultrasound data should be seen as an example of applications that can be implemented on large 4D data sets using the GPU based quadrature ltering approach. This method holds promises also to any other technique requiring local orientation estimates on large data sets. When looking at recent development of computational hardware an exponential growth of the number of parallel processing elements matching Moores law is seen on the GPU side. With the advent of accelerated processing units (APUs) where GPU and CPU processors are combined on the same chip, this exponential growth in number of processing elements can be expected to continue in the near future. Given the performances of the algorithms on modern hardware, it is not unreasonable to assume that realtime interactive ltering can be performed during clinical examinations within a few years. It is known that multibeat acquisition might introduce stitching artifacts that degrade the quality of the image. Since such artifacts is caused by miss registration of volumetric data from dierent heart beats a simple low-level ltering along the boundary of the artifact would not solve the underlying problem of the possible dierences in the stitched images. Since the presented ltering algorithm contain no image registration step, or other eects for dealing with the underlying problem with stitching artifacts only data sets that contain no apparent stitching artifacts were used. This may pose a limitation when applying the method to certain cases of patients. In conclusion, GPUs facilitate the use of 27

demanding adaptive image ltering techniques that enhance 4D echocardiographic data sets. This may open up for an improvement in diagnosis and pre- and even per-surgical examinations using 4D echocardiograms. This general methodology of implementing parallelism is also applicable for other medical multidimensional data sets, such as MRI and CT, that would benet from fast adaptive image processing.

28

Bibliography
[1] John S. Gottdiener, MD , James Bednarz, BS, RDCS, Richard Devereux, MD, Julius Gardin, MD, Allan Klein, MD, Warren J. Manning, MD, Annitta Morehead, BA, RDCS, Dalane Kitzman, MD, Jae Oh, MD, Miguel Quinones, MD, Nelson B. Schiller, MD, James H. Stein, MD, and Neil J. Weissman, MD, A Report from the American Society of Echocardiographys Guidelines and Standards Committee and The Task Force on Echocardiography in Clinical Trials Journal of the American Society of Echocardiography, Vol.17, no.10, pp.10211122, Oct.2004. [2] C. Otto, Principles of echocardiographic image acquisition and doppler analysis, in Textbook of Clinical Echocardiography. Philadelphia, PA: Saunders, 2004, pp. 129. [3] C. Otto, Other echocardiographic modalities, in Textbook of Clinical Echocardiography. Philadelphia, PA: Saunders, 2004, pp. 100104. [4] Mathias Broxvall, Kent Emilsson, and Per Thunberg, Fast GPU Based Adaptive Filtering of 4D Echocardiography, IEEE Trans. Commun., vol. 31, no. 4, pp. 532540, April 1983. [5] R. H. Bamberger and M. J. T. Smith, A lter bank for the directional decomposition of images: Theory and design, IEEE Trans. on Medical Imaging., vol. 31, no. 6, pp. 11651172, June 2012.

29

You might also like