A GPU Based Real-Time CUDA Implementation For Obtaining Visual Saliency

A GPU based Real-Time CUDA implementation for
obtaining Visual Saliency
Rahul Agrawal
Indian Institute of Technology

Kharagpur
WestBengal, India-721302
rahul@ece.iitkgp.ernet.in
Soumyajit Gupta

Kharagpur
smjtgupta@gmail.com
ABSTRACT
We present a GPU-based implementation of the saliency
model proposed by Achanta et al. [1] to perform real-time
and detailed saliency map generation. We map all the components of the algorithm to GPU-based kernels and data
structures. The parallel version of the algorithm is able to
accurately simulate the desired results in a very low time.
We describe the streaming pipeline and address many issues in terms of obtaining high throughput on multi-core
GPUs. We highlight the parallel performance of the algorithm on three dierent generations of GPUs. On a high-end
NVIDIA Tesla K20m, we observe up to 600x order of magnitude performance improvement as compared to a singlethreaded CPU-based algorithm, and about 300x order of
magnitude improvement over a CPU-based OpenCV implementation.
Keywords
Saliency, FSRD, GPU, OpenCV, CUDA
1.
INTRODUCTION
Visual attention is one of the most important component

of primate vision. Psychovisual experiments suggest that, in
the absence of any external guidance, attention is directed
to visually salient locations in an image. It is a low-cost
pre-processing step by which artificial and biological visual
systems select the most relevant information from a scene,
and relay it to higher-level cognitive areas that perform complex processes such as scene understanding, action selection,
and decision making.
Various computational models [1, 2, 3, 4, 5, 6, 7, 8] have
been proposed till now. Most of them [2, 3, 4, 5, 6, 7] generate regions that have low resolution, poorly defined borders,
Corresponding author
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from Permissions@acm.org.
ICVGIP 14, December 14-18, 2014, Bangalore, India
Copyright 2014 ACM 978-1-4503-3061-9/14/12 $15.00
http://dx.doi.org/10.1145/2683483.2683484.
Jayanta Mukherjee,
Ritwik Kumar Layek

Kharagpur
{ jay@cse | ritwik@ece
}.iitkgp.ernet.in
or are expensive to compute. Additionally, some methods

[5, 7] produce higher saliency values at object edges instead
of generating maps that uniformly cover the whole object,
which results from failing to exploit all the spatial frequency
content of the original image. On the other-hand [8] gives
relatively fast (CPU implementation) and good prediction
accuracy but doesnt have much parallel structure to be exploited using CUDA. We choose one of the Spectral Model
proposed by Achanta et al. [1] namely Frequency-Tuned
Salient Region Detection(FSRD) for parallel implementation. It is a bottom-up, frequency tuned approach of computing saliency in images using low level features of color,
and luminance, which is easy to implement, fast and provides full resolution saliency maps exploiting most of the
frequency content of the input image.
The algorithm was meant to be a pre-processing step for
applications like image re-targetting, cropping, object detection, etc. Hence it needs to be as fast as possible to be used
in real time. The CPU version of the same was found to
be very slow in terms of its performance, when implemented
using OpenCV library functions1 (Table 4). We exploited
the parallelism present in the algorithm for a fast CUDA
implementation.
The paper is organized as follows. Section II gives an
overview of the saliency model proposed in [1]. In Section III, the concept of GPU and CUDA is briefly introduced. The GPU implementation details are presented in
Section IV. Section V discusses profiling issues and their implications. In Section VI, our implementation is discussed
and compared with other implementations. Conclusions are
drawn in section VII.
2.
SALIENCY MODEL
2.1 Description
The method of calculating the saliency map S for an image
I of width W and height H pixels can be formulated as:
S(x, y) = |I Ihc (x, y)|
(1)
where I is the global mean pixel value of the image and

Ihc is the Gaussian blurred version of the original image
to eliminate fine texture details as well as noise and coding
1
http://opencv.org/.
Figure 1: A typical GPU program

artifacts. Only the magnitude of the dierence is considered
to make computationally quite ecient. Also, operations
are done on the original image without any downsampling,
hence a full resolution saliency map is obtained in this process.
For accommodating features of color and luminance, Eq.
1 is rewritten as:
S(x, y) = I Ihc (x, y)
(2)
where I is the global mean image feature vector, Ihc (x, y)

is the corresponding image pixel vector value in the Gaussian blurred version (using a 55 separable binomial kernel)
of the original image, and . is the L2 norm. Using the Lab
color space[14], each pixel is represented by [L a b]T vector,
and the L2 norm corresponds to the Euclidean metric.
2.2
intending to perform general-purpose computations. It supports a heterogeneous serial-parallel programming model. A

general program flow followed in CUDA is shown in Fig.1.
The basic unit of computation is a thread. Threads are
organized in the form of grid of blocks, where each block
can contain upto 1024 threads. Each thread has its own local memory and registers. For intra-block communication,
threads use Shared memory while for inter-block communication, Global, Texture and Constant memory are used.
CUDA device memory hierarchy composed of five levels and
thread organisation is shown (Fig.2). Each type of memory
have varying throughput depending on their access pattern.
The memory types used are briefly discussed below :
Limitation
Although the algorithm produces detailed saliency map

at full resolution and operates at O(N ) complexity, it is still
limited by its execution time (Table 4). Thus it fails to satisfy the application requirements for which it was originally
proposed.
3.
GRAPHICS PROCESSING UNIT
In recent years, due to limitations in heat dissipation and

transistor size of CPUs, focus has been shifted to GPUs. A
GPU is a specialized electronic circuit designed to manipulate graphics. Compared to CPU, the GPU is designed such
that more transistors are devoted to data processing rather
than data caching and flow control, resulting in higher overall throughput with ecient power dissipation.
The GPU is ecient in launching thousands of threads
in parallel, where each thread executes the same program
(kernel) independently on each data element, hence processing a lot of elements at the same time. This lowers the
requirement for sophisticated flow control. However, due to
absence of big data caches in GPUs, the memory access latency is quite expensive. To hide this latency, data-parallel
& high arithmetic intensity (the ratio of arithmetic operations to memory operations) problem statements are more
suited for GPU implementation.
4.
CUDA
Compute Unified Device Architecture(CUDA) is a

scalable parallel programming model and a software environment for parallel computing. CUDA Architecture includes a
unified shader pipeline, allowing each and every arithmetic
logic unit (ALU) on the chip to be marshaled by a program
Figure 2: Memory hierarchy and thread organization in CUDA enabled GPUs

Global Memory :Threads are executed in batches of
warps (group of 32 threads, which is the minimum size of
the data processed in SIMD fashion by a CUDA multiprocessor). When warp executes an instruction that accesses global
memory, it coalesces the memory accesses of the threads
within the warp into one or more of these memory transactions depending on the size of the word accessed by each
thread and the distribution of the memory addresses across
the threads. In other words memory requirement of threads
within a warp should be fulfilled within as minimum memory word fetches as possible by insuring that threads within
wrap access contiguous memory location and proper data
alignment thereby reducing global memory cache misses.
Global memory instructions support reading or writing

words of size equal to 1, 2, 4, 8, or 16 bytes. Any access (via
a variable or a pointer) to data residing in global memory
compiles to a single global memory instruction, if and only
if the size of the data type is 1, 2, 4, 8, or 16 bytes, and
the data is naturally aligned (i.e., its address is a multiple
of that size). If this size and alignment requirement is not
fulfilled, the access compiles to multiple instructions with
interleaved access patterns that prevent these instructions
from fully coalescing.
Constant Memory: It resides in device memory and is
cached in the constant cache. A constant memory request
for a warp is first split into two requests, one for each halfwarp, that are issued independently. A request is then split
into as many separate requests as there are dierent memory addresses in the initial request, decreasing throughput
by a factor equal to the number of separate requests. The
resulting requests are then serviced at the throughput of the
constant cache in case of a cache hit, or at the throughput
of device memory otherwise.
Shared Memory: Because it is on-chip, shared memory has much higher bandwidth and much lower latency
than local or global memory. To achieve high bandwidth,
shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously.
Any memory read or write request made of n addresses that
fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times
as high as the bandwidth of a single module.
However, if two addresses of a memory request fall in the
same memory bank, there is a bank conflict and the access
has to be serialized. The hardware splits a memory request
with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal
to the number of separate memory requests. If the number
of separate memory requests is n, the initial memory request
is said to cause n-way bank conflicts.
Local memory: The local memory space resides in device memory, so local memory accesses have same high latency and low bandwidth as global memory accesses and
are subject to the same requirements for memory coalescing. Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs.
Accesses are therefore fully coalesced as long as all threads
in a warp access the same relative address.
5.
Salience Map on
device memory
Input Image
Mean distance
kernel (L2 norm
from avg)
Convert from
uchar4(BGRA) to
float4(RGBA)
Allocate Global
Memory for storing
Input image, Lab
Image, Output image,
Buffers to be used as
global sync
Copy input image
stored in host memory
to device memory
allocated
Set Convolution
Kernel
Shared memory
reduce kernel
Shared memory
reduce kernel
Shared memory
reduce kernel
Finding
Avg(image)
using 3
reduce
kernel each
acting as a
global sync
point
RGBA to
LABA Kernel
Constant Memory
Figure 3: FSRD in GPU

both directions to exploit the warp coalescing.
5.2 Datatype Conversion

The input image has three 8-bit channels, namely blue,
green and red. The image data is copied from the CPU into
the global memory (DRAM) of the GPU. The data type in
changed from char to float to increase precision in computation. The size of data type float3 (3-channel float) is 4
3 = 12 bytes, which does not satisfy the coalesced alignment
requirement, resulting in higher access time. To counter this
issue and increase throughput, a strategy similar to what reported in [13] is followed. In this case, padding with zeros
on an extra Alpha channel (Fig. 4) is done, which makes
the data type as float4. Hence the given image is converted
from BGR format to RGBA format.
PROPOSED PARALLEL MODEL
We exploited parallelism in the semi parallel architecture

of the algorithm to provide real-time performance even for
very large images. Each of the serially dependent steps of filtering, RGB to Lab conversion, mean vector and Euclidean
metric calculation were implemented in parallel using dierent kernels, where each kernel acted as global synchronization point between dependent steps. The proposed architecture for parallel model (each box represents a kernel call) is
shown in Fig. 3.
5.1
Device (GPU)
Host (CPU)
Copy output
image stored in
device memory
to host memory
Initialization
First, the GPU is initialized. Since memory allocation in

GPU takes very long time, the memory is first allocated for
dierent images such as the input image, the image in the
Lab space as well as for storing the saliency map. Also the
input image is resized to the next higher multiple of 32 in
Figure 4: Image Data Padding (adapted from [13])
5.3 Filter
A GPU is ecient at launching a lot of threads in parallel
working together. The mapping between thread and memory is called communication pattern. A very basic parallel
filter version is to divide the image in blocks of threads (max
number threads per block is 1024) where each thread corresponds to each pixel calculation in filtered output. Now
each thread gathers the corresponding pixel value in input
image and other values in neighbourhood needed for mask

calculation, then applies mask, and stores it to output pixel.
This communication pattern is called stencil (Fig.5).
munication pattern is called map (Fig.7) following one to

one correspondence.
Figure 7: Map Communication pattern
5.5 Mean
Figure 5: Stencil Communication Pattern
To implement stencil operation more eciently, separable
5 5 filter with elements [1 4 6 4 1] is used. In this case, a
two-dimensional convolution filter requires 5 5 = 25 multiplications for each output pixel. A separable filter divided
into two consecutive one-dimensional convolution operations
requires only 5+5 = 10 multiplications for each output pixel.
Following the technique reported in [10], it computes separately in horizontal (row) and vertical (column) passes, and
uses a write to global memory between each pass, each pixel
is loaded five times at the most. The pixels at the edge
of image will depend on pixels outside the thread block as
shown by yellow called apron region (Fig.6). Thus each
thread block must load into shared memory, the pixels to
be filtered and the apron pixels. Using separable filter it
is no longer necessary to load the top and bottom apron
regions (for the horizontal pass) and vice versa for vertical
pass. This allows more pixels to be loaded for processing in
each thread block.
The operations such as filter, RGB to Lab can be decomposed into small tasks for each pixel independent of each
other. However, obtaining mean cannot be mapped into
such parallel independent tasks. So the reduce algorithm
is used to obtain the mean of the image. If we want to compute sum of n elements (n = 8 shown in Fig.8(a)) serially it
takes O(n) number of steps (7 steps), but in parallel, we can
do it by pairing elements in groups of two, finding sum of
each pair to get intermediate results, which are again paired
and added. This process continues till we get a single result,
and complexity of this parallel algorithm is of order log2 (n)
(3 steps).
(a) Serial Version
(b) Parallel Version

Figure 8: Reduce Tree kernel
Figure 6: Image in device memory and apron region

(adapted from [10])
5.4
RGBA to LABA
Representing the image in Lab color space was implemented in parallel by launching threads for each pixel which
calculates [L a b]T for corresponding [R G B]T . This com-
The global memory (where image is stored) is divided

into several blocks, and the data of each block is reduced
as shown in Fig. 9. Since the access time of shared memory
is less, we would have first copied the image block data in
faster shared memory, but that would have resulted in reduced occupancy of threads. In Step 2, half of the threads
launched per block will be idle which results in decrease in
occupancy thus reducing overall utilization. This eect can
be seen in each level of tree with the occupancy eect decreasing by half. To improve occupancy of threads and at
the same time reducing access latency, first step of reduction
is done on global memory. Along with this, intermediate results are buered in shared memory for their subsequent
uses (Fig. 9).
6.
PROFILING
Nvidia Nsight Visual Studio Edition ver.3.2.2 is used to

gather profiling information on the algorithm. We address
two important issues of the experimental results provided
by the profiler namely: Instruction Statistics and Issue Eciency.
6.1 Instruction Statistics
Figure 9: Sequential Addressing Reduce (adapted

from[11])
The normal process would have been to apply reduce algorithm to whole image stored in global memory. But due
to limitation of maximum number of threads that can be
launched per block, limited amount of shared memory and
absence of any global synchronization(for synchronization
between threads of dierent blocks) in CUDA, this approach
does not work for large size arrays (like images in our case).
So multiple reduce kernels (3 in our case) are used, each
serving as a global sync point as shown in Fig. 10. After
each kernel launch, partially reduced result of each blocks of
the grid is obtained as arrays of partial results. Then these
partial results are again divided in to a grid of blocks, and
each block is reduced to give once again an array of partial
results. These are again reduced to give the final reduced
value in last step. This four step processing, similar to the
approach reported in [11], is shown in Fig. 9.
It provides an understanding of the overall utilization of

the target device when executing the kernel. It addresses issues like: whether the kernel grid is able to keep all multiprocessors busy for the complete execution duration; whether a
well-balanced distribution of workloads is achieved; whether
the achieved instruction throughput come close to the hardwares peak performance. Following are the metrics which
represents the factors mentioned.
6.1.1
Instructions Per Clock (IPC)
Issued IPC is The average number of issued instructions

per cycle accounting for every iteration of instruction replays. Executed IPC is the average number of executed
instructions per cycle. Optimally issued IPC should be as
close as possible to the Executed IPC.
6.1.2
Instruction Serialization (IS)
It is defined as the ratio of required instruction replays

over the total number of issued instructions. Lower values
are better.
Instructions Issued Instructions Executed
(3)
Instructions Issued
6.1.3
Streaming Multiprocessors (SM) Activity
Shows the percentage of time each multiprocessor was active during the duration of the kernel launch. A multiprocessor is considered to be active, if at least one warp is currently
assigned for execution.
6.1.4
Instructions Per Warp (IPW)
Shows the average executed instructions per warp for each

multiprocessor.
6.1.5
Warps Launched
Shows the total number of warps launched per multiprocessor for the executed kernel grid.
Figure 10: Global Synchronization
5.6
Euclidean metric Calculation
Independent threads are launched which computes L2 norm

of each pixel value from mean value of corresponding channel
and maps it to output pixel.
5.7
Intrinsic Functions
To maximize the instruction throughput, the use of arithmetic instructions with low throughput is minimized. This
includes trading precision for speed when it does not aect
the end result. Instead of regular functions like sqrt() and
cbrt() in math.h, CUDA intrinsic functions like sqrtf() and
cbrtf() are used, which gives single floating precision instead
of double floating precision.
6.2 Issue Efficiency

The Issue Eciency experiment provides information about
the devices ability to issue the instructions. It indicates
whether the device was able to issue instructions every cycle.
6.2.1
Warps Per SM
Each warp scheduler manages a fixed, hardware-given maximum number of warps. This defines the Device Limit of
warps per SM - the upper bound of how many warps can be
resident at once on each SM. Active Warps are active from
the time it is scheduled on a multiprocessor until it completes the last instruction. Each warp scheduler maintains
its own list of assigned active warps. Eligible Warps is an
active warp which is able to issue the next instruction. Each
warp scheduler selects the next warp to issue an instruction
from the pool of eligible warps. Warps that are not eligible,
report an Issue Stall Reason. Theoretical Occupancy acts as
Function
Filter
RGBA2LabA
Reduce1
Reduce2
Reduce3
Euclid
Issued
1.81
1.96
1.59
1.24
0.10
1.76
IPC
Executed
1.77
1.60
1.42
1.11
0.09
1.57
IS
%
2.68
18.52
10.20
10.61
9.51
10.91
SM Activity
%
99.98
99.98
99.98
99.04
88.89
99.99
IPW
2459.25
409.00
669.75
669.75
875.00
427.00
Warps Launch
Warps Blocks
4096
1024
32768
8192
16384
4096
64
16
1
1
32768
8192
Table 1: Profiling data for Instruction Statistics on NVIDIA GT 610M.
Function
Filter
RGBA2LabA
Reduce1
Reduce2
Reduce3
Euclid
Active
27.13
30.33
24.69
19.68
1.00
30.60
Warps per SM
Eligible Occupancy
3.20
28
5.62
32
2.81
28
2.14
28
0.09
8
2.82
32
Warp Issue Eciency %

No Eligible 1 Eligible
2.16
97.84
1.14
98.96
7.73
92.27
24.52
75.48
90.94
9.06
4.52
95.48
Issue Stall Reasons (%)

IF
ED
DR Synch.
69.84 22.71
49.84 33.28 9.71
16.59 52.43
23.74
18.32 15.85
23.18
79.88 15.47
-
Table 2: Profiling data for Issue Eciency on NVIDIA GT 610M. - indicates Not Applicable.
upper limit to active warps and consequently also on eligible

warps per SM.
6.2.2
Warp Issue Efficiency
It is the distribution of the availability of eligible warps

per cycle across the GPU. The values are reported as sum
across all warp schedulers for the duration of the kernel execution. No-Eligible metric is the number of cycles that
a warp scheduler had no eligible warps to select from and
therefore did not issue an instruction. One or More Eligible
metric is the number of cycles that a warp scheduler had at
least one eligible warps to select from. This metric is equal
to the total number of cycles, for which an instruction was
issued, summed across all warp schedulers.
6.2.3
Issue Stall Reasons
The issue stall reasons capture why an active warp is not

eligible.That is why, the sum of the stall reasons gets incremented by a value between zero (if all warps are eligible) and
the number of active warps (if all warps are stalled) per multiprocessor per cycle. A warp incurs Instruction Fetch (IF)
stall, if the fetch unit has not returned the next instruction
for the warp. A warp incurs Execution Dependency (ED)
stall, if an input dependency is not yet available. This includes waiting for results from any global/local/shared memory access or any low-latency operation. A warp incurs Data
Requests (DR) stall, if a data request cannot be made at the
time as the required resources are not available, or are fully
utilized, or too many operations of that type are already outstanding. A warp incurs Synchronization stall, if the warp
is blocked at a syncthreads() or a memory barrier.
6.3
Discussion
It is found from Table 1, that device utilization of our

algorithm is good. As expected Issued IPC was found to
be very close to Executed IPC ( 0.3). Low Instruction
Serialization ( 10%) and high Streaming Multiprocessor
Activity ( 88%) for all kernels further guarantee the high
parallelism present. This means that all kernels are having

good access pattern and less bank conflicts.
From Table 2, it is observed that the Active warp count is
very close to the theoretical uppermost limit of warp parallelism on an SM (Occupancy). A good number of warps are
Eligible warps out of Active warps per SM, for all kernels
except that of Reduce3. This is due to very less amount
of data to be reduced in that kernel. However, it does not
have significant eect on performance, as it has very less
execution time compared 0.05 % to the total execution
time.
The Reduce kernels suer from synchronization stalls as
the bottom layer of kernel has to wait for the top layer of
kernel to finish its operation before passing on the data. This
results from the use of syncthreads() between each such
kernel. Execution dependency is also less for all the kernels
except Reduce1, since it operates on the entire image in a
semi-parallel manner.
7.
RESULTS
In this section, our implementation is described and the

performance of our parallel version of the algorithm is highlighted.
7.1 Implementation
The proposed parallel version of the algorithm is implemented on three dierent generation GPUs namely NVIDIA
GeForce GTS 450, NVIDIA GeForce GT 610M, and NVIDIA
Tesla K20m (Table 3). For all these NVIDIA GPUs, CUDA
toolkit 5.5, OpenCV 2.4.6 and Visual Studio 2010 are used
as the APIs and development environment. All tests were
carried out on a standard PC (Windows 7 Ultimate 64-bit,
Intel I3 CPU@2.3GHz, 4GB DDR3 RAM) as the testing
environment.
7.2 Performance
The original algorithm demonstrated very high execution
times. To speed it up in CPU, OpenCV version of the same
Figure 11: Top Row: Input Images, Bottom Row: Corresponding Saliency Maps.
GPU
Number of Cores
Number of SMs
Shared Memory per block
Memory Capacity (GB)
Memory Bus Width (Bits)
Memory Clock Rate (MHz)
GPU Clock Rate (MHz)
GTS
450
192
4
49152
1
128
1804
1566
GT
610M
48
1
49152
2
64
900
1250
Tesla
K20m
2496
13
49152
5
320
2600
706
Table 3: GPUs specifications
was implemented (Table 4). The speed up obtained for GPU

implementation are shown in Table 6 with respective execution time in Table 5 and respective Frames Per Seconds
(FPS) for dierent image sizes in Fig. 12. Saliency maps
obtained for dierent input using GPU implementation are
shown in Fig. 11.
Resolution
256 256
512 512
640 480
1024 768
1024 1024
2048 2048
FSRD - CPU
Time(s) fps
0.37
2.70
0.68
1.47
0.71
1.41
1.19
0.84
1.42
0.70
3.41
0.29
FSRD - OpenCV
Time(s)
fps
0.09
11.11
0.27
3.70
0.33
3.03
0.75
1.33
0.87
1.15
2.37
0.42
Resolution
256 256
512 512
640 480
1024 768
1024 1024
2048 2048
GTS 450
Time(ms)
0.83
2.27
2.62
6.41
8.35
32.54
GT 610M
Time(ms)
3.44
10.83
13.26
30.87
40.72
160.88
Tesla K20m
Time(ms)
0.35
0.74
0.85
1.81
2.41
8.89
Table 5: Performance evaluation data showing Execution Time and Framerate on GPU versions of
FSRD.
Resolution
256256
512512
640480
1024768
10241024
20482048
NVIDIA
GTS 450
108.44
119.06
125.97
117.30
104.14
73.17
NVIDIA
GT 610M
26.16
24.96
24.88
24.35
21.36
14.79
NVIDIA
Tesla K20m
257.17
365.23
388.27
415.41
360.82
267.83
Table 6: Table showing speedup factors against

OpenCV implementation.
Table 4: Performance evaluation data showing Execution Time and Framerate on CPU versions of
FSRD.
8.
CONCLUSION
In this paper, a GPU implementation of an existing method

namely FSRD algorithm for determining visually salient locations in an image is presented. It is a bottom-up approach
using low-level attributes (luminance and color). The important contribution of this paper is in adapting this algorithm
in a parallel architecture for obtaining significant speed up.
The algorithm has been implemented using CUDA-C li-
Figure 12: Plot of FPS vs Image Size on various

platforms
brary, and it is found to be working, satisfying requirements

of real time processing of images and videos of high resolution.
9.
ACKNOWLEDGMENTS
The work was partially supported by the sponsorship from

ISRO, IIT Kharagpur Cell, approved at JPC dated 07.03.2013.
10. REFERENCES
[1] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk,
Frequency Tuned Salient region Detection, Proc.
IEEE Conf. Computer Vision and Pattern Recognition
(CVPR), pp. 1597-1604, 2009.
[2] L. Itti, C. Koch, and E. Niebur, A model of saliency
based visual attention for rapid scene analysis, IEEE
Pattern Analysis and Machine Intelligence (PAMI),
vol. 20, no. 11, pp. 1254-1259, Nov. 1998.
[3] J. Harel, C. Koch, P. Perona, Graph-based visual
saliency, in Advances in Neural Information
Processing Systems, vol. 19. Cambridge, MA: MIT
Press, pp. 545-552, 2006.
[4] S. Frintrop, VOCUS: A Visual Attention System for
Object Detection and Goal-Directed Search, Lecture
Notes in Computer Science (LNCS), vol. 3899,
Springer, pp. 7-31, 2006.
[5] Xiaodi Hou and Liqing Zhang, Saliency detection: A
spectral residual approach, Proc. of IEEE Computer
Vision and Pattern Recognition (CVPR), pp. 1-8,
2007.
[6] V. Navalpakkam, L. Itti, Modeling the influence of
task on attention, Vision Research, vol. 45, Issue 2,
pp. 205-231, 2005.
[7] J. Li, D. Levine, X. An,X. Xu, Visual saliency based
on scale-space analysis in the frequency domain,
IEEE Pattern Analysis and Machine Intelligence
(PAMI),vol. 35, no. 4, pp. 996 - 1010, April 2013.
[8] S. Gupta, R. Agrawal, R. Layek, J. Mukhopadhyay,
Psychovisual saliency in color images, Proc. of IEEE
National Conference on Computer Vision, Pattern
Recognition, Image Processing and Graphics
(NCVPRIPG), pp. 1-4, 2013.
[9] NVIDIA Corporation, NVIDIA CUDA C
Programming Guide, June 2011.
[10] V. Podlozhnyuk, Image Convolution with CUDA,
NVIDIA Corporation white paper, June 2008.
[11] S. Sengupta, M. Harris and M. Garland, Ecient
Parallel Scan Algorithms for GPUs, NVIDIA
Technical Report NVR-2008-003, December 2008.
[12] G. Bradski, OpenCV computer vision library, Dr.
Dobbs Journal of Software Tools, 2000.
[13] Xu Tingting , T. Pototschnig, K. Kuhnlenz, M. Buss,
A high-speed multi-GPU implementation of bottom up attention using CUDA, Proc. of IEEE Conf. on
Robotics and Automation, pp. 41 - 47, 2009.
[14] G. Homann. CIE color space. Tech. Rep., FHO
Emden, 2008.

A GPU Based Real-Time CUDA Implementation For Obtaining Visual Saliency

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A GPU Based Real-Time CUDA Implementation For Obtaining Visual Saliency

Uploaded by

Copyright:

Available Formats

A GPU based Real-Time CUDA implementation for

obtaining Visual Saliency

Indian Institute of Technology

Indian Institute of Technology

Visual attention is one of the most important component

Indian Institute of Technology

or are expensive to compute. Additionally, some methods

where I is the global mean pixel value of the image and

Figure 1: A typical GPU program

where I is the global mean image feature vector, Ihc (x, y)

intending to perform general-purpose computations. It supports a heterogeneous serial-parallel programming model. A

Although the algorithm produces detailed saliency map

GRAPHICS PROCESSING UNIT

In recent years, due to limitations in heat dissipation and

Compute Unified Device Architecture(CUDA) is a

Figure 2: Memory hierarchy and thread organization in CUDA enabled GPUs

Global memory instructions support reading or writing

Figure 3: FSRD in GPU

5.2 Datatype Conversion

PROPOSED PARALLEL MODEL

We exploited parallelism in the semi parallel architecture

First, the GPU is initialized. Since memory allocation in

Figure 4: Image Data Padding (adapted from [13])

image and other values in neighbourhood needed for mask

munication pattern is called map (Fig.7) following one to

Figure 7: Map Communication pattern

(a) Serial Version

(b) Parallel Version

Figure 6: Image in device memory and apron region

The global memory (where image is stored) is divided

Nvidia Nsight Visual Studio Edition ver.3.2.2 is used to

6.1 Instruction Statistics

Figure 9: Sequential Addressing Reduce (adapted

It provides an understanding of the overall utilization of

Instructions Per Clock (IPC)

Issued IPC is The average number of issued instructions

Instruction Serialization (IS)

It is defined as the ratio of required instruction replays

Streaming Multiprocessors (SM) Activity

Instructions Per Warp (IPW)

Shows the average executed instructions per warp for each

Euclidean metric Calculation

Independent threads are launched which computes L2 norm

6.2 Issue Efficiency

Table 1: Profiling data for Instruction Statistics on NVIDIA GT 610M.

Warp Issue Eciency %

Issue Stall Reasons (%)

upper limit to active warps and consequently also on eligible

Warp Issue Efficiency

It is the distribution of the availability of eligible warps

Issue Stall Reasons

The issue stall reasons capture why an active warp is not

It is found from Table 1, that device utilization of our

parallelism present. This means that all kernels are having

In this section, our implementation is described and the

Table 3: GPUs specifications

was implemented (Table 4). The speed up obtained for GPU

Table 6: Table showing speedup factors against

In this paper, a GPU implementation of an existing method

Figure 12: Plot of FPS vs Image Size on various

brary, and it is found to be working, satisfying requirements

The work was partially supported by the sponsorship from

You might also like