You are on page 1of 86

FACULTY OF AUTOMATION AND COMPUTER SCIENCE

COMPUTER SCIENCE DEPARTMENT

IMAGE PROCESSING ON SYSTEM ON CHIP FPGA


DEVICES USING LABVIEW

LICENSE THESIS

Graduate: Gergő PAPP-SZENTANNAI

Supervisor: Sl. Dr. Ing. Mihai NEGRU

2018
FACULTY OF AUTOMATION AND COMPUTER SCIENCE
COMPUTER SCIENCE DEPARTMENT
FACULTY OF AUTOMATION AND COMPUTER SCIENCE
COMPUTER SCIENCE DEPARTMENT

DEAN, HEAD OF DEPARTMENT,


Prof. dr. eng. Liviu MICLEA Prof. dr. eng. Rodica POTOLEA

Graduate: Gergő PAPP-SZENTANNAI

IMAGE PROCESSING ON SYSTEM ON CHIP FPGA DEVICES USING


LABVIEW

1. Project proposal: We propose to implement a real-time image processing


system on a System on Chip device in the LabVIEW graphical programming
language.

2. Project contents: Presentation pages, Introduction – Project Context, Project


Objectives, Bibliographic Research, Analysis and Theoretical Foundation,
Detailed Design and Implementation, Testing and Validation, User’s manual,
Conclusions, Bibliography, Appendices

3. Place of documentation: Technical University of Cluj-Napoca, Computer


Science Department

4. Consultants: Vlad MICLEA

5. Date of issue of the proposal: March 19, 2018

6. Date of delivery: July 9, 2018

Graduate: ________________________________

Supervisor: ________________________________
FACULTY OF AUTOMATION AND COMPUTER SCIENCE
COMPUTER SCIENCE DEPARTMENT
FACULTY OF AUTOMATION AND COMPUTER SCIENCE
COMPUTER SCIENCE DEPARTMENT

Declarație pe proprie răspundere privind


autenticitatea lucrării de licență

Subsemnatul Papp-Szentannai Gergő, legitimat cu cartea de identitate


[REDACTED], CNP [REDACTED], autorul lucrării „PROCESARE DE IMAGINI PE
DISPOZITIVE FPGA SYSTEM ON CHIP FOLOSIND LABVIEW” (traducere:
“IMAGE PROCESSING ON SYSTEM ON CHIP FPGA DEVICES USING
LABVIEW”), elaborată în vederea susținerii examenului de finalizare a studiilor de
licență la Facultatea de Automatică și Calculatoare, Specializarea Calculatoare în limba
Engleză din cadrul Universității Tehnice din Cluj-Napoca, sesiunea vară a anului
universitar 2017-2018, declar pe proprie răspundere, că această lucrare este rezultatul
propriei activități intelectuale, pe baza cercetărilor mele și pe baza informațiilor
obținute din surse care au fost citate, în textul lucrării, și în bibliografie.
Declar, că această lucrare nu conține porțiuni plagiate, iar sursele bibliografice
au fost folosite cu respectarea legislației române și a convențiilor internaționale privind
drepturile de autor.
Declar, de asemenea, că această lucrare nu a mai fost prezentată în fața unei alte
comisii de examen de licență.
In cazul constatării ulterioare a unor declarații false, voi suporta sancțiunile
administrative, respectiv, anularea examenului de licență.
FACULTY OF AUTOMATION AND COMPUTER SCIENCE
COMPUTER SCIENCE DEPARTMENT
Table of Contents
Chapter 1. Introduction – Project Context ............................................ 5
1.1. Computer Vision .............................................................................................. 5
1.2. Image Processing ............................................................................................. 5
1.2.1. General image processing ......................................................................... 5
1.2.2. Image processing as a subcategory of digital signal processing .............. 7
1.3. Specialized Hardware for Image Processing ................................................... 7
1.3.1. The need for specialized hardware ........................................................... 7
1.3.2. Possible hardware for image processing applications .............................. 8
Chapter 2. Project Objectives ................................................................. 9
2.1. Problem specification ...................................................................................... 9
2.1.1. Real-time image processing...................................................................... 9
2.1.2. Restrictions in real-time image processing ............................................... 9
2.1.3. Problem statement and proposal ............................................................. 11
2.2. Objectives and Requirements of the Project .................................................. 11
2.2.1. Functional requirements ......................................................................... 11
2.2.2. Non-functional requirements .................................................................. 13
2.2.3. Other objectives ...................................................................................... 14
Chapter 3. Bibliographic Research ...................................................... 15
3.1. General Image Processing ............................................................................. 15
3.1.1. Definition of digital image processing ................................................... 15
3.1.2. Origins of digital image processing ........................................................ 16
3.1.3. Examples of image processing ............................................................... 17
3.1.4. The fundamental steps and components of an image
processing system ................................................................................................ 18
3.1.5. Image sensing and acquisition ................................................................ 19
3.1.6. Mathematical tools used in digital image processing ............................. 19
3.2. Properties of image processing algorithms and examples ............................. 20
3.2.1. Some properties of image processing algorithms ................................... 20
3.2.2. Example of an image processing algorithm - Linear Spatial Filter
(Convolution Filter) ............................................................................................. 21
3.3. Real-time image processing ........................................................................... 23
3.3.1. High-level dataflow programming for real-time image processing on
smart cameras....................................................................................................... 23
3.3.2. Fast prototyping of a SoC-based smart-camera: a real-time fall detection
case study ............................................................................................................. 24

1
3.3.3. An image processing system for driver assistance ................................. 25
3.4. Existing Image Processing Implementations in Hardware and their
Comparison .............................................................................................................. 25
3.4.1. FPGA-based implementations of image processing algorithms
and systems .......................................................................................................... 25
3.4.2. Performance comparison of FPGA, GPU and CPU in
image processing.................................................................................................. 26
3.5. SoC Image Processing ................................................................................... 27
3.5.1. Image Processing Towards a System on Chip........................................ 27
3.5.2. A Survey of Systems-on-Chip Solutions for Smart Cameras................. 28
3.5.3. FPGA implementation of a license plate recognition SoC using
automatically generated streaming accelerators .................................................. 28
3.6. Other usages of FPGA and SoC devices ....................................................... 28
3.6.1. SoC-FPGA implementation of the sparse fast Fourier transform
algorithm .............................................................................................................. 29
3.6.2. A fully-digital real-time SoC FPGA based phase noise analyzer with
cross-correlation ................................................................................................... 29
3.7. Other bibliographical research ....................................................................... 29
Chapter 4. Analysis and Theoretical Foundation ............................... 31
4.1. Overall Architecture ...................................................................................... 31
4.1.1. High-level architecture ........................................................................... 31
4.1.2. System on Chip overview ....................................................................... 32
4.1.3. Offloading work to the FPGA ................................................................ 33
4.2. Image Acquisition .......................................................................................... 34
4.2.1. Acquisition device .................................................................................. 35
4.2.2. Image capturing ...................................................................................... 36
4.3. Image and Data Transfer ............................................................................... 37
4.3.1. Digital image representation ................................................................... 37
4.3.2. Data decomposition and streaming......................................................... 38
4.4. Processing ...................................................................................................... 39
4.5. Display ........................................................................................................... 40
4.6. Possible hardware configuration.................................................................... 40
4.6.1. SoC vendors ............................................................................................ 40
4.6.2. SoCs in academical embedded devices .................................................. 41
Chapter 5. Detailed Design and Implementation ................................ 43
5.1. Ecosystem and Development Environment ................................................... 43
5.1.1. Development environment – LabVIEW ................................................. 43
5.1.2. NI myRIO hardware and software specifications................................... 45

2
5.2. System Architecture....................................................................................... 46
5.2.1. The system as a LabVIEW project ......................................................... 47
5.2.2. „Main” VIs and top-level view ............................................................... 49
5.3. Image Acquisition .......................................................................................... 52
5.3.1. Camera session ....................................................................................... 52
5.3.2. Image initialization ................................................................................. 53
5.3.3. Image capturing ...................................................................................... 53
5.4. Image Transfer using DMA FIFO Channels ................................................. 54
5.4.1. Ways of transferring data between the FPGA and the host device ........ 54
5.4.2. DMA FIFO implementation ................................................................... 55
5.5. Image Processing on the FPGA ..................................................................... 57
5.5.1. General structure..................................................................................... 57
5.5.2. Storing the image in a local memory ...................................................... 58
5.5.3. Applying a convolution kernel ............................................................... 59
5.5.4. Synchronization ...................................................................................... 60
5.5.5. Improving the FPGA code and preliminary results ................................ 61
5.6. FPGA Resource summary ............................................................................. 63
Chapter 6. Testing and Validation ....................................................... 65
6.1. Technological Motivation .............................................................................. 65
6.2. System Performance ...................................................................................... 65
6.2.1. Different versions of the LabVIEW SoC implementation ..................... 65
6.2.2. Comparison with other implementations ................................................ 67
6.3. System Scalability ......................................................................................... 68
Chapter 7. User’s manual ...................................................................... 69
7.1. Requirements ................................................................................................. 69
7.1.1. Hardware ................................................................................................ 69
7.1.2. Software .................................................................................................. 69
7.2. User’s Manual ................................................................................................ 69
7.2.1. Setting up the development environment ............................................... 69
7.2.2. Building the LabVIEW project............................................................... 69
7.2.3. Deploying and running the project ......................................................... 70
7.2.4. Validating results .................................................................................... 70
Chapter 8. Conclusions .......................................................................... 71
8.1. Result Analysis and Achievements ............................................................... 71
8.2. Future Work ................................................................................................... 71
8.2.1. Using the AXI standard for inter-SoC communication .......................... 72
8.2.2. Interfacing the acquisition device directly with the FPGA .................... 72
3
Bibliography ........................................................................................... 73
Appendix 1 – Acknowledgements ......................................................... 77
Appendix 2 – Table of Figures .............................................................. 78
Appendix 3 – Source Code .................................................................... 80

4
Chapter 1

Chapter 1. Introduction – Project Context


This chapter presents an overview of digital image processing to provide
context for the following sections. It also briefly presents the field of computer vision,
which is closely related to the presented subject. We will also focus on presenting
different types of hardware that are relevant for implementing various image processing
algorithms.

1.1. Computer Vision


As defined in [1], computer vision is a field that includes methods for analyzing
and understanding images or other high-dimensional data from the real world.
Computer vision produces results in form of numerical or symbolic information. This
result can be a decision or the identification of a real-world object.
Computer vision usually involves other fields too, such as artificial intelligence
or pattern recognition [2]. Applications of computer vision include autonomous
navigation, robotic assembly, industrial inspections among many others [3].
Computer vision systems are beyond the scope of this project, because they
include methods for acquiring, processing, analyzing and understanding an image [2].
To achieve this set of functionalities low-level image processing algorithms are
implied. Our focus will be on these low-level algorithms, that do not necessarily
generate a semantic or meaningful result but are of a high importance and a necessity
for other high-level applications. For example, a computer vision system that can read
a newspaper might use image processing algorithms for thresholding and edge
detection to identify each character one-by-one.

1.2. Image Processing


There is no exact definition for image processing because it is hard to draw the
line between processing an image and analyzing it. We will use the definition from
chapter 1 of [4]: image processing is a “processes whose inputs and outputs are images
and, in addition, encompasses processes that extract attributes from images, up to and
including the recognition of individual objects”.
In the remaining parts of this chapter, general aspects and properties of image
processing are presented, as well as a motivation for choosing the project in the field of
image processing.

General image processing


As previously defined, an image processing algorithm can either transform an
image into a different form or extract valuable information from it. In both cases the
input of the algorithm is an image, that we can represent as a matrix (or a 2D
vector/array).
We call the elements of the matrix pixels. Accessing a pixel of an image 𝐼 is
denoted by 𝐼(𝑖, 𝑗), where 𝑖 represents the index of the row and 𝑗 represents index of the
column. The size of the image is denoted by (𝑀, 𝑁), meaning that the image has 𝑀
rows and 𝑁 columns. We can deduce that an image has 𝑀 ∗ 𝑁 pixels, and by
convenience the first pixel is 𝐼(0, 0) and the last pixel is 𝐼(𝑀 − 1, 𝑁 − 1).
Using this notation, we will discuss both cases of image processing and give
representative examples from [5]:

5
Chapter 1

1.2.1.1. Image-to-image transformations


Given an image defined by 𝐼(𝑖, 𝑗), 0 ≤ 𝑖 < 𝑀 𝑎𝑛𝑑 0 ≤ 𝑗 < 𝑁 (input image), we
generate an output image of the form 𝐽(𝑘, 𝑙). We can denote the transformation as a
function 𝐹, where 𝐹(𝐼) = 𝐽. The resulting image, 𝐽 can be of any size, but in most cases,
it will be the same size as the size of the original image or it will be a fraction of the
size of the original image.
A few of the most used image transformation functions and examples are:
• Image transformation – color to grayscale transformation, image shrinking,
transforming between special and frequency domains
• Morphological (binary) image processing – opening, closing, dilation,
erosion
• Image filtering in the spatial and frequency domains – thresholding or
filtering, applying convolution kernels

Some of these algorithms will be discussed in more detail in the following


chapters but for now we are just focusing on general properties of these algorithms. The
most relevant property for us is the complexity of the algorithms and the necessarily
resources (inputs) required to compute one or several pixels of the resulting image.
In the simplest form of image-to-image transformations, a pixel in the output
image depends only on one pixel in the input image. The simplest example is negating
a binary (black and white) image: the value of pixel 𝐽(𝑖, 𝑗) in the resulting image only
depends on the pixel 𝐼(𝑖, 𝑗) of the input image. This is a very important property,
because all the pixels can be computed independently of each other and without the
need of any previous computation. The complexity of these algorithms is 𝑂(𝑀 ∗ 𝑁)
and because computing the pixels is done independently, the algorithm is highly
scalable – theoretically we can compute each pixel in parallel.
A slightly more complicated set of algorithms are those in which we need
several pixels from the input image to calculate a single pixel in the output image. A
well-known example is applying a convolution kernel on an image. The number of
input pixels required to compute a single pixel of the output image is defined by the
size of the convolution kernel: for example, applying a kernel of size 3 ∗ 3, we need 9
input pixels. The complexity of the algorithm remains in the same class; however, one
input pixel is used for calculating several output pixels – this might decrease the
performance of these algorithms compared to the first category.
We can also define algorithms that are more complex. Many of these algorithms
have a complexity greater than 𝑂(𝑀 ∗ 𝑁) – or have a much larger constant factor. A
very basic example is the histogram equalization algorithm: firstly, we must read all
pixel values to compute the cumulative probability distribution function (CPDF) of the
image; secondly, we normalize each resulting pixel value based on the previously
computed CPDF.
The previous example was a very simple one and there are much more complex
ones that are not detailed here. An important property of these algorithms is that they
cannot be fully parallelized: we must first compute some initial value and only after
that can we move on to further stages of the algorithm.

1.2.1.2. Extracting image attributes


Although our focus will be mostly on image-to-image transformations, it is
important to mention a few algorithms that only extract features or attributes of an
image. Some of these are:

6
Chapter 1

• Mean value and standard deviation of (the intensity levels of) an image
• Geometrical features of binary objects – area, center of mass, perimeter,
aspect ratio, etc.
• Histogram calculation
• Labelling or clustering an image
• Edge/border detection
In many cases these algorithms are used as part of a bigger, more complex
algorithm.

Image processing as a subcategory of digital signal processing


Because (digital) image processing is a category of digital signal processing
(DSP), many algorithms and design decisions presented in the following chapters can
also be applied to DSP in general.
In our case the “signal” is a 2D vector, each element being a numerical value
(e.g. the intensity of a pixel). In DSP, the signal can be of any size and dimension. In
most signal processing applications, the signal is a constant flow (or stream) of values,
that must be processed in well-defined time intervals.
As an example, voice can be represented as a digital signal on an audio CD,
having over 44.1 thousand samples per second, each sample having 16 bits [6]. In image
processing, the number of samples per second is much lower, usually below 60 samples
per second. The size of each sample (image) is however usually much larger: compared
to the 16 bits of an audio sample, the size of a grayscale image can be several kilobytes.
It becomes clear that processing digital images at high rates remains a great challenge,
especially because of the size of the signal.

1.3. Specialized Hardware for Image Processing


So far, we have seen several types of image processing algorithms. In this part
possible hardware circuits are presented that can be used for executing image
processing algorithms.

The need for specialized hardware


Most books and articles, such as [4], assume that an image processing algorithm
will be run on a personal computer. Also, most examples are given in imperative
languages, such as C/C++, MATLAB or Python, all written for serial processors. We
might ask ourselves, why bother with different hardware? The short answer is that the
current image processing systems might not be fast enough. The amount of data to be
processed is always growing and the execution speed of the algorithms is still expected
to decrease as much as possible.
This high-performance expectation might be generated by the recent increase of
image processing algorithms in the autonomous driving industry, where every
millisecond (ms) counts.
To give a numeric example, let’s suppose, that we can reduce the reaction time
by 20 ms of a computer vision system that (detects and) avoids collisions. If the car’s
velocity is 150 km/h, the decrease in the reaction time would mean that the car could
stop 0.83 meters earlier – this might not seem a large improvement, but we should at
least try to push the limits of technology further.
As Moore’s law might become invalid in the following years, we must
investigate different approaches to speeding up the execution of image processing

7
Chapter 1

algorithms – in our case this means using special hardware devices. These possible
hardware devices are presented in the next part.

Possible hardware for image processing applications


Using special hardware for image processing is not a new field. We will see
several such applications in the Bibliographic Research chapter. There are already
several classifications and comparisons between these hardware types, but in our case,
we are mostly interested in how we can combine two different types of hardware under
the same application. That is why we first present the most common integrated circuits
used and then we focus on how we can combine these circuits.

1.3.2.1. Using a single type of hardware


The most common environment in image processing is to use a general-purpose
computer under a certain operating system (e.g. Windows, Linux). The general
approach is to read the image from a I/O or peripheral device (e.g. hard drive or
webcam), load it into the RAM and process the image on the central processing unit
(CPU). This is the simplest form of executing image processing algorithms, especially
because the user has a large variety of programming languages to choose from.
Although most of these algorithms are written for serial execution, it is easy to
parallelize them and use the power of today’s multi-core CPUs.
Because many image processing algorithms are considered “embarrassingly
parallel”, several algorithms have been developed for graphical processing units
(GPUs). GPUs have thousands of cores that can execute the same instructions in
parallel, so it’s no wonder why they can be used for image processing too. Developing
an application for a GPU is slightly harder that for a CPU, however exploiting the
parallelism of image processing is a clear advantage
Another possibility is to develop application-specific integrated circuits (ASIC),
designed especially for image processing purposes. Unfortunately, this approach is hard
to develop and maintain.
To offer more flexibility to the developer, field-programmable gate arrays
(FPGAs) can be used. Programming these circuits is still much harder than writing a
program for a CPU, however an FPGA can be reprogrammed, which offers more
flexibility than an ASIC.

1.3.2.2. Combining several hardware types in a system on chip device


When several integrated circuits are combined in the same chip, and in addition
the chip also includes components of a computer (memory, input/output ports and
secondary storage), we call them system on chip (SoC) devices. This usually means
combining a microcontroller (having a CPU, memory, I/O, etc.) and a secondary unit,
such as a GPU, a coprocessor or an FPGA [7]. Usually the two circuits (e.g.
microcontroller and FPGA) are interconnected by several channels or buses and both
circuits can be programmed.
In our project we will use a system on chip device that has a microcontroller
and an FPGA. In the next chapter we will see how we want to use this type of hardware
to implement image processing algorithms.

8
Chapter 2

Chapter 2. Project Objectives


This chapter describes the project theme and specifies the problem we want to
solve. We also present the requirements that must be met and a proposal for solving the
specified problem.

Note – Throughout the next chapters, we will extensively use the concept of
“real-time” image processing. There is no exact definition for what real-time means,
but it is generally accepted that a real-time image processing system should be able to
process around 30 images per second [8].

2.1. Problem specification

Real-time image processing


Image processing has become an important field in several real-time
applications. A well-known example is using image processing algorithms in
autonomous driving, such as embedded vision systems or smart cameras [9]. The
autonomous vehicle has a set of sensors that capture images and other relevant data
(using a camera, radar or other device that provides a multi-dimensional representation
of the environment). The captured data must be processed, analyzed and in most cases
a decision must be made by the system. In these scenarios it is crucial that the system
can guarantee a fast response. Some applications that use this kind of image processing
systems are collision detection, traffic sign detection and autonomous lane keeping.
A different scenario is given by the fact that we live in an era of internet of
things (IoT), where we might want to push data acquired by a camera to the cloud.
These applications can be very consuming for the cloud computer, so we must consider
filtering and pre-processing close to the acquisition device, before sending data further
[9]. Processing the image before uploading can both reduce the size of the data (by
compressing the images) that needs to be sent over the network and give less work to
the server that needs to finally process the images (by doing some of the processing or
pre-processing close to the acquisition device). These improvements can speed up the
overall algorithm and reduce the required bandwidth for the application.

Restrictions in real-time image processing

2.1.2.1. Hardware-related considerations


We have already identified that performing image processing in real-time
applications must be done close to the image acquisition device, because sending the
data over a network to a more performant device is either not possible or it induces
delays that are unacceptable in a real-time scenario.
As a result, most applications use embedded devices that have major size and
power usage limitations compared to a general-purpose computer (that is much larger,
but may include a performant CPU, a large RAM and several other components). As
presented in the Specialized Hardware for Image Processing part of the first chapter,
industries that require real-time image processing, usually use specialized hardware to
meet their performance demands.
A vendor may choose to design its own integrated circuit from scratch with all
the necessary components required for image processing. This solution might guarantee

9
Chapter 2

a good performance; however, it is extremely unfeasible to develop and maintain such


a system. In addition, there is minimal to no flexibility – once the circuit is ready, it is
impossible to modify it, unless a new circuit is built. On a scale from very specific and
hard to develop to very general and easy to develop, this solution is obviously fits in
the “very specific and hard to develop” category.
On the other side of the scale, we could use a simple CPU or microcontroller
and develop a program written in a well-known imperative language, such as C. This
solution would bring high flexibility (changing the system would mean changing the
source code, recompiling and deploying the executable – this usually does not take
more than a few minutes). On the other hand, the performance of this system would
probably be much lower.
As with almost anything in life, we must try to achieve balance1. In our current
project theme, this means finding a solution that is both performant and offers some
flexibility. Based on the already known hardware types, that we might use for image
processing, we must choose a hardware based on performance but also on the flexibility
(and ease of development) of the given hardware. An estimation of the parameters
(performance and flexibility) of these hardware types is described below:

Type of hardware Flexibility Performance


Developing integrated circuit from scratch minimal to no very high
ASIC minimal high
FPGA low medium-high
SoC (microcontroller and FPGA) medium medium
GPU medium-high medium-low
CPU (microcontroller) high low
Table 2.1 Types of hardware that we considered for image processing, sorted
by flexibility (low to high), including the estimated performance

From Table 2.1 we can deduce that a balanced choice would be the usage of a
system on chip device, having both an FPGA and a microcontroller. The reasoning is
that we can separate the system into two different components: a smaller, time-critical
portion of the application can be developed on the FPGA, while the rest of the
application can be deployed to the microcontroller, which is much easier to program.
This way the flexibility is not too high, but we have considerable performance
improvements over choosing a CPU.
From now on, we will mostly concentrate on system on chip devices, however
we will still mention other types of hardware, especially in the Bibliographic Research
chapter. Also, in the Conclusions chapter, we will present a comparison of the same
algorithm implemented on different hardware.

2.1.2.2. Development-related considerations


In software (and hardware) development, a product most not only be fast and
performant – we must also deliver the product as soon as possible. This is obviously
more relevant in commercial products, but in the case of a non-commercial research
project we should also try fast delivery.
Unfortunately, it is much harder to create low-level and hardware-based
products, because of the complexity of these systems. In general, developing an
application on an FPGA is much slower than implementing a software-based solution

1
Based on the authors own experience

10
Chapter 2

using a traditional imperative or object-oriented programming language, such as C or


Java. Also, FPGA development has a much worse learning curve that gaining
experience in purely software development. Most universities don’t even include low-
level development in their curriculum, probably because of these considerations.
Despite these disadvantages, we still want a fast and flexible way of developing.
Therefore, we will choose an environment that accelerates our productivity is flexible
and lets us deliver our solution much faster.

Problem statement and proposal


The requirement for real-time image processing has growth significantly in the
last decades. The size of images also grows, and these must be processed even faster.
As a result, engineers are facing issues with implementing image processing algorithms
that meet today’s performance requirements.
We want to propose a hardware and software solution, using system on chip
devices, having a microcontroller and an FPGA, that can be used to speed up image
processing. Using this solution, we should be able to make significant progress in
acquiring and processing images.
To deliver the solution faster, we are going to use the LabVIEW development
environment, which enables the rapid development of low-level software and hardware
components. This way we can also tackle the problems discussed in the Development-
related considerations part.

2.2. Objectives and Requirements of the Project


The previous part, the main problem that we want to solve was identified – i.e.
the need for more performant image processing. An initial proposal was also made to
solve this problem, by designing a system using system on chip devices. In this part we
present the main objectives that the system must fulfil.
As in most software products, we can define the objectives of our system as
functional and non-functional requirements. In many software products, engineers tend
to concentrate more on the functional requirements rather than on the non-functional
ones. Contrary to this, in our project, we may be more interested in the non-functional
requirements that the functional ones. As an example, the correctness of an algorithm
will be considered important, however we are a lot more interested in the execution
time of that algorithm.
Besides the requirements that are discussed in the following part, we can also
define objectives that are hard to be written in the form of software requirements (see
the Other objectives section).

Because we have already decided to use LabVIEW as the development


environment, our very first objective will be to figure out whether it is even possible to
implement such a system using this language. Therefore, one of the conclusions of this
research must be regarding the usability of LabVIEW as a tool for SoC-based image
processing. Surprisingly, we could not find any bibliographical sources that would
even mention LabVIEW in this field – this is why it is our duty to do so now.

Functional requirements
Functional requirements define WHAT our system must do. These can be
broken down into a set of steps that must be performed by the system to be able to
successfully process images. These requirements are enumerated in a logical order in

11
Chapter 2

the following part. If any of these requirements are missing, we consider that our system
cannot be considered an image processing system. The initial requirements are also
specified in Figure 2.1 as a series of tasks that must be performed by the system.

Acquire Transfer Process Transfer


image image to image back Display
FPGA results
Figure 2.1 Requirements of the system organized as a series of tasks that must be
performed
In the following part we will describe the details of each requirement and the
dependencies between them.

2.2.1.1. Acquire a stream of images


The first step that needs to be performed is to acquire an image to be processed.
We can either use a peripheral device, such as a web-cam or camera, or we can save
the images in the persistent storage of the device and then load them into memory when
needed.
We must also provide the images at a given rate: this can either mean capturing
a new image every few milliseconds (from an external device) or reading the same
image repeatedly.
The images must be of a specific format (resolution and bitness of a pixel – the
number of bits required to represent one pixel). We must also predefine whether the
images are color, grayscale or black and white.

2.2.1.2. Transfer image from the microcontroller (UC2) to the FPGA


Once an image is loaded into the memory of the UC, it must be transferred to
the FPGA. The way the transferring is done is limited by implementation the actual
SoC that will be used for the project (in general this can be done using buses or
dedicated channels between the UC and FPGA).
In some cases, it is acceptable to pre-process the image on the UC before
sending it to the FPGA – this can include operations such as scaling, resizing or
transforming into a different representation (e.g. from color to grayscale).

2.2.1.3. Process the image using already known algorithms and generate
transformed image or relevant data
This is one of the most important steps of the system. All previous and future
steps are a “preparation” and “finalization” stages, respectively. In this stage we already
have the image stored in the FPGA in some form.
To process the image, we first need an already known (and frequently used)
image processing algorithm, so that we can easily compare our results to existing
results, considering the speed of the image processing. After selecting one or more
representative algorithms, these must be implemented on the FPGA.
It depends on the algorithms whether the input image can be overwritten by the
resulting image or it must be saved to a different location on the FPGA. A third option

2
Abbreviation of microcontroller or μ-controller

12
Chapter 2

is to directly transfer the result, as the image is processed – if this is the case, this step
and the next step can be merged together into one step.

2.2.1.4. Transfer the result to the UC


In most cases the result of the image processing algorithm will be an image as
well (the transformed image), so this step will be like the second step (Transfer image
from the microcontroller (UC) to the FPGA), but in the reverse direction.
If the result of the previous step is not an image but rather attributes of an image
(e.g. mean of the intensity), than this step is simplified, and we must only transfer a
small amount of values (compared to transferring a whole image).

2.2.1.5. Display the resulting image to the user


In a real embedded application, this is usually not a required step, because our
system would be part of a much larger system that would take as input the output image
resulting from our image processing algorithm. However, because we want to verify
the correctness of the algorithms visually too, it is important to see the outputs of the
application. This will also aid debugging the application.
Most probably implementing this step comes with major performance penalties.
Therefore, the user should be able to turn this step off – this is like using a test
environment for development instead of the production environment.

Non-functional requirements
Non-functional requirements define HOW our system must behave while
performing the functional requirements. These are enumerated below:

2.2.2.1. Deployment
Once the system is ready to be deployed from a development computer (in form
of an executable or bitfile3), it should be easy to connect to the system on chip target
device and start the application. This means that we should also be able to remotely
start the execution of the SoC application, by minimal user interaction.

2.2.2.2. Hardware constraints


It is well known, that most UCs and FPGAs have much lower resources
(memory, clock frequency, etc.), than general purpose computers. We must design the
system so that these resource limitations are met. In the case of the FPGA, we must not
exceed the number of available reconfigurable blocks and we must meet certain timing
constraints imposed be hardware limitations.

2.2.2.3. Speed/performance
We must not forget that our goal in experimenting with image processing
algorithms on FPGA-based SoC devices is to increase the performance of embedded
image processing systems. Therefore, one of the most important requirement is related
to speed and performance.
We are mostly interested in the time it takes to perform the steps defined in the
Functional requirements part – i.e. to acquire, process and present the resulting image.
The execution time of this process will also define the frequency of the image

3
A stream of bits that are used to configure the FPGA

13
Chapter 2

processing application, or in our terms, the frames that can be processed in a second
(FPS).
We will try to design, implement and optimize the system to reach high FPS
values, comparable to today’s processing frequencies, that are above 30 FPS [8].

2.2.2.4. Deterministic execution (optional)


It is not enough that the system performs well in most of the cases. Because we
are working in the field of real-time processing, the system may also need to be
deterministic – that is, to always guarantee processing of the images under a certain
time limit. This requirement can also result in a steady FPS over time.
In our current project we may choose not to implement this non-functional
requirement, because it may be beyond the scope of our research.

Other objectives
Throughout the next chapters we will design and implement a system, keeping
in mind the functional and non-functional requirements. We will need to be able to
measure the performance of our system. We must also implement the same algorithms
on several different types of hardware to be able to compare our results.
We expect to achieve greater performance using our proposed solution then
already existing solutions. However, if these expectations are not met (while the
requirements are still fulfilled), we do not consider our project a failure. The conclusion
of our project in that case will simply be that it is not feasible to use system on chip
devices having FPGAs for image processing. We will however try to avoid this result
as much as possible.

14
Chapter 3

Chapter 3. Bibliographic Research


In this chapter we will present already existing research about image processing,
as well as the state of the art in this field. We will start from presenting image processing
as a more general research field and then narrow down our focus to existing
bibliography that concerns our own project. We also present a representative algorithm
and its properties

3.1. General Image Processing


One of the most representative books in our field is entitled “Digital Image
Processing”, authored by Rafael C. Gonzalez and Richard E. Woods, and published by
the well-known Pearson Prentice Hall® [4]. The first two chapters give us a great
introduction to the following notions and fields:
• Definition of digital image processing
• Origins of digital image processing
• Examples of image processing
• The fundamental steps and components of an image processing system
• Image sensing and acquisition
• Image representation
• Mathematical tools used in digital image processing
In the following part we will shortly describe each of these mentioned subjects.
Please note that the notion of real-time image processing and using any special kind of
hardware is not specified in this book. Still the subjects presented here can be
considered a very good theoretical foundation for our project too, because they present
the basics of image processing in general.
The following sub-sections are all based on, cite or reference [4] in some way.

Definition of digital image processing


The very first paragraph of the first chapter in [4] defines an image as a two-
dimension function, 𝑓(𝑥, 𝑦), where (𝑥, 𝑦) are coordinates and the amplitude of 𝑓 at any
pair of (𝑥, 𝑦) coordinates is called the intensity or gray level of the image at that point.
For an image to be digital (or discrete), all values of 𝑓 must be finite.
Image processing and the field of computer vision aim to reproduce the vision
system of a human. This is not surprising, because vision is one of our most advances
senses. The basic components of this system include eyes, the brain and a neuronal
network that interconnects these. We have seen, that several image processing
applications could reproduce this system with success. Computerized vision can go far
beyond the capabilities of a human system, because it is not limited to the visible
electromagnetic (EM) spectrum. If proper sensors are used, we can apply image
processing to the whole spectrum of EM wave. Figure 3.1 shows that the visible
spectrum is only a small part of the entire EM spectrum.

15
Chapter 3

Figure 3.1 Electromagnetic Waves Spectrum, from Wikipedia (author: Philip


Ronan)

As we have already seen in the Introduction – Project Context chapter, there is


no clear boundary between image processing and artificial intelligence (AI). Obviously
simple algorithms, that transform images and do not give a “meaning” to the image will
be categorized as image processing algorithms, however a process that can read and
understand sentences from a newspaper will most likely be categorized as AI.
In the “What Is Digital Image Processing?” section of the first chapter in [4],
authors define a paradigm, that considers three levels of computerized processes
involved in computer vision. These processes are:
• Low-level processes: “involve primitive operations such as image
preprocessing to reduce noise, contrast enhancement, and image
sharpening”
• Mid-level processes: “involve tasks such as segmentation, description
of those objects to reduce them to a form suitable for computer
processing, and classification (recognition) of individual objects”
• Higher-level processes: “involve “making sense” of an ensemble of
recognized objects, as in image analysis, and, at the far end of the
continuum, performing the cognitive functions normally associated with
vision”
In our research and in the state of the art of image processing, presented in this
chapter, we will mostly exemplify low- and mid-level processes.

Origins of digital image processing


Chapter 1.2 of [4] introduces the origins of digital image processing. The first
examples of this field were in the printing (newspaper) industry, in the early 1920s,
where a digitalized image was sent over the Atlantic Ocean using submarine cables.
These images had a low quality and transmitting them was extremely slow for today’s
expectations (it took around three days to send the image).
As time passed, significant research has been made in the field of image
processing, however the first computers were too slow to allow the

16
Chapter 3

development/implementation of these algorithms. Therefore, the evolution of image


processing was tightly coupled with the evolution of computers.
In 1960s, the first breakthrough was made in our field because of the space
program. The huge research that was invested in “reaching the sky”, also made possible
the implementation of more powerful computers, which then allowed IP algorithms to
gain importance in the following years.
In fact, some of the first image processing algorithms were used for the space
program: in 1964 pictures of the moon were transmitted from a space shuttle and a
computer had to apply several image processing algorithms to enhance the image.

In the present, computers are much faster, and the field of computer vision has
grown exponentially. Today we can easily find IP algorithms almost anywhere: in
transportation, defense, social media, geography, space research, and the list could
continue. In the following part we will exemplify some of these fields.

Examples of image processing


Because usage of image processing is so varied and so wide, it is really hard to
categorize the given examples. Authors in [4] exemplify image processing based on
the principal energy source used to take an image. These are usually various bands of
the EM spectrum that are used as the source for an image. Figure 3.2 shows an overview
of these bands, with respect to the energy per photon, that characterizes the intensity of
the EM wave at that region. We will enumerate some of these and give a few examples
based on the electromagnetic source that generated the processed image:
• Gamma-rays – mostly used in nuclear medicine and astronomy
• X-rays – well known for its usage in medical diagnostics; is also used in
industrial applications
• Ultraviolet – used in several fields, such as industrial inspection,
microscopy, lasers, biological imaging, and astronomical observations
• Visible and Infrared – this is the most familiar for us, because it is used
in our everyday life; e.g. taking a portrait of a person or using face
detection at a border control
• Microwaves – e.g. radar, that can be used for navigation
• Radio waves – mostly used in medicine and astronomy, e.g. for
magnetic resonance imagining (MRI4)

Figure 3.2 The electromagnetic spectrum arranged according to energy per


photon, from [4] (chapter 1.3, pg. 7)
There are also other ways images can be acquired. We will give some examples
from [4], but we are not going to detail them:
• Acoustic imaging
• Electron microscopy
• Synthetic (computer-generated) imaging

4
Widely used in hospitals for medical diagnostics and is considered safer than using X-rays

17
Chapter 3

The fundamental steps and components of an image processing


system
Chapters 1.4 and 1.5 in [4] summarize the fundamental steps (or processes) that
an image processing system should perform and also defines that components that can
fulfill these steps. The following list enumerates these steps (note that an image
processing system does not have to implement all these steps – in fact, most systems
will only implement a subset of these steps):
• Image acquisition
• Image filtering and enhancement
• Image restoration
• Color image processing
• Wavelets and multiresolution image processing
• Compression
• Morphological processing
• Segmentation
• Representation and description
• Object recognition
The first items on the list (from image acquisition until morphological
processing) generally produce images as outputs, while the remaining steps (from
morphological processing until object recognition) are generally considered to be
algorithms that extract attributes from images.

Figure 3.3 Components of a general-purpose image processing system, from


[4] (chapter 1.5, pg. 27)

18
Chapter 3

Figure 3.3 presents an overview of the components of an image processing


application. Not all of these components have to be present in a given application. We
will use this architecture to structure our system in Chapter 4 and Chapter 5.

Image sensing and acquisition


The second chapter in [4] focuses on the fundamentals of digital images and
introduces the reader to the concepts of image sensing and image acquisition. When
speaking about acquiring an image, we must first describe the source of the image.
Usually this source is an “illumination” reflected on the scene. The source of
illumination may be a source of electromagnetic energy, as described in the previous
sections (e.g. X-rays, infrared or visible light).
To acquire an image from the provided source, sensors that can react to the
scene are used. In their most general aspect, these acquisition devices are no more than
analog to digital convertors (ADC), that transform an analog signal, such as light in a
digital form, usually represented by a two-dimensional array. The referenced book
gives much deeper detail about these aspects, but these are beyond the scope of our
project and are not relevant for us right now.

Mathematical tools used in digital image processing


In chapter 2.6 in [4], authors have the principal objective to present a
mathematical background needed for the following parts. The most used concepts and
operations are presented and exemplified.
The first mathematical terms that are introduced are array and matrix operations.
In image processing, the array product is much more often used then the conventional
matrix product operation. We can represent an image both as an array and a matrix, so
it becomes straightforward that both of these operations can be applied to images. It
might be important to note, that these operations are mostly composed of addition and
multiplication operations.
One of the most important classifications of IP methods is based on linearity.
[4] defines that an operator, 𝐻, that can be applied to an image, defined by 𝑓(𝑥, 𝑦), and
generates an output 𝑔(𝑥, 𝑦). We can exemplify this equating in the following form:
𝐻 [𝑓(𝑥, 𝑦)] = 𝑔(𝑥, 𝑦)

If 𝐻 is a linear operator, then we can decompose the functions 𝑓 and 𝑔 in the


following way:
𝐻A𝑎! 𝑓! (𝑥, 𝑦) + 𝑎" 𝑓" (𝑥, 𝑦)C = 𝑎! 𝐻 [𝑓! (𝑥, 𝑦)] + 𝑎" 𝐻A𝑓" (𝑥, 𝑦)C
= 𝑎! 𝑔! (𝑥, 𝑦) + 𝑎" 𝑔" (𝑥, 𝑦)

In the previous example 𝐻 is both additive and homogeneous. This can have
significant importance in the following chapters, when we discuss performance.
Some other mathematical operations that are presented are listed below:
• Arithmetic operations – e.g. addition, subtraction, multiplication or
division
• Set and logical operations – e.g. the difference of two images
• Logical operations – e.g. inverting (negating) a black and white image
• Spatial operations – applied on a single or several neighboring pixels, or
on the entire image at once, e.g. kernel5 or transformation operations

5
To be detailed/explained in the following chapters

19
Chapter 3

3.2. Properties of image processing algorithms and examples

Some properties of image processing algorithms


The algorithms that we will choose should be representative in the field of image
processing and should be well-known to any computer scientist who has at least a basic
knowledge in our field. It is good to choose an algorithm for which some performance
evaluations have been already made, because it will be easier for us to self-evaluate our
own project.
We also don’t want to “invent the wheel” by experimenting new algorithms,
because the main objective of our project is to implement already existing algorithms
on a different hardware – thus we don’t even discuss any kind of “new” algorithm.
In the following parts, we will discuss some properties related to image
processing algorithms in general, such as linearity and memory usage, as well as the
type of the output generated by the algorithm. Finally, we will choose our algorithm(s)
based on these criteria.

3.2.1.1. Linearity
In section 3.1.6, we have identified several mathematical operations that can be
used to process images. We have seen that most algorithms are linear, having a
complexity of 𝑂(𝑛), with a small constant factor. These algorithms are usually a good
choice to be parallelized or to be moved to more performant hardware, because linear
algorithms usually scale well, resulting in a good speedup.
If the algorithms are more complex ones, also having higher complexity, it will
be hard to scale them. Such an algorithm, that has for example a polynomial or
exponential complexity (e.g. 𝑂(𝑛# ) or 𝑂(𝑒 $ ), where 𝑛 is directly proportional to the
resolution of the image), might not even fit on the FPGA, because of the limitations
imposed by the hardware.
From these considerations, we will implement linear algorithms, that may have
a complexity of 𝑂(𝑛), with a low multiplier.

3.2.1.2. Memory usage


All image processing algorithms6 require as input an image. This image is
transferred to the “image processor” from another component in the system and must
be saved locally. We should measure the memory footprint of the algorithms. We will
categorize this memory usage in the following paragraphs.
If processing the image can be done while still receiving the image and we don’t
have to retain the pixel values, then we don’t even have to keep the image in the
memory. In other words, we can begin processing before the image is fully available
and the final result will not be an image. A basic example is the computation of the
mean of the intensity values of an image, where we only have to keep some parts of the
image in memory. After processing some pixels, we can discard them, and keep
working on the next set of pixels. Note that in this example we suppose that we have
access to the image via a continuous stream of pixels.
If we change the previous example, so that we use the computed mean to apply
thresholding on the input image, we will first have to save each pixel (the entire image)
in the memory and then we will have to apply thresholding on the saved image. This
approach has a higher memory footprint.

6
Form this book at least

20
Chapter 3

If more steps are performed on the same image, we might even have to keep
two copies of the image. This is very often the case with motion detection algorithms
or algorithms that require two consecutive images captured from an acquisition device.
Using these algorithms will have an increased memory usage.

3.2.1.3. Result of the algorithm


We have already identified, in previous sections, that the result of an image
processing algorithm can be either an image (that was generated by transforming the
input image) or some other property of the input image. In general, these two cases
should not influence the complexity of the algorithms, however they will certainly
influence the performance of the application – if we don’t generate an output image,
then we don’t even have to send one, which eliminates the need to perform the slower
operation of transferring the resulting image.
We should experiment with both types of these algorithms, to see the difference
in behavior between them. However, our focus should be algorithms that generate an
output image.

Based on the last three sections, we will select one or several representative
algorithms that we will describe and finally implement in the Detailed Design and
Implementation chapter.

Example of an image processing algorithm - Linear Spatial Filter


(Convolution Filter)
We will present a representative algorithm that is widely used in computer
vision, namely, the linear spatial filer, which is also called a convolution filter:
Spatial filtering, as defined in [10] is the process of transforming a digital image
by performing the following tasks:
1. Selectin a center point, (𝑥, 𝑦)
2. Performing an operation that involves only the pixels in a predefined
neighborhood about (𝑥, 𝑦)
3. Letting the result of that operation be the “response” of the process at
that point
4. Repeating this process at every point in the image
If the computations performed on the neighboring pixels are linear, then the
process is called Linear Spatial Filtering. The term spatial convolution is referred to
this same operation and is more often used. The “spatial” attribute refers to the fact that
the images are represented in the spatial domain – as opposed to the frequency domain,
that can be achieved by applying the Fourier transformation on the image.
The “atomic” linear operations that are performed by the algorithms are array
operations (see chapter 3.1.6). These operations are multiplying and addition,
performed between different coefficients and array elements extracted from the image.
These coefficients are arranged as a matrix (or array) and are called the convolution
kernel.

The algorithm and the pseudocode are specified in the next parts, as well as
properties and examples of the algorithm.

21
Chapter 3

3.2.2.1. Algorithm and pseudocode


We can define the convolution process in the spatial domain from [5] as a
process that applies a kernel 𝐻 on a source image 𝐼% and generates an output image, 𝐼& .
𝐻 is a matrix, having a symmetric shape and size 𝑤 ∗ 𝑤, where usually 𝑤 = 2𝑘 + 1
(e.g. 𝑤 = 3 or 𝑤 = 7). 𝐻 is said to be a constant because is usually doesn’t change
inside one image processing application. Each pixel in the output image is defined as:
𝐼& (𝑖, 𝑗) = 𝐻 ∗ 𝐼%
'() '()

𝐼& (𝑖, 𝑗) = H H 𝐻(𝑢, 𝑣) ∙ 𝐼% (𝑖 + 𝑢 − 𝑘, 𝑗 + 𝑣 − 𝑘)


-+, *+,

The above formula is applied on each pixel of 𝐼% , except the border of the image and
therefore applying this algorithm implies “scanning” of the image, as also presented in
Figure 3.4.

Figure 3.4 Illustration of the convolution process, from laboratory 9 in [5]


The pseudocode for applying the convolution kernel is presented in Figure 3.5,
where 𝑖𝑚𝑎𝑔𝑒𝐻𝑒𝑖𝑔ℎ𝑡 and 𝑖𝑚𝑎𝑔𝑒𝑊𝑖𝑑𝑡ℎ denote the size of the image and 𝑘 is the
coefficient in the size of the convolution kernel (𝑤 = 2𝑘 + 1).

for 𝑟𝑜𝑤 = 1 to (𝑖𝑚𝑎𝑔𝑒𝐻𝑒𝑖𝑔ℎ𝑡 − 1) do


for 𝑐𝑜𝑙 = 1 to (𝑖𝑚𝑎𝑔𝑒𝑊𝑖𝑑𝑡ℎ − 1) do
𝑠𝑢𝑚 = 0
for 𝑖 = −𝑘 to 𝑘 do
for 𝑗 = −𝑘 to 𝑘 do
𝑠𝑢𝑚 = 𝑠𝑢𝑚 + 𝐻(𝑖, 𝑗) ∗ 𝐼% (𝑟𝑜𝑤 − 𝑗, 𝑐𝑜𝑙 − 𝑖)
end for
end for
𝐼& (𝑟𝑜𝑤, 𝑐𝑜𝑙) = 𝑠𝑢𝑚
end for
end for
Figure 3.5 Pseudocode of convolution filtering

3.2.2.2. Properties of the algorithm


The convolution filter is a linear filter, because the value of each pixel in the
result is determined by a linear combination of a constant number of pixels in the
neighborhood of the pixel. Therefore, several pixels may be computed in parallel.

22
Chapter 3

Because each pixel from the input image influences only a small number of
output pixels (9 in the case of our convolution kernels), we can implement an “in-
memory” image processing algorithm. That means that we do not have to make a copy
of the original image to generate the result image. Instead, we can use the same memory
location, thus we overwrite the input image with the output image.

3.2.2.3. Examples
Figure 3.6 shows the result of applying a Gaussian kernel and a Sobel kernel on
a color input image.

Figure 3.6 Example of applying the Sobel filters (2nd image) and the Gaussian
blur (3rd image) on a color image (1st image), from [11]
The kernels are defined as follows:
1 2 1
Gaussian: T2 4 2V,
1 2 1
−1 0 1 1 2 1
Vertical Sobel filter: T−2 0 2V , Horizontal Sobel filter: T 0 0 0V
−1 0 1 −1 −2 −1

3.3. Real-time image processing


An excellent source of inspiration is the Journal of Real-Time Image
Processing, that publishes articles about the research in the field on real-time image
processing since 2006. 15 volumes have been published until June 2018, containing
over 700 articles. The journal presents state-of-the-art solutions to current problems in
image processing. We have selected two articles that are representative for our project,
as well as a third article that was published in a different journal. These will be
presented in the following parts.

High-level dataflow programming for real-time image processing


on smart cameras
Authors in [12] describe the application of CAPH to implement a real-time
image processing system. CAPH is a “domain-specific language for describing and
implementing stream-processing applications on reconfigurable hardware, such as
FPGAs” [13]. The language is based on a data-flow programming paradigm, where
entities exchange data using unidirectional channels.
The researchers in this paper identified that FPGAs are a very good solution for
image processing algorithms, because of the fine-grained parallelism that can be
achieved. On the downside, it was identified that programming an FPGA can be

23
Chapter 3

extremely difficult and requires a large skillset from the developer. To program an
FPGA, usually hardware description languages (HDL) are used, such as VHDL7 or
Verilog. Defining the personality of the FPGAs using these languages can be hard and
very complex. Therefore HLS8 tools are used that provide a layer of abstraction
between the low-level HDL and the abstract model of the system. Despite the effort
invested in these tools, they are still not performant enough and do not generate good
enough HDL code.
In response to these limitations, authors of the mentioned paper used CAPH to
describe the system that shall be implemented on the FPGA. The CAPH code is
compiled into highly optimized VHDL code, which is then synthesized and finally
programmed on the FPGA.
Authors exemplified the usage of this language on the “implementation of a
real-time image processing application on an FPGA embedded in a smart camera
architecture” [12]. As a conclusion, we have seen that using this approach is well suited
for architectures such as smart cameras.

Fast prototyping of a SoC-based smart-camera: a real-time fall


detection case study
Authors in [14] present a case study involving real-time image processing on
smart cameras. A fall detection system is presented that could be helpful especially for
the elderly in their daily life. The proposed system is a hardware/software (HW/SW)
solution that has a single camera and a Zynq SoC device from Xilinx.
The focus point of the authors is the development process, moreover to enable
a fast prototyping of the HW/SW solution. This results in fast architecture exploration
and optimization. Another contribution of this work is the design of a hardware
accelerator that is dedicated for boosting-based classification, which is a “hot topic” in
today’s image processing research.
The research project presented in [14] focuses extensively on the process of
partitioning software and hardware components. The development process is split in
three parts:
1. Standard flow of implementation as a software product, using C and
OpenCV libraries – this step includes C software development,
debugging and profiling
2. HW/SW implementation – delegating some work defined in the first part
to the hardware component; this step involves extensive use of the
Xilinx Vivado development environment
3. Final implementation on the target system – this involves the final
compilation, execution, debugging and profiling of the system defined
in the previous step, on real hardware and software components
In conclusion, using multi-CPU/FPGA systems (such as a Xilinx Zynq
embedded device) is a good choice for real-time image processing algorithms. We have
seen that the most complex part of the development was the HW/SW interfacing and
porting software-defined parts of the application to the programmable logic (to the
FPGA). This development time was slightly reduced by using C-to-HDL high level

7
Very High Speed Integrated Circuit Hardware Description Language
8
High-level Synthesis

24
Chapter 3

synthesis tools and creating intellectual property cores (IP9) that implement specific
algorithms, such as the AdaBoost10 classifier algorithm.

An image processing system for driver assistance


Another representative example that was published in a different journal than
the previous two, is entitled “An image processing system for driver assistance” [15].
The article presents an image processing system with focus on different methods for
analyzing driving-relevant scenes.
Authors present a system that captures and processes images from a camera
mounted on a moving vehicle. Three main computational tasks are defined. These are:
• Initial segmentation and object detection
• Object tracking
• Information processing
As a short parenthesis, we can see how these tasks correspond to the three levels
of computerized processes involved in computer vision, defined in 3.1.1. Segmentation
is a low-level task, object tracking is slightly more complicated and already involves
identification and labelling of objects, while information processing can be considered
as a higher-level task that gives “meaning” to the image or takes a decision based on
the lower levels.
The described application was implemented on a general-purpose computer that
was not designed for image processing usage. The system can still meet the
requirements of real-time processing, however only by removing or simplifying some
parts of the algorithm. Authors state that these limitations will be raised once more
performant hardware will be available – the article was issued in 2000, so today’s
computers can meet the requirements of the presented system.

3.4. Existing Image Processing Implementations in Hardware and their


Comparison
We will also present papers that focus extensively on hardware implementations
of IP algorithms. We will try to give FPGA-based examples as well as usages of other
hardware. At the end of this section we will present a paper that compares several types
of these hardware implementations.

FPGA-based implementations of image processing algorithms and


systems
Authors in [16] exploit the fact that most image processing algorithms are easily
parallelized. Therefore, a proposal is made to use FPGAs that can exploit the special
and temporal parallelism of image processing. Hardware issues, such as concurrency,
pipelining and resource constraints are among the problems that authors try to solve.
The paper presents a way to increase the performance of the algorithms, as well as the
development speed by using high-level languages and compilers. This way, the
complexity of the hardware can be hidden from the developer and parallelism can
automatically be extracted from an initially serial algorithm.
[17] presents implementing algorithms, such as filtering, smoothing, Sobel edge
detection or motion blur in FPGA hardware. Results using an image of size 585x450

9
Not to be confused with the abbreviation of image processing!
10
Adaptive Boosting, as defined by Wikipedia, is a machine learning meta-algorithm mostly
used in image processing classification

25
Chapter 3

show how these algorithms are well suited for FPGA. The paper also states that good
results can still be achieved after increasing the image size, if the memory constraints
of the device are met.
Paper [18] specifies that the most suited integrated chips for image processing
are ASIC, DSP chips (Digital Signal Processor chip) and FPGA. In this paper an FPGA-
based application is presented that was designed for image preprocessing. Authors
proposed and implemented a fast median filtering algorithm on an FPGA, that resulted
in reduced cost and higher performance than a similar implementation on a
conventional hardware. Results show that this approach can be also used for real-time
image processing.
An example of an FPGA-based embedded vision system is presented in [19].
Authors stress out the major parallelism that is implemented in the system and give
examples of algorithms that benefit from this hardware. The chosen hardware solution
is both fast and cost-effective. Authors could reach a processing frequency of over 100
FPS, compared to the 50 FPS of the same algorithm implemented on a serial processor.
It is also proposed in the Future Work section, to use “System-on-a-Programmable-
Chip (SOPC)” technology – we simply call this system on chip in our book.
Authors in [20] combine the already presented FPGA-based approaches with
digital signal processing (DSP) hardware, to achieve a highly parallel and
reconfigurable system intended for fast computer vision applications. A host-
independent architecture was designed which allows “dealing with high-level real-time
image processing routines”.

Performance comparison of FPGA, GPU and CPU in image


processing
So far only FPGAs were presented as suitable hardware components for image
processing applications. In the paper entitled “Performance comparison of FPGA, GPU
and CPU in image processing” [21], we are introduced to the implementation of image
processing algorithms on three different circuits. The paper compares the performance
of several simple algorithms executed on CPU, GPU or FPGA and states that FPGAs
have an obvious advantage over CPUs, while GPUs outperform the CPUs only when
most pixels can be processed in parallel (no dependencies between large data sets).
Figure 3.7 is a comparison of the performance (measured in FPS) of the k-mean
clustering algorithm between the three hardware types. It is obvious that the FPGA
outperforms both the CPU and GPU.

26
Chapter 3

Figure 3.7 Performance of the k-means clustering algorithm, from [21] (Fig. 8.
of the original paper)

3.5. SoC Image Processing


So far, we have extensively studied general image processing examples and
implementations on FPGAs. It is time to also present the state of the art in the field of
system on chip processing combined with image processing. We will present related
articles and give representative examples.
Unfortunately, we could not find any LabVIEW System on Chip
implementation in the field of image processing, so we are not able to present that topic.
We currently see two possible explanation for the lack of sources:
1. Most LabVIEW SoC image processing systems are kept as company
secrets, because they represent the intellectual property of that company
– sharing these projects would generate financial losses for these
companies
2. There has been little or no research in this approach yet
Either way, we would like to change this by contributing with this book.
In similar fields, we could find image processing-related projects, such as digital
signal processing and vision system implementations on CPUs developed in LabVIEW.
However, these projects did not seem relevant for our research.

Image Processing Towards a System on Chip


Authors in [22] present the recent evolution in image sensing devices. They
have identified that CMOS11 image sensors have taken over the traditional CCD12
technology that was used to capture and digitalize images. While CMOS technology is
much cheaper, has lower power usage and can be more easily be integrated on other

11
Complementary metal–oxide–semiconductor, a technology for constructing integrated
circuits
12
Charge-coupled device

27
Chapter 3

systems, it’s quality is not as good as what the CCD offers. Therefore, images are
noisier, have less contrast and are blurrier.
To face these issues and to also provide high performance, the paper proposes
to process (or preprocess) the images close to the acquisition device. This is done by
using a “retina” – a small computer vision system that mainly focuses on sensing and
initial processing on the same small device.
Authors implement several filtering algorithms (e.g. smoothing or Sobel filter)
on the retina, that is basically a system on chip device having an integrated camera
module. The success of this project shows how well-suited image processing algorithms
are for on-chip processing.

A Survey of Systems-on-Chip Solutions for Smart Cameras


In [23], researchers conduct a survey about using Systems-on-Chip solutions
for smart cameras. They start from a specification of a smart camera architecture and
define the Quality of Service (QoS) attributes that must be taken into consideration.
Some of the identified quality attributes are:
• Frame rate
• Transfer delay
• Image resolution
• Video compression rate
Authors also present current SoC-based solutions in the field of real-time image
processing, exemplifying again that these chips can be a perfect choice for smart
cameras and embedded vision systems.

FPGA implementation of a license plate recognition SoC using


automatically generated streaming accelerators
Another representative example in this field is provided by the Embedded
Systems Research group at Motorola, Inc. in [24]. Authors present a practical
implementation in FPGA using a system on chip device to detect and recognize license
plates. Contrary to several previous examples, the system not only processes but also
gives a “semantic meaning” to the images by extracting license plate information. This
is not considered to be in the field of artificial intelligence yet, but the system requires
considerable processing power.
Authors use a streaming data model, that crates streaming data, which is easily
processed in parallel by different parts of the FPGA hardware. A template-based
hardware generation is also presented, that automatically generates streaming
accelerators in hardware that process the previously generated data.
The final solution is the development of a methodology and prototype tool that
accelerates the construction of the hardware components (that are executed on the
FPGA). The resulting system is performant and similar approaches may be used in other
fields of SoC processing as well.

3.6. Other usages of FPGA and SoC devices


So far, we have only seen system on chip implementations targeting the
computer vision industry. However, these systems can be used in several other fields,
such as mathematical computations, signal processing and several other fields.
Examples include real-time thermal image processing [25], among many others. In the
following part two more examples are to be detailed.

28
Chapter 3

SoC-FPGA implementation of the sparse fast Fourier transform


algorithm
Authors in [26] implement the sparse fast Fourier transform algorithm. The fast
Fourier transform (FFT) “is an algorithm that samples a signal over a period of time (or
space) and divides it into its frequency components” [27]. Authors use an ARM
Coretex-A9 dual core processor in combination with an Altera FPGA. The system
shows how programmable logic can be used side-by-side with an open-source operating
system, such as Linux. The resulting system provides low execution speed for highly
intensive processive algorithms, with high scalability and medium development time
(compared to an FPGA-only implementation).

A fully-digital real-time SoC FPGA based phase noise analyzer with


cross-correlation
Another interesting example of FPGA-based SoC systems is presented in article
[28]. The paper presents a “fully-digital and real-time operation of a phase noise
analyzer”. Phase noise is represented by fluctuations and jitter in the phase of a
waveform.
Authors analyze the possibility to use system on chip devices for signal
processing and in time & frequency research. Results show that authors could
successfully develop a reconfigurable and fully digital system that performs well in
real-time scenarios. All this was made possible by the combined usage of the FPGA
(programmable logic) and the software-based application on the same chip.

3.7. Other bibliographical research


Several other articles and papers have been analyzed when building the
Bibliographic Research chapter of this book that were interesting, but the decision was
made not to include them – mainly because none of them was related to LabVIEW. For
example, paper [29] presents a low-cost internet of things (IoT) application for image
processing that uses an FPGA-SoC-based approach.
There are a huge variety of other papers too, but we consider that these articles
could efficiently present the current state of hardware-based image processing, and they
are a great bibliographical introduction and theoretical foundation for the next chapter.

29
Chapter 4

Chapter 4. Analysis and Theoretical Foundation


The purpose of this chapter is to explain the operating principles of the
application, as well to analyze the problem and create an initial design. Low-level
hardware projects are usually highly dependent on the chosen devices, however in this
initial design we won’t specify any specific hardware component. In this sense, it
should be feasible to implement the identified design on most SoC devices that offer a
processor and an FPGA.
We will also specify the architecture of the system in form of diagrams and
modules. Most development processes favor using UML13 as a modelling language
because the generated diagrams are easily mapped to actual software components,
especially in object-oriented programing languages. Unfortunately, this is not the case
in such a low-level hardware design, so more generic diagrams will be used instead of
UML. Because the implementation will be done in LabVIEW, which is a graphical
data-flow programming language (or engineering tool), many of the presented diagrams
will be easily mapped to LabVIEW code. From this consideration, we will try to define
the overall architecture and the larger modules in simple and generic diagrams that
represent a sequence of tasks/processes – and not by using standardized diagrams, such
an UML activity diagram.
We will also detail the design of the chosen image processing algorithm
(Convolution Filter) and introduce to the reader some of the initial performance and
implementation considerations.
Please note that the design that is presented in the following part is mostly
independent of the used technologies and development environment (LabVIEW), so
that this analysis and design could be also reused in other, similar projects.

4.1. Overall Architecture

High-level architecture
Most image processing systems are like data acquisition (DAQ) and control
systems, at least from a high-level view. These can be characterized as having three
distinct procedures: acquisition, processing and control. We can adapt this structure to
our needs, as described in Figure 4.1 – most of the systems presented in Chapter 3 are
also implementing this structure in some way.

Image Image
Display
acquisition processing

Figure 4.1 Overall Architecture, as a data acquisition and control process

13
Unified Modeling Language (http://www.uml.org/)

31
Chapter 4

The first step is acquiring the image and the last one is displaying it. These might
not seem important, especially displaying the image, however because we would like
to visualize the result of our system from real-time data, we need these steps too. The
analysis of these steps, as well as the image processing part, will be detailed in the
following sections.
Because our focus will be on the “Image Processing” part of the diagram, which
in fact will be implemented on the system on chip device, it is necessary to break down
the design to smaller logical components, that we will be able to map to specific parts
of our hardware. Figure 4.2 describes the logical components of the system, by splitting
the second part of the diagram.

Image
acquisition

UC FPGA

Display

Figure 4.2 Overall Architecture, from a simple, logical point of view

System on Chip overview


In Figure 4.2, we can also identify the SoC device, having its two distinct
components, the microcontroller and the FPGA respectively. Although these two
components are physically on the same chip, logically they are separate and the
interaction between them is not straightforward. Another reason why the UC and FPGA
are represented separately is because developing an application for them is much
different and might require other development environments and developer skillsets.
At this stage we might ask ourselves, why choose a SoC device, when we still
need to target the CPU and FPGA individually. The key answer is that these circuits
“provide higher integration, lower power, smaller board size, and higher bandwidth
communication between the processor and FPGA” [30]. With these aspects in mind,
we have a higher chance of meeting our project objectives – that is to implement faster
image processing algorithms. Another benefit is that SoC devices are generally much
cheaper than using a microcontroller and a separate FPGA.

32
Chapter 4

Offloading work to the FPGA


So far, we have identified the main components of the system and we have seen
that both the UC and FPGA will be used in some percent. We want to define how much
work to do on the FPGA and what to leave in the responsibility of the processor.
We will start from breaking down the process of image processing to steps that
require less word and we well call these “tasks”. Initially these tasks will be intended
to be executed serially on the processor. Then we identify the most critical parts and we
will offload the work of some tasks to the FPGA. We will have to deal with
communication overhead as well, but we would like the decrease of execution speed
induced by using the FPGA to be much higher than this overhead. The following
paragraphs describe our strategy to organize the application between the processor and
the programmable logic14.

4.1.3.1. Serial tasks


The serial tasks to be performed by the application can be easily depicted by a
simple flow chart:

Capture Apply IP Generate


Scale
image algorithm output

Figure 4.3 Serial tasks performed by the SoC device


In the initial implementation, all the tasks from Figure 4.3 (represented as
rectangles) will be implemented on the processor. The acquisition and display devices
are beyond the scope of this part, but they are also described in the diagram, as the
boundaries of these tasks. Note that these are only preliminary tasks and additional tasks
might be added later. Also, the „Apply IP algorithm” will be detailed once we select an
appropriate algorithm (to be done later in this chapter).
To give precise results about the speedup of our system, we will have to measure
the execution speed of the system that is implemented only on the processor. We have
intentionally specified the acquisition and display device as separate parts, because we
don’t want to include these components when measuring performance.
Scaling was added because many acquisition devices support only a single
image type and resolution, but our implementation of the image processing algorithm
might require a different representation of a digital image.
The “Capture image” and “Generate output” tasks might seem straightforward,
but they are important steps in the system. In the simplest case these tasks would mean
reading or writing to and from a local memory, but in more complicated scenarios, we
might have to transfer data over a network or we might have strict memory restrictions
– these must be taken into consideration when implementing the system, because they
might have a significant performance penalty.

4.1.3.2. Selecting tasks to be ran on FPGA


We can already estimate that one of the most computationally intensive tasks
from Figure 4.3 is the „Apply IP algorithm” one (also marked with a different color).
As already mentioned, we are not going to discuss the algorithm yet, however we can

14
FPGAs are often referred to as programmable logic devices

33
Chapter 4

define the way the processor interacts with the FPGA. We can free the processor from
the responsibility of processing the image, by executing the processing part on the
FPGA. Unfortunately, this induces overhead, because the images must be transferred
between the two devices even if physically they are on the same chip. This behavior is
visualized in Figure 4.4.

Acq. Capture Generate


device Scale
image output

Transfer

Transfer
Apply IP Write
Read image
algorithm image

Figure 4.4 Delegating work from the UC to the FPGA


The resulting diagram from offloading work to the FPGA presents how the same
tasks are distributed among the two physical components. We have also added two
more rectangles, „Read image”, and „Write image” respectively, as well as adding two
more thick arrows, labelled with „Transfer”. These new items are required for
communication between the two components and they are an easy way of representing
overhead.
During implementation, we will have to individually measure time required for
the image transfer, as well as separately measure the time required for the image
processing itself.

The following parts will explain the operating principles of the different
components described here, starting from acquisition and data transfer, as well as the
image processing part and finally the visualization of the image.

4.2. Image Acquisition


In the Overall Architecture part, we have defined the architecture of the system,
as well as the main components that we must design and implement. The first such
component is represented by the image acquisition. We must define possible solutions
for the choice of hardware that will acquire images and we will see how this device
interacts with the system on chip circuit.
In many FPGA-based image processing projects, it was supposed that the image
was already in the memory of the FPGA. If we compare the performance of such a
system to an implementation in other types of hardware (e.g. CPU or GPU), where
placing the image in memory takes some time, we get unrealistic and unrepresentative
results. Therefore, we must take in consideration the process of “getting” images.
We will also define methods for capturing images from the acquisition device,
so that we can apply the image processing algorithm on the specific captured image.

34
Chapter 4

Acquisition device
In a more realistic embedded device – e.g. one used in an ECU15 of an
autonomous vehicle – the acquisition device would be directly connected to the FPGA.
This would not require the images to be transferred from the processor to the FPGA,
instead the FPGA would directly access the image. This would eliminate much of the
communication overhead and the latency of the application would be lower. Smart
cameras are an example for this behavior, where the acquisition device is much closer
to the processor – or they are on the same physical unit.
Unfortunately, we cannot use such industrial and high-performance cameras.
The first reason is that such a camera is not available for this project and the second
reason is that even if we had a camera intended for embedded image processing, we
would have to implement an interface to that camera, which is beyond the scope of our
project. Instead we will find different alternatives, that may or may not generate extra
overhead, but they are accessible to us. These are to be discussed in the next paragraphs.
We have identified two relatively simple methods of reproducing a high-
performance embedded camera:
• USB webcam
• Persistent storage device
The possible usage of these devices is detailed below.

4.2.1.1. USB webcam


We can use a low-cost USB webcam that is compatible with most devices that
have a USB port. Our SoC device will need appropriate drivers to communicate with
the webcam. Fortunately, most system on chip devices are shipped with an operating
system (mostly Linux) that already supports these cameras. If there is no driver support
for the given webcam, we can still most probably download a driver from a third-party
supplier.
A common webcam usually streams 30 images per second. Most of today’s
webcams have HD resolution of 720p (1,280 ∗ 720 pixels) or 1080p (1,920 ∗ 1,080
pixels) and generate color images.
Even if we consider the streaming rate of the webcam to be acceptable (30
frames in a second), the latency induced by transferring the image from the camera over
USB and then loading it into the main memory of the processor can be considerably
high for a real-time application.
When evaluating the system, we should be able to measure the overhead caused
by the webcam compared to using a more performant device. It would be also
interesting to measure the time needed for the processor for loading an image from the
I/O device – that is the time needed for completing the I/O and memory-related
operations performed by the processor. This way we can simulate a real system that
does not have low-performance peripherals.

4.2.1.2. Persistent storage


Most microcontrollers and system on chip devices have a secondary, persistent
memory device, where larger amounts of data can be stored. This storage is usually
implemented as flash memory – in a general purpose personal computer, this device
would be the hard drive of the computer. In more advanced devices that have a running

15
Electronic Control Unit – mostly used in vehicles

35
Chapter 4

operating system, this storage can be accessed over the operating system’s file system
– this storage is also where the program and user data files are stored.
Instead of using a pluggable camera, we can gather images from a different
source and save them to the device’s permanent storage. When the image processing
application is executed, we can load the images into memory by simply reading the
contents of a file. This is much simpler that using a camera because all the functionality
of opening the file and reading its contents to the main memory are handled by the
operating system (if it exists).
The performance of this solution would be much better that the previous one,
because reading from a file is considered to be much faster than reading from a
peripheral device, even if both are considered input-output devices.
The only downside of using this approach is that we cannot test our system with
live data. Instead, pre-defined images will be used (we can reuse the same image several
times).

Image capturing
In section 4.2.1 we defined several ways that we can use to acquire images.
There is still a necessity to also explicitly define what “capturing” an image represents.
We can take the example of a webcam that acquires several images every
second. These images will not be automatically processed. In fact, they will not even
necessarily be available to our image processing system. Most webcams stream images
at a given rate and always keep the most recent image in a buffer. We must keep in
mind that only the most recent image is kept in the buffer and the rest of the (previous)
images are discarded.
If we want to process an image we must first capture it, i.e. to copy it from the
buffer to a location in the main memory that is accessible from the image processing
application. Thus, the action performed to access one particular image to be processed
from the stream of incoming images is defined as capturing.

4.2.2.1. Capturing modes


We can specify two capturing modes:
• live mode;
• offline mode – “reuse last captured image”.
In live mode, at the beginning of each image processing iteration, the most
recent image from the image buffer will be captured. If the source of images is a
webcam, then we will be implementing real-time image processing. In this mode, the
system could react with a low delay to changes in the scene that were acquired by the
webcam. For example, if we use an obstacle detection algorithm, and the webcam
acquires images from a real obstacle, the system would be able to react in a reasonable
delay and detect the obstacle.
In the other, offline mode, that can also be called “reuse last captured image”,
we suppose that an image was already captured and saved into the main memory at
some previous point. Instead of reading a new image, we reuse the same old image – in
other words we entirely skip the capturing process and do not update the input image
of the algorithm. In this mode, we must make sure that the images are not discarded
between different iterations.

We would like to be able to dynamically switch between the two modes. If we


start the application in live mode, using a webcam, and later we switch to offline mode,

36
Chapter 4

we want the input image to “freeze”. All future iterations of the algorithm will use the
last image that was captured before switching to offline mode.
To be able to start the application in offline mode, we must make sure that an
image is already available in the memory – this can be done by executing a “fake”
capturing operation before the application starts.

4.2.2.2. Capturing performance


At this point we can define two more notions: the acquisition frequency and the
capturing frequency. The first one defines the rate at which the acquisition device
pushes new images to the image buffer and the second one refers to the rate at which
images are captured from the image buffer. These two rates are ideally synchronized
but they can also be different. In our implementation we will focus only on the capturing
frequency, that is defined by the overall frequency of the system.
Note that if the capturing frequency is two times higher than the acquisition one,
then the same image will be processed two times. In the opposite scenario, when
acquisition is faster than capturing, only every second image will be captured, and half
of the images will be discarded.

If we want to measure the performance of the system without taking in


consideration the capturing operation, we can simply switch to the already defined
offline mode. This gives us an easy way to temporarily eliminate (or greatly reduce)
the overhead of image acquisition and image capturing.

4.3. Image and Data Transfer


So far, we have defined means of generating a stream of input images, as well
as capturing one image that shall be processed. All these operations take place on the
processor (and the acquisition devices, e.g. the webcam). Because the actual image
processing will take place on the FPGA we have to define how to transfer the image
between the processor (the processor’s main memory) and the FPGA.
In this section we will detail how to transfer the image from the UC to the FPGA
and similarly, how to transfer the image from the FPGA to the UC. These two
operations are usually symmetric, and we will only discuss them once.
In some cases, however, the FPGA to UC transmission can be much simpler,
when the output of the algorithm is some property of the image (and not another
transformed image). As an example, we can compute the mean of the intensity values
in a grayscale image: the result will be one single value (the mean), represented in just
one or few bytes. As a result, this operation is much faster that transferring back a whole
image. Because of its simplicity and because most algorithms that we discuss are image
transformations, we are not even discussing this case here.
Because the acquisition device and our image processing algorithm might
represent images in different formats, we might need to transform the image first to a
different representation. Only after this operation can we transfer the image between
the two components. Both operations are detailed in the following part.

Digital image representation


At this stage, the captured image is already in the processor’s main memory.
The format of this image is defined by the acquisition device (e.g. webcam or the format
that was used to save an image to a file), so we have no control over it. Reprogramming

37
Chapter 4

the acquisition device or replacing it is usually not possible, so we have to use the
images that are given to us.
However, in the next parts of the system, we might want to use several types of
image processing algorithms. These might require other types of images. We do not
want to restrict our system to only be able to use the image format provided by the input
devices, so we might have to first apply a simple transformation to the captured images.
We can suppose that all images are represented in memory as a matrix, having
the size (𝑀, 𝑁), which defines the resolution – see Chapter 1.2 for more details. For
color images, each pixel is represented by three values, for grayscale images, however
only one value is required. Each such (pixel) value can also be represented with
different precision – i.e. each value can be represented by one or several bytes in
memory.
We can already see that there are several parameters used to define the “type”
of the image. These can all be different for the capturing and the image processing part.
To make a conclusion of these parameters, these are enumerated below:
• Resolution, e.g. 256 ∗ 256
• Samples/pixel – number of values required to represent one pixel, e.g.
1 sample for grayscale and 3 samples for color images
• Sample depth (bitness) – size of one sample, e.g. 8 bits/sample or 32
bits/sample, for very high-quality images

In the ideal case, these parameters are the same for the acquisition device and
for the image processing algorithm. For our research project, we do not have the
resources to choose between several acquisition devices, therefore the parameters of
the captured device are fixed. To solve this mismatch, we have to transform the images,
for example to change the resolution (scale operation) or change the representation from
color to grayscale.
These operations can introduce significant delays and they can decrease the
performance of the overall system. Unfortunately, we do not have a proper workaround
for this issue. The only thing that we can do is to individually measure the execution
time of these operations and subtract them from the overall execution time. This way
we can estimate the performance of a system in which the type of the captured image
and the processed image match.

Data decomposition and streaming


It is relatively hard to design the aspects of data streaming between the
processor’s memory and FPGA, without knowing the exact type of the system in chip
device. We will define the actual SoC in the next chapter instead. Therefore, in this part
we are going to present general ways that can be usually used to transfer (or stream)
data between the two logical devices.
Most SoCs have dedicated buses that can be used by multiple readers and
writers to share data. If these are available, we can use them to interface the processor’s
memory and the FPGA.
Another type of link that can exist between the two components are dedicated
channels (or links) that offer direct, point-to-point communication. These operate under
the principles of a FIFO16 queue and usually offer only unidirectional communication
(as opposed to a bus). Because FIFOs usually have only one reader and one writer,
synchronization becomes much simpler.

16
First In First Out

38
Chapter 4

To read and write data from and to a channel or bus, buffers have to be allocated
on both sides of the links. This way the processor (or FPGA) knows from where to send
the outgoing data or where to write the incoming data.
In some cases, we can directly use the main memory instead of a dedicated
buffer. This is called direct memory access, or DMA. When DMA is available, it can
become easier to access large amounts of data. Also, in some cases, the memory used
for DMA transfer can be a shared memory (shared between the FPGA and processor),
which increases performance even more.

Links between the two components are usually serial, so we have to send the
image as a stream of data. If more than one physical link is available to send the image,
we might consider using several links in the same direction, by transferring the image
in parallel. This could reduce the transfer time, but the image will have to be
decomposed – and at the receiving end it will have to be reconstructed. This also implies
extra synchronization but could still improve the overall algorithm.
Fortunately, decomposing an image is fairly simple, because it is easy to split a
matrix in several equal parts. For example, we can splint an image in two parts, by
.
sending the first # rows at the first link and the remaining rows at the second link (𝑀
represents the number of rows). We can also decompose by columns, sub-matrices or
other more sophisticated ways.

Sending data is usually done with the aid of a communication protocol. Most
protocols however add some supplementary control data, that generates communication
overhead. To reduce overhead, it is recommended to send large chunks of data at once
(instead of sending an image pixel-by-pixel). We must also pay attention not to send
too large amounts of data, because of the limitations of the used communication
channels, or because the memory buffers might be overflown.

4.4. Processing
So far, we have prepared mostly every aspect that we need to finally be able to
process the images. In this stage the image is in the specified format and is already
accessible by the FPGA.
The algorithms that we will use is the Convolution Filter that was already
presented in section 3.2.2. The kernel that is used for convoluting should be
parametrized, i.e. we should be able to easily change it if we want to test several kernels.
Initially we want to use the Gaussian blur, but several other kernels can also be
used. In fact, the used kernel type is not relevant at all – it should just be easy to verify
the correctness of the algorithm. For example, after applying the Gaussian kernel, the
output image should be smother and blurrier. Also, in some cases we will have to apply
the division operator – this will be detailed in the implementation part.
The presented examples all use 3 ∗ 3 kernels. We will also start the development
using this size. However, once we have a stable product (that possibly meets the
objectives), we will start gradually increasing the kernel size. This is required, because
we want to measure the scalability of the system. Applying a larger kernel is also
important, because we have to test the system in computationally more intensive
scenarios. Another reason why it is important to try to use large kernels, is because it
might not be worth using an FPGA for a really small kernel: the communication
overhead would be too high compared to the speedup of the image processing part itself.

39
Chapter 4

4.5. Display
The embedded image processing device that we are reproducing would
normally not be able to present images in a visual way. Therefore, displaying the
resulting image of the algorithm (or displaying a computed value that was generated by
the image processing algorithm) has only debugging and validation purposes. We
would like to be able to manually verify the correctness of the system: for example, if
we apply a smoothing kernel in a convolution filter algorithm, we expect the resulting
image to be “smoother”.
Adding a visual representation of the output can greatly aid the development
process and it is also much easier to demo the application.
Displaying the image must not be done on the system on chip device. In fact,
this is usually not even possible, because we would require special displaying devices.
Instead, we can send the result of the algorithm over a network to a different device
(e.g. a general-purpose PC) and display the image there.
We should keep in mind that displaying the image should not have a major
impact on our system. If, however, displaying the image generates a large overhead, we
should be able to turn off this feature when measuring the performance of the system.

4.6. Possible hardware configuration


We will start from presenting several vendors and their products in the SoC
market. We will only mention products that incorporate a microprocessor and an FPGA.
Two of the most known SoC vendors are Xilinx and Altera (now owned by Intel). There
are several other vendors in the market, such as Microsemi 17or ON Semiconductor18,
but we are not going to present them.
We have decided to include this part in the Analysis and Theoretical Foundation
section and not in the Bibliographic Research chapters. The reasoning behind this
decision is that we did not want to present too many hardware-specific details in the
bibliography. After all, we should be able to use the identified design for several types
of system in chip devices.

SoC vendors
Xilinx offers three types of SoCs, depending on the customer’s needs and the
complexity of the application. These categories are: Cost-optimized, Mid-range and
High-end. The cost-optimized category, represented by the Zynq-7000 SoC device,
mainly targets the education and research industry and also users who need a fast
prototyping board that is within budget and speeds up the development process (and
time to market). These boards are shipped with single- or dual-core ARM processors.
Xilinx also offers a well-known development tool for programming the FPGA of the
SoC, called Vivado. For programming the software-defined part, usually Eclipse is used
as a free and open-source IDE19 [31].
Since the fusion with Altera, Intel has also released several SoC devices (such
as the Intel Stratix® 10 SoC, exemplified in Figure 4.5). These are less known in our
academic environment but can be an interesting alternative to the Zynq chips. Intel also

17
From Wikipedia: Microsemi Corporation was a California-based provider of semiconductor
and system solutions for aerospace & defense, communications, data center and industrial markets
18
From Wikipedia: ON Semiconductor is a global semiconductor supplier company and was
ranked in the 500 of the largest US corporations by total revenue for their respective fiscal years
19
Integrated Development Environment

40
Chapter 4

offers a variety of development tools compatible with their devices, but we are not
going to detail them here [32].

Figure 4.5 Intel Stratix 10 TX FPGA, from altera.com

SoCs in academical embedded devices


SoCs alone are not usable, unless they are placed on a larger system, such as a
motherboard, a system on module20 (SoM) or a development and prototyping board.
Most boards that are targeting the academic sector (and are available to us) are
development and evaluation boards, lacking high performance. However, these boards
come equipped with a large variety of interfaces, such as USB, HDMI21, VGA22 and
Ethernet ports, I/O pins for digital and analog signal processing, audio jacks and on-
board LEDs23. Most boards are equipped with static memory (e.g. flash memory) and
volatile memory (e.g. RAM). Most boards also provide a level of abstraction over the
hardware by supporting an operating system (OS) – which is usually open-source, such
as Linux-based OSes.
The Zynq®-7000 family of system on chip devices is our best solution for the
current research project (and also the only available one). There three boards/kits that
we selected, which are shipped with this SoC:
• Zynq-7000 SoC ZC702 Evaluation Kit, sold by Xilinx
• myRIO, sold by National Instruments (NI)
• ZedBoard, sold by Avnet, in cooperation with Digilent
When looking only at the provided hardware, there is not much difference
between these boards (mainly because they use the same family of SoCs). Because we
want to use LabVIEW as a development environment, which is a tool by National
Instruments, it becomes obvious that we will use the myRIO device.

The NI myRIO’s hardware and software configuration influences deeply the


implementation of the solution. Therefore, we will only detail the specification of this
device in the Ecosystem and Development Environment sub-chapter.

20
Small, integrated, single-board computers
21
High-Definition Multimedia Interface
22
Video Graphics Array
23
Light emitting diodes

41
Chapter 5

Chapter 5. Detailed Design and Implementation


This chapter is the largest (and probably most important) part of this work. We
present the final stages of development, by first choosing the required hardware and
software environment and then expanding the analysis and design identified in the
previous chapter, to finally guide the reader through the implementation of the proposed
solution.
Because low-level applications, such as the one presented here, are highly
dependent on the chosen platform and technology, we will first argument our choice of
the development environment and hardware equipment. The presentation of the
development environment (ecosystem) will be followed by a general view of the design
and implementation of the system. Based on a top-down approach, after the
presentation of the architecture, we will focus on sub-modules and smaller components
of the application.

5.1. Ecosystem and Development Environment


As already mentioned, defining the environment is a key step in developing
low-level hardware solutions. This however, does not mean that our
design/implementation is suited only for a specific set of hardware and software
components. The methods and concepts presented here can apply to any system on chip
application targeting the field of image processing.
So far, we already know that the development will be made in LabVIEW and
we target the myRIO embedded device.

Development environment – LabVIEW


We want to stress out again the importance of the chosen environment
(LabVIEW and myRIO). Therefore, we present very shortly other alternatives that
could have been used.
The key factor that decided between the two boards mentioned in the previous
chapter, is the development environment. To program the ZedBoard, one can use the
Xilinx Vivado Design Suite, that is an IDE specialized in FPGA development. The
programmable logic can be configured by specifying an HDL description of the design
(in VHDL or Verilog) and then performing the steps from Figure 5.1, to obtain a bitfile
(FPGA configuration bitstream). The bitfile can then be deployed on the FPGA part of
the SoC. These steps are automatically performed by Vivado.

Figure 5.1 Tool flow for FPGA configuration compilation,


from [33] (chapter 2.1, pg. 30)
We consider hardware description languages being “hard” to master, because a
text-based representation of the hardware can be extremely complicated and not
intuitive. There are tools that provide a graphical representation by allowing the

43
Chapter 5

interconnection of different components as a diagram. The low-level components of the


system, however, must still be specified in an HDL format.
To specify the software behavior of the SoC, usually low-level imperative
languages are used, such as C or C++. We can take advantage of the operating system
running on the given board and compile the C/C++ programs targeting that OS with an
IDE such as Eclipse. To interface the programmable logic and the software, pre-defined
libraries can be used on the software-side, and Xilinx intellectual property blocks can
be used on the FPGA side.
The programmer needs to have a deep knowledge in the fields of embedded
programming, FPGA design and also has to study with great attention the specifications
and manual of the given SoC. This makes development of SoC-based applications very
hard for beginners and slows down the research progress in this field.

National Instruments provides a graphical, data-flow programming language,


called LabVIEW24. This is a very good solution to overcome these limitations and
problems. Also, in LabVIEW, we can implicitly represent the system in a graphical
way, and the “code” maps much better to the underlying hardware and reduces the
semantic gap between the specification of the hardware and the actual implementation.
As opposed to text-based languages, in LabVIEW we write code by graphically
placing “code” in a virtual instrument (VI). A VI has a front diagram, that specifies the
interface of the VI, by means of Controls (inputs or parameters) and Indicators
(outputs). The functionality of a VI is defined on the block diagram. In Figure 5.2 we
can observe the block diagram of a simple VI that performs operations on an array.

Figure 5.2 Snippet of a VI’s block diagram that computes 𝑓(𝑥) = 𝑔𝑎𝑖𝑛 ∗ 𝑥 + 𝑜𝑓𝑓𝑠𝑒𝑡,
on each element of an array (x)
One of the great advantages of using LabVIEW is that the same VI can be used
to specify software functionality and to specify the behavior of the FPGA (with some
constraints and limitations). The above VI can be ran on different computers, having
different operating systems, if these support the LabVIEW runtime engine. On the other
hand, if this VI is used for FPGA development, its contents will be first translated to
corresponding VHDL code and will be automatically synthesized for the specified
FPGA hardware (using the same tools that we would use in Vivado) – therefore,
LabVIEW can also be used as a High-Level Synthesis tool for FPGA development.

The following part presents the myRIO in detail and introduces the LabVIEW
concepts that will be used throughout the development of the system – in general,
24
Laboratory Virtual Instrument Engineering Workbench

44
Chapter 5

LabVIEW will be presented in some detail, however the reader is expected to have a
basic knowledge in understanding graphical data-flow code.

NI myRIO hardware and software specifications


We would normally include hardware and software specifications either in the
bibliographical study or in a final chapter that specifies the different hardware and
software requirements.
However, we believe it is important to specify these aspects, here, in the
Detailed Design and Implementation chapter, because, as opposed to the initial design
and theoretical concepts parts, the implementation is highly dependent on the used
hardware.

5.1.2.1. HW
As specified in the NI myRIO-1900 user guide and specification [34], the
myRIO-1900 is a “portable reconfigurable I/O (RIO) device that students can use to
design control, robotics, and mechatronics systems”. The hardware block diagram is
specified in Figure 5.3. We can see the clear separation between the processor and the
FPGA, even if these are on the same chip. The embedded device also includes several
peripherals, such as buttons (including a reset button), USB host and device ports,
LEDs, DD325 and nonvolatile memories, as well as a variety of I/O ports.
The USB host port supports most web cameras that are UVC compliant (USB
Video Device Class protocol), as well as machine vision cameras that conform to the
USB3 Vision standard and are USB 2.0 backward compatible [34]. We will use the
USB host port to connect a webcam for acquiring images.
The SoC used in myRIO comes equipped with a dual-core 32-bit ARM
processor, having 667 MHz maximal frequency. The device has 256 MB of RAM and
512 MB of nonvolatile memory. Both USBs have the 2.0 Hi-Speed specification. The
device has to be powered from an external DC voltage source, having between 6 and
16 V.

25
Double data rate type three

45
Chapter 5

Figure 5.3 NI myRIO-1900 Hardware Block Diagram, from [34]


The user guide specifies many other components of the device too, however
these are not required for our project.

5.1.2.2. SW
The most important aspect of the myRIO software environment is that it has a
Linux-based real-time operating system developed by National Instruments, called “NI
Linux Real-Time” or RTOS. Therefore, we have access to a large Linux ecosystem and
also real-time functionality – this means that we can write deterministic code with strict
timing constraints. The RTOS is fully compatible with the NI ecosystem, so we do not
have to worry about compatibility issues when developing the application.
However, some additional software components have to be installed as an
addition to the default configuration, but these components are all provided by NI. We
will include a full list of required hardware and software stack in the User’s manual
chapter.

5.2. System Architecture


We will present the structure of the system as seen from the development
environment.

46
Chapter 5

The system as a LabVIEW project


LabVIEW has a project-based approach when developing code for an
application. A LabVIEW project is a file, having the “lvproj” file extension and has a
tree structure. Below the root level of the tree, the user can specify the devices targeted
for development. These can be the Windows-based host computer that is used for
development or other target devices supported by NI, such as the myRIO.
In Figure 5.4 we can see the structure of our project. The root node contains
three Targets:
• My Computer – represents the current PC (this target is always present
in a project, because a host computer is required to connect to other
targets);
• NI-myRIO-1900-Gergo – this is the myRIO used as the SoC embedded
device. You can notice the two “RT Main” VIs that define the behavior
of the processor under the myRIO. This device also contains an FPGA
Target node, under the Chassis node. This represents the programmable
logic hardware of the system
• Simulated myRIO Target – this target represents a replica of the
previous one, however it is configured in such a way that the containing
VIs will be simulated in the host computer (instead of a real target)

Figure 5.4 Overview of a LabVIEW project file (lvproj)


When placing a VI under a specific target, it is compiled for that specific target
– for example deploying to a 32-bit Linux-based target will generate a target-specific
Linux-compatible binary. Running such a VI will deploy the compiled VI to the target
where it will be executed. The front panel of such a VI can be opened on the host
computer, where the controls and indicators will be automatically updated, as the block
diagram is executed on the target device. The communication between the target device

47
Chapter 5

and the host (the development environment) is done over a network using the IP
protocol.
As you can see from Figure 5.4, we have specified two myRIO targets, out of
which the second one is a simulated one. In the following part we will detail why this
is very important in the development process.

5.2.1.1. Simulation environment


When the execution mode of an FPGA target is set to Simulation (see Figure
5.5), we can achieve almost the same functionality as in a real environment. The
LabVIEW code will be interpreted for the processor and all the I/O operations will be
simulated. The performance of such a simulated environment is very low – processing
one image with a simple algorithm takes several seconds. The behavior and result of
the algorithms will however be the same as if they were executed on the FPGA.
Therefore, we can use this environment to test our algorithms’ correctness before
compiling them for a specific hardware. This approach saves a lot of time, because
running a simulated FPGA VI is done almost instantly (we do not have to wait the
compilation time).

Figure 5.5 Selecting the execution mode of an NI FPGA target


Another great advantage of using the simulated VI is that is gives us the
possibility to debug the FPGA code. Thus, we can use a powerful debugger built within
LabVIEW, that lets us place breakpoints, probes, pause the execution and execute the
code step-by-step, using “step in”, “step out” and “step over” instructions. Throughout
the development, most of the FPGA debugging was done in this mode. Unfortunately,
the timing and synchronization behavior (including the execution speed) cannot be
observed in this mode, but this is a small downside compared with the already
mentioned positive effects.
We have added two myRIO targets, each having an FPGA target, so that we do
not have to manually switch between the execution modes. When running the RT Main
VI from the “My Computer” target, the simulated device will automatically be selected.

5.2.1.2. Real (production) environment


In its default behavior, an FPGA VI is compiled into a bitfile that is used to
define the behavior of the hardware. In our development environment, we have installed
both a local FPGA Compile server and worker. The server receives compilation
requests from LabVIEW and delegates them to the worker – in a real production
environment, the workers are usually very high-performance computers or cloud

48
Chapter 5

computers. The compile worker uses the Xilinx compilation tools, which are installed
locally (the compilation tool includes Vivado).
The steps performed to generate the bitfile from the LabVIEW FPGA VI are
taken from [35] and are also shown below (note that “compile worker” refers to the
Vivado application that was installed with the Xilinx compilation tools):
1. Generation of intermediate files – LabVIEW converts the FPGA VI
into intermediate files (HDL code) to send to the compile server;
2. Queuing – The compile server queues jobs and sends the intermediate
files to the compile worker for compiling;
3. HDL compilation, analysis, and synthesis – The compile worker
transforms intermediate files (HDL code) into digital logic elements.
4. Mapping – The compile worker divides the application logic between
the physical building blocks on the FPGA;
5. Placing and routing – The compile worker assigns the logic to physical
building blocks on the FPGA and routes the connections between the
logic blocks to meet the space or timing constraints of the compilation;
6. Generating programming file – The compile worker creates binary
data that LabVIEW saves inside a bitfile;
7. Creating bitfile – LabVIEW saves the bitfile in a subdirectory of the
project directory and can download and/or run the application on the
FPGA VI.
As one can probably imagine, performing the steps mentioned above can be a
very long process, requiring high memory usages. In early stages of our development,
several compilations failed due to insufficient memory, extra-long compile time
(several days) or because timing and resource constraints on the FPGA were not met.
In later stages of development, most of our VIs were optimized allowing compilation
times below 20 minutes.
Once the steps needed to compile an FPGA VI are successfully completed, the
bitfile can be deployed on the target device. The VIs that are going to be executed by
the myRIO’s processor must also be deployed. Therefore, we need to connect the host
computer (the development PC that contains the LabVIEW project and the compiled
application) to the myRIO via an USB cable. When both devices are configured
properly, a LAN26 is created over the USB and IP addresses are assigned to the host
and target devices. We can then open a connection to the target by specifying the IP
address of the target device in LabVIEW project. Once the connection is made, VIs,
bitfiles and other deployment items can be transferred from the host to the target.

„Main” VIs and top-level view


We have already identified the main components of the LabVIEW project file
and we are going to detail how the system can be started from the project. We have split
this explanation in two parts, so that first we present how the execution of the
application can be initiated, followed by describing the VI that represents the entry-
point of the system.

5.2.2.1. Starting the application


There are two ways of starting a LabVIEW application on a remote target that
is connected over the network to the host PC. For both ways, we must choose a main
VI, that should be executed first. This is similar to specifying a “main” function in a
26
Local area network

49
Chapter 5

C/C++ application or the “public static void Main()” method in the C# language. This
VI must be placed under the specific target in the LabVIEW project – as a reminder,
we will only work with VIs in the context of a project.
The first method is to simply run the VI, as we would run it under “My
Computer”. The deployment (and compilation, if needed) will start shortly, and once
all deployment items are transferred, the main VI is executed remotely on the host. The
contents of the front panel will however still be updated on the host by an automatic
mechanism that polls the target device to acquire the latest values on the front panel.
This induces some communication overhead for the target device, however this is
unnoticeable for front diagrams that contain small amounts of data. In this mode, it is
possible to remotely debug the block diagram or its sub-VIs too (obviously at the cost
of performance degradation).
The second choice for starting up the system is to create a new “Real-Time
Application” build specification in the project and set the main VI as the build
specifications startup VI. As its name suggests, a build specification can be built,
resulting in a folder that contains all the compiled items, dependencies and deployment
items that are needed for the application. Therefore, in this mode, everything is
“precompiled” which saves some time. On the other hand, starting the VI is somewhat
less intuitive, because we have to set the build specification as the default startup item
for our target device. Once the device is restarted, it will automatically start executing
the main VI.

We will mostly use the first approach because it implicitly lets us visualize the
front panel of the VI, which helps us in debugging and also lets us manually (visually)
verify the correctness of our image processing algorithms, by displaying the contents
of the processed image. We also created a build specification, but this would be mostly
intended for a “releasable” product and does not suit the requirements of the research
and development project. If our solution would be offered to the market, probably
creating a real-time executable or shared object would be the most appropriate way.

5.2.2.2. Top-level view


In the previous part we have defined how to start the first VI, however we have
not yet defined the contents of that VI. In this part we will present the top-level or
“main” VI.
We start from a template VI for myRIO development, provided by NI. The
template VI contains three major parts, the initialization, processing and the finalization
phases. A simplified version of the template is provided in Figure 5.6. The “FPGA
Target” item is configured to reference the VI that is set as the FPGA’s main VI, which
is automatically compiled and deployed when running the template. The Main Loop in
the figure is currently empty, but this will shortly be populated.

50
Chapter 5

Figure 5.6 Template VI for myRIO development using custom FPGA personality
We can notice that the previous example is very similar to a Data Acquisition
(DAQ) and Control application, where we first initialize the system and then
continuously read, process and write data (in a while loop). In the following part, we
will present the three main parts from Figure 5.6.

A. Initialization
In the initialization part, we introduced a Conditional Disable Structure, that is
similar to preprocessor directives in C/C++. The structure has two different behaviors
(implemented in two different subdiagrams): when executed on the host development
PC (having a Windows OS), we open an FPGA reference to a VI that is placed under
the Simulated FPGA target – this allows us to automatically execute the application in
a simulated FPGA environment on the host computer; when the Conditional Disable
Structure is executed on the target device, having a Linux operating system, we load a
reference in the default way, to the actual FPGA, so we have a real, production
environment. The condition of the structure that determines which subdiagram to be
executed is a string that is interpreted before compiling the VI and is written below:
• “OS==Linux” – when evaluated to true, we open a reference to the real
FPGA
• “OS==Win” – when evaluated to true, we will simulate the behavior of
the FPGA
The initialization section will also be responsible for opening a connection to
the image acquisition device, as well as to create any other references or executing setup
instructions – these will be presented in detail when we start detailing the different
components of the system, in the next subchapters.

B. Processing
The data processing part is responsible for most of the work done by the
application. In this part we continuously execute a loop that acquires images, transfers
them to the FPGA and then transfers the resulting image (or image attribute) back to
the processor. These are the main responsibilities, which are also visible from the
previous figure and Figure 5.7. These tasks will be presented in more detail in the
following subchapters.

Acquire image Transfer to FPGA Transfer from


FPGA

Figure 5.7 The main responsibilities of the main VI

51
Chapter 5

Beside the main responsibilities, there are several other tasks that must be
performed in the main loop. We have to update the indicators that present the acquired
input image and the resulting output. Because the data that populates these indicators
is coming from the target device and we want to display them on the host, a large
amount of data has to be transferred between the target and host over the LAN.
Although the provided USB is capable at transferring hundreds of Mb of information
per second, the latency and computational overhead on both devices is significant.
Therefore, we placed a boolean control on the front panel, which lets the user to
deactivate displaying the images on the front panel.
To measure the performance of the application, we compute the elapsed time
between two iterations of the processing loop. This is done by reading a millisecond
counter. We subtract from the value of read in the current iteration the previous value
and then multiply by 1000, to convert from milliseconds to seconds. This represents the
elapsed time between the iterations, also known as the execution time of one iteration
(including all the communication and additional overhead). To measure the frame rate
or FPS of the application, we compute the inverse of the elapsed time:
1000
𝐹𝑃𝑆 = [𝐻𝑧]
𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑡𝑖𝑚𝑒 − 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑡𝑖𝑚𝑒
C. Finalization
The finalization phase is the simplest and has the responsibility to close any
references that were previously created. Skipping this phase could put the myRIO in a
faulty state – even if the currently executing VI is closed after finishing execution, the
LabVIEW process in not closed; therefore, unwanted items can remain in the process’
memory. In this phase we also display any errors that occurred during execution.

5.3. Image Acquisition


To create a stream of images we are going to use a traditional USB webcam that
is connected to the USB host port of the myRIO (note that the USB device port is
connected to the development PC). We will first create a session to the camera and
initialize the images, then we will capture data from the camera in the main loop. These
tasks are going to be presented in the next part.

Camera session
We use the NI Vision Acquisition Software to create a session to the camera
and enable capturing images from it. These steps are presented in Figure 5.8.

Figure 5.8 Opening and configuring a camera


using NI Vision Acquisition Software VIs
Specifying “cam0” as the input device will automatically select the first
available webcam that the system detects. In the simulation environment, this will select
the first camera available on the host computer – we are using a built-in webcam for

52
Chapter 5

this environment. In the real scenario, the webcam will be used that is connected to the
myRIO board.
We are using a Logitech QuickCam Pro 5000, that provides color images with
a resolution of 640 ∗ 480 pixels and at a frequency of approximately 30 FPS.

Image initialization
We will have to declare and initialize the input and output images in LabVIEW.
This is presented below, in Figure 5.9.

Figure 5.9 Declare and initialize the input and output images
We will place the captured data from the webcam into the input image, while
the output image will contain the result of the processed image. This way we can display
both images, so that it is possible to visually compare the initial image with its
transformation.
Both subVIs in the figure allocate a grayscale 8-bit image, having the resolution
of 256 ∗ 256 pixels. We have chosen the resolution to be a power of two, so that
memory addressing might be easier, however other values are also acceptable. To
enable the execution to work without a camera, the input image is populated with data
from the myRIO’s permanent storage. This can be very helpful, when measuring
performance, because we can disable the data acquisition part, which generates a large
overhead (that would not be present in the case of an embedded camera in the myRIO).

Image capturing
We have placed a boolean control on the front panel of the main VI, called
“Live". When this boolean is set to true, we acquire the most current frame that the
camera session provides – this is called the Snap operation.
We can remember from the previous parts, that the camera provides a new
image 30 times a second, that is roughly every 33 milliseconds. If less than 33 ms is
elapsed between iterations, then most probably we will snap the same image
consecutively. On the other hand, if the frequency of the main loop is lower than the
frequency of the acquisition device (30 Hz or 30 FPS), then some images provided by
the acquisition device might be lost/skipped. This is not an issue, but it is good to keep
these concepts in mind.

53
Chapter 5

Figure 5.10 Capturing an image


After acquiring the frame, we transform it to an 8-bit grayscale image and scale
it to the 256 ∗ 256 resolution. At this point we have a reference to an image that is
compatible with the FPGA-based implementation. Because the next step (transferring
the image to the FPGA) requires the image to be represented as an array of characters,
the final part of image capturing is the process of acquiring the Image Pixels from the
image reference. This process can be seen in the rightmost subVI in Figure 5.10.

5.4. Image Transfer using DMA FIFO Channels

Ways of transferring data between the FPGA and the host device
National Instrument defines three ways of transferring data between the FPGA
and the Host device – in our case the myRIO FPGA and the myRIO LabVIEW
application process. These are, according to [36] the following:
• Programmatic Front Panel Communication
• Direct Memory Access (DMA)
• User-Defined I/O Variables
Using the front panel can work for small sets of data and has a low call overhead,
however at the cost of higher CPU usage. This method is mostly used to pass
configuration data, report status from the FPGA or transfer single-point data. It is
however not recommended to pass large amounts of data because of the low throughput.
Another downside of this approach is that the user has to implement a synchronization
mechanism – e.g. to pause execution if data is not yet available or resume it when data
transfer can be initiated.
For transferring large sets of data, it is recommended to use the DMA. This is
approach has a much higher throughput and also lower call overhead. Another
advantage is the built-in synchronization mechanism. The method of DMA
communication is based on a FIFO-method. Two buffers have to be allocated on each
endpoint of the transfer. Sending data from device A to B means that we read the
content of A’s buffer and place it in the DMA FIFO channel. The NI DMA Engine will
place the data in B’s buffer, which can be read by B. An example of such a
communication is provided in Figure 5.11, where we transfer data from the FPGA to
the host.

54
Chapter 5

Figure 5.11 Illustration of DMA FIFO transfer from the FPGA to the host, from [37]
Using the User-Defined I/O Variables is similar to the first option but has a
lower host CPU usage and provides automatic synchronization. The performance and
throughput of this method is also much worse than using the FIFO-based method.

It becomes obvious at this point, that the best way to transfer the image arrays
between the two components of the SoC is to use Direct Memory Access.

DMA FIFO implementation

5.4.2.1. DMA channels


To implement the DMA FIFO communication, we first have to declare the FIFO
channels. The SoC device has 16 available channels that can be configured and each of
them is unidirectional. Therefore, to have bidirectional communication, we have to use
an even number of channels. In general, we want to maximize performance, while
minimizing resource usage. In our case we have decided to use two channels for each
direction. The reasoning is that this offers much better performance than using a single
channel. Using more than 2 channels did not seem to decrease execution time by much.
This is probably due to the fact that the communication also requires significant CPU
overhead. Therefore, using 2 channels was a good choice, because the processor has
two cores – so each core can handle one channel.
We will allocate the following FIFOs (for simplicity we have named the FIFOs
“A” and “B”):
• RT to FPGA FIFO A
• RT to FPGA FIFO B
• FPGA to RT FIFO A
• FPGA to RT FIFO B
Configuring a FIFO is done by specifying the type of the FIFO, the requested
number of elements, the data type of these elements and the number of elements to
simultaneously read or write when accessing the FIFO. These configurations are
slightly different between the “RT to FPGA” and “FPGA to RT” FIFOs, however the
“A” and “B” FIFOs are equivalent (between the corresponding types, regarding their
direction).

55
Chapter 5

The common configurations are the element size, that is set to 1 byte,
representing an unsigned character (8-bit grayscale value) and the number of elements
to be read or written, which is set to 1. The differences will be detailed separately in the
next part.

5.4.2.2. Host to FPGA communication


The first step in transferring the data is to decompose the 2D array (that
represents the image) into two equal parts. The decomposed (2D) arrays are serialized
into a much longer 1D array, which is much easier to transfer on the channels. We have
chosen to decompose row-by-row, so that each odd row is sent via FIFO A and the
remaining even rows are sent on FIFO B (see Figure 5.12).

Figure 5.12 Decompose image in two parts and


transfer it in parallel on two DMA FIFO channels
The “Write” methods send the incoming arrays by writing the elements one-by-
one on the actual physical channel on the SoC. The number of transferred elements is
given by the image resolution: 256 ∗ 256 𝑝𝑖𝑥𝑒𝑙𝑠 = 65,536 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 or
/0/1/$23
32,768 456$$/0 . At the FPGA side, these values are placed in a buffer, that is much
smaller than the number of sent elements. We have allocated a buffer of 256 elements
per channel, so that one row can be buffered at a time.
This communication mechanism includes built-in synchronization. If the buffer
is full, then the transfer will stop and wait for a specific duration, specified by the
“Timeout” control. In our implementation, we have set the control value to −1, so that
the transfer waits indefinitely for the buffer to free up space in case of slow
communication.
In reality, this is rarely the case, because the FPGA can read from the buffer
much faster that how the host can write to it. This makes the 256-length buffer be large
enough, while saving important FPGA resources.

5.4.2.3. FPGA to host communication


The “FPGA to host communication” is symmetric to the previous one, but there
are still some important differences. The first difference is that using the “Write”
method provided in the FPGA VI, we can only send one element at a time – as opposed
to the previous example, where we sent a half image (32,768 elements) at a time.
56
Chapter 5

Secondly, we have to allocate a much larger buffer on the host side. This is required
because the FPGA writes to the FIFO much faster than how the host can read from the
FIFO. As a result, we have allocated buffers large enough to hold an entire image
(32,767 elements each buffer). The large buffer doesn’t affect the performance of the
host, because the myRIO’s microcontroller has much bigger memory capacity than the
FPGA. This approach is also faster, because the FPGA can send a processed image
even if the host is not yet ready to receive it.

5.5. Image Processing on the FPGA


So far, we have implemented all the required host-side logic (that is executed
on the CPU) and we have also specified how to read and write an image at the FPGA
level. We are going to implement the remaining parts of the programmable logic
(FPGA).
As a reminder, we should recall that the behavior of the FPGA is defined by a
special VI that is transformed into VHDL code and synthesized for the given Xilinx
target. Because an FPGA has serious hardware limitations compared to a CPU-based
application, several restrictions apply to the set of elements, subVIs and structures that
may be used under an FPGA target. We must also keep in mind that strict resource and
timing constraints can apply when developing on an FPGA.
We will present the FPGA-based development (and implementation) in a top-
down way. First the overall architecture is elaborated, followed by the details of several
components. At the end, we will present many improvements that had to be applied to
meet the restrictions imposed by the FPGA hardware.
Throughout the development, we first test most versions of the application in
the simulated environment and then if the simulation yields positive results, we compile
the system and use the real environment.

General structure
As most signal processing applications, we can structure the FPGA image
processing into three parts: acquisition, processing and acting/producing an output. In
section 5.4, we have already defined how to read and write the image. All we have to
do is specify how we store the image locally and how is the IP algorithm implemented.

To increase the performance of the FPGA, we split the three identified parts in
three independent loops. This allows LabVIEW to generate VHDL code that is more
performant. Figure 5.13 shows the independent loops that can be executed in parallel.
The figure is just a template – in the actual implementation the “#TODO” comments
are replaced with actual LabVIEW code.

57
Chapter 5

Figure 5.13 Independent loops in the FPGA VI


In the following parts we will define how to implement the following concepts:
• Saving image to local memory
• Applying the convolution kernel
• Synchronization
Also, several optimization-related techniques and “tricks” will be presented.

Storing the image in a local memory

5.5.2.1. Memory Design


Because most IP algorithms require the image to be in memory, we also have to
save the incoming image from the buffer into a local memory. The size of the memory
should be large enough to hold the entire image. Therefore, it should have an address
space of 65,536 elements and the depth of each element is 1 byte (an 8-bit value).
Because 65,536 = 2)7 , the memory can be addressed by a 16-bit value.
In a similar way to defining FIFOs, there are several additional parameters that
have to be specified. The first property refers to the implementation type of the memory.
This can be either a Block RAM (BRAM) or Look-up table (LUT) implementation.
Most FPGAs contain pre-built memory blocks that can be used to implement BRAMs
without affecting the resources of the FPGA. In the case of a LUT however, the logic
gates of the circuit are used to implement the memory. This usually reduces latency,

58
Chapter 5

but a large amount of important FPGA resources would be lost. Therefore, we will use
the Block RAM implementation.
The BRAM will be configured with a dual-port read access, so that two values
can be read in the same clock cycle. To increase the performance even more, we set the
cycles of read latency to the maximal value, 3. This means that internally the BRAM
access is pipelined, and several clock cycles have to be executed to read a value from
the memory. This introduces a minor delay but also allows higher clock rates. Because
it is possible to read and write the memory in the same clock cycle, we also specify that
arbitration should always happen when more writers try to access the memory, or
several readers try to read from it. This can introduce a minor degradation in the
execution speed, however it is required for the correctness of the algorithms.

If an algorithm does not require the whole image to be in the memory at the
same time, we don’t even need to save the image to a local memory. To exemplify this,
we can calculate the mean of the pixel intensity values while the image is still being
received from the incoming FIFO channel. We consider that most real-world scenarios
of image processing algorithms do not have this advantage, so we will not consider this
improvement – even if it could be applied to the convolution filter in some way.

5.5.2.2. In-memory image representation


In general PCs we usually allocate a two-dimensional array for storing the
image. This is done by first allocating an array of arrays (an array of pointers, where
each pointer locates another array). The allocation of these items is done by the
operating system and we do not have control over the allocated memory (which might
not be contiguous). Addressing an index of an image would be done by reading the
pointer corresponding to the “row” index. This index gives the offset of another array
that represents a row in the image. To access te desired pixel, we jump to the element
indicated by the “column” index.
Contrary to the specified example, we want accessing an element to be as simple
as possible. Because the pixels are already coming as a stream, forming a 1D array, it
is much easier and performant to represent the image of a 1D array in memory. To
access a pixel at the coordinates (𝑥, 𝑦), must access the (𝑥 ∗ 𝑟𝑜𝑤𝑠) + 𝑦-th element,
where 𝑟𝑜𝑤𝑠 represents the number of rows in the image (currently 256).

Applying a convolution kernel


The method of applying a convolution kernel was already presented in section
entitled: Example of an image processing algorithm - Linear Spatial Filter (Convolution
Filter). The already provided pseudocode was simply translated to LabVIEW code.
In the first iterations of development, we are implementing a small kernel,
having only 9 elements. The algorithm is specified, as follows:
We keep in a buffer an 3 ∗ 3 array representing 3 neighboring elements and
overlap them with the convolution kernel to compute the value of the element in the
middle of the 9 neighboring pixels. We iterate over the image by moving the 3 ∗ 3 array
at each iteration. The borders of the image are excluded from the convolution, because
they do not have enough neighbors.
After computing the value of a pixel by convolution, we have the possibility to
divide it with a constant. This is called scaling. Because division is involved, which is
a highly expensive operation, we use the “scale by power of two” operation, that is
much faster because is only uses logical shift operations. For example, to divide by 16,

59
Chapter 5

we scale the values by 2(8 , which in fact results in four right shift operations (if the
most significant bit is on the left side).
We will use the Gaussian kernel and an edge detection kernel, as seen below:

1 2 1 −1 −1 −1
Gaussian: T2 4 2V, Edge detect: T−1 8 −1V
1 2 1 −1 −1 −1
We also make sure that the values are within the allowed range of [0, 255] – if
not, the values are saturated. This can result in some loss of data. The best method to
deal with this is to use the histogram equalization algorithm. However, this is beyond
the scope of our hardware-level implementation.

Finally, once the simple kernels are implemented, we can extend the
implementation to larger kernels. This will not be detailed here, because the method is
very similar – we just have to increase the kernel size and include the new kernel
elements in the computation. Also, the borders of the image, where the convolution is
not applied will be thicker.

Synchronization
The three main loops that we defined (get data and save to memory, process
image and write result) have dependencies because they share the same memory. If the
processing loop starts to process the image before the required data is available, we can
generate incorrect result. Similarly, if we send the image before it is processed, the
resulting image might be corrupted27.

5.5.4.1. Concept of occurrences


Therefore, we must implement a synchronization mechanism that splits the
concurrent phases into three different steps. This can be also called a partial
serialization of the three parallel components. We can notice in Figure 5.14 that even if
we have synchronization, the three regions can still overlap – i.e. some operations can
still be executed in parallel. This is because we can start processing the image even if
only a portion of the data is available. For example, we could start applying the
convolution kernel, when the first three lines in the image are available. In the first
development iteration we will implement the loops as three completely serial
operations, and only after that will we improve by enabling parallelism between the
loops.

Read image Read image

Process image Process image

Write image Write image

Figure 5.14 The three loops of the FPGA VI with and without synchronization

27
Like our government

60
Chapter 5

We use the basic principles of working with semaphores, just like in an


operating system. We will use a synchronization mechanism built into the LabVIEW
FPGA module, called “occurrence”. The occurrence is based on the consumer-producer
idea and defines two methods: “Wait on occurrence” and “Set occurrence”. The usage
of these primitives is very similar to the concept of locks if higher-level languages.
We will define three occurrences:
• “Image received” occurrence
• “Image processed” occurrence
• “Image sent” occurrence

5.5.4.2. Implementation
In the initial state, we manually generate an “Image sent” occurrence – this
marks the starting point of the FPGA execution. The Read image loop will start
execution and it will read element coming from the FIFOs until an image is fully read.
Once the image has arrived, we generate an “Image received” occurrence, which
triggers the execution of the second loop. In the meantime, the first loop is blocked
because it waits for another “Image sent” occurrence.
Once the second loop finishes processing the image it will be blocked again and
it will generate the “Image processed” occurrence, which will unblock the third loop
that sends the image over the DMA FIFO channels back to the target. We can see that
this way we “serialized” the execution of the three tasks.
The first improvement that we can make, is to enable partial parallelism between
the tasks, as we have already mentioned. This is very similar to the concept of
pipelining, which we will use frequently in the improvement part of the
implementation.

Improving the FPGA code and preliminary results

5.5.5.1. Preliminary results


The initial performance of the system was very low, between 10 and 15 FPS,
for a 3 ∗ 3 convolution kernel. This result does not meet the requirements specified in
the first chapters. We have iteratively improved the performance of the FPGA VI, by
applying a series of optimizations. We have gradually increased the throughput of the
algorithm, reaching over 40 FPS.
These results will be detailed in the Testing and Validation chapter. In the
following part we will show some of the improvements that had to be made in order to
increase performance.

5.5.5.2. Pipelining
The performance of the FPGA is highly dependent on the frequency of its clock.
The base clock of the myRIO FPGA is set to 40 MHz, but this values ca be extended
to 80, 120, 160 or 200 MHz. The initial implementation did not allow frequencies
greater than 40 MHz, therefore, we had to optimize the design.
The frequency of the FPGA is closely related to the propagation delay of the
implemented circuits and is determined by the longest path that was generated by the
FPGA compilation tools. As an example, if the propagation delay of the longest path is
0.01 microseconds (1e-8 seconds), then the maximal frequency will be 100 MHz (108
Hz).

61
Chapter 5

To reduce the maximal propagation delay, we have firstly identified the longest
paths using the log generated by the Xilinx Compilation tool. To reduce the delay, we
must break down a long “execution path” into smaller ones – in FPGA development we
can achieve this by pipelining.
Therefore, we use pipelining almost everywhere in the design (even on the
processor in some cases – because the CPU has two cores, only one pipeline stage is
worth implemented).

To exemplify the pipelining procedure, we present a simplified version of the


image acquisition loop in the main FPAG VI. This loop has the responsibility to read
the incoming pixels from the FIFO DMA channels and save these pixels in a local
distributed memory. In Figure 5.15 we can observe that the FIFO Read and the Memory
Write operations are connected directly by wires. Therefore, the execution time of one
iteration of the loop is determined by adding the propagation delay of both operations
and also the delay caused by transferring the data from one item to the other.

Figure 5.15 Serial FIFO Read and Memory Write operations


Results showed that only low clock frequencies were allowed on the FPGA
using the previous example. To solve this issue, we can remove the dependencies
between the two operations, by pipelining. Figure 5.16 shows that there are no more
direct dependencies between the two operations. We can also see the usage of Shift
Registers in the block diagram. These elements act as a feedback node in the loop and
allow us to propagate values between consecutive iterations. Therefore, the values read
from the FIFO in iteration 𝑛 will only be saved in iteration 𝑛 + 1. This increases latency
by one extra loop iteration, but the frequency of the loop (and consequently, the
throughout) will almost double.

62
Chapter 5

Figure 5.16 Pipelined FIFO Read and Memory Write operations


There are several other parts of the design, that use pipelining, especially where
memory-related and computationally intensive operations are used.

5.5.5.3. Removing multiplication and division


The FPGA implementations contains several parts, that require multiplication
or division. These operations are expensive in terms of FPGA resources and timing
considerations.
An example that requires division is the scaling that is required after computing
the convolution of a pixel in the image. In the case of the 3 ∗ 3 Gaussian kernel we must
divide by 16. Fortunately, because 16 is the power of two, we can use logical shift
operators instead of actually multiplying. This greatly improves the performance.
If, however we would have to divide by a number that is not a power of 2, we
can use built-in high-throughput mathematical functions from the LabVIEW FPGA
module. When using these functions (or VIs), we can specify the number of pipeline
stages that are implemented in the multiplier or divider.

5.5.5.4. Parallelize read, process and write operations


We can further increase performance by parallelizing the “image read”,
“process” and “image write” loops. Currently only one of these loops is running,
because of the shared memory that each loop uses. However, we could partially overlap
the operations, because each loop uses only a part of the memory. If we can implement
synchronization on smaller sets of the memory, several loops could run in parallel,
given that they use different parts of the memory. We must also note that it is not
possible to fully parallelize, because the memory only has one write interface and one
read interface.
This optimization is not implemented, but we expect to further increase
performance. Theoretically, the frequency of the FPGA clock would not improve, but
we would be able to execute more tasks in parallel, which would result in lower
execution time.

5.6. FPGA Resource summary


The FPGA has the following resources and device utilization, when
implementing the 3 ∗ 3 Gaussian blur operation, with a clock frequency of 160 MHz:

63
Chapter 5

Device utilization Used Total Percent


Total Slices 4329 4400 98.4
Slice Registers 13953 35200 36.6
Slice LUTs 13409 17600 76.2
Block RAMs 25 60 41.7
DSPs 8-high 80 10
Table 5.1 Total FPGA device utilization
We can conclude from Table 5.1, that almost all slices have been used in the
device and also most of the LUTs are in use too. On the other hand, RAMs and Registers
have a medium-low usage, indicating that we could increase the memory requirements
of our application. On the other hand, adding much more logic to the FPGA might not
fit on the device.
We should also mention that the required clock speed (160 MHz) could only be
met when we manually configured the Xilinx tools to increase placement efficiency.
The compilations time of such a design, with these settings is around 16-18 minutes.

64
Chapter 6

Chapter 6. Testing and Validation


This chapter presents the testing and validation phase. We will start by
presenting other technologies that could have been used, followed by an evaluation of
the performance

6.1. Technological Motivation


The first question that we asked ourselves was why use LabVIEW for the CPU
execution, when several highly optimized libraries already exist in C and C++. It is also
known that C libraries (DLLs28 or SOs29) can be very easily integrated into LabVIEW
code. The main reasoning is that we wanted to experiment as much as possible with the
NI ecosystem including the LabVIEW IDE and programming language. We already
knew from examples in the literature, that imperative languages are often used in
embedded image processing. We wanted to find out, mostly out of curiosity, if
LabVIEW can be a good environment for this application. When evaluating the
performance, we will compare different LabVIEW implementations of the same
algorithm on different execution systems. However, we will also compare the results to
similar algorithms implemented in C/C++.
For the programmable logic part, it was obvious to use the LabVIEW FPGA
module. If we would have to write our own VHDL code and design in a different
environment, such as the Xilinx Vivado Design Suite, we would have probably not been
able to finish the project by now or meet any of our deadlines. It would be great to
compare the performance of the LabVIEW-generated code with an “original” VHDL
implementation, but unfortunately this is not possible.

6.2. System Performance


We will present the performance differences between different versions of the
SoC system: we will see how different optimizations affect the performance as well as
the difference between turning on or off several feature/components of the system.
Because we want to measure the overall performance of the system, we will
consider the FPS as a performance metric.

Different versions of the LabVIEW SoC implementation


We have iteratively developed the applications and we consider that presenting
the performance of the intermediary implementations is very important. Therefore, we
summarized the performance measured in FPS of each major version of the system in
Figure 6.1. The major versions of the system are:
1. The initial implementation – images were read only from memory and
the code was highly unoptimized; this is the first version that
successfully compiled on the hardware and the image processing
algorithm yielded correct results30
2. Version 2 – Added a live camera implementation and improved the
application by: parallelizing the FIFO read and write operations on the

28
Dynamically Loadable Library
29
Shared Object – similar to a DLL, but used in a Linux environment
30
Manually/visually validated

65
Chapter 6

host, adding duplicate FIFO channel and sending large chunks of data at
once to reduce communication overhead
3. Version 3 – Removed error handling on the processor (after testing in
detail the implementation) and optimized the execution mode (e.g. by
disabling debugging); on the FPGA side, we improved by using the
smallest possible numeric representations for variables
4. Version 4 – Pipeline the operations between the FIFO and BRAM, add
multi-cycle BRAM read operations and implement multi-stage division
and multiplication operations (having several pipeline stages) – these
improvements increased the longest path in the design and allowed clock
speeds up to 160 MHz
5. Last version – is not yet complete, but preliminary results show a great
increase in performance; in this stage, we execute in parallel the three
loops of the FPGA VI, similarly to having a large pipelined solution

Figure 6.1 shows the comparison of the performance of different iterations of


the development phase, when using the algorithm with a 3 ∗ 3 kernel. This is basically
the performance measurement in case of fixing the image and kernel size and increasing
the processing power.

Comparison of the different versions of LabVIEW SoC


implementations

The initial implementation

Version 2

Version 3

Version 4

Last version

0 10 20 30 40 50
FPS

FPS None FPS Display FPS Live FPS Live & Display

Figure 6.1 Comparison of the different LabVIEW SoC implementations


We have included four different types of measurements based on the features
that were used (or turned off). These features are Image Display and Live image
capturing. We can see that the best performance is measurable when both the are Image
Display and Live image capturing modes are turned off. Using this mode, we have
successfully achieved speeds over 40 FPS, which meets the standards of real-time
image processing. Development is also ongoing to provide a system that fully
parallelizes the three loops of the FPGA VI – the performance of this system could be
much higher (probably between 50 and 60 FPS).

66
Chapter 6

Comparison with other implementations


We will compare the LabVIEW SoC implementation that use the 3 ∗ 3 kernel
with different other implementations as well. These are:
• Single-core implementation in LabVIEW using only the CPU
• Dual-core implementation in LabVIEW using only the CPU
• NI implementation in C using only the CPU

Implementations on other platforms

Single-core LabVIEW executed on CPU

Dual-core LabVIEW executed on CPU

C executed on CPU (NI)

0 10 20 30 40 50 60 70 80 90

FPS None FPS Display FPS Live FPS Live & Display

Figure 6.2 Comparison of Convolution Filter implementations on other platforms


Based on Figure 6.2, we can see that our LabVIEW-based SoC implementation
is much faster that a similar implementation that only uses the CPU of the myRIO.
Although there is almost no communication overhead, the performance is much slower,
because the target’s CPU performs the image processing much slower that the FPGA.
We can also see that a similar application written in C by National Instruments
is much faster than any of our implementations – using the NI DLLs, we could reach
processing rates slightly below 90 FPS. From this point of view, we can say that there
is absolutely no speedup when using a LabVIEW-based SoC application over using a
highly optimized C library.

The question is whether it is worth using our solution, when the problem size
increases. Therefore, we increased the kernel size to 5 ∗ 5 and 7 ∗ 7 (we have also
implemented the 15 ∗ 15 version, but it did not fit on the FPGA). Figure 6.3 shows the
performance comparison of the C and LabVIEW implementation when using different
kernel sizes.

67
Chapter 6

Comparison of LabVIEW SoC and CPU


implementations, while increasing kernel size*
100
90 90
80
70
60 60
FPS

50
40 33
30 41 38 27
20 28
10
6
0
1 3 5 7 9 11 13 15
Kernel size

LabVIEW SoC CPU

Figure 6.3 Comparing the performance of the LabVIEW SoC implementation with the
C implementation executed on CPU. *performance for the 15 ∗ 15 kernel was estimated
We have seen from the previous figure, that if the problem is large enough (e.g.
the size of the image or the size of the kernel is increased), the SoC-based
implementation outperforms the highly optimized CPU-based one. Note these results
on the SoC were achieved by keeping the frequency of the FPGA constant (at 160
MHz). Because of the limited FPGA resources, it was very hard to reach this frequency
– recompiling the same LabVIEW specification that lead to these results might not even
succeed. The estimated result of using a 15 ∗ 15 kernel would only work on a larger
FPGA, that can meet both the timing and resource constraints.

It would be interesting to see the performance of a similar SoC application


implemented entirely in C and VHDL – but this is something that we may only do in the
future.

6.3. System Scalability


When speaking of scalability, we can either scale the application by fixing the
image or kernel size and increasing performance or by fixing the FPS rates, while trying
to increase the image or kernel size (or we can do both).
We have seen that increasing the size of the kernel, has a much lower effect on
the SoC execution time that on the CPU. This is because we calculate the coefficients
of the kernel, by executing the multiplication and addition operations in parallel – using
a larger kernel simply increases the parallelism of the application, without severely
affecting the performance. On the other hand, the operations on the CPU are performed
serially, thus they scale very badly.

We have to keep in mind that the FPGA-based approach scales really well, only
as long as the FPGA resource constraints are met.

68
Chapter 7

Chapter 7. User’s manual


7.1. Requirements

Hardware
Our system has the following hardware requirements:
o NI myRIO-1900
o Host computer (development PC)
o Generic USB webcam
o USB to connect the myRIO to the PC
o Power source for the myRIO

Software
The following software must be installed on the Windows development PC:
• LabVIEW 2017 Development Edition, including the following
modules:
o LabVIEW Real-time module
o LabVIEW FPGA module
o myRIO add-on
o LabVIEW Vision Development module
• LabVIEW 2017 FPGA Module Xilinx Compilation Tool for Vivado
2015.4
On the myRIO, we must also install necessary software (beside the software
packages that are automatically shipped):
• NI Vision RT
• LabVIEW Real-Time
• NI IMAQdx (image acquisition drivers for the webcam)

7.2. User’s Manual

Setting up the development environment


After installing the necessary software components, one needs to download the
source code (provided with this book or specified in Appendix 3 – Source Code). The
user must then open the LabVIEW project, set the DNS31identifier or IP address of the
myRIO and connect to it by right-clicking on the target in the project tree and selecting
“Connect”.

Building the LabVIEW project


Once we establish a connection, we need to make sure that the FPGA VI is
compiled. We can test this by opening the “RT Main” VI and running it. The VIs that
are executed on the CPU of the target will automatically be built, but if the FPGA VI
is not compiled, we must compile it by opening the FPGA VI and “running” it – this
will automatically trigger the compilation process.

31
Domain Name Server

69
Chapter 7

Deploying and running the project


If compilation was successful, we can deploy and run the project by opening the
RT Main VI and running it. To enable capturing from the camera, the “Live” boolean
must be set to true. To display the input and output images, the user has to also set the
“Display” control to true.

Validating results
Once the application is running the user can visualize the result of the system
on the front panel. You can also notice the FPS indicator, that indicates the performance
of the system.
On the right image you should see a “blurrier” image than on the left one. This
is because we apply the Gaussian convolution kernel to the image on the left.

70
Chapter 8

Chapter 8. Conclusions
In the previous several months, we have identified some requirements of real-
time image processing systems and we decided to implement a System on Chip-based
software and hardware solution. As opposed to many already existing implementations,
we proposed a new approach for development using the LabVIEW graphical and data-
flow programming language to specify the behavior for both the processor and FPGA.

8.1. Result Analysis and Achievements


As a summary, we made a significant contribution in the field of System on
Chip-based image processing, by developing a complete hardware-software solution
that meets real-time image processing requirements. As opposed to most examples in
the already existing bibliography, we used a platform-based approach, by using
LabVIEW and the NI ecosystem, which is not a common approach of scientific image
processing projects.
For a small problem size, we have seen that the performance compared to non-
SoC LabVIEW implementations is much better, however we could not succeed in
outperforming a well-written serial C implementation. On the other hand, we have
successfully identified, that by increasing the problem size, the LabVIEW System on
Chip-based approach can be a much better solution than using a serial, software-based
approach.

We have created a well-defined structure and architecture for SoC-based


applications, that require large data usage. These aspects can be reused in several other
fields, and not only in image processing. Our system also offers much better scalability
than traditional image processing systems.
We have also shown that the time needed to develop a fully functional SoC
system was greatly reduced by using the LabVIEW ecosystem. We can say with high
confidence, that the future of low-level embedded device development (such as
microprocessor programming or FPGA design), will be influenced very positively by
high-level engineering tools, such as LabVIEW.
We have also estimated, that a non-LabVIEW application, even if it performs
better, is much harder to implement. Therefore, we have sacrificed some of the
performance in favor of delivering a valuable and fully operational system in time. We
hope that concepts or even implementation details presented in this book can and will
someday be reused in other low-level System on Chip applications. Therefore, our
implementation will be an addition for the open-source community too (by publishing
it online).

This project involved acquiring a large amount of experience in the fields of


FPGA design, real-time processing, graphical programming, embedded device
programming and image processing. We have also learned important aspects about
DMA and FIFO communication, as well as in FPGA pipelining and parallel
programming.

8.2. Future Work


A low-level and complex project, such as this, can be always improved. We
have selected two interesting spaces for improvement, that will be presented:

71
Chapter 8

Using the AXI standard for inter-SoC communication


AXI is a protocol for transferring data on chips. It is adopted in several Xilinx
products, such as the Zynq used in the myRIO and enables data transmission at high
speeds. NI does not fully support this standard yet, but it allows users to integrate Xilinx
AXI Intellectual Property cores into the LabVIEW FPGA design.
In the future, we would like to implement the image transfer between the CPU
and FPGA using the AXI standard and compare the results.

Interfacing the acquisition device directly with the FPGA


Our current implementation has the major drawback that is uses a low-
performance USB webcam. The images that are captured must take a long path from
the camera until they are transferred to the FPGA, which decreases the performance of
the system.
We propose to implement the system in a possible next release, so that the
camera is directly connected to the FPGA pins. This usually requires embedded
cameras and more development on the FPGA-side but would probably have significant
performance benefits.

72
Bibliography

Bibliography

[1] S. Nedevschi, "Image Processing," 2018. [Online]. Available:


ftp.utcluj.ro/pub/users/nedevschi/IP/.
[2] Wikipedia, "Computer vision," [Online]. Available:
https://en.wikipedia.org/wiki/Computer_vision. [Accessed 30 May 2018].
[3] T. S. Huang, "Computer Vision: Evolution and Promise," 19th
CERN School of Computing, pp. 21-25, 8-21 Sep 1996.
[4] R. Gonzalez and R. Woods, “Digital Image Processing", vol. 3rd
Edition, Pearson Prentice Hall, 2008.
[5] U. T. C.-N. Computer Science Department, "Image Processing -
Laboratory 1-11," Cluj-Napoca.
[6] Wikipedia, "Digital audio," [Online]. Available:
https://en.wikipedia.org/wiki/Digital_audio. [Accessed 13 6 2018].
[7] Wikipedia, "System on a chip," [Online]. Available:
http://en.wikipedia.org/wiki/System_on_a_chip. [Accessed 15 3 2018].
[8] "What is meant by real-time image processing? - Quora," [Online].
Available: https://www.quora.com/What-is-meant-by-real-time-image-
processing. [Accessed 20 April 2018].
[9] G. Papp-Szentannai, Proposal of the Diploma Project entitled:
Image Processing on System on Chip FPGA Devices, Cluj-Napoca, 2018.
[10] R. Gonzalez, R. Woods and S. Eddins, "Intensity Transformations
and Spatial Filtering," in Digital Image Processing Using MATLAB®
Second Edition, Gatesmark Publishing, 2009, pp. 109-114.
[11] D. A. N. &. A. c. framework, DANA Handbook, 2012.
[12] J. Serot, F. Berry and C. Bourrasset, "High-level dataflow
programming for real-time image processing on smart cameras," Journal
of Real-Time Image Processing, vol. 12, no. 4, pp. 635-647, 2016.
[13] J. Serot, "CAPH," 5 May 2018. [Online]. Available:
http://caph.univ-bpclermont.fr/CAPH/CAPH.html. [Accessed 8 June
2018].
[14] B. Senouci, I. Charfi, B. Heyrman, J. Dubois and J. Miteran, "Fast
prototyping of a SoC-based smart-camera: a real-time fall detection case
study," Journal of Real-Time Image Processing, vol. 12, no. 4, pp. 649-
662, December 2016.
[15] U. Handmann, T. Kalinke, C. Tzomakas, M. Werner and W. v.
Seelen, "An image processing system for driver assistance," Image and
Vision Computing, vol. 18, no. 5, pp. 367-376, 2000.
[16] C. T. Johnston, K. T. Gribbon and D. G. Bailey, "Implementing
Image Processing Algorithms on FPGAs," 2018.
[17] M. I. AlAli, K. M. Mhaidat and I. A. Aljarrah, "Implementing
image processing algorithms in FPGA hardware," 2013 IEEE Jordan
Conference on Applied Electrical Engineering and Computing
Technologies (AEECT), pp. 1-5, 2013.

73
Bibliography

[18] R. Lu, X. Liu, X. Wang, J. Pan, K. Sun and H. Waynes, "The


Design of FPGA-based Digital Image Processing System and Research on
Algorithms," International Journal of Future Generation Communication
and Networking, vol. 10, no. 2, pp. 41-54, 2017.
[19] S. McBader and P. Lee, "An FPGA implementation of a flexible,
parallel image processing architecture suitable for embedded vision
systems," Proceedings International Parallel and Distributed Processing
Symposium, p. 5, 2003.
[20] J. Batlle, J. Marti, P. Ridao and J. Amat, "A New FPGA/DSP-
Based Parallel Architecture for Real-Time Image Processing," Real-Time
Imaging, vol. 8, no. 5, pp. 345-356, 2002.
[21] S. Asano, T. Maruyama and Y. Yamaguchi, "Performance
comparison of FPGA, GPU and CPU in image processing," 2009
International Conference on Field Programmable Logic and
Applications, pp. 126-131, 2009.
[22] A. Elouardi, S. Bouaziz, A. Dupret, L. Lacassagne, J. Klein and R.
Reynaud, "Image Processing: towards a System on Chip," in Image
Processing, 2009.
[23] A. Ahmadinia and D. Watson, "A Survey of Systems-on-Chip
Solutions for Smart Cameras," in Distributed Embedded Smart Cameras,
C. V. S. (. Bobda, Ed., New York, NY, Springer-Verlag New York, 2014,
pp. 25-41.
[24] N. Bellas, S. Chai, M. Dwyer and D. Linzmeier, "FPGA
implementation of a license plate recognition SoC using automatically
generated streaming accelerators," Proceedings 20th IEEE International
Parallel & Distributed Processing Symposium, pp. 8-, 2006.
[25] G. Bieszczad, "SoC-FPGA Embedded System for Real-time
Thermal Image Processing," in Mixed Design of Integrated Circuits and
Systems, Lodz, Poland, 2016.
[26] A. Lopez-Parrado and J. Velasco-Medina, "SoC-FPGA
Implementation of the Sparse Fast Fourier Transform Algorithm," in
Circuits and Systems (MWSCAS), Boston, MA, USA, 2017.
[27] Wikipedia, "Fast Fourier transform," [Online]. Available:
https://en.wikipedia.org/wiki/Fast_Fourier_transform. [Accessed 1 July
2018].
[28] P.-Y. Bourgeois, G. Goavec-Merou, J.-M. Friedt and E. Rubiola,
"A fully-digital realtime SoC FPGA based phase noise analyzer with
cross-correlation," Frequency and Time Forum and IEEE International
Frequency Control Symposium (EFTF/IFCS), 2017 Joint Conference of
the European, pp. 578-582, 2017.
[29] S. Dhote, P. Charjan, A. Phansekar, A. Hegde, S. Joshi and J.
Joshi, "Using FPGA-SoC interface for low cost IoT based image
processing," 2016 International Conference on Advances in Computing,
Communications and Informatics (ICACCI), pp. 1963-1968, 2016.
[30] Altera Corporation, "Architecture Brief: What is an SoC FPGA?,"
2014. [Online]. Available:

74
Bibliography

https://www.altera.com/en_US/pdfs/literature/ab/ab1_soc_fpga.pdf.
[Accessed 6 June 2018].
[31] Xilinx Inc., "SoCs, MPSoCs & RFSoCs," 2018. [Online].
Available: https://www.xilinx.com/products/silicon-devices/soc.html.
[Accessed 1 July 2018].
[32] Intel Corporation, "SoCs Overview," 2018. [Online]. Available:
https://www.altera.com/products/soc/overview.html. [Accessed 1 July
2018].
[33] E. Vansteenkiste, New FPGA Design Tools and Architectures,
2016.
[34] National Instruments, "myRIO-1900 User Guide and
Specifications," 16 May 2016. [Online]. Available:
http://www.ni.com/pdf/manuals/376047c.pdf. [Accessed 17 March
2018].
[35] National Instrument, "Understanding the LabVIEW FPGA
Compile System (FPGA Module)," March 2017. [Online]. Available:
http://zone.ni.com/reference/en-XX/help/371599N-
01/lvfpgaconcepts/compiling_fpga_vis/. [Accessed July 2018].
[36] National Instruments, "Transferring Data between the FPGA and
Host (FPGA Module)," 2017. [Online]. Available:
http://zone.ni.com/reference/en-XX/help/371599N-
01/lvfpgaconcepts/fpga_data_transfer_overview/. [Accessed 16 February
2018].
[37] National Instruments, "How DMA Transfers Work (FPGA
Module)," 2017. [Online]. Available: http://zone.ni.com/reference/en-
XX/help/371599N-01/lvfpgaconcepts/fpga_dma_how_it_works/.
[Accessed 18 June 2018].
[38] Wikipedia, "Field-programmable gate array," [Online]. Available:
http://en.wikipedia.org/wiki/Field-programmable_gate_array. [Accessed
15 3 2018].
[39] Wikipedia, "Internet of Things," [Online]. Available:
http://en.wikipedia.org/wiki/Internet_of_things. [Accessed 15 3 2018].
[40] National Instruments, "National Instruments: Test, Measurement,
and Embedded Systems - National Instruments," [Online]. Available:
www.ni.com/en-us.html. [Accessed 15 March 2018].
[41] National Instruments, "National Instruments: Test, Measurement,
and Embedded Systems - National Instruments," [Online]. Available:
http://www.ni.com/en-us/shop/labview.html. [Accessed 15 March 2018].
[42] Viewpoint Systems Inc, "LabVIEW FPGA: Features, Benefits &
Drawbacks | Viewpoint Systems," Viewpoint Systems, [Online].
Available: https://www.viewpointusa.com/IE/ar/labview-fpga-the-good-
the-bad-and-the-ugly/. [Accessed 18 March 2018].
[43] Viewpoint Systems, Inc, "LabVIEW FPGA: Features, Benefits &
Drawbacks | Viewpoint Systems," Viewpoint Systems, [Online].
Available: https://www.viewpointusa.com/IE/ar/labview-fpga-the-good-
the-bad-and-the-ugly/. [Accessed 18 March 2018].

75
Bibliography

Several sources that were cited in this section were distributed under the GNU
Free License. Therefore, reusing or distributing this document must also comply to the
GNU Free Documentation License and the GNU General Public License, which are
available at https://fsf.org.

76
Appendices

Appendix 1 – Acknowledgements

The hardware (myRIO) and software (LabVIEW) components required for this
project were provided by National Instruments ® Romania, with headquarters in Cluj-
Napoca, Romania (having its corporate headquarters Austin, TX, USA).
I want to thank the Romanian team for the opportunity and the support they
gave to make the implementation of this project possible.

77
Appendices

Appendix 2 – Table of Figures


Figure 2.1 Requirements of the system organized as a series of tasks that must
be performed ................................................................................................................ 12
Figure 3.1 Electromagnetic Waves Spectrum, from Wikipedia (author: Philip
Ronan) .......................................................................................................................... 16
Figure 3.2 The electromagnetic spectrum arranged according to energy per
photon, from [4] (chapter 1.3, pg. 7)............................................................................ 17
Figure 3.3 Components of a general-purpose image processing system, from [4]
(chapter 1.5, pg. 27) ..................................................................................................... 18
Figure 3.4 Illustration of the convolution process, from laboratory 9 in [5] ... 22
Figure 3.5 Pseudocode of convolution filtering ............................................... 22
Figure 3.6 Example of applying the Sobel filters (2nd image) and the Gaussian
blur (3rd image) on a color image (1st image), from [11] ............................................. 23
Figure 3.7 Performance of the k-means clustering algorithm, from [21] (Fig. 8.
of the original paper) .................................................................................................... 27
Figure 4.1 Overall Architecture, as a data acquisition and control process ..... 31
Figure 4.2 Overall Architecture, from a simple, logical point of view ............ 32
Figure 4.3 Serial tasks performed by the SoC device ...................................... 33
Figure 4.4 Delegating work from the UC to the FPGA ................................... 34
Figure 4.5 Intel Stratix 10 TX FPGA, from altera.com ................................... 41
Figure 5.1 Tool flow for FPGA configuration compilation, from [33] (chapter
2.1, pg. 30) ................................................................................................................... 43
Figure 5.2 Snippet of a VI’s block diagram that computes 𝑓𝑥 = 𝑔𝑎𝑖𝑛 ∗ 𝑥 +
𝑜𝑓𝑓𝑠𝑒𝑡, on each element of an array (x) ..................................................................... 44
Figure 5.3 NI myRIO-1900 Hardware Block Diagram, from [34] .................. 46
Figure 5.4 Overview of a LabVIEW project file (lvproj) ................................ 47
Figure 5.5 Selecting the execution mode of an NI FPGA target ..................... 48
Figure 5.6 Template VI for myRIO development using custom FPGA
personality .................................................................................................................... 51
Figure 5.7 The main responsibilities of the main VI ....................................... 51
Figure 5.8 Opening and configuring a camera using NI Vision Acquisition
Software VIs ................................................................................................................ 52
Figure 5.9 Declare and initialize the input and output images ........................ 53
Figure 5.10 Capturing an image....................................................................... 54
Figure 5.11 Illustration of DMA FIFO transfer from the FPGA to the host, from
[37] ............................................................................................................................... 55
Figure 5.12 Decompose image in two parts and transfer it in parallel on two
DMA FIFO channels.................................................................................................... 56
Figure 5.13 Independent loops in the FPGA VI .............................................. 58
Figure 5.14 The three loops of the FPGA VI with and without synchronization
...................................................................................................................................... 60
Figure 5.15 Serial FIFO Read and Memory Write operations......................... 62
Figure 5.16 Pipelined FIFO Read and Memory Write operations ................... 63

78
Appendices

Figure 6.1 Comparison of the different LabVIEW SoC implementations ...... 66


Figure 6.2 Comparison of Convolution Filter implementations on other
platforms ...................................................................................................................... 67
Figure 6.3 Comparing the performance of the LabVIEW SoC implementation
with the C implementation executed on CPU. *performance for the 15 ∗ 15 kernel was
estimated ...................................................................................................................... 68

79
Appendices

Appendix 3 – Source Code

Because LabVIEW is a graphical and data-flow programming language it is


impossible to give text-based representation of the code. Representative code sections
were already specified as screenshots in the Detailed Design and Implementation
chapter.

The whole source code can be viewed online at the following GitHub web page:
https://github.com/gergo13/SystemOnChip-ImageProcessing-myRIO

80

You might also like