You are on page 1of 40

Neural Networks and Deep Learning

Manuel Carbonell - Autonomous University of Barcelona


Master in Modelling for Science and Engineering
Tutor: Ruben Tous
20th June 2016
https://github.com/manucarbonell/convnet
Abstract
In this article we study artificial neural networks models applied to
computer vision, and also how their architecture modifications affect to
the performance of their training and precision of the predictions. To do
this we used machine learning library Tensorflow. After some training
rounds we saw that deeper networks generally gave better classification
results, with the exception of adding a normalization layer, which asks for
further discussion.

Contents
1 Introduction and previous concepts
1.1 Motivation and objectives . . . . . . . . . . .
1.2 Artificial Intelligence . . . . . . . . . . . . . .
1.2.1 History . . . . . . . . . . . . . . . . .
1.3 Feedforward neural networks . . . . . . . . .
1.3.1 Perceptrons . . . . . . . . . . . . . . .
1.3.2 Perceptron output with step function
1.3.3 An example . . . . . . . . . . . . . . .
1.3.4 Layers . . . . . . . . . . . . . . . . . .
1.3.5 Network training and sigmoid neurons
1.4 Learning with gradient descent . . . . . . . .
1.4.1 Cost function . . . . . . . . . . . . . .
1.4.2 Backpropagation algorithm equations
1.4.3 Backpropagation algorithm steps . . .
1.5 Types of layers . . . . . . . . . . . . . . . . .
1.5.1 Convolutional layers . . . . . . . . . .
1.5.2 An example . . . . . . . . . . . . . . .
1.5.3 Pooling Layers . . . . . . . . . . . . .
1.5.4 Rectifier linear units layers . . . . . .
1.5.5 Local Response Normalization layers .
1.5.6 Softmax layers . . . . . . . . . . . . .
1.6 Why are CNN so effective on image data? . .
1.7 Tensorflow . . . . . . . . . . . . . . . . . . . .
1.8 Programming model . . . . . . . . . . . . . .
1

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

2
2
3
4
4
4
5
5
6
7
8
8
9
11
13
13
14
15
16
16
16
17
17
17

2 Related work
2.1 Universal approximation of functions
2.2 Recurrent Neural Networks . . . . .
2.3 Deep Belief Networks (DBNs) . . . .
2.4 Random forest . . . . . . . . . . . .
2.5 Deep Dream . . . . . . . . . . . . . .
2.6 Image Inpainting . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

18
18
18
18
21
22
23

3 Methodology
3.1 The dataset . . . . . . . . . . . . . . . .
3.2 Hardware setup . . . . . . . . . . . . . .
3.3 Implemented and customized programs .
3.4 Transfer learning . . . . . . . . . . . . .
3.4.1 Bottlenecks . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

24
24
25
25
27
27

4 Results
4.1 Single softmax layer network . . . .
4.2 Convolution . . . . . . . . . . . . . .
4.3 Convolution and pooling . . . . . . .
4.4 Two convolutional and pooling layers
4.4.1 Normalization Layers . . . .
4.5 Image size augmentation . . . . . . .
4.6 Overfitting . . . . . . . . . . . . . .
4.7 Data amount augmentation . . . . .
4.8 Batch size . . . . . . . . . . . . . . .
4.9 Extracted features . . . . . . . . . .
4.10 Retrain ImageNet Model . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

28
28
29
29
31
32
32
33
33
34
35
38

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

5 Conclusions and future work

1
1.1

39

Introduction and previous concepts


Motivation and objectives

The idea of a computer able to acquire, process, analyze, and understand images in a human level, known as computer vision, has been a challenge already
over the past 40 years. Lately the results of research in this field are evolving so
fast that in some particular classification tasks computers perform as good as
humans (e. g. MNIST digit classification, [8]). This improvements have been
possible mainly thanks to the new computing capabilities that allow us to process big quantities of data very fast, a phenomenon known as big data. Among
others, an important advance was the possibility to use models s.a. artificial
neural networks (ANNs), which boosted the predictive quality of computer vision and other artificial intelligence applications.
The purpose of this work is to get do discover some insights of convolutional
neural networks, a special kind of ANNs, also the reasons why they have a particular structure, and how good they perform classification on a self elaborated
image data set. We first do a general contextualization and introduction to
ANNs , and their parts, and then we will describe how to train a fully connected network with gradient descent method. Then we check how essential is
2

every part of the network by running a training algorithm with different setups.
On the way we get introduced to the recently launched library Tensorflow and
discuss obtained results.

1.2

Artificial Intelligence

Artificial neural networks is one of the trend research topics in artificial intelligence (AI). Inside the main faced goals in which intelligence simulation is split,
which are deduction, knowledge representation, planning, natural language processing (communication), perception, motion and manipulation and learning,
neural networks belong to the last one, more specifically called machine learning (ML). In that branch of AI we study algorithms that improve automatically
through experience, by means of approximating functions according to a given
data, usually called training data thats why when running ANNs algorithms
we say that the network is learning or being trained.
Machine learning is split between supervised, unsupervised and reinforcement learning. In unsupervised learning the objective is to infer a function
to describe hidden structure from unlabeled data, f.e. finding similarity between each data point, clustering data or reducing its dimension. Examples of
unsupervised learning are k-means, maximum likelihood estimation, principal
component analysis. Supervised learning consists of using labeled training data
to estimate a map that returns the right label when receiving new data according to the same pattern as in the training set. Convolution neural networks
are an example of supervised learning, where for example we classify images by
their label according to a labeled training data set. In reinforcement learning,
programs are rewarded by taking actions in an environment so as to maximize
some notion of cumulative reward.
Theres disagreement about whether biological foundations are important for
continuing the development of AI as it happens with ANNs, which are inspired
in biological neurons, or it should be a completely independent research field the
same way bird biology has no contribution in most of aeronautical engineering
ideas. Until now AI research has been mostly statistical related. In many AI
specific tasks, e.g. recognizing a song, where a fingerprint of the audio frequencies is generated, a purely statistical method produces results similar that a
human would give. But are we then doing just a classical statistics work? We
could find some differences between classical statistics and AI, which are:
The dimensions of the data. In classical statistics we have low dimension
data sets, e.g. less than 100 dimensions, in AI we can have much more
than that.
In classical statistics we have a lot of noise in the data, which might make
it difficult to find a structure, in AI the noise is not sufficient to hide the
structure of the data when properly processed.
In classical statistics theres not much structure in the data and if there
is, it can be represented with a simple model, with not many parameters,
in AI the structure is too complicated to be represented with a simple
model with few parameters.
Usually the objective when doing classical statistics is to reveal a structure hidden by noise, in AI the objective is to get to present a complicated
3

structure in a way that can be learned.


1.2.1

History

The first scientific approach to artificial neural networks was done by Warren
McCulloch and Walter Pitts [1] in 1943. Their objective was to mathematically
formalize the behavior that brain neurons have when we perform logical reasoning, or when reading our senses inputs.
In 1940 neuroscientist Donald Hebb proposed a theory for the adaptation of the
neurons in the brain during the learning process, which was named after him,
hebbian learning. He already thought of the idea of weights in connections
between neurons, which appear in present artificial neural network (ANN) models. The model states that the weight between two neurons increases if the two
neurons activate simultaneously, and reduces if they activate separately. This
is not exactly the principle of convolutional neural networks (CNN) but it was
a start.
In 1948 Alan Turing suggested a model of computation called unorganized
machine, thinking of the human cortex of an infant, which is largely random
initially but can be trained to perform particular tasks. In this model Turing
defined A-type machines, which consisted of randomly connected networks of
NAND logic gates, and B-type machines, which where built taking A-type machines and substituting inter-node connections with structures called connection
modifiers, which where made of A-type nodes. This connection modifiers where
supposed to undergo appropriate interference, mimicking education.
Frank Rosenblatt first created an electronic device called perceptron in
1960, which represented a neuron as a logical gate with weights and bias. The
utility of the model though was not observable due to lack of computing resources, thats why the idea of neural network was not trend till recent times.
It was in 1975 when Paul Werbos [4] thought of applying backpropagation
algorithm to find an optimal solution for the parameters in a neural network,
which greatly improved the problem-solving capability of a neural network, and
is now the state of the art for image recognition. After this advance many research in this field was done again and the goodness of the predictions gradually
improved till present times where ANN are state of the art for image recognition, achieving human level precision and substituting methods such as support
vector machine or random forest, as it happened in ImageNet 2012 contest
where winner team of image classification used a convolutional neural network
[2] standing out of all other methods. Next we introduce the components of a
convolutional neural network in present time.

1.3
1.3.1

Feedforward neural networks


Perceptrons

An artificial neural network is a graph, where nodes are a special kind of logical gates called perceptrons or artificial neurons, which have some parameters that allow their behavior change so that patterns can be recognized in data
sets.
They are called neural networks because
the original model was supposed to be inspired in animal neurons, which have the
4

Figure 1: Animal neurons gradually


change with synapses

property to gradually chemically change


when doing synapses, and getting to
extract an abstract concept when connected, which also happens with artificial
neural networks model, where values of
parameters are changed instead of physical and chemical properties. Perceptrons
receive many inputs and compute an output by means of weights and biases. We
can imagine perceptrons as decision taking units considering different sources
of information and giving different importance to these sources.
xi Inputs
HH wi W eights
H
HH
'$
j
a Output
-



&%
*



Figure 2: Perceptron with 3 inputs and 1 output

1.3.2

Perceptron output with step function

We first represent the output a of perceptron with inputs {xi }m


i=1 weighted by
{wi }m
i=1 and bias b with the step function:
(
Pm
0 if i=1 xi wi < b
a(x1 , ..., xm , w1 , ..., wm , b) =
(1)
Pm
1 if i=1 xi wi b
1.3.3

An example

A real-life example of perceptron could be the decision of enrolling or not in a


given master program. The input variables could be:
The subjects are or not of your interest.
The references you have of job possibilities after the master are good or bad.
The university is close or not to your place.
You would take a positive decision if the sum of the weighted input variables is
greater than a given bias. Lets say you give a weight of 5 to the interest of the
subjects, a 4 to the goodness of job opportunities and 3 to the closeness of the
uni. Then if the bias is for example 5 it would be enough to have an interesting
content in the subjects to enroll in the master, but if the bias is for example
seven then two conditions should be positive to enroll in the master.
From now on we will use the following notation for inputs weights and biases:
l
wl := (wjk
), is the weights matrix for layer l where each element of the matrix
l
wjk denotes the weight of the connection from the k-th neuron in the (l 1)
-th layer to the j-th neuron in the l-th layer. x := (x1 , ..., xm )denotes the
5

P l l1
input vector, and zjl = k wjk
ak + blj is the weighted input to the activation
function for neuron j in layer l .

Figure 3: Step function.

1.3.4

Layers

Perceptrons are organized in different layers which represent levels of abstraction


of the decision taking, from lower to higher. That means first layers recognize
concrete non-general patterns in the data, and last layers give an abstract classification of the data, such as is the picture a 0 or a 1, or is there an eye on this
picture? Many layers together form a neural network.
Input
layer

Hidden
layers

Output
layer

Figure 4: Neural network with 4 layers.

The first layer is called input layer, which receives the information that has
to be processed by the network (in our case image pixel intensities). Coming
up next are hidden layers. In this example picture we only show two hidden
layers but we can find networks with 12 hidden layers, for example. Hidden
layers process the input layer outputs to give the output layer a final result.

1.3.5

Network training and sigmoid neurons

Our purpose is to get the network to give the output result we want when
we give a determined input. For this we will proceed with a method called
network training or learning. The idea of training the network consists of
giving a big amount of input data together with the expected output results,
and adapt the network parameters to fit as good as possible these expected
results. Lets say we want the network to recognize if faces are appearing or
not in a picture, then we will give as input many pictures that are containing
a face and many pictures that dont, and the labels of the images telling if
a face actually appears. For every picture we gradually modify weights and
biases of our network so that the output is every time more frequently the same
as the label. To do that we use gradient descent method which we explain
below. After the training, the network should be able to recognize if theres a
face or not in a new input picture with a high accuracy. The above described
perceptrons are an intuitive approximate simple idea of neuron model that was
developed into sigmoid neurons, which we define below. The output function of
a sigmoid neuron is not as the one described in (1), since with a step function, the
behavior of the network would be chaotic when modifying weights and biases.
We introduce then sigmoid neurons, which compute their output with sigmoid
function. The output a of a sigmoid neuron is defined with sigmoid function
which is a smooth-shaped version of the step function.
(z) = tanh(z) =

1
1 + ez

(2)

Figure 5: Sigmoid function.

The reason why such a function is chosen for the perceptron output is its
smooth shape and the property that makes it similar to step function: for input
values multiplied by weights that are much greater than bias (i.e. z ) the
output is close to 1, and equivalently for (z ) the output is close to 0.
The important difference with step function is that this time when we slightly
change the weights and biases of a perceptron the output is going to slightly
change too, due to sigmoid function continuity and smoothness. This is going
to allow us to search for the optimal weights and biases of each perceptron
s.t. we get the targeted output when giving an input by using gradient descent
method. Putting together the definitions of neural network and sigmoid neuron,

the activation of the jth neuron in layer l is going to be:


!
X
l1
l
l
l
aj =
wjk ak + bj = (zjl )

(3)

k
l
where wjk
is the weight of the kth neuron of l 1th layer activation into jth
neuron of lth layer and blj is the bias of the jth neuron in the lth layer.

1.4
1.4.1

Learning with gradient descent


Cost function

We denote the target output or desired output of the network when x is the
input by y(x), and the neuron output by a(x, w, b)=a(z) (the desired output
doesnt depend on the weights and biases of neurons but the neuron output
does). We want the training algorithm to determine which weights and biases
approximate best the outputs a(x, w, b) to y(x) for all inputs x.
We define a cost function (also called loss function) as a measurement of the
goodness of the fit of a neural network with weights and biases w and b to a
target y(x). First we have the quadratic cost function:
Cx (w, b) =

X 1
(a(x, w, b) y(x))2
2n
x

(4)

Where n is the number of training inputs and the sum is over each input x. In
the quadratic cost function we can easily observe the main properties that any
cost function should have:
It is positive in all the function domain.
The more outputs of the network are different from the label, the higher
is the value taken by the function.
P
The cost function can be written as an average C = n1 x Cx over cost
functions Cx for individual training examples, x.
It can be written as a function of the outputs from the neural network.
We will write from now on a instead of a(z) and y instead of y(x) to ease
the reading.
Cost functions are defined with the objective to find some weight values w
and biases b such that the output a is as frequent as possible the same as the
target y, equivalently, to find a minimum of the function C by varying w and
b. We could use an analytic method by solving the equation matching gradient
to zero to find local minimums and check which one is the lowest, but since
the number of variables is going to be very large and the shape of the function
tends to be pretty complicated that method would be too costly and we would
probably not get close to the real minimum, thats why we are going to use
gradient descent method instead.
Gradient descent method consists of gradually get closer to a local or absolute
minimum (w0 , b0 ) of the function by means of subtracting the gradient scaled by
a small value called learning rate, based on the fact that in a multidimensional
8

scalar function the gradient vector indicates the direction of maximum growth,
so the opposite vector indicates the maximum descent. So in each step of the
gradient descent method the weights and biases would change following:
(w, b)n+1 = (w, b)n Cx (w, b)

(5)

Where w is the weights vector, b the biases, the learning rate, x is a fixed
input and C(x, w, b) the cost function.
1.4.2

Backpropagation algorithm equations

Now the question is, how do we calculate the gradient of this cost function,
of which we dont know even the concrete expression? The answer is to use
backpropagation algorithm. Before describing backpropagation algorithm we
need to define a couple of equations.
We define the error of a neuron j in layer l as the variation of the cost function
with respect to the weighted inputs plus bias in that neuron:
jl :=

C
zjl

(6)

For two matrices, A, B, of the same dimensions, mn the Hadamard product,


or element-wise matrix product, A B, is a matrix, of the same dimension as
the operands, with elements given by
(A B)i,j = (A)i,j (B)i,j

(7)

Claim 1. Let L be a the number of layers of a neural network and j one of its
neurons, then we have the following equality for the neuron error:
jL =

C 0 L
(zj )
aL
j

(8)

in matrix form
L = a C 0 (z L )
Proof. Lets first apply the definition of error for a neuron in layer L.
jL =

C
zjL

(9)

We have to develop the right term of the equality until we get to the right term
in equation (8). If we derive the cost function applying the chain rule and taking
L
in account that the activations aL
k of neurons in layer L depend of zj we get
the intermediate step:
X C aL
k
jL =
(10)
L z L
a
j
k
k
where the sum is over all the neurons in the output layer. The activation of
a neuron in a given layer only depends of the input that receives that same
aL
neuron, not the other neurons of layer so the term zLk is equal to zero when
j

k 6= j. In consequence we have
jL =

C aL
j
L
aL
z
j
j
9

(11)

but from (1.4.3) we know that

aL
j
zjL

= 0 (zj ), which finishes the proof.

Claim 2. The errors of two consecutive neural network layers are related by
the following equality:
l = ((wl+1 )T l+1 ) 0 (z l )

(12)

Where (wl+1 )T is the transpose of the weight matrix for layer l + 1.


Proof. Taking in account the relation of the input of a neuron in a layer with
the inputs of previous layer we can write
jl

C
zjl
X C z l+1
k
l+1 z l
z
j
k
k

=
=

X z l+1
k

zjl

kl+1

and after the definitions of z and its activations


X
X
l+1 l
l+1
zkl+1 =
wkj
aj + bl+1
=
wkj
(zjl ) + bl+1
k
k
j

(13)
(14)
(15)

(16)

If we differentiate we get
zkl+1
l+1 0 l
= wkj
(zj )
zjl
substituting back in previous expression we get
X
l+1 l+1 0 l
jl =
wkj
k (zj )

(17)

(18)

and in matrix form


l = ((wl+1 )T l+1 ) 0 (z l )

Claim 3. We have the following equalities for the gradient components


C
= jl ;
blj
C
l
= al1
k j
l
wjk

10

(19)

Proof. For the first equality we differentiate the expression of zjl = wjl xlj T +
P l l1
ak + blj ) with respect to blj
blj (equivalently k wjk
C
blj

C zjl
zjl blj

(20)

C
1 = jl
zjl

(21)

and for the second


C
l
wjk

C zjl
l
zjl wjk

= jl
=

zjl

l
wjk
P l l1
k wjk
ak + blj
l
wjk

l
= al1
k j

1.4.3

(22)
(23)
(24)
(25)

Backpropagation algorithm steps

Now that we have shown all the necessary equations, we can list the steps of
backpropagation algorithm to calculate gradient of the cost function. Denoting
ax,l = (al1 , ..., aln )as the vector of neurons activations in layer l when x is the
input, where alj is defined in .
Input: For all neurons in input layer, set neuron values a1 to the corresponding values of pixel intensities in the example image.
Feedforward:
For each layer l {2, ..., L 1} do:
z l = wl al1 + bl and alj = (z l )
Output error L : Calculate the output error
L = a C 0 (z L )
Backpropagation: After calculating the last layer error we backpropagate it until first layers:
l = ((wl+1 )T l+1 ) 0 (z l ) for l in {L 1, ..., 2}
Output gradient components: We compute the components of the
cost function gradient as given in claim 3.
C
= jl ;
blj

This process is done for all examples x in a given subset of the training set
usually called batch, and then the weights are updated (gradient descent
step)
11

X x,l x,l1 T
(a
)
m x
X x,l
bl bl

m x

wl wl

(26)
(27)

Notice that the output error is very simple to calculate in case its a quadratic
cost function:
L = a C 0 (z L ) = (aL y)(z L )(1 (z L ))

(28)

since
0 (z) =

ez
1
1 + ez 1
=
= (z)(1 (z))
z
2
z
(1 + e )
(1 + e ) (1 + ez )

(29)

The backpropagation algorithm description, concretely equation (8) tells us that


the variation of weights and biases in every step depends on the derivative of
the output function, which will be sigmoid in general. When looking at sigmoid
function limits we see that limz 0 (z) = limz 0 (z) = 0 so for very large
or very small values of z the variation of the cost will be very small. It is the case
that sometimes that happens and the learning process gets increasingly slower.
When that happens we say that the neuron is saturated. To avoid learning
slowdown by neuron saturation, we use the cross-entropy cost function

1 XX
L
yj ln aL
(30)
C=
j + (1 yj ) ln(1 aj )
n x j
The motivation to use such a function is that, in addition to fulfill the desired
properties of a cost function mentioned before, when we calculate the partial
derivatives of the cross entropy cost function C , they dont depend on the
derivative of the activation function 0 (z) which as we said causes the saturation.
Indeed if we calculate the derivative of the cross entropy cost function with
respect to the weight:


C
(1 y)

1X
y

=
(31)
wj
n x
(z) 1 (z) wj


y
(1 y)
1X

0 (z)xj
(32)
=
n x
(z) 1 (z)
1X
0 (z)xj
((z) y)
(33)
=
n x (z)(1 (z))
1X
=
xj ((z) y)
(34)
n x
and with respect to the bias:
C
1X
=
((z) y)
b
n x

(35)

Anyway when we use linear neurons, that is neurons with a linear activation
function (not constant), the neuron saturation doesnt happen, because their
derivative is not zero or asymptotically close to it, so in that case we could use
quadratic cost function.
12

1.5

Types of layers

We have now a general idea of how a feedforward neural network works. We


focus now on some types of layers contained in feedforward neural networks.
We studied some of them and they are all implemented in the library that we
used, but there was not enough time to test all of them.
1.5.1

Convolutional layers

In the previous description we supposed that each perceptron of a layer was


connected to all perceptrons of next layer, but thats not how convolution networks are generally built. For image recognition, the training performance and
accuracy of the network prediction improves if we use the so called local receptive fields. It is clear that its not as important the relation between two pixels
that are next to each other as between two pixels that are in opposite sides
of the image. When we recognize a pattern in an image we scan the image
looking for concrete shapes in parts of the image. This is what characterizes
convolutional layers, where a reduced size region of the input image layer is
connected with a single neuron of the hidden layer next to it.

Sliding the local receptive field, also called kernel to the right by one neuron, or
any number of neurons defined as stride, we connect the obtained new region
with the next hidden neuron, by saving its activation, and do so for the whole
layer.

So if for example we choose a region of 5x5 neurons, the activation described


in equation (3) becomes
!
4 X
4
X
l+1
l
ajk = b +
wm,n aj+m,k+n .
(36)
m=0 n=0

13

The previous operation is called convolution, which gives the name to the network model. We notice that the weights and biases are shared for all local
receptive fields, so with this process we are checking in which degree a feature is
present all across the image, and slightly modify it during the learning process.
Also this way we greatly reduce the number of parameters compared with the
fully connected network, and get a more meaningful information for each neuron
in the hidden layer.
We call the map from local receptive fields to a hidden layer a feature map.
For a convolutional layer we can have many feature maps, this way we can recognize different shapes on the images. So to say each feature map tells us if in
a region a given pattern is present or not, with a real value between 0 and 1.
Having the kernel defined as above, the output of a convolutional layer would
have smaller dimensions than the input, but theres cases when this is not the
case. In some cases we use an enhancement of the layer on the edges with mean
values called padding to filter the layer and get an output of the same size and
shape as the layer.
1.5.2

An example

Lets say for example we want a network to recognize if an eye is present on a


picture. A possible feature map with a 4 neuron local receptive field telling if
the following features are present in the right position would tell a case in which
an eye is present:

The previous layer would have another a feature map to recognize if the
above features are present or not.

14

and so on for every lower level of abstraction until we get to the input layer
with the image data.
1.5.3

Pooling Layers

After a convolution layer we usually have pooling layers, which simplify the
information of the previous layer. A commonly used one is max-pooling layer,
which takes the maximum value of the activation in a given region, say 2 2
neurons of previous layer.
al+1
jk = max{a2j+m,2k+n }m,n(0,1)

(37)

The idea of max-pooling is to summarize in a layer the most relevant infor-

mation of the feature maps, if they appear or not in a approximate part of the
image, since we dont care about the exact position of a feature when we are
looking for patterns.

15

1.5.4

Rectifier linear units layers

As we explained before, when using sigmoid output for all the neurons in a layer
it can happen that the state of many neurons becomes saturated, due to the
shape of this output function. The rectifier layers are characterized by having
their neuronss activation function defined as
f (x) = max(0, x)

(38)

Using this kind of output we will avoid saturation, so this kind of layers are
usually combined with convolutional layers with sigmoid function. A smooth
approximation to the rectifier is the analytic function
f (x) = ln(1 + ex )

(39)

which is called the softplus function. This layers have the property of accelerating the learning process, that is achieving a lower cost value in less steps.
1.5.5

Local Response Normalization layers

The local response normalization layers perform a kind of lateral inhibition


by normalizing over local input regions. After Krizhevskys work we know that
they should improve the classification goodness. Their activation function is
given by

min(N 1,i+n/2)
X
i
j
aix,y = zx,y
/ k +
(zx,y
)2
(40)
j=max(0,in/2)
i
where zx,y
is the activity of a neuron computed by applying kernel i at position
(x, y) and then applying ReLU nonlinearity aix,y is the response normalized
i
. The sum runs over n adjacent kernel maps at the same spatial
activity of zx,y
position and N is the total number of kernels in the layer. Details about other
parameters can be found in [3].

1.5.6

Softmax layers

Softmax layers transform the activations from previous layer into a probability
distribution, keeping the same information of the activations. Each neuron of
the softmax layer has the following activation function:
L

ezj
aL
=
P
L
j
zk
ke

(41)

The activations of the previous layer zjL are not necessarily between 0 and 1
and summing 1 for the whole layer, so with the softmax layer we get sure that
we have a better representation of the probability that the image belongs to a
particular class. We will actually not count softmax as a layer helping to train
the network, but a layer that helps to make the classification results human
readable.

16

1.6

Why are CNN so effective on image data?

We studied the importance of convolutional networks because they have been


major advance for working on images, sounds and video data. But why are
they mostly only used in this kind of data? The most plausible reason is that
it has a common fundamental property: the local stationarity and multi-scale
compositional structure, that allows expressing long range interactions in terms
of shorter, localized interactions. That is, video and image data have always
a particular property that if we look the data close enough we see that theres
a usually a smoothness of values, for example in any image part, if we zoom
close enough we see that the color changes gradually from one to another. This
smoothness of values might be an advantage to make gradient descent training
work fine. By multi-scale compositional structure we mean that for different
scales of observation of the data there are always given patterns that let us
identify concepts, when looking close its types of edges and points, when looking
far its different complex geometrical shapes.

1.7

Tensorflow

To run the needed operations for training a neural network we used Googles
recently launched open source machine learning library Tensorflow. TensorFlow
is an open source software library for numerical computation using data flow
graphs. It substitutes previous libraries with similar purposes such as Caffe,
Torch, Theano, DL4J or SciKit-Learn. In each graph nodes represent mathematical operations, from simple ones like matrix multiplication and addition to
more complex like convolution or softmax. Graph edges represent the multidimensional data arrays (tensors) communicated between them. The system is
flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been
used for conducting research and for deploying machine learning systems into
production across more than a dozen areas of computer science and other fields,
including speech recognition, computer vision, robotics, information retrieval,
natural language processing, geographic information extraction, and computational drug discovery. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with
a single API.

1.8

Programming model

TensorFlow computations are represented by a directed graph, which is consists


of a set of nodes. The graph describes a data flow computation, with extensions for allowing some kinds of nodes to maintain and update persistent state
and for branching and looping control structures within the graph. Tensorflow
computational graphs are typically constructed using one of the supported front
end languages (C++ or Python), in our case it was Python.
An operation has a name and represents an abstract computation (e.g.,
matrix multiply, or add). An operation can have attributes, and all attributes must be provided or inferred at graph-construction time in order to
instantiate a node to perform the operation. One common use of attributes is
to make operations polymorphic over different tensor element types (e.g., add of

17

two tensors of type float versus add of two tensors of type int32). A kernel is a
particular implementation of an operation that can be run on a particular type of
device (e.g., CPU or GPU). A TensorFlow binary defines the sets of operations
and kernels available via a registration mechanism, and this set can be extended
by linking in additional operation and/or kernel definitions/registrations.
A tensor in Tensorflows is understood as a typed, multidimensional array.
A variety of tensor element types are supported, including signed and unsigned
integers ranging in size from 8 bits to 64 bits, IEEE float and double types, a
complex number type, and a string type (an arbitrary byte array).

2
2.1

Related work
Universal approximation of functions

The way artificial neural networks have evolved until today, where they are
useful to solve many classification problems with good results was heuristic, it
was not analytically and with deductive steps determined that neural networks
could properly model certain types of data such as audio and video, but it can be
analytically proven that linear combinations of sigmoid functions can uniformly
approximate any continuous function, which tells that we could approximate
any data set with neural networks. Details can be found in [6]. In spite of this
formalization of the function approximation capability of ANN it is accepted
that they have a black box nature in terms of the feature extraction. It is not
exactly known the interpretation of the weights and biases learned, although we
could observe that basic shapes that might be present in images are identified
as features.

2.2

Recurrent Neural Networks

In our work we used all the time feedforward neural networks, which propagate
the activations in one direction, but it is also important to remark that theres
other types of commonly used ANNs as recurrent neural networks, in which
connection form a directed cyclic graph.

2.3

Deep Belief Networks (DBNs)

The main condition to use CNNs is to have labeled data, which in most of
cases in life doesnt happen. Sometimes we have similar kind of problems, but
need to be solved in an unsupervised way, and to do this we can use deep
belief networks, which are the unsupervised learning version of artificial neural
networks. Another inconvenient with CNNs and backpropagation is that weights
and biases can get stuck in a poor local optima, making the model stay far from
good prediction results. So to overcome this limitations Smolensky [7] thought
of a network that learn hidden patterns on the data. So the idea is to have
only one visible layer, many hidden and infer states of hidden variables for
some visible variables states, being later able to generate new visible variables
samples. In case of images, we would learn the probability of some features
appearing in a given image, without it being labeled.
DBNs are composed by Restricted Boltzmann Machines (RBMs). RBMs
are simpler than CNNs version of ANNs that learn a probability distribution
18

over a set of inputs. In case of image sets, the network learns a set of features
given an input image dataset. This can be used to initialize deep neural networks features values. RBMs only have 2 layers, a visible one with m neurons,
(in this method also called units) and a hidden one with n units, with binary
boolean values. The same way as it happens in CNNs, theres a weight matrix
W = (wi,j ) of size m n ,where wi,j determines the weight of connection between visible unit vi and hidden unit hj and also biases, ai for visible units and
bi for hidden units. In RBMs we have a function that associates a scalar value
called energy to each configuration of the variables:
E(v, h) =

m
X

ai vi

i=1

n
X

bj hj

j=1

m X
m
X

vi wi,j hj

(42)

i=1 j=1

in matrix notation,
E(v, h) = aT v bT h v T W h

(43)

Learning corresponds to modifying that energy function so that its shape has
desirable properties. We would like plausible or desirable configurations to have
low energy. We also have a probability distribution for each configuration
of the network, which depends of the energy function:
P (v, h) =

1 E(v,h)
e
Z

(44)

P
E(v,h)
being Z =
a normalizing constant to ensure the probability
(v,h) e
distribution sums 1. The sum is over all possible configurations of visible and
hidden units. Plausible configurations should have a higher probability value,
that is a energy function value close as possible to 0. In a similar way we have
the probability of a given visible units vector is the normalized sum of energy
functions exponential over all possible hidden units configurations.
P (v) =

1 X E(v,h)
e
Z

(45)

Visible as well as hidden units activations are intralayer independent, thats


why theyre called restricted. For this reason we have that the conditional
probability of a configuration of the visible units v, given a configuration of the
hidden units h, is
m
Y
P (v|h) =
P (vi |h)
(46)
i=1

and the other way around, the conditional probability of h given v is


P (h|v) =

n
Y

P (hj |v)

(47)

j=1

The individual activation probabilities are given by


P (hj = 1|v) = bj +

m
X
i=1

19

!
wi,j vi

(48)

and

P (vi = 1|h) = ai +

n
X

wi,j hj

(49)

j=1

where is sigmoid function described in the introduction. Given a training


set V the idea is to maximize the product of configuration probabilities P (v)
varying the weights, that is to find
Y
arg max
P (v)
(50)
W

vV

equivalently, to maximize the expected log probability of V:


"
#
X
arg max E
log P (v)
W

(51)

vV

To do that it is commonly used the Stochastic Maximum Likelihood (SML)


or Persistent Contrastive Divergence (PCD) algorithm. This would be
the equivalent to backpropagation algorithm which is also performed inside gradient descent, to find the optimum weights. The algorithm computes negative
and positive gradient to calculate the gradient descent step. The procedure for
a sample of values for the visible layer is as follows:
Take a training sample v, compute the probabilities of the hidden units
and sample a hidden activation vector h from this probability distribution.
Compute the the positive gradient, which is the outer product between
v and h.
From h, sample a reconstruction v 0 of the visible units, then resample the
hidden activations h0 from this. (Gibbs sampling step)
Compute the negative gradient v 0 h0T .
Let the weight update to wi,j be the positive gradient minus the negative
gradient, times some learning rate: wi,j = (vhT v 0 h0 T ).
In a similar way we update biases a and b.
This algorithm is implemented in the SciKit Learn library BernoulliRMB and
as an example we can see the extracted features W when running it with the
MNIST data set as an input.

20

Composing together many RBMs, making each hidden layer, the visible layer of
another RMB we form a deep belief network, which is able to extract features
in data of different levels of abstraction. To train a Deep Belief Network we
would proceed as follows:
Given a input data sample X we would train a restricted Boltzmann machine on X to obtain its weight matrix, W . Then we would use it as the
weight matrix between the lower two layers of the network.
Then we would transform X by the RBM to produce new data sample
X 0 , either by sampling or by computing the mean activation of the hidden
units.
Next we repeat the procedure with X X 0 for the next pair of layers,
until the top two layers of the network are reached.
At last we would fine-tune all the parameters of this deep architecture
with respect to a proxy for the DBN log-likelihood, or with respect to
a supervised training criterion (after adding extra learning machinery to
convert the learned representation into supervised predictions, e.g. a linear
classifier).

2.4

Random forest

Before neural network models where improved to perform accurate classification


tasks, random forest was one of the most used machine learning techniques for
solving similar problems. Here we do a short introduction, to get to know
another widely used approach when trying to find patterns and classify data.
A decision tree is a tree-type graph in which each node represents a logical
operation or decision. After input data traverses down the tree through logical
operations, the data gets bucketed into smaller and smaller sets. In decision tree
learning, a predictive model maps observations about an item to conclusions
about the items target value.
Random forest is a decision tree learning algorithm developed by Leo
Breiman in 2001, and consists of a combination of decision trees such that each
tree depends on the values of a random vector independently tested and with
21

the same distribution for each tree.


Its a substantial modification an algorithm called bootstrap aggregating,
also known as bagging, that constitutes a long collection of uncorrelated trees
and then takes the average.
Let D be a standard training set of size n, then bagging generates m new training
sets Di , each of size n0 , by sampling from D uniformly and with replacement.
By sampling with replacement, some observations may be repeated in each
Di . If n0 = n, then for a large n the set Di is expected to have the fraction
(1 1e ) (63.2%) of the unique examples of D, the rest being duplicates. This
kind of sample is known as a bootstrap sample. The m models are fitted
using the above m bootstrap samples and combined by averaging the output
(for regression) or voting (for classification). Applying this idea to prediction
trees, we get Random Forest algorithm. Given a training set X = x1 , ..., xn
with responses Y = y1 , ..., yn , bagging repeatedly (B times) selects a random
sample with replacement of the training set and fits trees to these samples:
For b = 1, ..., B: Sample, with replacement,n training examples from X, Y ;
call these Xb , Yb . Train a decision or regression tree fb on Xb , Yb . After training,
predictions for unseen samples x0 can be made by averaging the predictions from
all the individual regression trees on x0 :
B
1 X 0

fb (x ) or by taking the majority vote in the case of decision trees.


f=
B
b=1
So the algorithm uses a similar idea of decision taking as perceptrons in CNNs,
but taking a look at recent image classification contest results, it doesnt get
such good accuracy for image data as convolutional networks.

2.5

Deep Dream

Have you ever thought so long in something or someone that have the feeling for
a second you see it even when its not there? This a curious application of Deep
Neural Networks, the generation of images reminding to hallucinations. The
idea in Deep Dream is to maximize the activations of certain layers features
in a network that is already trained, and mix the detected features with an
input image. To do this it is used gradient ascent, which is the opposite idea of
gradient descent, instead of subtracting gradient to weights and biases in each
step we add it, to get a higher activation value. The result of doing this on
an image of Barcelonas skyline from Parc G
uell with an Inception NeuralNets
layer that detects the presence of canines and other animals is this:

22

2.6

Image Inpainting

Another curious application of neural networks with image data is, a common
process done by humans which is to imagine the completion of a missing part
of an image. This promising idea, was taken to reality in [9], and was accepted
this years Computer Vision and Pattern Recognition contest in Las Vegas and
represents a big step for computer vision. In figure 8 we can see the impressive
results:

23

Figure 8: A demo of context encoders an image inpainting developed using neural


networks.

Methodology

Our goal was to build our own CNN, being inspired in examples that already
perform prediction effectively and understand their architecture. To do that we
use Tensorflow library and get ideas from its available examples.

3.1

The dataset

The first proposed objective was to train a convolutional network to classify a


dataset of food pictures extracted from Instagram and Google Images in the
following 10 classes:
Beer (0)
Burger (1)
Coffee (2)
Croissant (3)
Fried Eggs (4)
Other (5)
Paella (6)
Pizza (7)
Sushi (8)
Wine (9)

24

Figure 9: A sample of our initial data set

The aim of the class other is to make the model able to tell if the picture
doesnt belong to any of the food classes. The dataset is composed both from
Instagram photos and web images. Instagram photos have been obtained from
the Instagram API, filtered with user defined tags and manually purged. As
user defined tags are very noisy this method proved to be inefficient and very
time-consuming. In order to facilitate the generation of more ground truth
annotations and a larger training dataset we also obtained images from Google
Images through the Custom Google Search API. This method, which allowed to
automatically annotate a bigger set of images, turned out to be very useful as
almost all the retrieved images showed the desired food category and minimum
manual purge was required. The first model that we are going to build is going
to be a single layer neural network. The images of the dataset have no specific
size or format, we store them in a .bin file with records with information of
32x32 pixels with 3 channels RGB. Then to feed the network we randomly crop
them into 24x24 pixel images, to expand the data set size.

3.2

Hardware setup

The models were trained over a high-end server with a quadcore Intel i7-3820 at
3.6 GHz with 64 GB of DDR3 RAM memory, and 4 NVIDIA Tesla K40 GPU
cards with 12 GB of GDDR5 each, connected through a PCIe 3.0 in x16 mode
(containing two PCIe switches). The machine runs a GNU/Linux system, with
Linux kernel 3.12 and NVIDIA driver 340.24. We performed experiments with
different configurations (downscaling sizes, data augmentation, different number and composition of layers, different layers geometry, etc.). We also tested
aspects with no impact in the classification accuracy but with practical implications such as different input formats (TFRecords, compressed numpy arrays,
etc.) or different hardware configurations (one or more CPUs and GPUs, etc).
Our runs include an extensive set of conffigurations; for brevity, when those parameters were shown to be either irrelevant or to have negligible effect, we use
default values. Each experimental conffiguration was repeated at least 5 times.
Unless otherwise stated, we report median values in seconds.

3.3

Implemented and customized programs

Our program contains the following parts:


25

Data format transformer - folder2bin.py:


This script reads images located in a folder containing as many subfolders
as classes of pictures (10) and returns a .bin file with all the image data
stored in records of length n = picture width picture height n color
channels + 1 bytes where the first byte is the class label of the picture
and the rest of bytes are the pixel intensities . We also implemented the
opposite step in bin2folder.py.
Input reader - ManuNet input.py: This script contains functions to read
.bin data files using a queue of image examples and returning tensors of
a given batch size, containing image arrays and separately labels. Also
contains a function to distort images randomly crop, flip and whiten them
to enlarge the data set.
Model - ManuNet.py:
The network implementation of ManuNet, a customized version of CIFAR
10 network [2]. This programs allows us to extract our data set from our
github repository (www.github.com/manucarbonell/datasets) and then build
a convolutional network with different architectures to perform experiments. For each value of mode we build a network with:
1 fully connected sigmoid layer
1 convolutional layer followed by pooling
2 convolutionals with pooling and normalization
2 convolutionals with pooling and normalization followed by 2 fully
connected layers
Contains function to save summaries of the cost (value of the cost function) evolution during the training process that can be observed later with
TensorBoard, a platform to visualize the learning in an interactive way.
In the end we didnt use TensorBoard since we preferred to generate our
custom graphs and training and evaluation outputs. Also contains the
function called during the training process that builds the model, and
performs the backpropagation algorithm with learning rate exponential
decay.
As we said before, in each step of the backpropagation algorithm we update
weights and biases in the direction of the cost function gradient multiplied
by a scalar , the learning rate. If we used a constant value for a learning
rate, we would stop getting close to the cost function minimum very fast
as we would pass through it, the same way it would be difficult to get the
ball in the whole only using the drive stick when playing golf. So after
a given number of epochs NUM EPOCHS PER DECAY we decay the learning
rate exponentially using the factor LEARNING RATE DECAY FACTOR. In the
function train() the training step is defined calculating loss results and
applying calculated gradient. The different kind of layer ops used are described in Tensorflow library https://www.tensorflow.org/versions/
r0.8/api_docs/python/nn.html.
Network training - ManuNet train.py:
This script calls the input reader, builds the network graph and calls the
training step from the model program and iterates the process saving the
26

results in a file. The graph is saved in a file .ckpt so it can be read when
evaluating the model. During the training, loss value, steps and execution
time are saved.
Train network and track precision - ManuNet train eval.py:
A modified version of previous program that saves the prediction precision
using both train and test data separately. Model inferences are grouped
in a scope to allow reusing variables.
Evaluate model - ManuNet eval.py:
Returns the precision of our models inference over test data.
Evaluation sample - ManuNet eval sample.py:
Performs model inference over the desired number of batches and saves
the images with their predicted and correct label.
Generate confusion matrix - ManuNet eval byclasses.py:
This program performs inference of imagess labels over the desired number of batches and returns a matrix where each entry aij is the portion of
examples that where labeled as class i and predicted as class j, this way
values on the diagonal aii give the precision for class i.
Extract features - ManuNet get features.py:
Performs inference and saves extracted features in convolutional layer kernel as images.

3.4

Transfer learning

One may ask himself, is it normal if I need to see 1000 images of an


elephant before Im able to recognize it when I see another one again?
Maybe in a very strange case in which you just had your sight given
and an elephant is the first thing you ever saw, otherwise it shouldnt be
necessary. Apparently it happens the same with CNN learning. Once the
network has learned many visual concepts its easier every time to learn
new ones. This way after seeing some results with a self trained CNN we
move to this approach, which is going to be to retrain a large ImageNet
model to recognize the pictures of our data set. This technique is called
transfer learning or convolutional network fine tuning. After knowing
about Donaue et al. [5] work, and the option to load Inception Neural
Network with Tensorflow to use the learned features in your own data set
we checked if this is a better approach than to train a net only with our
own data.
3.4.1

Bottlenecks

The idea consists of loading the graph of an ImageNet Inception network


which is already trained (concretely for 1000 classes) and using the learned
features, perform a training over the new classes to recognize, avoiding this
way a long training process. To do that we have to adapt the last layer
of the trained graph before softmax to the new added classes. To do this
with our dataset we use tensorflows program to retrain Imagenet trained
model with an own dataset.
27

4
4.1

Results
Single softmax layer network

The first and simplest possible approach that weve taken was to train the network with a single softmax layer, with a single matrix product of the image data
with the weights and addition of biases.
Since our dataset is quite noisy and not very
large, without extracting features of different levels of abstraction, the first result is
not going to be very good, if it does learn
something it could be considered already an
achievement.
After running the experiment we can see
the results of running stochastic gradient descent with a single layer network in figure.
Figure 10: 1 softmax layer ANN Clearly we are not getting close to a minimum of the cost function, since after some
steps the cost is barely decreasing.
When checking the precision of the predictions, feeding the network with
600 test images and dividing the correct classifications between the total classifications, we get a disaster score of 44% (44 out of 100 images are classified
correctly).

Figure 11: Training loss in each step for a network with no hidden layers.

So only with a softmax layer receiving the weight product with the input layer
plus biases, the model does learn something, since a random classification would
get around 10% precision, but it doesnt get much more better than that. Taking
a look at the loss value evolution we leave this helloworld experiment and
dont spend time in doing experiments with more steps and go on with the
layers that give the name to the networks we are studying. So lets see how
does the classification improve with a convolution and later pooling layer.

28

4.2

Convolution

After the simplest approach of having a network with no hidden layers we see
how does the model do with only one hidden convolutional layer.
The layer is going to extract 64 5x5
neurons kernels, with a stride of a
single neuron on each direction and
a padding which will make the output of the convolution layer have the
same size of the input layer.
We
think convolution layer as a prism since
it extracts 64 features presence infor- Figure 12: Convolution network simple
mation for every part of the image, structure
this information we save in a 3d-vector
(24x24x64).
The loss value continues decreasing for a longer continued training (in simplest model it almost stopped decreasing in the first 10.000 steps) and ends up
having an average value of 0.3. Training time was of 2 hours 56 minutes for
completing 50.000 steps, which we choose as a training length sufficient for loss
stabilization. The predictions do as expected a jump in precision to 85.5% of
accuracy in contrast with previous model result.

Figure 13: Loss function values during training, in steps and time

4.3

Convolution and pooling

Now we check the effect of adding a pooling layer after the convolutional layer.
This time cost decreases quite faster and
reaches close to 0.16 value after 50.000
training steps in contrast with the 0.3 of
the previous model. In terms of precision
we get up to 88% goodness score, which
is actually not bad taking in account that
the images are not quite simple as for ex- Figure 14: Convolution network with
1 conv. layer and pooling
ample handwritten digits, and our network has only 2 layers. If we randomly
choose a sample of the classifications we see that effectively more than 8 out of
10 images are classified in the right group. The real label of the picture is after
L: and the prediction computed by the network is after P:.

29

L: 0, P: 0

L: 1, P: 1

L: 2, P: 2

L: 3, P: 3

L: 4, P: 8

L: 5, P: 5

L: 6, P: 6

L: 7, P: 6

L: 8, P: 8

L: 9, P: 9

Also we can see the precision results in the confusion matrix, which tells us from
each class, which portion of the predictions where correct and which where in
wrong classes, where rows are real class labels and columns are class predictions.
For example we can observe that 15% of the fried eggs where classified as sushi.
That might be cause of the similarity of colors and shapes (white ovals surrounded by black are present in both classes). So maybe we should rise the
number of layers and features detected in our network, or change other parameters to let the network tell the difference between those classes. A part of this
theres already no notable confusion (greater than 10%) between other classes.
Table 1: Confusion matrix

0
1
2
3
4
5
6
7
8
9

0
0.87
0.00
0.08
0.02
0.02
0.01
0.00
0.00
0.01
0.00

1
0.02
0.85
0.00
0.02
0.00
0.01
0.02
0.02
0.00
0.00

2
0.00
0.02
0.77
0.00
0.02
0.00
0.00
0.00
0.01
0.04

3
0.02
0.04
0.00
0.78
0.05
0.00
0.03
0.02
0.01
0.00

4
0.00
0.00
0.00
0.06
0.72
0.00
0.02
0.02
0.01
0.00

5
0.08
0.04
0.11
0.04
0.02
0.94
0.00
0.00
0.06
0.10

6
0.02
0.00
0.02
0.00
0.00
0.01
0.90
0.05
0.01
0.00

7
0.00
0.02
0.00
0.06
0.02
0.01
0.02
0.85
0.04
0.00

8
0.00
0.02
0.03
0.02
0.15
0.01
0.01
0.05
0.83
0.00

9
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.86

Figure 16: Training cost depending on time and steps

So taking a look at the classifications, we could say that with 1 convolutional


30

and pooling layer the network can already classifies images with a clear color
and shape pattern, but has difficulties with classes that contain many colors
and complicated shapes so clearly the idea of convolution together is the main
key of the learning process of the network. In terms of time, it took us 2 hours
and 24 minutes to train the network using the GPU cluster.
It is to remark that in this model we are estimating a function with a total
of 32x32+5x5x64=2624 parameters, so we can imagine the complexity of the
computation, this is what we meant by the differences between classical statistics
and machine learning.

4.4

Two convolutional and pooling layers

So now that we saw that the main improvement comes after having a convolutional layer, lets see what happens if we have another big improvement after
adding a second convolutional layer.

Figure 17: Convolution network with 2 convolution and pooling layers

Again we will recognize 64 features which resulted to be an good amount


in our reference work [3]. Comparing the cost during the training with the
network with a single convolutional layer we can see that this time the learning
curve is not so steep from the beginning, it decreases in a more continuous
rhythm during the whole training, and also with less oscillations. So we could
say adding a convolutional layer gives us a more stable training process. In
terms of precision, in the first 10.000 steps we get almost the same (0.87). If
we do the training process with the GPU version of Tensorflow we can set the
number of steps to 50.000, still finishing the training procedure in a reasonable
time and see the difference between 1 and 2 convolution layers.

Figure 18: Training cost depending on time and steps

Here we have the confusion matrix results when training a network with 2
convolutional layers.

31

Table 2: Confusion matrix for network with 2 convolution and normalization layers

0
1
2
3
4
5
6
7
8
9

0
0.86
0.00
0.10
0.02
0.02
0.01
0.00
0.02
0.00
0.00

1
0.00
0.79
0.02
0.04
0.00
0.01
0.00
0.03
0.00
0.00

2
0.02
0.00
0.70
0.00
0.00
0.00
0.00
0.00
0.00
0.00

3
0.00
0.04
0.03
0.82
0.04
0.00
0.05
0.01
0.02
0.00

4
0.00
0.00
0.03
0.04
0.85
0.00
0.02
0.00
0.00
0.00

5
0.08
0.05
0.07
0.00
0.02
0.93
0.00
0.00
0.02
0.09

6
0.02
0.02
0.00
0.02
0.00
0.02
0.91
0.06
0.00
0.05

7
0.00
0.04
0.00
0.04
0.02
0.00
0.00
0.85
0.01
0.00

8
0.00
0.05
0.00
0.02
0.04
0.01
0.02
0.03
0.94
0.00

9
0.02
0.00
0.05
0.00
0.00
0.00
0.00
0.00
0.00
0.86

It takes 2h 51 minutes to train the network with 2 convolutional + pooling


layers, but the precision results are the almost the same. Theres some changes
in the confusion, but the general precision score is still 88%, which make us
wonder why in many widely used models theres repetition of convolutional
layers to improve the classifications. It is possible that our dataset is not big
enough to the difference to be noticed when adding repeated layers.
4.4.1

Normalization Layers

If we add a normalization layer after each of the pooling, as its done in cited
models, surprisingly our accuracy decreases again to 86.8% so we look which
other network configuration values can be modified to get better results. Another parameter we used by reference of previous works is the cropping size,
which was 32x32 for all images, but maybe using such small versions of the
pictures is preventing our network to recognize some features that need more
resolution to be recognized.

4.5

Image size augmentation

So next approach will be to train the network with two convolutional, pooling
and normalization layers with a higher resolution version of the same dataset.
We choose a size that is not going to do the computation too slow but the
difference of resolution is quite noticeable, which is going to be 48x48 after
cropping step. With this size, the precision of the predictions rises to 89.99%,
which is quite a significant improvement. So we can state than higher resolution
with more convolutional layers gives better classification results. Also it takes a
high computation time to proceed with these experiments. With the last one (2
convolutionals, 48x48 pixels) it took 8 hours 21 minutes to complete the 50.000
steps of training. The confusion matrix show us some classes predictions now are
really highly accurate, but still some of them lack of a complete understanding
of the patterns by the neural network.

32

0
1
2
3
4
5
6
7
8
9

0
0.83
0.00
0.02
0.00
0.00
0.01
0.00
0.02
0.00
0.00

1
0.00
0.82
0.02
0.00
0.00
0.00
0.02
0.03
0.00
0.00

2
0.04
0.00
0.88
0.02
0.05
0.00
0.02
0.00
0.01
0.04

3
0.02
0.06
0.00
0.83
0.02
0.00
0.02
0.02
0.01
0.00

4
0.02
0.00
0.00
0.04
0.81
0.00
0.02
0.00
0.00
0.00

5
0.07
0.04
0.08
0.00
0.02
0.96
0.02
0.02
0.00
0.09

6
0.02
0.00
0.00
0.04
0.00
0.01
0.86
0.00
0.01
0.00

7
0.00
0.04
0.00
0.02
0.02
0.00
0.05
0.90
0.00
0.00

8
0.00
0.04
0.00
0.04
0.07
0.01
0.00
0.02
0.96
0.00

9
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.87

Table 3: Confusion matrix of 2 convolution layers model trained with 48x48 images

4.6

Overfitting

The plotted loss values are calculated with train data, but it would be good to
see how does the classification precision evolves during the training process, to
know if the network actually learning concepts behind the data, or is it only
memorizing the training data set.

Figure 19: Precision of classifications with training data set in green and test data set
in reed

To do that we take a look the evolution of the precision with training and
test data. In figure 19 we can see that theres a point when the precision of
the classifications stops improving, that means that our network is overfitting,
or some of the neurons are saturated. To avoid this we can try with adding
normalization layer.

4.7

Data amount augmentation

In all those training procedures, we used a routine implemented in CIFAR10


model which enlarges the data set by randomly cropping a part of the image to
get a new one. In the first round of experiments we cropped the 32x32 images

33

into 24x24. In the image resolution augmentation round we cropped 64x64


images into 48x48. We can see that this data augmentation has a positive effect,
because if we run the network with the 64x64 dataset without being cropped,
i.e. not using data augmentation, we get again down to 87% precision, so it
indeed helped to have more examples distorting the original images.

4.8

Batch size

Initially we set a commonly used batch size, which is 128 examples/batch, but
we wanted to see how this does affect training time and precision of classification.
We compare the training evolution and results for the last network model which
consisted of two convolutional layers each followed by pooling and normalization.
In figure 20 we see that the loss takes slightly less oscillating and lower values
from step 30.000 on. We can see it more clearly in a close up of the last 1.000
steps of training in figure 21.

Figure 20: Training loss with batch size of 128 examples in blue and 64 in red.

34

Figure 21: Training loss on 1000 last steps for batch size of 128 examples in blue and
64 in red.

The big difference comes for training time, half lower for lower batch size,
we can see the comparison in minutes in figure 22.

Figure 22: Training loss depending on time for batch sizes 128 in blue and 64 in red.

The precision of predictions stays at 89% in both cases, so the conclusion


would be that for our data set it is clearly better to set a smaller batch size such
as 64 examples per batch. In future work we could discuss to use other batch
sizes, but this one seems to produce a good enough result.

4.9

Extracted features

Here are some extracted features by our network, in this case the 64 kernels of
the first convolutional layer (5x5 weights). As it happens with CNNs we can not
now why a particular shape and kind of features is learned during the process,
35

but intuitively in some of them it looks like the network learns basic shapes to
recognize the boundary of the objects in the images.

Figure 23: Extracted kernels in first convolutional layer, in a model with 2 convolutional layers.

For each of the previous features, we get a tensor of the shape of the image
with the activations for this tensor, as explained in section 1.5.1, since we are
using a padding to get the output with the same shape of the input.

36

Figure 24: Output of first convolution layer for the model 4.4

For example in figure 24 we can see the output of some images after convolution
for the first extracted feature, for each color channel.
If we take a look to a sample of the classifications of test images for this deeper
model, we notice that the network is indeed understanding some patterns within
the images. Lets see first a sample of correctly classified images:

37

Figure 25: Some of the correct classificied pictures by last built model.

But also lets take a look at the error to see the level of bad understanding in
some cases:

(a) Image classified as beer, label


was coffee.

(b) Image classified as egg, label


was croissant

(c) Image classified as pizza, label was paella.

Figure 26: Some of the wrong classified pictures by last built model.

So as we can see, still theres a lot to improve in our model, since for example
in a MNIST data set classification by a state of the art model, errors rely mostly
in images that are really hard to classify correctly by a human eye, but here it is
not the case. Of course our images are much more complicated than digits and
training set is smaller, but for sure further work in the direction we are (more
layers, more data quality) would let us achieve a much better result.

4.10

Retrain ImageNet Model

As we explained before, the last shot was to use a pretrained network that
recognizes 1000 classes contained in Imagenet dataset. To do this we use Inception v-3 model provided by Tensorflow and this time we achieved a precision of
91.2% in only 17 minutes. So definitely we can state than knowing patterns
of many image classes greatly helps learning faster and better new classes as it
happens with biological neural networks.
Summing up we can see 27 that although transfer learning from a large
previously trained model gives the best results, our last self trained model was
not far from it and it helped us to understand some insights of the network that
would be more difficult to understand without seeing and tuning the source code
for building the model.

38

Figure 27: Accuracy results for different network architectures

Conclusions and future work

First thing that can be said is that this work was helpful to get introduced in
a family of algorithms that have amazing results in some particular problems
and now I can understand how they work. After different unexpected accuracy
values and training times, we can say that although some of convolutional neural
networks insights are still not completely clear, they do work quite good for
image classification. Using the feature extraction program we can visualize
an idea of what convolution does, which is to learn some edges, colors and
points that help to learn and then identify the learned shapes of objects. Some
statements we can conclude with are, that we can not say that deeper networks
are always going to give better classification results, since we gradually improved
the precision when adding layers, but it didnt always happened so as we saw
in the case of adding a normalization layer. Further work could be why it
happened so in our case. Also to remark, its observable that higher image
resolution is likely to improve classification goodness, as we saw on the last
results. When looking for documentation in this aspect we saw that many of the
parameters such as batch size or learning rate used in state of the art networks,
are determined as we proceeded, by trial and error so it is still to solve why a
specific number of layers and of which deepness works better. Another thing
we observed is that as expected, using a previously trained net that already
classifies 1000 classes gives a slightly better accuracy for our dataset than the
networks trained from the beginning only with our dataset (90% vs 91%), also
consuming much less time. With this work, taking in account that we didnt
use a very large data set, and with further reading we could say now that object
recognition in static images with CNN is close to being a solved problem. A
future work possibility would be to continue exploring other applications such as
online user behavior prediction, improving speech recognition or look in which
other research areas this family of machine learning algorithms can also be
useful.

39

References
[1] McCulloch, Warren; Walter Pitts (1943). A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics
[2] Krizhevsky, Alex (2009). Learning multiple layers of features from tiny images.
[3] A. Krizhevsky, I. Sutskever, G E. Hinton, Imagenet Classification with Deep
Convolutional Neural Networks
[4] Werbos, P.J. (1975). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences.
[5] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric
Tzeng, Trevor Darrell (2013). A Deep Convolutional Activation Feature for
Generic Visual Recognition
[6] George Cybenko (1989). Approximation by Superpositions of a Sigmoidal
Function
[7] Smolensky, Paul (1986). Chapter 6: Information Processing in Dynamical
Systems: Foundations of Harmony Theory.
[8] Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, Rob Fergus. Regularization of Neural Networks using DropConnect.
[9] Deepak Pathak, Philipp Krahenb
uhl, Jeff Donahue, Trevor Darrell, Alexei
A. Efros. Context Encoders: Feature Learning by Inpainting.

40

You might also like