Professional Documents
Culture Documents
Abstract
We explore the effects of pooling the results of multiple
different rapidly trained convolutional neural nets where
each one is trained with different parameters, with or
without data augmentation, and on different spatial
frequency bands. We find that not only can we get better
performance on neural nets by training with augmented
data and in a different frequency space, we also find that
appropriately weighting the results from each CNN leads
to an overall although slight improvement in the error rate
than just using the best performing CNN in our disposal.
1. Introduction
As of 2001, scene recognition as opposed to the more
studied object recognition has gathered a lot of interest in
particular because of the associated importance of adding
a layer of contextual meaning within image processing
problems. For examples, within a particular scene, there is
a potential likelihood associated with finding a given
object in that same scene at a particular instance of time
[1].
This paper adds to the literature of scene classification
by exploring the effects of training multiple individual
convolutional neural networks with different architectures,
pre-processing of training data, and finally pooling the
results of these networks in a weighted ensemble.
Additionally, we also explore the effects of reducing the
number of parameters while adding an extra layer to the
architecture of our base reference net, RefNet1, with
respect to the amount of training time required for it to
converge. The goal of our research was to improve on the
accuracy of RefNet1 (~65% on validation data) while
reducing the amount of time required to train the
convolutional neural network.
Our results show that reducing the number of
parameters of the neural network, training different neural
networks on different special frequencies, and pooling the
results together can achieve this. By the end of this project
we managed to obtain ~75% accuracy on validation data,
an improvement of 10% compared to refNet1.
2. Related Work
One of the problems with complex convolutional neural
networks is that they can be prone to over-fitting. One
potential solution is to incorporate the use of dropout [2]
which helps reduce over-fitting. We do not pursue this in
this project, but instead pursue convolutional neural
networks that can be rapidly trained through batch
normalization. Batch normalization takes a step towards
reducing internal covariate shift, and in doing so
dramatically accelerates the training of deep neural nets
This allows us to use much higher learning rates without
the risk of divergence. Furthermore, batch normalization
regularizes the model and reduces the need for Dropout.
[3]
In the aforementioned paper, Sergey Loffe incorporates
ensembles of convolutional neural networks to improve
the performance of ImageNet. Since appropriately
weighting diverse neural networks has been shown to
increase performance compared to single networks [4] we
decide to pursue this same strategy to see if we can obtain
similar results but on scene recognition instead of object
detection.
We also try to incorporate convolutional neural
networks trained in different spatial frequency bands.
Partly in order to diversify our ensemble of convolutional
neural networks, but also because computational studies
have shown that different spatial scales offer different
qualities of information for recognition purpose. [5]
These papers combined suggest that there is room for
improvements on scene classification by using ensembles
instead of individual convolutional neural networks,
training on different frequency spaces, and that with batch
normalization we can reduce amount of time required for
training while also avoiding over-fitting. We explore these
concepts in our project.
3. Approach
3.1 Architecture Design and Batch Normalization
Our initial approach was to train three different
architectures: a shallow, medium, and deep. However, due
to time constraints we realized it was not feasible due to
both our limited understanding on how to modify the
learning parameters and the amount of computation time
we had before the project deadline. We also initially
attempted to use very large convolutional window sizes
between 15x15 and 11x11. Our initial convolutional
neural networks were learning very slowly even with
batch normalization, so we decided to use refNet1, the
Mini Places Challenge base network, as a starting
template. To reduce the amount of training time required
we set out to look for a way to reduce the number of
parameters without sacrificing performance. From this
design goal, we built our two network architectures,
potatoNet1 and potatoNet2, which are shown in the next
figure.
Both potatoNet1 and potatoNet2 replace refNet1s fully
connected 6x6 layer with a 3x3 layer and a 4x4 fully
connected layer. So although we are increasing the depth
of our convolutional neural network, we are reducing the
number of parameters by 2,359,296 147,456 1,048,576
= 1,163,264 parameters. PotatoNet2 adds one extra layer
deep by replacing refNet1s first 8x8 layer with two 4x4
layers which adds 6,144 + 65,536 24,576 = 47,104
parameters to the network but still remains 1,116,160 less
parameters than those of refNet1. Therefore, we expected
potatoNet1 to require the least amount of time followed by
potatoNet2, which should still be significantly faster than
refNet1.
Top 5 Validation
Error
35%
34.50%
36.35%
31.68%
28.57%
37.40%
32.94%
31.13%
28.81%
26.85%
4. Experimental Results
We managed to train 9 convolutional neural networks
with our potatoNet architecture that converged absolutely
around epoch 6 (around 30 minutes of training). Some of
these networks varied just in the learning rates that were
used to train while others varied in that they were trained
with augmented data (the flipped images) or trained with
images from the training data after being passed by a band
pass filter. The worst performing convolutional neural
networks had a similar top5 error rate on the validation
data as that of our initial reference net (refNet1) and base
line (~35%). However, refNet1 converged after 30 epochs
and took approximately 5 hours of training. The following
table shows the validation error for each of our neural
5. Conclusion
We can conclude the following. Batch normalization
can have a very significant impact on the amount of time
required to train a convolutional neural network. Without
batch normalization we found that it was very difficult to
choose the correct learning rates for our convolutional
neural networks and only saw improvement (which was
drastic) with batch normalization enabled. This coupled
with parameter reduction can have a significant impact on
the amount of time required to train a convolutional neural
network without sacrificing performance. It also allows us
to consider deeper models since we wont need to be as
concerned with the learning rates as without batch
normalization.
We also conclude that training the convolutional neural
network on different special frequencies shows promise
and can provide an improvement against just training on a
normal RGB image. One possible explanation for this is,
that the band pass filters can remove values from the
image that are not important for classification but are
otherwise noisy and may cause a misclassification.
Another possible explanation is that certain scenes have
more prominent features in only one specific frequency
band, and hence the improved performance was not a
matter of a better convolutional neural network but that
the validation and testing data had more categories where
this was the case. Both of these hypothesis are just
speculation, but our results do show that training on
different frequency bands provided different performance
with the high frequency trained neural network having the
top overall performance.
Finally, we also conclude that the effect of pooling
results from multiple convolutional neural networks can
provide a performance boost. However, it is important to
note our results do not show a trend in that adding more
neural networks to the pool improves performance, but
that in general, multiple (well trained) convolutional
networks performed better when weighted together than
individually. This is probably because a couple of
different convolutional neural networks may be strong at
classifying different types of scenes, so an appropriate
weighting scheme can improve performance but is
ultimately limited to the diversity and accuracy of the
individual convolutional neural networks in the pool.
Therefore the Holy Grail for scene classification may
not be a single convolutional neural network great at
classifying many categories of scenes, but a set of smaller,
accurate, and diverse networks that when pooled together
perform with great accuracy.
References
[1] A. Quattoni, A. Torralba. Recognizing indoor scenes.
CVPR , 2009.
[2] Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,
Sutskever, Ilya & Salakhutdinov, Ruslan (2014). Dropout:
A simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research, 15, 1929-1958.
[3] Sergey Ioffe and (2015). Batch Normalization:
Accelerating Deep Network Training by Reducing. CoRR,
abs/1502.03167.
[4] P.M. Granitto, P.F. Verdes, H.A. Ceccatto, Neural network
ensembles: evaluation of aggregation algorithms. Artificial
Intelligence, Volume 163, Issue 2, April 2005, Pages 139162.
[5] Oliva, Aude, and Antonio Torralba. "Building the Gist of a
Scene: The Role of Global Image Features in Recognition."
Progress in Brain Research Visual Perception Fundamentals of Awareness: Multi-Sensory Integration and
High-Order Perception (2006): 23-36.