You are on page 1of 5

Effects of Pooling Multiple Different Rapidly Trained Convolutional Neural

Networks on Scene Classification


Fernando Yordn
Massachusetts Institute of Technology
77 Mass Ave, Cambridge MA 02139
fyordan@mit.edu

Abstract
We explore the effects of pooling the results of multiple
different rapidly trained convolutional neural nets where
each one is trained with different parameters, with or
without data augmentation, and on different spatial
frequency bands. We find that not only can we get better
performance on neural nets by training with augmented
data and in a different frequency space, we also find that
appropriately weighting the results from each CNN leads
to an overall although slight improvement in the error rate
than just using the best performing CNN in our disposal.

1. Introduction
As of 2001, scene recognition as opposed to the more
studied object recognition has gathered a lot of interest in
particular because of the associated importance of adding
a layer of contextual meaning within image processing
problems. For examples, within a particular scene, there is
a potential likelihood associated with finding a given
object in that same scene at a particular instance of time
[1].
This paper adds to the literature of scene classification
by exploring the effects of training multiple individual
convolutional neural networks with different architectures,
pre-processing of training data, and finally pooling the
results of these networks in a weighted ensemble.
Additionally, we also explore the effects of reducing the
number of parameters while adding an extra layer to the
architecture of our base reference net, RefNet1, with
respect to the amount of training time required for it to
converge. The goal of our research was to improve on the
accuracy of RefNet1 (~65% on validation data) while
reducing the amount of time required to train the
convolutional neural network.
Our results show that reducing the number of
parameters of the neural network, training different neural
networks on different special frequencies, and pooling the
results together can achieve this. By the end of this project
we managed to obtain ~75% accuracy on validation data,
an improvement of 10% compared to refNet1.

The paper is organized as follows: in Section 2 we


detail some of the related work surrounding this problem
and the inspiration to our strategy. In Section 3, we
explain our approach. In Section 4 we show the results we
obtain and finally make closing remarks and observations
in Section 5.

2. Related Work
One of the problems with complex convolutional neural
networks is that they can be prone to over-fitting. One
potential solution is to incorporate the use of dropout [2]
which helps reduce over-fitting. We do not pursue this in
this project, but instead pursue convolutional neural
networks that can be rapidly trained through batch
normalization. Batch normalization takes a step towards
reducing internal covariate shift, and in doing so
dramatically accelerates the training of deep neural nets
This allows us to use much higher learning rates without
the risk of divergence. Furthermore, batch normalization
regularizes the model and reduces the need for Dropout.
[3]
In the aforementioned paper, Sergey Loffe incorporates
ensembles of convolutional neural networks to improve
the performance of ImageNet. Since appropriately
weighting diverse neural networks has been shown to
increase performance compared to single networks [4] we
decide to pursue this same strategy to see if we can obtain
similar results but on scene recognition instead of object
detection.
We also try to incorporate convolutional neural
networks trained in different spatial frequency bands.
Partly in order to diversify our ensemble of convolutional
neural networks, but also because computational studies
have shown that different spatial scales offer different
qualities of information for recognition purpose. [5]
These papers combined suggest that there is room for
improvements on scene classification by using ensembles
instead of individual convolutional neural networks,
training on different frequency spaces, and that with batch
normalization we can reduce amount of time required for
training while also avoiding over-fitting. We explore these
concepts in our project.

3. Approach
3.1 Architecture Design and Batch Normalization
Our initial approach was to train three different
architectures: a shallow, medium, and deep. However, due
to time constraints we realized it was not feasible due to
both our limited understanding on how to modify the
learning parameters and the amount of computation time
we had before the project deadline. We also initially
attempted to use very large convolutional window sizes
between 15x15 and 11x11. Our initial convolutional
neural networks were learning very slowly even with
batch normalization, so we decided to use refNet1, the
Mini Places Challenge base network, as a starting
template. To reduce the amount of training time required
we set out to look for a way to reduce the number of
parameters without sacrificing performance. From this
design goal, we built our two network architectures,
potatoNet1 and potatoNet2, which are shown in the next
figure.
Both potatoNet1 and potatoNet2 replace refNet1s fully
connected 6x6 layer with a 3x3 layer and a 4x4 fully
connected layer. So although we are increasing the depth
of our convolutional neural network, we are reducing the
number of parameters by 2,359,296 147,456 1,048,576
= 1,163,264 parameters. PotatoNet2 adds one extra layer
deep by replacing refNet1s first 8x8 layer with two 4x4
layers which adds 6,144 + 65,536 24,576 = 47,104
parameters to the network but still remains 1,116,160 less
parameters than those of refNet1. Therefore, we expected
potatoNet1 to require the least amount of time followed by
potatoNet2, which should still be significantly faster than
refNet1.

Figure 1: Architecture of our CNNs

Initially, our potatoNet architecture without batch


normalization did not observe any learning despite
modifying the learning parameters multiple times. Once
we enabled batch normalization however, we were able to
see improvements immediately before even finishing the
first epoch. Therefore we included batch normalization in
both potatoNet1 and potatoNet2. The one thing that was
not immediately clear to us however, was that for
analyzing the testing images the inputs needed to be
passed into the networks in batches. Once we realized this,
we chose to use a batch size of 200 images during the
training and testing phases for consistency.
3.2 Different Spatial Frequencies
While for some of our neural networks we augmented
our data by creating a mirror or flipped image of our data
set, we focused our attention on transforming the original
dataset into different frequency spaces (low frequency,
mid frequency, high frequency). The reason for this is that
it is well accepted that augmenting data with scaling,
crops, and flipping across the vertical axis improves
performance in general, so we opted for studying the
effects of looking at the information that might be more
prominent in a different frequency space and evaluating
the performance against the convolutional neural networks
trained solely on the RGB space. To do so, we
incorporated in our training scripts a function that will
either augment the training batch with flipped images (if
that neural network was one labeled for data
augmentation), or transform the image batch into another
batch on a different spatial frequency space by using the
Difference of Gaussian filters (similar to 8.869 pset 1). For
evaluating a batch we used the same filters if the network
was one trained at a particular frequency, the filters that
were trained on RGB images did not require additional
pre-processing.
3.3 Weighted Pooling of CNNs
Initially we ran all of our networks separately to get
validation errors for each one, as well as validation errors
for each specific category for each individual network.
This means that we did not just evaluate how effective a
given network was at classifying all 100 categories, but
how effective each network was at classifying each given
category. To obtain these weights, we ran a script that
checks how many time a given scene was misclassified in
the top 5 results for each network and generate an error
rate for that. The weight assigned to a given category for a
given network i is calculated according to:
1
!
1
!
!
Where ! corresponds to the error rate for that given
category according to the network i. The equation has
several convenient properties, first the sum of the weights
for each category across all networks sum to 1, and

secondly it is inversely proportional to the error rate. An


edge case arises if at least one network has a zero error
rate on one category, in which case we weight all
networks on that category with a zero error rate equally
and give a weight of zero to the rest.
Having automated the validation error checking scripts
and the weight generation script, we coded a script
bakePotatoes.m that took a list of convolutional neural
networks and a set of images as input, and then generated
weights for each network, analyzed the scores for a given
image (after applying any necessary pre-processing steps),
took the weighted output of each network and summed
them up to calculate a final top 5 results.txt file. From this
file we ran our validation checking error code to see
evaluate the effect of pooling and weighting on our
ensemble of convolutional neural networks.
The following figure shows a diagram of our topperforming ensemble following this weighting scheme.

networks. We also show the error vs epoch plots for


refNet1 and potatonet-normal_flipped2. The plots
corresponding to the other networks look very similar to
that of potatonet-normal_flipped2 except for the value for
the final error.
Name of CNN
refNet1
potatonet-normal
potatonet-flipped
potatonet-normal_flipped
potatonet-normal_flipped2
potatonet-lowFrequency
potatonet-midFrequency
potatonet-highFrequency
potatonet-highFrequency2
potatonet-highFrequency_flipped2

Top 5 Validation
Error
35%
34.50%
36.35%
31.68%
28.57%
37.40%
32.94%
31.13%
28.81%
26.85%

Table 1: The names of the CNNs represent the data (normal,


flipped, normal+flipped, flitered, filtered+flipped) that was used
to train the CNN. The number two corresponds to the potatonet2
architecture.

Figure 2: Pooling Net Structures - The three CNN above describe


the data on which it was trained (and in the case with high
frequency, the filter applied to the input image). The CNNs were
later puled together in a weighted ensemble to determine the top
5 labels.

4. Experimental Results
We managed to train 9 convolutional neural networks
with our potatoNet architecture that converged absolutely
around epoch 6 (around 30 minutes of training). Some of
these networks varied just in the learning rates that were
used to train while others varied in that they were trained
with augmented data (the flipped images) or trained with
images from the training data after being passed by a band
pass filter. The worst performing convolutional neural
networks had a similar top5 error rate on the validation
data as that of our initial reference net (refNet1) and base
line (~35%). However, refNet1 converged after 30 epochs
and took approximately 5 hours of training. The following
table shows the validation error for each of our neural

Figure 3: RefNet1 plot to the left, potatonet-normal_flipped2.


Better performance with less training.

From table 1 we see that just by using our potatonet


architecture we can obtain similar if not equal
performance to refNet1 even though our networks
required fewer epochs and time to train. Unsurprisingly,
we also saw that just by adding augmented data to the
training set (in this case the flipped images) we saw an
improvement in the validation error. Interestingly enough
we also found that training on the high frequency data of
the training set as well as the medium frequency data both
provided an improvement over just training on the normal
data. The best performing convolutional neural network

was potatonet-highFrequency_flipped, which was trained


with the high frequency data of the training set and its
flipped variants. This last network outscored refNet1 by
around 7% and only had a 27.6% top5 error rate on the
testing data.
We also found that pooling the results from multiple
networks and then properly weighting the score from each
network for each category to then obtain a final output can
also provide an improvement to just using our top
performing individual convolutional neural network. The
following table shows the combinations of neural
networks we tested.
Val top5
CNNs
Name of Set
error
29.09%
highFrequency,
Top 2 Above
~2%
normal_flipped
30%
improvement
25.95%
normal_flipped2,
Top 2
~1%
highFrequency_flipped2
improvement
highFrequency2,
25.78%
highFrequency_flipped2,
Top 3
~1%
normal_flipped2
improvement
lowFrequency,
Frequencies
26.55%
midFrequency,
and RGB
~0.3%
highFrequency_flipped2,
space
improvement
normal_flipped2
highFrequency_flipped2,
normal_flipped2,
25.86%
highFrequency2,
Top 5
~1%
highFrequency,
improvement
midFrequency,
26.09%
All 9 CNNs
All
~0.75%
improvement
Table 2: Pooling multiple CNN together consistently showed an
improvement over the best performing CNN for that same set.
However, it was not true that just adding more CNNs into the
pool improved results.

The best performance we observed for both individual


networks and for pooling multiple networks was when we
evaluated the results of pooling potatonet-highFrequency2,
potatonet-highFrequency_flipped2,
and
potatonetnormal_flipped2. The pooled results of these three
networks had a top5 validation error of 25.78%, which
was one percent more accurate than our best individual
network (potatonet-highFrequency_flipped) and almost
10% more accurate than our initial baseline of refNet1.
Ultimately, we decided to use this set to submit our final
submission for the Mini Places Challenge scoring 26.5%
top 5 error on testing data, an improvement of 1% over our
submission when only using our best performing
convolutional neural network.

5. Conclusion
We can conclude the following. Batch normalization
can have a very significant impact on the amount of time
required to train a convolutional neural network. Without
batch normalization we found that it was very difficult to
choose the correct learning rates for our convolutional
neural networks and only saw improvement (which was
drastic) with batch normalization enabled. This coupled
with parameter reduction can have a significant impact on
the amount of time required to train a convolutional neural
network without sacrificing performance. It also allows us
to consider deeper models since we wont need to be as
concerned with the learning rates as without batch
normalization.
We also conclude that training the convolutional neural
network on different special frequencies shows promise
and can provide an improvement against just training on a
normal RGB image. One possible explanation for this is,
that the band pass filters can remove values from the
image that are not important for classification but are
otherwise noisy and may cause a misclassification.
Another possible explanation is that certain scenes have
more prominent features in only one specific frequency
band, and hence the improved performance was not a
matter of a better convolutional neural network but that
the validation and testing data had more categories where
this was the case. Both of these hypothesis are just
speculation, but our results do show that training on
different frequency bands provided different performance
with the high frequency trained neural network having the
top overall performance.
Finally, we also conclude that the effect of pooling
results from multiple convolutional neural networks can
provide a performance boost. However, it is important to
note our results do not show a trend in that adding more
neural networks to the pool improves performance, but
that in general, multiple (well trained) convolutional
networks performed better when weighted together than
individually. This is probably because a couple of
different convolutional neural networks may be strong at
classifying different types of scenes, so an appropriate
weighting scheme can improve performance but is
ultimately limited to the diversity and accuracy of the
individual convolutional neural networks in the pool.
Therefore the Holy Grail for scene classification may
not be a single convolutional neural network great at
classifying many categories of scenes, but a set of smaller,
accurate, and diverse networks that when pooled together
perform with great accuracy.

References
[1] A. Quattoni, A. Torralba. Recognizing indoor scenes.
CVPR , 2009.
[2] Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,
Sutskever, Ilya & Salakhutdinov, Ruslan (2014). Dropout:
A simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research, 15, 1929-1958.
[3] Sergey Ioffe and (2015). Batch Normalization:
Accelerating Deep Network Training by Reducing. CoRR,
abs/1502.03167.
[4] P.M. Granitto, P.F. Verdes, H.A. Ceccatto, Neural network
ensembles: evaluation of aggregation algorithms. Artificial
Intelligence, Volume 163, Issue 2, April 2005, Pages 139162.
[5] Oliva, Aude, and Antonio Torralba. "Building the Gist of a
Scene: The Role of Global Image Features in Recognition."
Progress in Brain Research Visual Perception Fundamentals of Awareness: Multi-Sensory Integration and
High-Order Perception (2006): 23-36.

You might also like