Professional Documents
Culture Documents
j,ikR
inMapB
2
cji
)
. (4)
F. Full-Link Layer
We dene the bundle of a convolution layer, a max-pooling
layer, a nonlinear layer, and a normalization layer as a stage. A
CNN usually has several stages. After having passed through
several stages, raw images are converted to lots of low-
resolution feature maps. These small-size feature maps are
concatenated into a long vector. Such a vector plays the same
role as hand-coded features and it is fed to a one-hidden-layer
neural network. The feature vector is fully connected to the
hidden layer.
If the feature vector is an n-element column vector, we usu-
ally add a constant 1 to the end of the vector, making it (n + 1)-
element denoted by inMap. If we set m hidden neurons, the
parameters of a full-link layer will be an m (n + 1) matrix
and the outMap follows (5). We choose m by cross-validation.
Thus
outMap = inMap. (5)
The outMap is an m-element column vector. A full-link
layer follows the nal normalization layer and is followed by
a nonlinear layer.
G. Soft-Max Layer
The soft-max layer is the last layer of a CNN. It is a multi-
class logistic regression. If it is a k-class classication task, the
inMap will be a k-element vector with the kth element xed to
0. The probability of the jth class is as (6), in which inMap(i)
is the ith element of inMap. The sum of the k probability is
1, so only k 1 elements of inMap are independent, which is
why the last element is xed to 0. Thus
P(inMap jth|inMap) =
exp (inMap(j))
k
i=1
exp (inMap(i))
. (6)
In the learning procedure, the cost function is the minus log-
likelihood as
J =
k
i=1
1{inMapith}logP(inMapith|inMap). (7)
In the test procedure, the test example is classied according
to the position of the maximum probability.
H. Architecture Selection
This section introduces our whole architecture using the
above layers and how we select this architecture. According
to learning theory, if the architecture has too much capacity,
it tends to overt the training data and has poor generalization.
If the model has too little capacity, it underts the training data,
and both the training error and the test error are high. Hence,
choosing the architecture is choosing a model with matching
capacity. There are two ways to achieve this goal.
The rst is the forward method. At rst, only one stage
is added and the performance is evaluated. Then, a second
stage is added and the model is evaluated again. As the CNN
becomes deeper, the test accuracy rst increases then gradually
converges. When the accuracy is high enough, the model stops
going deeper, because deeper models are much harder to train
and work very slowly.
The second method is the backward method. At rst, a big
enough model is established, which has the ability to make
the training error zero, and then several techniques are used to
reduce the models capacity. An obvious technique is to reduce
the depth of the model or reduce the size of each layer. Other
techniques such as weight decay regularization and drop out
[13] have been proved very powerful.
When choosing the architecture, we use a greedy strategy.
At rst, we choose a default architecture and then use cross-
validation to optimize all the hyperparameters one by one. We
divide the training data into ve folds; the evaluation of some
parameters is the average of ve results. The division of the data
should be very careful to ensure that four-fold training data and
one-fold validation data are fromdifferent physical trafc signs.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
TABLE I
LAYER SIZE SELECTION
TABLE II
FULL-LINK LAYER HIDDEN NEURONS SELECTION
TABLE III
CONVOLUTION LAYER KERNELSIZE SELECTION
TABLE IV
MAX-POOLING LAYER KERNELSIZE SELECTION
TABLE V
NORMALIZATION LAYER KERNELSIZE SELECTION
Layer Size Selection: After referring to several architectures
[13], [14], [16], we choose a default architecture with three
stages. Then, we change layers sizes and evaluate their per-
formances. The max-pooling layer will decrease resolution. To
prevent the information from being lost too quickly, we often
add more feature maps after pooling. In Table I, 204060
means the rst stage has 20 feature maps, the second 40, and
the third 60.
We choose 100150250 as the default. According to cross-
validation results, 70110180 is best.
Full-Link Layer Hidden Neurons Selection: We set full-link
layer 300 hidden neurons as the default. According to cross-
validation results, 200 is best (see Table II).
Convolution Layer KernelSize Selection: We set convolution
layer kernelSize 333 as the default, which means the rst
stages convolution kernelSize is 3, the second 3, and the third
3. According to cross-validation results, 777 converges too
slowly and 555 has the same best result as 533. However,
533 is a smaller model, which trains more quickly and
predicts more quickly (see Table III).
Max-Pooling Layer KernelSize Selection: A max-pooling
layer downsamples inMaps by a factor of stride
2
. The speed of
downsampling should not be too fast; hence, we x stride to be
2. We set the max-pooling layer kernelSize 3 as default. Ac-
cording to cross-validation results, 3 is the best (see Table IV).
Normalization Layer KernelSize Selection: We set normal-
ization layer kernelSize 5 as the default. According to cross-
validation results, 5 is the best (see Table V).
TABLE VI
NORMALIZATION LAYER SELECTION
TABLE VII
NORMALIZATION LAYER k SELECTION
TABLE VIII
NORMALIZATION LAYER SELECTION
TABLE IX
ARCHITECTURE OF OUR CNN
Normalization Layer Parameters Selection: A normaliza-
tion layer has parameters , k, and . We choose them one
by one.
We set 0.75 as the default. If is greater than 2, it
converges too slowly. According to cross-validation results,
0.75 is the best (see Table VI).
We set k 2 as the default. According to cross-validation
results, 2 is the best (see Table VII).
We set 1e 4 as the default. According to cross-validation
results, 1e 4 is the best (see Table VIII).
Whole Architecture: After cross-validation selection, our
CNN has 17 layers in total, and details are shown in Table IX
and Fig. 4. In Table IX, chan is short for channel, kS for
kernelSize, std for stride, and 0-B for zero borders.
Our architecture has 1 162 284 parameters to learn. These
parameters are from convolution layers and full-link layers. The
former learns feature detectors and the latter learns a classier.
III. TRAINING
The CNN has higher capacity than at models. To train
such models, large numbers of data are needed. Hence, batch
learning will be rather slow and SGD is preferred.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
JIN et al.: TRAFFIC SIGN RECOGNITION WITH HINGE LOSS TRAINED CONVOLUTIONAL NEURAL NETWORKS 5
Fig. 4. Our CNNs architecture: inMapC1M1Non1Nor1C2M2Non2Nor2C3M3Non3Nor3F4Non4F5S6.
Fig. 5. Structure of a layer.
A. Data Augmentation
Data augmentation is a technique used to enlarge the training
data set and make CNN have better generalization. It is widely
used when training CNN[13], [14], [16]. For the TSRtask, each
trafc sign in a training set is jittered in one of the following
three ways. The rst is randomly translating the trafc sign
with 10% of the image size. The second is randomly rotating
the image with an angle between 5
and 5
. The third is
randomly scaling the trafc sign with a factor between 0.9 and
1.1. These augmentation parameters will generate new training
examples that are a little different from the original training
data, but they still look like they are keeping the same label.
Data augmentation is performed only on training data, and the
augmented data are all added to the training set.
B. BP
The CNN is a deep neural network. We use a backpropa-
gation (BP) algorithm to train the whole model. When per-
forming the BP algorithm, the structure of a layer is shown in
Fig. 5. The cost function is calculated according to (7). Then,
J/inMap of the soft-max layer is calculated. J/Map has
the same shape as Map, and each element (J/Map)
cxy
is
just J/(Map
cxy
). We dene the input image as the most
previous layer and the soft-max layer as the most next
layer. For each layer, its J/outMap is exactly the next
layers J/inMap. Once J/outMap is calculated, with
this layers inMap, outMap, parameters and its mathemat-
ical rules, J/inMap and J/parameters are calculated.
J/inMap is used to backpropagate derivatives to previous
layers and J/parameters are the gradients of parameters
that are used in SGD. Different layers have different mathe-
matical rules; hence, calculating partial derivatives should be
different and very careful.
TABLE X
SELECTION ON ALL PARAMETERS
TABLE XI
SELECTION ON PART PARAMETERS
Once all the J/parameters are calculated, the learning
rule is as (8), where is the learning rate and parameters
is weight decay term. Thus
parameters := parameters
J
parameters
+ parameters
. (8)
Before training, all parameters are initialized from a uniform
distribution of [ ] so that symmetry is broken. We set
= 0.1 according to LeCun et al. [22]. Choosing the learning
rate is a little tricky. We usually try the learning rate across
orders of magnitude such as 1e 1, 1e 2, 1e 3 . . .. If is
too big, the training does not converge. If is too small, the
convergence is rather slow. We try a big (we use 0.001) that
does not diverge and during training, is gradually becoming
smaller and smaller.
We consider the usage of in two ways. If has effects on
all the parameters including convolution and full-link layers,
according to cross-validation, 1e 1 is the best. 1e1 converges
too slow (see Table X).
If has effects only on parameters of full-link layers, the
cross-validation results are shown in Table XI. 1e1 converges
too slowly.
Finally, we choose = 1e 1 on all parameters.
C. Hinge Loss
We propose a hinge loss cost function method to train CNN,
which gives better test accuracy and faster stable convergence.
Hinge loss is usually used to train large margin classiers such
as SVM. Its math expression is as (9) (y is the true label and
y is the predicted label), and its plot is as in Fig. 6. With hinge
loss, when label y > 0, if y > 1, the training example is rightly
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
Fig. 6. Horizontal axis is predicted y denoted by y and vertical axis is cost. (Top two plots) SVM hinge loss. (Bottom left plot) Cross-entropy. (Bottom right
plot) Hinge loss cross-entropy.
classied and has a margin beyond zero, then the cost is zero
and learning stops; if y < 1, the margin is not big enough or
the training example may even be classied incorrectly; then
the cost is positive and the model learns. If label y < 0, the
analysis is similar. Thus
cost =I(y >0) max(0, 1 y)+I(y <0) max(0, 1+ y). (9)
When training CNN, the cost function is (7), in which each
classs output cost is cross-entropy. Cross-entropys plot is as
the bottom left gure in Fig. 6. Not like hinge loss, even when
some classs output probability y > 0.9, which means the sum
of the rest of the probabilities is less than 0.1, the prediction
is right enough. However, the cost is still positive and will con-
tinue to drive the probability from 0.9 to 1.0. In our experiments
and also according to the report in [23], CNN trained with
cross-entropy will have most output probabilities very close to
1.0 but keep several training examples misclassied. In fact,
this phenomena mean a waste of capacity.
Our motivation for proposing hinge loss CNN is from this
problem. We revise the cross-entropy by setting a margin
threshold as in Fig. 6. If a training example is rightly classied
and has an output probability beyond the margin threshold, the
cost will be zero and learning stops. In fact, we do not need to
revise the formula of (7). In the SGD training procedure, after
performing forward computation, we check the output probabil-
ity and if it is beyond the margin threshold no BP will happen,
as if this training example is useless and will be dropped.
Our proposed cost function is similar to SVM hinge loss.
This method has two main advantages. The rst is the improve-
ment of model generalization, as shown in the experiments. We
achieve a state-of-the-art test accuracy of 99.65% over the rst-
place record 99.46% in the IJCNN 2011 German Signs recogni-
tion competition [23], the error rate decreasing by 35.19%. This
improvement is due to the optimized usage of model capacity.
Hinge loss will drive CNN to focus on those misclassied
training examples, which, in SVM, are called support vectors.
Compared with cross-entropy cost, hinge loss lets CNN spend
its capacity in classifying training data correctly other than
making a correct prediction more correct.
Another advantage of hinge loss is the improvement of
training speed. In SGD, every iterations time is mostly cost by
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
JIN et al.: TRAFFIC SIGN RECOGNITION WITH HINGE LOSS TRAINED CONVOLUTIONAL NEURAL NETWORKS 7
TABLE XII
MARGIN THRESHOLD SELECTION
forward and backward propagation. Without hinge loss, every
training example, no matter how useful or useless, will cost
the time of forward and backward propagation. With hinge
loss, after forward propagation, if a useless training example
is identied, this iteration will immediately stop and backward
propagation will not happen. Hinge loss also results in more
stable convergence, because as the model converges, most
training examples are useless and only a few examples have
effects on the update of parameters. This means the update
frequency gradually decreases and, nally, the update stops or
rarely happens.
In practice, if a training example is identied as useless,
we usually guess that in the next two or three traversals, it
is still useless. Hence, during these traversals, these training
examples are omitted and forward propagation is also omitted.
After several traversals, they are rechecked.
We use cross-validation to choose the margin threshold as in
Table XII and 0.9 is the best.
D. Ensemble
To test the performance of our method, we use the same
ensemble method as described in [23]. We use four different
image-preprocessing methods. They are the original image,
histogram-equalized image, adjust image intensity values, and
contrast-limited adaptive histogram equalized image. These
preprocessing methods normalize the image and give better
contrast. The details of these methods can be found in [23].
For each of the above methods, we train ve CNNs. Finally,
we have 20 CNNs, all of which are independently initialized
before training. When testing the model, the nal output is the
average of the 20 CNNs output, which proves to be much better
than any single CNN.
E. Implementation
Deep neural networks using the BP algorithm usually re-
quire a good deal of time to train. Several methods have been
suggested to speed up the training procedure with CPU [24].
In addition to the optimization of software, recently, with the
development of the graphics processing unit (GPU), a hardware
speeding-up method proves more effective. According to our
test, different layers can be trained more quickly by a GPU with
a factor from 3 to 10.
In fact, the GPUs cores run more slowly than the CPUs
cores; however, the GPU has hundreds of cores, each of
which runs a thread, making GPU run more quickly as a
whole. However, not all algorithms are able to be divided into
small independent tasks. As for CNN, kernels operations
are not independent. If we design reading operations to be
independent, then writing operations will generate conicts and
computation results are not guaranteed to be correct. Therefore,
we divide kernels operations into subtasks according to
writing operations and make them independent as in [25],
which calls such routine pulling derivatives.
For example, if we want to calculate J/inMap, each
J/inMap
cxy
is a subtask. For inMap
cxy
, the inuenced
outMap
kji
satisfy
(j 1) kernelSize + 1 x j kernelSize (10)
(i 1) kernelSize + 1 y i kernelSize (11)
all k in outMap. (12)
We should not loop according to outMap, because that
is the way of pushing derivatives to J/inMap and will
generate writing conicts. Hence, we loop according to
J/inMap. Solving (10), (11), and (12), we know that for
each J/inMap
cxy
, we only need to pull derivatives from
those outMap
kji
satisfying
x
kernelSize
x 1
kernelSize
+ 1
(13)
y
kernelSize
y 1
kernelSize
+ 1
(14)
all k in outMap. (15)
We write our own GPU CNN package in C language con-
trolled by MATLAB.
IV. EXPERIMENTS
Our experiments include two parts. The rst compares the
convergence of HLSGD and cross-entropy SGD. The second
compares the test accuracy of our HLSGD and the IJCNN 2011
German Trafc Signs recognition competitions best algorithm.
Our experiments are conducted on two Tesla C2075 GPUs
and a 12-core Intel(R) Core i7-3960X 3.30-GHz computer. In
training, only two CPU cores are needed.
A. Data Set
We use the same data set as in the nal competition session at
the IJCNN 2011 German Trafc Signs recognition competition
[15]. The German Trafc Sign Recognition Benchmark gives a
single-image multiclass classication problem. The data set has
43 classes of trafc signs. The training data set contains 39 209
training images in 43 classes and the test data set contains
12 630 test images.
The images contain one trafc sign each and a border of
10% around the actual trafc sign (at least 5 pixels) to allow
for edge-based approaches. Image sizes vary between 15 15
and 250 250 pixels and are not necessarily squared. There is
a bounding box for each trafc sign as part of the annotations.
We resize the image within the bounding box to square image
as our CNNs input.
B. Convergence
We start two GPUs to conduct this experiment. The rst GPU
is running SGD and saves the CNNs current parameters onto
a hard disk every 1000 iterations. The second GPU reads the
saved current parameters from the hard disk constantly and
tests the accuracy on a validation set of 2209 images. We let
the second GPU save 200 monitored accuracy. A trial costs
about 7 h.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
Fig. 7. Horizontal axis: training time, vertical axis: accuracy on validation data. The three plots are from the same curve. (Top) Whole gure. (Bottom left)
Turning part of the curves. (Bottom right) Last 40 monitored accuracy.
TABLE XIII
RECOGNITION RATE OF THE IJCNN 2011
COMPETITION WINNERS ALGORITHM
We only use original images as CNN input in this experi-
ment. The other three preprocessing methods will give similar
results. Cross-entropy and hinge loss methods are both tried
ve times, which means ten times in total. Moreover, for each
method, ve curves are averaged. The nal two averaged curves
are shown in Fig. 7.
Analyzing the curves, we can nd the following results.
Before the turning point, the range that accuracy is below 95%,
both methods increase very quickly and their performances are
similar. The reason is that when the model is very bad, all the
training examples lie in the positive range of the cost function;
hence, cross-entropy and hinge loss are the same. However,
after the turning point, more and more training examples be-
come useless. The hinge loss method only learns bad examples
TABLE XIV
RECOGNITION RATE OF OUR ALGORITHM
TABLE XV
RECOGNITION RATE OF DIFFERENT CLASSIFIERS
and accuracy continues to improve until nal convergence.
However, cross-entropy keeps driving useless examples output
close to 1.0 and most parameters update operations are similar,
with bad examples rarely learned. That is why the cross-entropy
curve uctuates a lot and converges less stably. At the end of
the two curves, the hinge loss curve gives a better validation
accuracy of 98.89% than cross-entropys 98.48%.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
JIN et al.: TRAFFIC SIGN RECOGNITION WITH HINGE LOSS TRAINED CONVOLUTIONAL NEURAL NETWORKS 9
Fig. 8. 44 errors of our ensemble CNNs.
C. Accuracy
Here, we show that our algorithm gives state-of-the-art test
accuracy. For each preprocessing method, we train ve CNNs.
The nal models output is the average of all the 20 CNNs
output. The best record of this data set is achieved by the IJCNN
2011 recognition competitions winner Dan Ciresan. In addition
to the above four preprocessing methods, they use another
preprocessing method called Conorm. They train ve CNNs for
each method and their performance is shown in Table XIII. Our
algorithms performance is in Table XIV. No matter which kind
of preprocessing, our algorithm has better test accuracy and the
nal ensemble model gives the best accuracy of 99.65% over
the winners 99.46%, the error rate decreasing by 35.19%.
In the competition, other participants using other classiers
also give several results, as shown in Table XV. To compare
computer ability and human ability, apart from providing the
benchmark data set, Institut fr Neuroinformatik Real-Time
Computer Vision research group also organizes several people
who are good at labeling trafc signs to give a performance
on the test set. The human performance is at 98.84%, which is
below our accuracy and that of the winner. Other methods such
as random forests on hand-coded histogramof oriented gradient
(HOG) features [26] and linear discriminative analysis on HOG
features are below human performance.
D. Error Analysis
We print 44 errors of 12 630 test images by our ensemble
CNNs in Fig. 8. Errors mainly result fromlowresolution, strong
sunlight, and incorrect annotations of bounding box. When an
error happens, the maximum value of the output is not big
enough. If the system focuses on precision, most errors can be
rejected by a higher threshold.
V. CONCLUSION
We have designed our TSR system using a CNN, which is a
special kind of deep neural network. The model has the ability
to learn both features and classiers. The learned features detect
specic local patterns that prove to be better than hand-coded
features. We have described the details of each layer and written
a GPU package for our CNN, which lets the training time speed
up by a factor between 3 and 10 over CPU.
We have proposed an HLSGD method to train CNNs. We
tested our algorithm on the German Trafc Sign Recognition
Benchmark and compared our results with other competitors.
The experiments showthat HLSGDgives faster and more stable
convergence and a state-of-the-art recognition rate of 99.65%.
ACKNOWLEDGMENT
The authors would like to thank Tsinghua National Labora-
tory for Information Science and Technology for the Explorer
100 cluster system on which this work was completed.
REFERENCES
[1] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund, Vision-based traf-
c sign detection and analysis for intelligent driver assistance systems:
Perspectives and survey, IEEE Trans. Intell. Transp. Syst., vol. 13, no. 4,
pp. 14841497, Dec. 2012.
[2] M. Meuter, C. Nunny, S. M. Gormer, S. Muller-Schneiders, and
A. Kummert, A decision fusion and reasoning module for a trafc sign
recognition system, IEEE Trans. Intell. Transp. Syst., vol. 12, no. 4,
pp. 11261134, Dec. 2011.
[3] A. Ruta, Y. Li, and X. Liu, Robust class similarity measure for trafc sign
recognition, IEEE Trans. Intell. Transp. Syst., vol. 11, no. 4, pp. 846855,
Dec. 2010.
[4] F. Zaklouta and B. Stanciulescu, Real-time trafc-sign recognition using
tree classiers, IEEE Trans. Intell. Transp. Syst., vol. 13, no. 4, pp. 1507
1514, Dec. 2012.
[5] J. Greenhalgh and M. Mirmehdi, Real-time detection and recognition
of road trafc signs, IEEE Trans. Intell. Transp. Syst., vol. 13, no. 4,
pp. 14981506, Dec. 2012.
[6] P. Viola and M. J. Jones, Robust real-time face detection, Int. J. Comput.
Vis., vol. 57, no. 2, pp. 137154, May 2004.
[7] N. Dalal and B. Triggs, Histograms of oriented gradients for human
detection, in Proc. IEEE Comput. Soc. Conf. CVPR, 2005, vol. 1,
pp. 886893.
[8] D. G. Lowe, Distinctive image features from scale-invariant keypoints,
Int. J. Comput. Vis., vol. 60, no. 2, pp. 91110, Nov. 2004.
[9] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, Introduction to
the special issue on machine learning for trafc sign recognition, IEEE
Trans. Intell. Transp. Syst., vol. 13, no. 4, pp. 14811483, Dec. 2012.
[10] D. H. Hubel and T. N. Wiesel, Receptive elds and functional architec-
ture of monkey striate cortex, J. Physiol., vol. 195, no. 1, pp. 215243,
Mar. 1968.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
[11] K. Fukushima, Neocognitron: A self-organizing neural network model
for a mechanism of pattern recognition unaffected by shift in position,
Biol. Cybern., vol. 36, no. 4, pp. 193202, Apr. 1980.
[12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning
applied to document recognition, Proc. IEEE, vol. 86, no. 11, pp. 2278
2324, Nov. 1998.
[13] A. Krizhevsky, I. Sutskever, and G. Hinton, Imagenet classication with
deep convolutional neural networks, in Proc. Adv. Neural Inf. Process.
Syst., 2012, vol. 25, pp. 11061114.
[14] D. Ciresan, U. Meier, J. Masci, and J. Schmidhuber, A committee of
neural networks for trafc sign classication, in Proc. IJCNN, 2011,
pp. 19181921.
[15] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, The German Trafc
Sign Recognition Benchmark: A multi-class classication competition,
in Proc. IJCNN, 2011, pp. 14531460.
[16] P. Sermanet and Y. LeCun, Trafc sign recognition with multi-scale
convolutional networks, in Proc. IJCNN, 2011, pp. 28092813.
[17] B. A. Olshausen and D. J. Field, Emergence of simple-cell receptive eld
properties by learning a sparse code for natural images, Nature, vol. 381,
no. 6583, pp. 607609, Jun. 1996.
[18] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, Extracting and
composing robust features with denoising autoencoders, in Proc. 25th
Int. Conf. Mach. Learn., 2008, pp. 10961103.
[19] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, What is the best
multi-stage architecture for object recognition? in Proc. IEEE 12th Int.
Conf. Comput. Vis., 2009, pp. 21462153.
[20] Q. V. Le, Building high-level features using large scale unsupervised
learning, in Proc. ICASSP, 2013, pp. 85958598.
[21] D. Scherer, A. Mller, and S. Behnke, Evaluation of pooling operations
in convolutional architectures for object recognition, in Articial Neural
NetworksICANN. Berlin, Germany: Springer-Verlag, 2010, pp. 92101.
[22] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Mller, Efcient backprop,
in Neural Networks: Tricks of the Trade. Berlin, Germany: Springer-
Verlag, 1998, pp. 950.
[23] D. Ciresan, U. Meier, and J. Schmidhuber, Multi-column deep neural
networks for image classication, in Proc. IEEE Conf. CVPR, 2012,
pp. 36423649.
[24] J. Bouvrie, Notes on convolutional neural networks, 2006.
[25] D. C. Cirean, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber,
Flexible, high performance convolutional neural networks for image
classication, in Proc. 22nd Int. Joint Conf. Artif. Intell., 2011, vol. 2,
pp. 12371242, AAAI Press.
[26] F. Zaklouta, B. Stanciulescu, and O. Hamdoun, Trafc sign classication
using kd trees and randomforests, in Proc. IJCNN, 2011, pp. 21512155.
Junqi Jin received the B.S. degree from Tsinghua
University, Beijing, China, in 2011, where he is
currently working toward the Ph.D. degree in the
Department of Automation.
Kun Fu received the B.S. degree from Tsinghua
University, Beijing, China, in 2011, where he is
currently working toward the Ph.D. degree in the
Department of Automation.
Changshui Zhang (M02) received the B.S. de-
gree in mathematics fromPeking University, Beijing,
China, in 1986 and the M.S. and Ph.D. degrees
in control science and engineering from Tsinghua
University, Beijing, in 1989 and 1992, respectively.
In 1992 he joined the Department of Automa-
tion, Tsinghua University, where he is currently a
Professor. He has authored more than 200 papers.
His research interests include pattern recognition and
machine learning.
Dr. Zhang is an Associate Editor of Pattern Recog-
nition Journal. He is a member of the Standing Council of the Chinese
Association of Articial Intelligence.