You are on page 1of 10

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.


IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1
Traffic Sign Recognition With Hinge Loss
Trained Convolutional Neural Networks
Junqi Jin, Kun Fu, and Changshui Zhang, Member, IEEE
AbstractTrafc sign recognition (TSR) is an important and
challenging task for intelligent transportation systems. We de-
scribe the details of our models architecture for TSR and sug-
gest a hinge loss stochastic gradient descent (HLSGD) method to
train convolutional neural networks (CNNs). Our CNN consists
of three stages (70110180) with 1 162 284 trainable parameters.
The HLSGD is evaluated on the German Trafc Sign Recognition
Benchmark, which offers a faster and more stable convergence
and a state-of-the-art recognition rate of 99.65%. We write a
graphics processing unit package to train several CNNs and es-
tablish the nal classier in an ensemble way.
Index TermsConvolutional neural networks (CNNs), hinge
loss, stochastic gradient descent (SGD), trafc sign recognition
(TSR).
I. INTRODUCTION
T
RAFFIC signs play very important roles for both trans-
portation efciency and safety. An automatic trafc sign
recognition (TSR) system is helpful for assisting drivers and is
essential for autonomous cars. The research around this issue
has long been popular [1], but the task is challenging. Some
popular methods include Bayesian classiers [2], boosting [3],
tree classiers [4], and support vector machines (SVMs) [5].
These methods, from todays point of view, are considered
using hand-coded features such as a circle detector in [2], a
Haar wavelet [6] in [3], and a histogram of oriented gradient
(HOG) [7] or scale-invariant feature transform (SIFT) [8] in
[3][5]. However, designing such features requires a good deal
of time and we do not know what feature is robust to a
specic task. In recent years, the availability of much more
data and deeper models allows machine learning algorithms
[9] to achieve breakthroughs in the TSR system, among which
the convolutional neural network (CNN) is very popular and
powerful.
The CNN is inspired by the early work of Hubel and Wiesel
[10]. According to cats visual system, many models have been
proposed [11], [12]. The advantage of the CNN is that the
models input is a raw image rather than hand-coded features.
Manuscript received February 1, 2014; accepted February 12, 2014. This
work was supported in part by the 973 Program under Grant 2013CB329503,
by the National Natural Science Foundation of China under Grant 91120301,
and by the Beijing Municipal Education Commission Science and Technology
Development Plan Key Project under Grant KZ201210005007. The Associate
Editor for this paper was J. Zhang.
The authors are with the Department of Automation, State Key Laboratory
of Intelligent Technologies and Systems, Tsinghua National Laboratory for
Information Science and Technology, Tsinghua University, Beijing 100084,
China (e-mail: jjq11@mails.tsinghua.edu.cn; fuk11@mails.tsinghua.edu.cn;
zcs@mail.tsinghua.edu.cn).
Color versions of one or more of the gures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identier 10.1109/TITS.2014.2308281
Fig. 1. Structure of a layer and 3-D image of a map.
The model accomplishes feature extracting and classifying as
a whole, and both building blocks are learned in a supervised
procedure. The learned features are usually very robust to the
specic task. In recent years, the CNN has achieved several
state-of-the-art performances such as in the ImageNet Chal-
lenge [13] and the 2011 International Joint Conference on
Neural Networks (IJCNN) competition [14], [15].
The CNN usually has millions of parameters, which is more
difcult to train than at models. Stochastic gradient descent
(SGD) is preferred for training the CNN, but according to the
reports by Sermanet and LeCun [16], Ciresan et al. [14], and
LeCun et al. [12], SGD training also requires a good deal of
time. The contribution of this paper is that we propose a hinge
loss stochastic gradient descent (HLSGD) method whose cost
function is similar to that of the SVM, and the results show
that our algorithm makes the training procedure converge more
quickly and achieves a state-of-the-art test error of 99.65% on
the German Trafc Sign Recognition Benchmark.
II. CNNs
The CNN is a multilayer neural network, consisting of many
different building blocks. Here, details of each layer will be
described.
A. Basic Notations
A single CNN is constructed with several layers. Each layer
consists of maps and parameters. To save a trained model, only
the parameters are needed to be stored and, for each input im-
age, the maps of each layer will be calculated according to this
layers parameters and the maps of the previous layer. As shown
in Fig. 1, a layers inMap is exactly the previous layers outMap.
Maps including inMaps and outMaps are all 3-D matrices.
Each map has properties of channel, width, and height, as in
Fig. 1. The element from c in the channel axis, x in the width
axis, and y in the height axis of some inMap is denoted by
inMap
cxy
.
1524-9050 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
Fig. 2. inMap (channel =2, width=4, height =4), kernel (operate channel =
2, kernelSize=2, stride = 2), outMap (channel = 1, width = 2, height = 2),
operation order: red yellow blue green.
Kernels dene the parameters and operations of the corre-
sponding layer. A kernel is similar to a sliding window that
is connected to a kernelSize square of the inMap. Kernels
slide along the x- and y-directions. In a specic location, a
convolution layer kernel will operate on all channels of that
location, whereas other kinds of layers only operate on one of
the channels. When sliding, kernels usually do not move pixel
by pixel, and they have a moving step along both the x- and
y-directions called stride, as shown in Fig. 2.
According to the above denitions, we have the width and
height relationships of inMap and outMap of a layer as
width
out
=
width
in
kernelSize
stride
+ 1 (1)
height
out
=
height
in
kernelSize
stride
+ 1. (2)
Among all layers, max-pooling layers are used to downsam-
ple features. In convolution and normalization layers, we do not
want to decrease the resolution; otherwise, the downsampling
is too fast. In order to make the outMaps keep the width and
height of the inMaps, we have to set the stride to be 1, usually
kernelSize to be odd, and add a border around the inMap
as in (3). The borders bandwidth is kernelSize/2 and the
borders values are all zeros. Thus
height
out
=
heightWithBorder
in
kernelSize
stride
+ 1
=
height
in
+ 2kernelSize/2 kernelSize
1
+ 1
= height
in
+ (2kernelSize/2 kernelSize + 1)
= height
in
. (3)
B. Convolution Layer
The convolution layer aims to extract local patterns. This
operation is similar to a gradients extractor when computing the
SIFT or HOG feature [7], [8]. In hand-coded features, gradient
templates are used to convolve the inMap; however, in CNN,
kernels are initialized as random templates and will be learned
to be edge, color, and specic patterns detectors, which are
very similar to the basis of sparse coding [17] and hidden
neurons weights of an autoencoder [18].
In a convolution layer, a kernel consists of a 3-D matrix and
a bias. The 3-D matrix has the same channel as the inMap, and
Fig. 3. f(x) = max(0, x).
both its width and height are equal to kernelSize. A kernel
will convolve the inMap as a sliding window moving along
the x- and y-axes. In a specic location, an inner product is
calculated using the kernels 3-D matrix and the corresponding
elements of the inMap. The inner product added with the bias
value will be the outMap element. Therefore, a kernel has
(inMap.channelkernelSize
2
+1) parameters to be learned.
In our implementation, we add zero borders to keep the
width and height. Each kernel will generate an outMap with
the same width and height as inMap, but the outMap has
only one channel. To generate multichannel outMaps, more
different kernels are needed and the number of kernels is just
the channel of this layers outMap. For a convolution layer,
there are totally ((inMap.channel kernelSize
2
+ 1)
outMap.channel) parameters to be learned.
C. Nonlinear Layer
The convolution layer performs as a linear lter. To form
a nonlinear complex model, nonlinear functions are needed
to be passed. In traditional neural networks, sigmoid or tanh
functions are used as nonlinear transformations. The work of
Krizhevsky et al. [13] shows that a rectied linear neuron
gives no worse performance but makes training faster by a
factor of about 6. Jarrett et al. [19] also show that a rectied
function such as f(x) = |tanh(x)| gives better performance
than traditional nonlinearity without absolute operation.
In our experiments, rectied function f(x) = max(0, x) is
used, whose shape is shown in Fig. 3. This function computes
very quickly and, and apart fromthe point x = 0, the derivatives
are all quick and easy to calculate. Anonlinear layer operates on
each element of the inMap using the nonlinear function; hence,
the outMap has the same shape as the inMap. A nonlinear layer
usually follows a convolution layer.
D. Max-Pooling Layer
The convolution and nonlinear layers role is to extract local
features of a training image. When performing the following
classifying procedure, the precise positions of the features
will be harmful to the performance, because different training
instances with the same label have different precise positions.
To solve such problem, a reasonable method is subsampling the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
JIN et al.: TRAFFIC SIGN RECOGNITION WITH HINGE LOSS TRAINED CONVOLUTIONAL NEURAL NETWORKS 3
features to decrease feature maps resolution, which is called
pooling in a CNN. LeCun et al. [12] use average pooling,
whereas Le [20] uses L2 pooling. We adopt max-pooling as
in [21], which has been proven to be powerful in CNN pooling
operations.
Max-pooling can be also viewed as a kind of kernel op-
eration. The kernel has kernelSize and stride. In a specic
location, the kernel operates only on one channel of inMap and
it selects the maximum element of the local area covered by
the sliding kernel as the outMap element. Hence, max-pooling
subsamples inMap channel by channel, and local maximum
values are stored in outMap while small values are dropped.
In a max-pooling layer, the outMap has the same channel as the
inMap, but width and height are reduced according to (1) and
(2). We do not add zero borders to this kind of layer.
A max-pooling layer usually follows a nonlinear layer as a
default. However, if the nonlinearity is monotonically increas-
ing, the layers order should be rearranged as a convolution
layer, max-pooling layer, and nonlinear layer. This order gives
the same result as the default order: convolution layer, nonlin-
ear layer, and max-pooling layer. However, the former order
downsamples rst and calculates fewer nonlinear operations.
E. Normalization Layer
According to [13] and [20], a normalization layer gives better
generalization. We use the technique similar to that in [13].
The kernel of this layer also operates on one channel at a
time. In a specic location, the outMap element is calculated
according to (4). In the equation, inMapB is inMap added
with zero borders; hence, the outMap has the same shape as
the inMap. The notation kR refers to the kernel region. The
parameters k, , and are hyperperameters whose values are
chosen through cross-validation. A typical setting is k = 2,
= 1e 4, = 0.75, kernelSize = 5, i.e.,
outMap
cxy
=
inMapB
cxy
(k +

j,ikR
inMapB
2
cji
)

. (4)
F. Full-Link Layer
We dene the bundle of a convolution layer, a max-pooling
layer, a nonlinear layer, and a normalization layer as a stage. A
CNN usually has several stages. After having passed through
several stages, raw images are converted to lots of low-
resolution feature maps. These small-size feature maps are
concatenated into a long vector. Such a vector plays the same
role as hand-coded features and it is fed to a one-hidden-layer
neural network. The feature vector is fully connected to the
hidden layer.
If the feature vector is an n-element column vector, we usu-
ally add a constant 1 to the end of the vector, making it (n + 1)-
element denoted by inMap. If we set m hidden neurons, the
parameters of a full-link layer will be an m (n + 1) matrix
and the outMap follows (5). We choose m by cross-validation.
Thus
outMap = inMap. (5)
The outMap is an m-element column vector. A full-link
layer follows the nal normalization layer and is followed by
a nonlinear layer.
G. Soft-Max Layer
The soft-max layer is the last layer of a CNN. It is a multi-
class logistic regression. If it is a k-class classication task, the
inMap will be a k-element vector with the kth element xed to
0. The probability of the jth class is as (6), in which inMap(i)
is the ith element of inMap. The sum of the k probability is
1, so only k 1 elements of inMap are independent, which is
why the last element is xed to 0. Thus
P(inMap jth|inMap) =
exp (inMap(j))
k

i=1
exp (inMap(i))
. (6)
In the learning procedure, the cost function is the minus log-
likelihood as
J =
k

i=1
1{inMapith}logP(inMapith|inMap). (7)
In the test procedure, the test example is classied according
to the position of the maximum probability.
H. Architecture Selection
This section introduces our whole architecture using the
above layers and how we select this architecture. According
to learning theory, if the architecture has too much capacity,
it tends to overt the training data and has poor generalization.
If the model has too little capacity, it underts the training data,
and both the training error and the test error are high. Hence,
choosing the architecture is choosing a model with matching
capacity. There are two ways to achieve this goal.
The rst is the forward method. At rst, only one stage
is added and the performance is evaluated. Then, a second
stage is added and the model is evaluated again. As the CNN
becomes deeper, the test accuracy rst increases then gradually
converges. When the accuracy is high enough, the model stops
going deeper, because deeper models are much harder to train
and work very slowly.
The second method is the backward method. At rst, a big
enough model is established, which has the ability to make
the training error zero, and then several techniques are used to
reduce the models capacity. An obvious technique is to reduce
the depth of the model or reduce the size of each layer. Other
techniques such as weight decay regularization and drop out
[13] have been proved very powerful.
When choosing the architecture, we use a greedy strategy.
At rst, we choose a default architecture and then use cross-
validation to optimize all the hyperparameters one by one. We
divide the training data into ve folds; the evaluation of some
parameters is the average of ve results. The division of the data
should be very careful to ensure that four-fold training data and
one-fold validation data are fromdifferent physical trafc signs.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
TABLE I
LAYER SIZE SELECTION
TABLE II
FULL-LINK LAYER HIDDEN NEURONS SELECTION
TABLE III
CONVOLUTION LAYER KERNELSIZE SELECTION
TABLE IV
MAX-POOLING LAYER KERNELSIZE SELECTION
TABLE V
NORMALIZATION LAYER KERNELSIZE SELECTION
Layer Size Selection: After referring to several architectures
[13], [14], [16], we choose a default architecture with three
stages. Then, we change layers sizes and evaluate their per-
formances. The max-pooling layer will decrease resolution. To
prevent the information from being lost too quickly, we often
add more feature maps after pooling. In Table I, 204060
means the rst stage has 20 feature maps, the second 40, and
the third 60.
We choose 100150250 as the default. According to cross-
validation results, 70110180 is best.
Full-Link Layer Hidden Neurons Selection: We set full-link
layer 300 hidden neurons as the default. According to cross-
validation results, 200 is best (see Table II).
Convolution Layer KernelSize Selection: We set convolution
layer kernelSize 333 as the default, which means the rst
stages convolution kernelSize is 3, the second 3, and the third
3. According to cross-validation results, 777 converges too
slowly and 555 has the same best result as 533. However,
533 is a smaller model, which trains more quickly and
predicts more quickly (see Table III).
Max-Pooling Layer KernelSize Selection: A max-pooling
layer downsamples inMaps by a factor of stride
2
. The speed of
downsampling should not be too fast; hence, we x stride to be
2. We set the max-pooling layer kernelSize 3 as default. Ac-
cording to cross-validation results, 3 is the best (see Table IV).
Normalization Layer KernelSize Selection: We set normal-
ization layer kernelSize 5 as the default. According to cross-
validation results, 5 is the best (see Table V).
TABLE VI
NORMALIZATION LAYER SELECTION
TABLE VII
NORMALIZATION LAYER k SELECTION
TABLE VIII
NORMALIZATION LAYER SELECTION
TABLE IX
ARCHITECTURE OF OUR CNN
Normalization Layer Parameters Selection: A normaliza-
tion layer has parameters , k, and . We choose them one
by one.
We set 0.75 as the default. If is greater than 2, it
converges too slowly. According to cross-validation results,
0.75 is the best (see Table VI).
We set k 2 as the default. According to cross-validation
results, 2 is the best (see Table VII).
We set 1e 4 as the default. According to cross-validation
results, 1e 4 is the best (see Table VIII).
Whole Architecture: After cross-validation selection, our
CNN has 17 layers in total, and details are shown in Table IX
and Fig. 4. In Table IX, chan is short for channel, kS for
kernelSize, std for stride, and 0-B for zero borders.
Our architecture has 1 162 284 parameters to learn. These
parameters are from convolution layers and full-link layers. The
former learns feature detectors and the latter learns a classier.
III. TRAINING
The CNN has higher capacity than at models. To train
such models, large numbers of data are needed. Hence, batch
learning will be rather slow and SGD is preferred.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
JIN et al.: TRAFFIC SIGN RECOGNITION WITH HINGE LOSS TRAINED CONVOLUTIONAL NEURAL NETWORKS 5
Fig. 4. Our CNNs architecture: inMapC1M1Non1Nor1C2M2Non2Nor2C3M3Non3Nor3F4Non4F5S6.
Fig. 5. Structure of a layer.
A. Data Augmentation
Data augmentation is a technique used to enlarge the training
data set and make CNN have better generalization. It is widely
used when training CNN[13], [14], [16]. For the TSRtask, each
trafc sign in a training set is jittered in one of the following
three ways. The rst is randomly translating the trafc sign
with 10% of the image size. The second is randomly rotating
the image with an angle between 5

and 5

. The third is
randomly scaling the trafc sign with a factor between 0.9 and
1.1. These augmentation parameters will generate new training
examples that are a little different from the original training
data, but they still look like they are keeping the same label.
Data augmentation is performed only on training data, and the
augmented data are all added to the training set.
B. BP
The CNN is a deep neural network. We use a backpropa-
gation (BP) algorithm to train the whole model. When per-
forming the BP algorithm, the structure of a layer is shown in
Fig. 5. The cost function is calculated according to (7). Then,
J/inMap of the soft-max layer is calculated. J/Map has
the same shape as Map, and each element (J/Map)
cxy
is
just J/(Map
cxy
). We dene the input image as the most
previous layer and the soft-max layer as the most next
layer. For each layer, its J/outMap is exactly the next
layers J/inMap. Once J/outMap is calculated, with
this layers inMap, outMap, parameters and its mathemat-
ical rules, J/inMap and J/parameters are calculated.
J/inMap is used to backpropagate derivatives to previous
layers and J/parameters are the gradients of parameters
that are used in SGD. Different layers have different mathe-
matical rules; hence, calculating partial derivatives should be
different and very careful.
TABLE X
SELECTION ON ALL PARAMETERS
TABLE XI
SELECTION ON PART PARAMETERS
Once all the J/parameters are calculated, the learning
rule is as (8), where is the learning rate and parameters
is weight decay term. Thus
parameters := parameters

J
parameters
+ parameters

. (8)
Before training, all parameters are initialized from a uniform
distribution of [ ] so that symmetry is broken. We set
= 0.1 according to LeCun et al. [22]. Choosing the learning
rate is a little tricky. We usually try the learning rate across
orders of magnitude such as 1e 1, 1e 2, 1e 3 . . .. If is
too big, the training does not converge. If is too small, the
convergence is rather slow. We try a big (we use 0.001) that
does not diverge and during training, is gradually becoming
smaller and smaller.
We consider the usage of in two ways. If has effects on
all the parameters including convolution and full-link layers,
according to cross-validation, 1e 1 is the best. 1e1 converges
too slow (see Table X).
If has effects only on parameters of full-link layers, the
cross-validation results are shown in Table XI. 1e1 converges
too slowly.
Finally, we choose = 1e 1 on all parameters.
C. Hinge Loss
We propose a hinge loss cost function method to train CNN,
which gives better test accuracy and faster stable convergence.
Hinge loss is usually used to train large margin classiers such
as SVM. Its math expression is as (9) (y is the true label and
y is the predicted label), and its plot is as in Fig. 6. With hinge
loss, when label y > 0, if y > 1, the training example is rightly
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
Fig. 6. Horizontal axis is predicted y denoted by y and vertical axis is cost. (Top two plots) SVM hinge loss. (Bottom left plot) Cross-entropy. (Bottom right
plot) Hinge loss cross-entropy.
classied and has a margin beyond zero, then the cost is zero
and learning stops; if y < 1, the margin is not big enough or
the training example may even be classied incorrectly; then
the cost is positive and the model learns. If label y < 0, the
analysis is similar. Thus
cost =I(y >0) max(0, 1 y)+I(y <0) max(0, 1+ y). (9)
When training CNN, the cost function is (7), in which each
classs output cost is cross-entropy. Cross-entropys plot is as
the bottom left gure in Fig. 6. Not like hinge loss, even when
some classs output probability y > 0.9, which means the sum
of the rest of the probabilities is less than 0.1, the prediction
is right enough. However, the cost is still positive and will con-
tinue to drive the probability from 0.9 to 1.0. In our experiments
and also according to the report in [23], CNN trained with
cross-entropy will have most output probabilities very close to
1.0 but keep several training examples misclassied. In fact,
this phenomena mean a waste of capacity.
Our motivation for proposing hinge loss CNN is from this
problem. We revise the cross-entropy by setting a margin
threshold as in Fig. 6. If a training example is rightly classied
and has an output probability beyond the margin threshold, the
cost will be zero and learning stops. In fact, we do not need to
revise the formula of (7). In the SGD training procedure, after
performing forward computation, we check the output probabil-
ity and if it is beyond the margin threshold no BP will happen,
as if this training example is useless and will be dropped.
Our proposed cost function is similar to SVM hinge loss.
This method has two main advantages. The rst is the improve-
ment of model generalization, as shown in the experiments. We
achieve a state-of-the-art test accuracy of 99.65% over the rst-
place record 99.46% in the IJCNN 2011 German Signs recogni-
tion competition [23], the error rate decreasing by 35.19%. This
improvement is due to the optimized usage of model capacity.
Hinge loss will drive CNN to focus on those misclassied
training examples, which, in SVM, are called support vectors.
Compared with cross-entropy cost, hinge loss lets CNN spend
its capacity in classifying training data correctly other than
making a correct prediction more correct.
Another advantage of hinge loss is the improvement of
training speed. In SGD, every iterations time is mostly cost by
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
JIN et al.: TRAFFIC SIGN RECOGNITION WITH HINGE LOSS TRAINED CONVOLUTIONAL NEURAL NETWORKS 7
TABLE XII
MARGIN THRESHOLD SELECTION
forward and backward propagation. Without hinge loss, every
training example, no matter how useful or useless, will cost
the time of forward and backward propagation. With hinge
loss, after forward propagation, if a useless training example
is identied, this iteration will immediately stop and backward
propagation will not happen. Hinge loss also results in more
stable convergence, because as the model converges, most
training examples are useless and only a few examples have
effects on the update of parameters. This means the update
frequency gradually decreases and, nally, the update stops or
rarely happens.
In practice, if a training example is identied as useless,
we usually guess that in the next two or three traversals, it
is still useless. Hence, during these traversals, these training
examples are omitted and forward propagation is also omitted.
After several traversals, they are rechecked.
We use cross-validation to choose the margin threshold as in
Table XII and 0.9 is the best.
D. Ensemble
To test the performance of our method, we use the same
ensemble method as described in [23]. We use four different
image-preprocessing methods. They are the original image,
histogram-equalized image, adjust image intensity values, and
contrast-limited adaptive histogram equalized image. These
preprocessing methods normalize the image and give better
contrast. The details of these methods can be found in [23].
For each of the above methods, we train ve CNNs. Finally,
we have 20 CNNs, all of which are independently initialized
before training. When testing the model, the nal output is the
average of the 20 CNNs output, which proves to be much better
than any single CNN.
E. Implementation
Deep neural networks using the BP algorithm usually re-
quire a good deal of time to train. Several methods have been
suggested to speed up the training procedure with CPU [24].
In addition to the optimization of software, recently, with the
development of the graphics processing unit (GPU), a hardware
speeding-up method proves more effective. According to our
test, different layers can be trained more quickly by a GPU with
a factor from 3 to 10.
In fact, the GPUs cores run more slowly than the CPUs
cores; however, the GPU has hundreds of cores, each of
which runs a thread, making GPU run more quickly as a
whole. However, not all algorithms are able to be divided into
small independent tasks. As for CNN, kernels operations
are not independent. If we design reading operations to be
independent, then writing operations will generate conicts and
computation results are not guaranteed to be correct. Therefore,
we divide kernels operations into subtasks according to
writing operations and make them independent as in [25],
which calls such routine pulling derivatives.
For example, if we want to calculate J/inMap, each
J/inMap
cxy
is a subtask. For inMap
cxy
, the inuenced
outMap
kji
satisfy
(j 1) kernelSize + 1 x j kernelSize (10)
(i 1) kernelSize + 1 y i kernelSize (11)
all k in outMap. (12)
We should not loop according to outMap, because that
is the way of pushing derivatives to J/inMap and will
generate writing conicts. Hence, we loop according to
J/inMap. Solving (10), (11), and (12), we know that for
each J/inMap
cxy
, we only need to pull derivatives from
those outMap
kji
satisfying

x
kernelSize

x 1
kernelSize
+ 1

(13)

y
kernelSize

y 1
kernelSize
+ 1

(14)
all k in outMap. (15)
We write our own GPU CNN package in C language con-
trolled by MATLAB.
IV. EXPERIMENTS
Our experiments include two parts. The rst compares the
convergence of HLSGD and cross-entropy SGD. The second
compares the test accuracy of our HLSGD and the IJCNN 2011
German Trafc Signs recognition competitions best algorithm.
Our experiments are conducted on two Tesla C2075 GPUs
and a 12-core Intel(R) Core i7-3960X 3.30-GHz computer. In
training, only two CPU cores are needed.
A. Data Set
We use the same data set as in the nal competition session at
the IJCNN 2011 German Trafc Signs recognition competition
[15]. The German Trafc Sign Recognition Benchmark gives a
single-image multiclass classication problem. The data set has
43 classes of trafc signs. The training data set contains 39 209
training images in 43 classes and the test data set contains
12 630 test images.
The images contain one trafc sign each and a border of
10% around the actual trafc sign (at least 5 pixels) to allow
for edge-based approaches. Image sizes vary between 15 15
and 250 250 pixels and are not necessarily squared. There is
a bounding box for each trafc sign as part of the annotations.
We resize the image within the bounding box to square image
as our CNNs input.
B. Convergence
We start two GPUs to conduct this experiment. The rst GPU
is running SGD and saves the CNNs current parameters onto
a hard disk every 1000 iterations. The second GPU reads the
saved current parameters from the hard disk constantly and
tests the accuracy on a validation set of 2209 images. We let
the second GPU save 200 monitored accuracy. A trial costs
about 7 h.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
Fig. 7. Horizontal axis: training time, vertical axis: accuracy on validation data. The three plots are from the same curve. (Top) Whole gure. (Bottom left)
Turning part of the curves. (Bottom right) Last 40 monitored accuracy.
TABLE XIII
RECOGNITION RATE OF THE IJCNN 2011
COMPETITION WINNERS ALGORITHM
We only use original images as CNN input in this experi-
ment. The other three preprocessing methods will give similar
results. Cross-entropy and hinge loss methods are both tried
ve times, which means ten times in total. Moreover, for each
method, ve curves are averaged. The nal two averaged curves
are shown in Fig. 7.
Analyzing the curves, we can nd the following results.
Before the turning point, the range that accuracy is below 95%,
both methods increase very quickly and their performances are
similar. The reason is that when the model is very bad, all the
training examples lie in the positive range of the cost function;
hence, cross-entropy and hinge loss are the same. However,
after the turning point, more and more training examples be-
come useless. The hinge loss method only learns bad examples
TABLE XIV
RECOGNITION RATE OF OUR ALGORITHM
TABLE XV
RECOGNITION RATE OF DIFFERENT CLASSIFIERS
and accuracy continues to improve until nal convergence.
However, cross-entropy keeps driving useless examples output
close to 1.0 and most parameters update operations are similar,
with bad examples rarely learned. That is why the cross-entropy
curve uctuates a lot and converges less stably. At the end of
the two curves, the hinge loss curve gives a better validation
accuracy of 98.89% than cross-entropys 98.48%.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
JIN et al.: TRAFFIC SIGN RECOGNITION WITH HINGE LOSS TRAINED CONVOLUTIONAL NEURAL NETWORKS 9
Fig. 8. 44 errors of our ensemble CNNs.
C. Accuracy
Here, we show that our algorithm gives state-of-the-art test
accuracy. For each preprocessing method, we train ve CNNs.
The nal models output is the average of all the 20 CNNs
output. The best record of this data set is achieved by the IJCNN
2011 recognition competitions winner Dan Ciresan. In addition
to the above four preprocessing methods, they use another
preprocessing method called Conorm. They train ve CNNs for
each method and their performance is shown in Table XIII. Our
algorithms performance is in Table XIV. No matter which kind
of preprocessing, our algorithm has better test accuracy and the
nal ensemble model gives the best accuracy of 99.65% over
the winners 99.46%, the error rate decreasing by 35.19%.
In the competition, other participants using other classiers
also give several results, as shown in Table XV. To compare
computer ability and human ability, apart from providing the
benchmark data set, Institut fr Neuroinformatik Real-Time
Computer Vision research group also organizes several people
who are good at labeling trafc signs to give a performance
on the test set. The human performance is at 98.84%, which is
below our accuracy and that of the winner. Other methods such
as random forests on hand-coded histogramof oriented gradient
(HOG) features [26] and linear discriminative analysis on HOG
features are below human performance.
D. Error Analysis
We print 44 errors of 12 630 test images by our ensemble
CNNs in Fig. 8. Errors mainly result fromlowresolution, strong
sunlight, and incorrect annotations of bounding box. When an
error happens, the maximum value of the output is not big
enough. If the system focuses on precision, most errors can be
rejected by a higher threshold.
V. CONCLUSION
We have designed our TSR system using a CNN, which is a
special kind of deep neural network. The model has the ability
to learn both features and classiers. The learned features detect
specic local patterns that prove to be better than hand-coded
features. We have described the details of each layer and written
a GPU package for our CNN, which lets the training time speed
up by a factor between 3 and 10 over CPU.
We have proposed an HLSGD method to train CNNs. We
tested our algorithm on the German Trafc Sign Recognition
Benchmark and compared our results with other competitors.
The experiments showthat HLSGDgives faster and more stable
convergence and a state-of-the-art recognition rate of 99.65%.
ACKNOWLEDGMENT
The authors would like to thank Tsinghua National Labora-
tory for Information Science and Technology for the Explorer
100 cluster system on which this work was completed.
REFERENCES
[1] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund, Vision-based traf-
c sign detection and analysis for intelligent driver assistance systems:
Perspectives and survey, IEEE Trans. Intell. Transp. Syst., vol. 13, no. 4,
pp. 14841497, Dec. 2012.
[2] M. Meuter, C. Nunny, S. M. Gormer, S. Muller-Schneiders, and
A. Kummert, A decision fusion and reasoning module for a trafc sign
recognition system, IEEE Trans. Intell. Transp. Syst., vol. 12, no. 4,
pp. 11261134, Dec. 2011.
[3] A. Ruta, Y. Li, and X. Liu, Robust class similarity measure for trafc sign
recognition, IEEE Trans. Intell. Transp. Syst., vol. 11, no. 4, pp. 846855,
Dec. 2010.
[4] F. Zaklouta and B. Stanciulescu, Real-time trafc-sign recognition using
tree classiers, IEEE Trans. Intell. Transp. Syst., vol. 13, no. 4, pp. 1507
1514, Dec. 2012.
[5] J. Greenhalgh and M. Mirmehdi, Real-time detection and recognition
of road trafc signs, IEEE Trans. Intell. Transp. Syst., vol. 13, no. 4,
pp. 14981506, Dec. 2012.
[6] P. Viola and M. J. Jones, Robust real-time face detection, Int. J. Comput.
Vis., vol. 57, no. 2, pp. 137154, May 2004.
[7] N. Dalal and B. Triggs, Histograms of oriented gradients for human
detection, in Proc. IEEE Comput. Soc. Conf. CVPR, 2005, vol. 1,
pp. 886893.
[8] D. G. Lowe, Distinctive image features from scale-invariant keypoints,
Int. J. Comput. Vis., vol. 60, no. 2, pp. 91110, Nov. 2004.
[9] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, Introduction to
the special issue on machine learning for trafc sign recognition, IEEE
Trans. Intell. Transp. Syst., vol. 13, no. 4, pp. 14811483, Dec. 2012.
[10] D. H. Hubel and T. N. Wiesel, Receptive elds and functional architec-
ture of monkey striate cortex, J. Physiol., vol. 195, no. 1, pp. 215243,
Mar. 1968.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
[11] K. Fukushima, Neocognitron: A self-organizing neural network model
for a mechanism of pattern recognition unaffected by shift in position,
Biol. Cybern., vol. 36, no. 4, pp. 193202, Apr. 1980.
[12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning
applied to document recognition, Proc. IEEE, vol. 86, no. 11, pp. 2278
2324, Nov. 1998.
[13] A. Krizhevsky, I. Sutskever, and G. Hinton, Imagenet classication with
deep convolutional neural networks, in Proc. Adv. Neural Inf. Process.
Syst., 2012, vol. 25, pp. 11061114.
[14] D. Ciresan, U. Meier, J. Masci, and J. Schmidhuber, A committee of
neural networks for trafc sign classication, in Proc. IJCNN, 2011,
pp. 19181921.
[15] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, The German Trafc
Sign Recognition Benchmark: A multi-class classication competition,
in Proc. IJCNN, 2011, pp. 14531460.
[16] P. Sermanet and Y. LeCun, Trafc sign recognition with multi-scale
convolutional networks, in Proc. IJCNN, 2011, pp. 28092813.
[17] B. A. Olshausen and D. J. Field, Emergence of simple-cell receptive eld
properties by learning a sparse code for natural images, Nature, vol. 381,
no. 6583, pp. 607609, Jun. 1996.
[18] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, Extracting and
composing robust features with denoising autoencoders, in Proc. 25th
Int. Conf. Mach. Learn., 2008, pp. 10961103.
[19] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, What is the best
multi-stage architecture for object recognition? in Proc. IEEE 12th Int.
Conf. Comput. Vis., 2009, pp. 21462153.
[20] Q. V. Le, Building high-level features using large scale unsupervised
learning, in Proc. ICASSP, 2013, pp. 85958598.
[21] D. Scherer, A. Mller, and S. Behnke, Evaluation of pooling operations
in convolutional architectures for object recognition, in Articial Neural
NetworksICANN. Berlin, Germany: Springer-Verlag, 2010, pp. 92101.
[22] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Mller, Efcient backprop,
in Neural Networks: Tricks of the Trade. Berlin, Germany: Springer-
Verlag, 1998, pp. 950.
[23] D. Ciresan, U. Meier, and J. Schmidhuber, Multi-column deep neural
networks for image classication, in Proc. IEEE Conf. CVPR, 2012,
pp. 36423649.
[24] J. Bouvrie, Notes on convolutional neural networks, 2006.
[25] D. C. Cirean, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber,
Flexible, high performance convolutional neural networks for image
classication, in Proc. 22nd Int. Joint Conf. Artif. Intell., 2011, vol. 2,
pp. 12371242, AAAI Press.
[26] F. Zaklouta, B. Stanciulescu, and O. Hamdoun, Trafc sign classication
using kd trees and randomforests, in Proc. IJCNN, 2011, pp. 21512155.
Junqi Jin received the B.S. degree from Tsinghua
University, Beijing, China, in 2011, where he is
currently working toward the Ph.D. degree in the
Department of Automation.
Kun Fu received the B.S. degree from Tsinghua
University, Beijing, China, in 2011, where he is
currently working toward the Ph.D. degree in the
Department of Automation.
Changshui Zhang (M02) received the B.S. de-
gree in mathematics fromPeking University, Beijing,
China, in 1986 and the M.S. and Ph.D. degrees
in control science and engineering from Tsinghua
University, Beijing, in 1989 and 1992, respectively.
In 1992 he joined the Department of Automa-
tion, Tsinghua University, where he is currently a
Professor. He has authored more than 200 papers.
His research interests include pattern recognition and
machine learning.
Dr. Zhang is an Associate Editor of Pattern Recog-
nition Journal. He is a member of the Standing Council of the Chinese
Association of Articial Intelligence.

You might also like