You are on page 1of 6

Optical Recognition of Handwritten Digits

ECE539 Project Report

Pradeep Rajendran
December 20, 2013

1 Introduction
Recognition of handwritten digits has many everyday uses. It is particularly applied in the auto-
mated sorting of postal addresses based on zipcode. A general implementation of such a system is
breifly described.
As the postal item moves on a conveyor belt, robotic mechanisms reposition the item such that
the address label is visible to a camera. Then, the camera registers an image of the label. The
image is then fed into a image processing system which dissects the image into constituent character
blocks. The digit blocks composing the zipcode field is then thresholded and pre-processed to ensure
uniform scale and orientation. After preprocessing, the binary image of a digit block is ready for
use in machine learning tools.
The pre-processing step is rather involved and out of the scope of this project as it requires many
image processing steps. It cannot be skipped as it is crucial to the success of most machine learning
In this project, I focus on how the performance of common machine learning tools compare with
each other.

2 Dataset
The MNIST(Mixed National Institute of Standards and Technology) dataset [1] is used in this
project. All the digit images have been pre-processed such that the digit is centered on a 28 28
block of 8-bit grayvalues. Fig. 1 shows an example of a digit block containing the character 8.
Fig. 2 shows a ensemble of digit blocks containing the character 5.






5 10 15 20 25

Fig. 1: 28 28 digit block containing character 8

Fig. 2: A small subset of training ensemble containing character 5

3 Tools applied
3.1 Multi-layer perceptron
A two layer perceptron implementation found in the Neural Network Toolbox is utlized for this
section. The first layer is the input layer and it has 784 inputs. The second layer has h hidden
neurons. And, the final layer is the output layer consisting of 10 neurons corresponding to the 10
class labels (i.e. 10 digits).

Fig. 3: An example of 2-layer neural network with 10 hidden neurons

The table below shows the various values of h that were tried and the corresponding performances.

h Error rate (%)

10 8.46
20 8.92
100 2.09
120 11.22
200 21.27

Table 1: Various values of h and correspoding error rates

From Table 1, it is clear that h = 100 gives better performance than h = 200. This might be due
to over-fitting associated with h = 200. The ROC plots for h = 100 are given in Fig. 4.

(a) ROC for h = 100 (b) ROC for h = 100

Fig. 4: ROC plot and its zoomed version (right)

From the ROC, it appears that the number 4 is often misclassified as it is furthest away from the
point (0, 1).

3.2 Support vector machine

The LIBSVM suite of tools developed by Chih-Chung Chang and Chih-Jen Lin is used in this section
[2]. The main tools are: svm-scale (used for scaling data), svm-train (used to obtain support vectors

from given data), svm-predict (used to make predictions based on a trained model).

svm-scale is first used to to scale the input feature vectors in the training file to have a value of either
+1 or -1. A scale file is also produced along with the scaled training file. The scale file contains the
appropriate scale values that have to be applied to the testing data as a pre-processing step. The
scaled training file is then input to svm-train which produces a model file. This model file contains
the support vectors identified in the scaled training file. Once the model file is obtained, testing is
performed using svm-predict. svm-predict uses the model file to determine the classification of test
vectors and produces an output file containing these predictions.

Choice of parameters
There are many choices of parameters that can be chosen during the training phase of the SVM.
Table 2 shows the error rate corresponding different kernels and parameters.

Kernel Type Error rate (%)

Linear 6.85
Polynomial (Order 4) 2.18
Polynomial (Order 5) 2.01
Polynomial (Order 6) 1.92
Polynomial (Order 7) 1.92
Polynomial (Order 8) 1.99
Gaussian 2.66
Sigmoidal 10.71

Table 2: Performance for different kernels

According to Table 2, it seems that highest performance is achieved when a polynomial kernel of
order 6 or 7 is used.

3.3 K-Nearest Neighbor (K-NN)

K-NN is the simplest classification method. But, it is also a slow and memory intensive method.
This is because, during testing phase, a similarity metric between each test feature vector and all
the training feature vectors has to be calculated. Therefore, for this dataset, 60 000 similarity metric
calculations have to be performed for each test vector. Since there are 10 000 test vectors, the total
similarity metric calculation calls sums up to 600 000 000.

Eigen-digit method
The Eigen-digit method involves using PCA (Principal Component Analysis) to reduce the dimen-
sionality of the feature space from M to m. In this way, each training feature vector becomes
a m-dimensinal vector instead of the original 784-dimensional vector as shown in Fig. 5. While
performing PCA, the m eigenvectors (vj ) and 10 eigendigits (Ej = [v1 v2 . . . vm ]) are also obtained.

v1 Weight 1

v2 Weight 2

784-dim vector v3 Weight 3 5-dim vector

v4 Weight 4

v5 Weight 5

Fig. 5: Example of dimensionality reduction for m = 5

During the testing phase, each of 784-dimensional test feature vector is projected to the m-
dimensional feature space using the eigen-digits obtained in the training phase. The resulting
m-dimensional feature vector is then compared with the huge collection of labeled m-dimensional
training feature vectors (leftmost vectors in Fig. 6). During the comparision, K labled closest
matches from the collection are identified. Amongst the closest K matches, the most frequently
occuring label is taken as the classification output of the Eigen-digit method.

Distance bewteen
50-dim vector 50-dim training vector

E0 50-dim training vector

e ct ..
oj ..
Pr .
Projection E5
784-dim test vector 50-dim vector 50-dim training vector

50-dim training vector
o je ..
ct . 50-dim training vector

50-dim vector

Fig. 6: Processing steps illustrated for m = 50 and K = 3

Combinations K=1 K=3 K=5 K=7 K = 11 Computation time (s)
m = 10 61.156 60.546 61.656 62.286 63.496 71
m = 50 8.981 8.931 9.841 10.991 13.851 410
m = 100 5.261 5.451 6.141 7.011 8.501 837
m = 200 4.01 4.23 4.78 5.381 6.621 1474
m = 250 3.77 4.03 4.59 5.161 6.391 1786
m = 500 3.49 3.76 4.16 4.65 5.731 3348
m = 784 3.47 3.7 4.12 4.59 5.661 5907

Table 3: Error rate and computation time for various combinations of m and K

Increasing K does not seem to improve error rate. And, increasing m has a diminishing improvement
on error rate. A trade-off between computation time and error rate can be acheived by picking a
value for m for which the classification rate specifications are still met.

4 Conclusion
The SVM seems to be particularly well suited for digit recognition as it has the the best performance
when compared to the MLP or KNN methods. The best error rates of SVM, MLP and KNN are
1.92, 2.09 and 3.47 respectively. SVM is not only accurate but also faster than the other methods
tested. KNN with m = 784 and K = 1 nearly took 1 hour and 40 minutes of computation time for
calculating labels of 10 000 test vectors. This translates to about 0.60 seconds for each test vector
which is too slow.

[1] Y. LeCun and C. Cortes, MNIST handwritten digit database., 2010.
[2] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, ACM Transac-
tions on Intelligent Systems and Technology, vol. 2, pp. 27:127:27, 2011.