You are on page 1of 41

Optical Handwritten Character

Recognition

National Center for Scientific Research


“Demokritos”Athens - Greece

Institute of Informatics and


Telecommunications

Computational Intelligence Laboratory


(CIL)

Giorgos Vamvakas
gbam@iit.demokritos.gr
Outline

 Handwritten OCR systems


 Greek Handwritten Character Recognition
 Novel Feature Extraction followed by
Hierarchical Classification Scheme for
Handwritten Character Recognition
 Historical Character Recognition
 Unconstrained Word Recognition
 Word Spotting
 Parameter Selection based on Clustering for
Character Segmentation
OCR Systems

 OCR systems consist of four major stages :

• Pre-processing
• Segmentation
• Feature Extraction
• Classification
• Post-processing
Feature Extraction Methods

 In feature extraction stage each character is represented as


a feature vector, which becomes its identity. The major goal
of feature extraction is to extract a set of features, which
maximizes the recognition rate with the least amount of
elements.

 Due to the nature of handwriting with its high degree of


variability and imprecision obtaining these features, is a
difficult task. Feature extraction methods are based on 3
types of features:

• Statistical
• Structural
Statistical Features

 Representation of a character image by statistical


distribution of points takes care of style variations to some
extent.

 The major statistical features used for character


representation are:

• Zoning

• Projections and Profiles

• Crossings and Distances


Structural Features

 Characters can be represented by structural features with


high tolerance to distortions and style variations. This type of
representation may also encode some knowledge about the
structure of the object or may provide some knowledge as to
what sort of components make up that object.

 Structural features are based on topological and geometrical


properties of the character, such as aspect ratio, cross points,
loops, branch points, strokes and their directions, inflection
between two points, horizontal curves at top or bottom, etc.
Greek OCR

 The CIL Database was used


• 56 characters
• 625 variations of each character
• 35,000 isolated and labeled Greek handwritten characters

 10 pairs of classes were


merged, due to size
normalization step, resulting
to a database of 28,750
characters.
Feature Extraction - [1]

 Two types of features :


• Features based on zones:
The character image is divided into horizontal and vertical zones
and the density of character pixels is calculated for each zone

• Features based on character projection profiles:


The centre mass ( xt , yt ) of the image is first found
Upper/ lower profiles are computed by considering for each image
column, the distance between the horizontal line y  yt and the closest
pixel to the upper/lower boundary of the character image. This ends up
in two zones depending on . Then both zones are divided into
vertical blocks. For all blocks formed we calculate the area of the
upper/lower character profiles.
Similarly, we extract the features based on left/right profiles.
[1] G. Vamvakas, B. Gatos, I. Pratikakis, N. Stamatopoulos, A. Roniotis, S.J. Perantonis, "Hybrid Off-Line OCR
for Isolated Handwritten Greek Characters", The Fourth IASTED International Conference on Signal Processing,
Pattern Recognition and Applications (SPPRA’07), pp. 197-202, Innsbruck, Austria, February 2007.
Feature Extraction - [2]

 Three types of features


 zones & projections
 distance features
 profile features
 Dimensionality Reduction
Linear Discriminant Analysis (LDA)

[2] G. Vamvakas, B. Gatos, S. Petridis and N. Stamatopoulos, ''An Efficient Feature Extraction and
Dimensionality Reduction Scheme for Isolated Greek Handwritten Character Recognition'', Proceedings of
the 9th International Conference on Document Analysis and Recognition, Curitiba, Brazil, 2007, pp. 1073-1077.
Experimental Results – Greek OCR

 Classifier: Support Vector Machines (SVM)

CIL Database

Feature Extraction [2] 92.05%

Feature Extraction [1] 91.61%

Kavalieratou et.al [3] 88.62%

[3] E. Kavalieratou, N. Fotakis, G Kokkinakis, ''Handwritten Character Recognition Based on Structural


Characteristics'', 16th International Conference on Pattern Recognition (ICPR'02) - Volume 3, 2002.
Feature Extraction
Hierarchical Classification (v.1)

 Recursive subdivisions of the character image based on the


introduced Division Point (DP) [4]
 DP = pixel at the intersection of a horizontal and a vertical
lines at which the resulting sub-images at each iteration have
balanced (approximately equal) numbers of foreground pixels

 At level L, the co-ordinates (xi, yi) of all DPs are stored as


features

[4] G. Vamvakas, B. Gatos, S. J. Perantonis, "Hierarchical Classification of Handwritten Characters based on


Novel Structural Features" (ICFHR'08), 11th International Conference on Frontiers in Handwriting Recognition,
Montreal, Canada, August 2008.
Feature Extraction
Hierarchical Classification (v.1)

 Step 1: Starting from level 1 and gradually proceeding to


higher levels of granularity, features are extracted for the train
patterns, the confusion matrix is created using cross validation
and the overall recognition rate is calculated, until the
recognition accuracy starts decreasing. The level at which the
highest recognition rate (Max_RR) is achieved is considered to
be the best performing granularity level (L0)
Feature Extraction
Hierarchical Classification (v.1)

 Step 2: At L0 where the maximum recognition rate is


obtained the corresponding confusion matrix is scanned and
classes with high misclassification rates are merged

 Step 3: For each one of the groups of classes found another


classifier is trained with features extracted at level L0 + 1 of
the granularity procedure in order to distinguish them at a
later stage of the classification

 Step 4: Each pattern of the test set is then fed to the initial
classifier with features extracted at level L0. If the classifier
decides that this pattern belongs to one of the single classes
then the unknown pattern is assumed to be classified. Else, if
it is classified to one of the groups of classes then the new
classifier decides the recognition result
Feature Extraction
Hierarchical Classification (v.2)

 In order to improve precision DP is calculated with sub-pixel


accuracy [5]

Absolute Difference = 4 Absolute Difference = 2

[5] G. Vamvakas, B. Gatos, S. J. Perantonis, “A Novel Feature Extraction and Classification Methodology for the
Recognition of Historical Documents”, 10th International Conference on Document Analysis and Recognition
(ICDAR’09), pp 491-495, Barcelona, Spain, July 2009
Feature Extraction
Hierarchical Classification (v.3)

 For each group of confused classes found in step 2 do not


use features from L0 + 1 to distinguish them, but iterate again
step 1, [6]

Level = 2

Level = 2 Level = 4

[6] G. Vamvakas, B. Gatos, S. J. Perantonis,” Handwritten Character Recognition through Two-Stage Foreground
Sub-Sampling”, Pattern Recognition, accepted for publication
Experimental Results

 Databases
 CIL
 CEDAR
• 52 characters (classes) of isolated and labeled English
handwritten characters
• 19145 characters for training

• 2183 characters for testing

 MNIST
• 10 classes of isolated and labeled handwritten digits
• 60000 digits for training
• 10000 digits for testing
Experimental Results

 CIL

CIL Database
[1] - zones & projections 91.61%
[3] - Kavalieratou et.al 88.62%
[2] - dimensionality reduction 92.05%
[4] – DP (no sub-pixel) 93.21%
[5] – DP (sub-pixel) 93.65%
[6] – DP (sub-pixel and iteration of step 1 95.63%
of the classification procedure)
Experimental Results

 CEDAR
CEDAR Character Database (52 Classes)
Uppercase Lowercase Overall Recognition
Characters Characters Rate
YAM[7] NA NA 75.70%
KIM [8] NA NA 73.25%
GAD[9] 79.23% 70.31% 74.77%
DP[6] 86.17% 84.05% 85.11%

[7] H. Yamada and Y. Nakano, "Cursive Handwritten Word Recognition Using Multiple Segmentation Determined
by Contour Analysis", IECIE Transactions on Information and System, Vol. E79-D. pp. 464-470, 1996.
[8] F. Kimura. N. Kayahara. Y. Miyake and M. Shridhar, "Machine and Human Recognition of Segmented
Characters from Handwritten Words", International Conference on Document Analysis and Recognition
(ICDAR '97), Ulm, Germany, 1997, pp. 866-869.
[9] P. D. Gader, M. Mohamed and J-H. Chiang. "Handwritten Word Recognition with Character and
Inter-Character Neural Networks", IEEE Transactions on System, Man. and Cybernetics-Part B: Cybernetics,
Vol. 27, 1997, pp. 158-164.
Experimental Results

 CEDAR

CEDAR
Uppercase Characters Lowercase Characters
(26 Classes) (26 Classes)
# Train # Test Recogni- # Train # Test Recogni-
Patterns Patterns tion Rate Patterns Patterns tion Rate

BLU[10] 7175 939 81.58% 18655 2240 71.52%


DP[6] 11454 1367 95.90% 7691 816 93.50%

[10] M. Blumenstein, X.Y. Liu, B. Verma, "A modified direction feature for cursive character recognition",
IEEE International Joint Conference on Neural Networks, Vol.4, pp. 2983 – 2987, 2007.
Experimental Results

 CEDAR – Merge uppercase and lowercase characters with


similar shapes
Experimental Results

 CEDAR – Merge uppercase and lowercase characters with


similar shapes
CEDAR
Number of Recognition Number of Recognition
Classes Rate Classes Rate
(all classes ) (after merging)
SIN [11] 52 NA 36 67%
CAM [12] 52 83.74% 39 84.52%
DP [6] 52 85.11% 35 94.73%

[11] S. Singh and M. Hewitt, "Cursive Digit and Character Recognition on Cedar Database". International
Conference on Pattern Recognition, (ICPR 2000), Barcelona, Spain. 2000, pp. 569-572.
[12] F. Camastra and A. Vinciarelli. "Combining Neural Gas and Learning Vector Quantization for Cursive
Character Recognition", Neurocomputing. vol. 51. 2003, pp. 147-159.
Experimental Results

 MNIST

• Recognition Rate = 99.03%

•According to [13] the lowest recognition rate for the


MNIST database is 88% and the highest is 99.61%,
while the best results available vary between 98.5% and
99.5%.

[13] The MNIST Database, http://yann.lecun.com/exdb/mnist/


Application to Historical Character
Recognition

 POLYTIMO
 Handwritten Database (HW)
• 51 characters (classes) of Greek historical
handwritten characters
• 5407 characters for training
• 1351 characters for testing

 Typewritten Database (TW)


• 67 classes of Greek historical typewritten
characters
• 11173 characters for training
• 2793 characters for testing
Application to Historical Character
Recognition

 POLYTIMO

TW-Database HW-Database
[1] - zones & projections 95.44% 94.62%
[5] – DP (sub-pixel) 97.71% 94.51%
[6] – DP (sub-pixel and
iteration of step 1 of the 98.24% 95.21%
classification procedure)
Application to Historical Character
Recognition

 IMPACT
 Typewritten Database (TW-1)
• 53 characters (classes) of German
historical typerwritten characters
• 13181 characters for training
• 3246 characters for testing

 Recognition Rate = 99.53%


Unconstrained Word Recognition [14]

 Methodology described in [6] with two changes


 Slant Correction based on Entropy
• The dominant slope of the word is found from the slope corrected words
which gives the minimum entropy of a vertical projection histogram. The
vertical histogram projection is calculated for a range of angles ± R. In our
case R=60, seems to cover all writing styles. The slope of the word, a ,is
m
found from: N
m min
H H
aR

 p
i log
p
i
i
1

• The character is then corrected by am using:


xytan(
x a
m) y  y
Unconstrained Word Recognition [14]

 In order for features to be invariant of translation the


feature vector does not consist of the co-ordinates (xi, yi) of all
the DPs at a level L but of the pairs (xi - x0, yi - y0), where xi,
yi are the co-ordinates of the DPi at L and x0, y0 are the co-
ordinates of the initial DP (at level L = 0) of the word image

[14] G. Vamvakas, B. Gatos and S.J.Perantonis, “Efficient Character/Word Recognition based on a Hierarchical
Classification Scheme” , International Journal on Document Analysis and Recognition (IJDAR) , under review.
Experimental Results

 IAM
 Handwritten Database
• 147 classes of English handwritten words
• 23171 words for training
• 3799 words for testing
IAM Database
GAT [15] 87.68%
Proposed Methodology 89.19%
(no slant correction)
Proposed Methodology 90.56%
(slant correction)

[15] B. Gatos, I. Pratikakis, A.L. Kesidis and S.J. Perantonis, "Efficient Off-Line Cursive Handwritten Word
Recognition", 10th International Workshop on Frontiers in Handwriting Recognition (IWFHR 2006), La Baule,
France, October 2006, pp. 121-125.
Word Spotting [14]

 Feature extraction technique described in [6]


 Create five lists Ri, i = 1, 2, 3, 4 and 5, each one consisting of the
Euclidean Distances between the keyword and every word of the set of
documents, using feature vectors from granularity levels Li, i = 1, 2,
3, 4 and 5 respectively

 Normalize all distances in every Ri to [0, 1] by diving each one with


the maximum distance in Ri.

 Merge all five lists in a list Q. Every word the documents’ set is now
represented in Q by five distances from the keyword. For each one we
choose to keep the minimum distance and remove the others,
resulting to Q΄.
 Sort Q΄ in ascending order. Choose a threshold thr and remove all
instances above thr. List Q΄ now contains only the thr nearest words
to the word we want to be matched, which is the result of the
matching algorithm.
Experimental Results

Typewritten German Historical Collection (dataset – 1) and


George Washington’s Handwritten Collection (dataset – 2)

Keyword Dataset # of instances Threshold # of words Recognition


in dataset (Y) (thr) found (X) Rate
(X / Y ) *100
Durchleucht 1 10 10 5 50%
nicht 1 21 21 20 95.23%
Natur 1 17 17 13 76.47%
appointments 2 10 10 8 80%
public 2 9 9 7 77.77%
government 2 8 8 4 50%
Historical Document Recognition

 Flowchart of the OCR methodology [16]

[16] G. Vamvakas, B. Gatos, N. Stamatopoulos, S.J. Perantonis, "A Complete Optical Character Recognition
Methodology for Historical Documents" (DAS’08), 8th IAPR International Workshop on Document Analysis
Systems, pp.525-532, Nara, Japan, September 2008.
Historical Document Recognition

 Tool for Database Creation


 k-Means Clustering
Historical Document Recognition

 ASCII code to each cluster

 Handling clustering errors


Historical Document Recognition

 Conversion of a historical handwritten document into ASCII


format.
Parameter Selection based on
Clustering for Character Segmentation

 A major difficulty for designing a document image


segmentation methodology is the proper value selection for all
free parameters involved.
 Parameter Selection based on clustering [17]

[17] G.Vamvakas, N. Stamatopoulos, B.Gatos, S.J.Perantonis, “Automatic Unsupervised Parameter Selection


for Character Segmentation” , accepted to appear in 9th IAPR International Workshop on Document Analysis
Systems (DAS 2010), Boston, USA.
Parameter Selection based on
Clustering for Character Segmentation

 Methodology
 k – Means clustering
 Given a parameter set S, in order to evaluate the performance of
the clustering algorithm for every k between k1 and k2, the mean
squared distances from the centroids (within clusters sum of squares)
is calculated as follows:

 The value of W (k) is low when the partition is good thus resulting
to compact clusters.
 A quality of the segmentation result that
corresponds to a parameter set S is given as:

 The optimal parameter set Sopt is defined as :


Experimental Results

 20 Historical German Documents

 Two Character Segmentation Algorithms


 Skeleton Segmentation Paths
• Two parameters : S= {MinCharWidth, MaxCharWidth}

Running Length Smoothing Algorithm (RLSA)


•One paramater : S = {a} , where a*LettH defines the
thresold
Experimental Results

MinCharWidth = 0.7 a = 0.4


MaxCharWidth = 0.9
Experimental Results

 Evaluation based on counting the number of matches between the


entities detected by the segmentation algorithm and the entities in
the ground truth
Publications

 Journals
[1] G. Vamvakas, B. Gatos, S. J. Perantonis, “Handwritten Character Recognition
through Two-Stage Foreground Sub-Sampling”, Pattern Recognition, accepted
for publication
[2] G. Vamvakas, B. Gatos, S. J. Perantonis, “Efficient Character/ Word Recognition
based on a Hierarchical Classification Scheme” , International Journal on
Document Analysis and Recognition (IJDAR) , under review

 Conferences
[1] G.Vamvakas, N. Stamatopoulos, B.Gatos, S.J.Perantonis, “Automatic
Unsupervised Parameter Selection for Character Segmentation” , accepted to
appear in the 9th IAPR International Workshop on Document Analysis Systems
(DAS 2010)
[2] G. Vamvakas, B. Gatos, S. J. Perantonis, “A Novel Feature Extraction and
Classification Methodology for the Recognition of Historical Documents”, 10th
International Conference on Document Analysis and Recognition (ICDAR’09), pp
491-495, Barcelona, Spain, July 2009
Publications

 Conferences
[3] G. Vamvakas, B. Gatos, N. Stamatopoulos, S.J. Perantonis, "A Complete Optical
Character Recognition Methodology for Historical Documents," 8th IAPR
International Workshop on Document Analysis Systems (DAS’08), pp.525-532,
Nara, Japan, September 2008
[4] G. Vamvakas, B. Gatos, S. J. Perantonis, "Hierarchical Classification of
Handwritten Characters based on Novel Structural Features", 11th International
Conference on Frontiers in Handwriting Recognition (ICFHR'08), Montreal, Canada,
August 2008.
[5] G. Vamvakas, B. Gatos, S. Petridis, N. Stamatopoulos, "An Efficient Feature
Extraction and Dimensionality Reduction Scheme for Isolated Greek Handwritten
Character Recognition ", 9th International Conference on Document Analysis and
Recognition(ICDAR’07), vol.2, pp.1073-1077, 23-26 September 2007
[6] G. Vamvakas, N. Stamatopoulos, B Gatos, I. Pratikakis, S. J. Perantonis, "Greek
Handwritten Character Recognition", 11th Panhellenic Conference on Informatics
(PCI’07), pp. 343-352, Patras, Greece, May 2007
[7] G. Vamvakas, B. Gatos, I. Pratikakis, N. Stamatopoulos, A. Roniotis, S.J.
Perantonis, "Hybrid Off-Line OCR for Isolated Handwritten Greek Characters", The
Fourth IASTED International Conference on Signal Processing, Pattern Recognition,
and Applications (SPPRA’07), pp. 197-202, Innsbruck, Austria, February 2007

You might also like