Off-Line Handwritten OCR

Optical Handwritten Character
Recognition
National Center for Scientific Research

“Demokritos”Athens - Greece
Institute of Informatics and

Telecommunications
Computational Intelligence Laboratory

(CIL)
Giorgos Vamvakas
gbam@iit.demokritos.gr
Outline
 Handwritten OCR systems

 Greek Handwritten Character Recognition
 Novel Feature Extraction followed by
Hierarchical Classification Scheme for
Handwritten Character Recognition
 Historical Character Recognition
 Unconstrained Word Recognition
 Word Spotting
 Parameter Selection based on Clustering for
Character Segmentation
OCR Systems
 OCR systems consist of four major stages :
• Pre-processing
• Segmentation
• Feature Extraction
• Classification
• Post-processing
Feature Extraction Methods
 In feature extraction stage each character is represented as

a feature vector, which becomes its identity. The major goal
of feature extraction is to extract a set of features, which
maximizes the recognition rate with the least amount of
elements.
 Due to the nature of handwriting with its high degree of

variability and imprecision obtaining these features, is a
difficult task. Feature extraction methods are based on 3
types of features:
• Statistical
• Structural
Statistical Features
 Representation of a character image by statistical

distribution of points takes care of style variations to some
extent.
 The major statistical features used for character

representation are:
• Zoning
• Projections and Profiles
• Crossings and Distances

Structural Features
 Characters can be represented by structural features with

high tolerance to distortions and style variations. This type of
representation may also encode some knowledge about the
structure of the object or may provide some knowledge as to
what sort of components make up that object.
 Structural features are based on topological and geometrical

properties of the character, such as aspect ratio, cross points,
loops, branch points, strokes and their directions, inflection
between two points, horizontal curves at top or bottom, etc.
Greek OCR
 The CIL Database was used

• 56 characters
• 625 variations of each character
• 35,000 isolated and labeled Greek handwritten characters
 10 pairs of classes were

merged, due to size
normalization step, resulting
to a database of 28,750
characters.
Feature Extraction - [1]
 Two types of features :

• Features based on zones:
The character image is divided into horizontal and vertical zones
and the density of character pixels is calculated for each zone
• Features based on character projection profiles:

The centre mass ( xt , yt ) of the image is first found
Upper/ lower profiles are computed by considering for each image
column, the distance between the horizontal line y  yt and the closest
pixel to the upper/lower boundary of the character image. This ends up
in two zones depending on . Then both zones are divided into
vertical blocks. For all blocks formed we calculate the area of the
upper/lower character profiles.
Similarly, we extract the features based on left/right profiles.
[1] G. Vamvakas, B. Gatos, I. Pratikakis, N. Stamatopoulos, A. Roniotis, S.J. Perantonis, "Hybrid Off-Line OCR
for Isolated Handwritten Greek Characters", The Fourth IASTED International Conference on Signal Processing,
Pattern Recognition and Applications (SPPRA’07), pp. 197-202, Innsbruck, Austria, February 2007.
Feature Extraction - [2]
 Three types of features

 zones & projections
 distance features
 profile features
 Dimensionality Reduction
Linear Discriminant Analysis (LDA)
[2] G. Vamvakas, B. Gatos, S. Petridis and N. Stamatopoulos, ''An Efficient Feature Extraction and
Dimensionality Reduction Scheme for Isolated Greek Handwritten Character Recognition'', Proceedings of
the 9th International Conference on Document Analysis and Recognition, Curitiba, Brazil, 2007, pp. 1073-1077.
Experimental Results – Greek OCR
 Classifier: Support Vector Machines (SVM)
CIL Database
Feature Extraction [2] 92.05%
Feature Extraction [1] 91.61%
Kavalieratou et.al [3] 88.62%
[3] E. Kavalieratou, N. Fotakis, G Kokkinakis, ''Handwritten Character Recognition Based on Structural

Characteristics'', 16th International Conference on Pattern Recognition (ICPR'02) - Volume 3, 2002.
Feature Extraction
Hierarchical Classification (v.1)
 Recursive subdivisions of the character image based on the

introduced Division Point (DP) [4]
 DP = pixel at the intersection of a horizontal and a vertical
lines at which the resulting sub-images at each iteration have
balanced (approximately equal) numbers of foreground pixels
 At level L, the co-ordinates (xi, yi) of all DPs are stored as

features
[4] G. Vamvakas, B. Gatos, S. J. Perantonis, "Hierarchical Classification of Handwritten Characters based on

Novel Structural Features" (ICFHR'08), 11th International Conference on Frontiers in Handwriting Recognition,
Montreal, Canada, August 2008.
Feature Extraction
 Step 1: Starting from level 1 and gradually proceeding to

higher levels of granularity, features are extracted for the train
patterns, the confusion matrix is created using cross validation
and the overall recognition rate is calculated, until the
recognition accuracy starts decreasing. The level at which the
highest recognition rate (Max_RR) is achieved is considered to
be the best performing granularity level (L0)
Feature Extraction
 Step 2: At L0 where the maximum recognition rate is

obtained the corresponding confusion matrix is scanned and
classes with high misclassification rates are merged
 Step 3: For each one of the groups of classes found another

classifier is trained with features extracted at level L0 + 1 of
the granularity procedure in order to distinguish them at a
later stage of the classification
 Step 4: Each pattern of the test set is then fed to the initial
classifier with features extracted at level L0. If the classifier
decides that this pattern belongs to one of the single classes
then the unknown pattern is assumed to be classified. Else, if
it is classified to one of the groups of classes then the new
classifier decides the recognition result
Feature Extraction
 In order to improve precision DP is calculated with sub-pixel

accuracy [5]
Absolute Difference = 4 Absolute Difference = 2
[5] G. Vamvakas, B. Gatos, S. J. Perantonis, “A Novel Feature Extraction and Classification Methodology for the
Recognition of Historical Documents”, 10th International Conference on Document Analysis and Recognition
(ICDAR’09), pp 491-495, Barcelona, Spain, July 2009
Feature Extraction
 For each group of confused classes found in step 2 do not

use features from L0 + 1 to distinguish them, but iterate again
step 1, [6]
Level = 2
Level = 2 Level = 4
[6] G. Vamvakas, B. Gatos, S. J. Perantonis,” Handwritten Character Recognition through Two-Stage Foreground
Sub-Sampling”, Pattern Recognition, accepted for publication
Experimental Results
 Databases
 CIL
 CEDAR
• 52 characters (classes) of isolated and labeled English
handwritten characters
• 19145 characters for training
• 2183 characters for testing
 MNIST
• 10 classes of isolated and labeled handwritten digits
• 60000 digits for training
• 10000 digits for testing
 CIL
CIL Database
[1] - zones & projections 91.61%
[3] - Kavalieratou et.al 88.62%
[2] - dimensionality reduction 92.05%
[4] – DP (no sub-pixel) 93.21%
[5] – DP (sub-pixel) 93.65%
[6] – DP (sub-pixel and iteration of step 1 95.63%
of the classification procedure)
 CEDAR
CEDAR Character Database (52 Classes)
Uppercase Lowercase Overall Recognition
Characters Characters Rate
YAM[7] NA NA 75.70%
KIM [8] NA NA 73.25%
GAD[9] 79.23% 70.31% 74.77%
DP[6] 86.17% 84.05% 85.11%
[7] H. Yamada and Y. Nakano, "Cursive Handwritten Word Recognition Using Multiple Segmentation Determined
by Contour Analysis", IECIE Transactions on Information and System, Vol. E79-D. pp. 464-470, 1996.
[8] F. Kimura. N. Kayahara. Y. Miyake and M. Shridhar, "Machine and Human Recognition of Segmented
Characters from Handwritten Words", International Conference on Document Analysis and Recognition
(ICDAR '97), Ulm, Germany, 1997, pp. 866-869.
[9] P. D. Gader, M. Mohamed and J-H. Chiang. "Handwritten Word Recognition with Character and
Inter-Character Neural Networks", IEEE Transactions on System, Man. and Cybernetics-Part B: Cybernetics,
Vol. 27, 1997, pp. 158-164.
 CEDAR
CEDAR
Uppercase Characters Lowercase Characters
(26 Classes) (26 Classes)
# Train # Test Recogni- # Train # Test Recogni-
Patterns Patterns tion Rate Patterns Patterns tion Rate
BLU[10] 7175 939 81.58% 18655 2240 71.52%

DP[6] 11454 1367 95.90% 7691 816 93.50%
[10] M. Blumenstein, X.Y. Liu, B. Verma, "A modified direction feature for cursive character recognition",
IEEE International Joint Conference on Neural Networks, Vol.4, pp. 2983 – 2987, 2007.
 CEDAR – Merge uppercase and lowercase characters with

similar shapes
 CEDAR – Merge uppercase and lowercase characters with

similar shapes
CEDAR
Number of Recognition Number of Recognition
Classes Rate Classes Rate
(all classes ) (after merging)
SIN [11] 52 NA 36 67%
CAM [12] 52 83.74% 39 84.52%
DP [6] 52 85.11% 35 94.73%
[11] S. Singh and M. Hewitt, "Cursive Digit and Character Recognition on Cedar Database". International
Conference on Pattern Recognition, (ICPR 2000), Barcelona, Spain. 2000, pp. 569-572.
[12] F. Camastra and A. Vinciarelli. "Combining Neural Gas and Learning Vector Quantization for Cursive
Character Recognition", Neurocomputing. vol. 51. 2003, pp. 147-159.
 MNIST
• Recognition Rate = 99.03%
•According to [13] the lowest recognition rate for the

MNIST database is 88% and the highest is 99.61%,
while the best results available vary between 98.5% and
99.5%.
[13] The MNIST Database, http://yann.lecun.com/exdb/mnist/

Application to Historical Character
Recognition
 POLYTIMO
 Handwritten Database (HW)
• 51 characters (classes) of Greek historical
handwritten characters
 Typewritten Database (TW)

• 67 classes of Greek historical typewritten
characters
Recognition
 POLYTIMO
TW-Database HW-Database
[1] - zones & projections 95.44% 94.62%
[5] – DP (sub-pixel) 97.71% 94.51%
[6] – DP (sub-pixel and
iteration of step 1 of the 98.24% 95.21%
classification procedure)
Recognition
 IMPACT
 Typewritten Database (TW-1)
• 53 characters (classes) of German
historical typerwritten characters
 Recognition Rate = 99.53%

Unconstrained Word Recognition [14]
 Methodology described in [6] with two changes

 Slant Correction based on Entropy
• The dominant slope of the word is found from the slope corrected words
which gives the minimum entropy of a vertical projection histogram. The
vertical histogram projection is calculated for a range of angles ± R. In our
case R=60, seems to cover all writing styles. The slope of the word, a ,is
m
found from: N
m min
H H
aR

 p
i log
p
i
i
1
• The character is then corrected by am using:

xytan(
x a
m) y  y
Unconstrained Word Recognition [14]
 In order for features to be invariant of translation the

feature vector does not consist of the co-ordinates (xi, yi) of all
the DPs at a level L but of the pairs (xi - x0, yi - y0), where xi,
yi are the co-ordinates of the DPi at L and x0, y0 are the co-
ordinates of the initial DP (at level L = 0) of the word image
[14] G. Vamvakas, B. Gatos and S.J.Perantonis, “Efficient Character/Word Recognition based on a Hierarchical
Classification Scheme” , International Journal on Document Analysis and Recognition (IJDAR) , under review.
 IAM
 Handwritten Database
• 147 classes of English handwritten words
• 23171 words for training
• 3799 words for testing
IAM Database
GAT [15] 87.68%
Proposed Methodology 89.19%
(no slant correction)
Proposed Methodology 90.56%
(slant correction)
[15] B. Gatos, I. Pratikakis, A.L. Kesidis and S.J. Perantonis, "Efficient Off-Line Cursive Handwritten Word
Recognition", 10th International Workshop on Frontiers in Handwriting Recognition (IWFHR 2006), La Baule,
France, October 2006, pp. 121-125.
Word Spotting [14]
 Feature extraction technique described in [6]

 Create five lists Ri, i = 1, 2, 3, 4 and 5, each one consisting of the
Euclidean Distances between the keyword and every word of the set of
documents, using feature vectors from granularity levels Li, i = 1, 2,
3, 4 and 5 respectively
 Normalize all distances in every Ri to [0, 1] by diving each one with

the maximum distance in Ri.
 Merge all five lists in a list Q. Every word the documents’ set is now
represented in Q by five distances from the keyword. For each one we
choose to keep the minimum distance and remove the others,
resulting to Q΄.
 Sort Q΄ in ascending order. Choose a threshold thr and remove all
instances above thr. List Q΄ now contains only the thr nearest words
to the word we want to be matched, which is the result of the
matching algorithm.
Typewritten German Historical Collection (dataset – 1) and

George Washington’s Handwritten Collection (dataset – 2)
Keyword Dataset # of instances Threshold # of words Recognition

in dataset (Y) (thr) found (X) Rate
(X / Y ) *100
Durchleucht 1 10 10 5 50%
nicht 1 21 21 20 95.23%
Natur 1 17 17 13 76.47%
appointments 2 10 10 8 80%
public 2 9 9 7 77.77%
government 2 8 8 4 50%
Historical Document Recognition
 Flowchart of the OCR methodology [16]
[16] G. Vamvakas, B. Gatos, N. Stamatopoulos, S.J. Perantonis, "A Complete Optical Character Recognition
Methodology for Historical Documents" (DAS’08), 8th IAPR International Workshop on Document Analysis
Systems, pp.525-532, Nara, Japan, September 2008.
 Tool for Database Creation

 k-Means Clustering
 ASCII code to each cluster
 Handling clustering errors

 Conversion of a historical handwritten document into ASCII

format.
Parameter Selection based on
Clustering for Character Segmentation
 A major difficulty for designing a document image

segmentation methodology is the proper value selection for all
free parameters involved.
 Parameter Selection based on clustering [17]
[17] G.Vamvakas, N. Stamatopoulos, B.Gatos, S.J.Perantonis, “Automatic Unsupervised Parameter Selection

for Character Segmentation” , accepted to appear in 9th IAPR International Workshop on Document Analysis
Systems (DAS 2010), Boston, USA.
Parameter Selection based on
Clustering for Character Segmentation
 Methodology
 k – Means clustering
 Given a parameter set S, in order to evaluate the performance of
the clustering algorithm for every k between k1 and k2, the mean
squared distances from the centroids (within clusters sum of squares)
is calculated as follows:
 The value of W (k) is low when the partition is good thus resulting
to compact clusters.
 A quality of the segmentation result that
corresponds to a parameter set S is given as:
 The optimal parameter set Sopt is defined as :

 20 Historical German Documents
 Two Character Segmentation Algorithms

 Skeleton Segmentation Paths
• Two parameters : S= {MinCharWidth, MaxCharWidth}
Running Length Smoothing Algorithm (RLSA)

•One paramater : S = {a} , where a*LettH defines the
thresold
MinCharWidth = 0.7 a = 0.4

MaxCharWidth = 0.9
 Evaluation based on counting the number of matches between the

entities detected by the segmentation algorithm and the entities in
the ground truth
Publications
 Journals
[1] G. Vamvakas, B. Gatos, S. J. Perantonis, “Handwritten Character Recognition
through Two-Stage Foreground Sub-Sampling”, Pattern Recognition, accepted
for publication
[2] G. Vamvakas, B. Gatos, S. J. Perantonis, “Efficient Character/ Word Recognition
based on a Hierarchical Classification Scheme” , International Journal on
Document Analysis and Recognition (IJDAR) , under review
 Conferences
[1] G.Vamvakas, N. Stamatopoulos, B.Gatos, S.J.Perantonis, “Automatic
Unsupervised Parameter Selection for Character Segmentation” , accepted to
appear in the 9th IAPR International Workshop on Document Analysis Systems
(DAS 2010)
[2] G. Vamvakas, B. Gatos, S. J. Perantonis, “A Novel Feature Extraction and
Classification Methodology for the Recognition of Historical Documents”, 10th
International Conference on Document Analysis and Recognition (ICDAR’09), pp
491-495, Barcelona, Spain, July 2009
Publications
 Conferences
[3] G. Vamvakas, B. Gatos, N. Stamatopoulos, S.J. Perantonis, "A Complete Optical
Character Recognition Methodology for Historical Documents," 8th IAPR
International Workshop on Document Analysis Systems (DAS’08), pp.525-532,
Nara, Japan, September 2008
[4] G. Vamvakas, B. Gatos, S. J. Perantonis, "Hierarchical Classification of
Handwritten Characters based on Novel Structural Features", 11th International
Conference on Frontiers in Handwriting Recognition (ICFHR'08), Montreal, Canada,
August 2008.
[5] G. Vamvakas, B. Gatos, S. Petridis, N. Stamatopoulos, "An Efficient Feature
Extraction and Dimensionality Reduction Scheme for Isolated Greek Handwritten
Character Recognition ", 9th International Conference on Document Analysis and
Recognition(ICDAR’07), vol.2, pp.1073-1077, 23-26 September 2007
[6] G. Vamvakas, N. Stamatopoulos, B Gatos, I. Pratikakis, S. J. Perantonis, "Greek
Handwritten Character Recognition", 11th Panhellenic Conference on Informatics
(PCI’07), pp. 343-352, Patras, Greece, May 2007
[7] G. Vamvakas, B. Gatos, I. Pratikakis, N. Stamatopoulos, A. Roniotis, S.J.
Perantonis, "Hybrid Off-Line OCR for Isolated Handwritten Greek Characters", The
Fourth IASTED International Conference on Signal Processing, Pattern Recognition,
and Applications (SPPRA’07), pp. 197-202, Innsbruck, Austria, February 2007

Off-Line Handwritten OCR

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Off-Line Handwritten OCR

Uploaded by

Copyright:

Available Formats

Optical Handwritten Character

National Center for Scientific Research

Institute of Informatics and

Computational Intelligence Laboratory

 Handwritten OCR systems

 OCR systems consist of four major stages :

 In feature extraction stage each character is represented as

 Due to the nature of handwriting with its high degree of

 Representation of a character image by statistical

 The major statistical features used for character

• Projections and Profiles

• Crossings and Distances

 Characters can be represented by structural features with

 Structural features are based on topological and geometrical

 The CIL Database was used

 10 pairs of classes were

 Two types of features :

• Features based on character projection profiles:

 Three types of features

 Classifier: Support Vector Machines (SVM)

Feature Extraction [2] 92.05%

Feature Extraction [1] 91.61%

Kavalieratou et.al [3] 88.62%

[3] E. Kavalieratou, N. Fotakis, G Kokkinakis, ''Handwritten Character Recognition Based on Structural

 Recursive subdivisions of the character image based on the

 At level L, the co-ordinates (xi, yi) of all DPs are stored as

[4] G. Vamvakas, B. Gatos, S. J. Perantonis, "Hierarchical Classification of Handwritten Characters based on

 Step 1: Starting from level 1 and gradually proceeding to

 Step 2: At L0 where the maximum recognition rate is

 Step 3: For each one of the groups of classes found another

 In order to improve precision DP is calculated with sub-pixel

Absolute Difference = 4 Absolute Difference = 2

 For each group of confused classes found in step 2 do not

• 2183 characters for testing

BLU[10] 7175 939 81.58% 18655 2240 71.52%

 CEDAR – Merge uppercase and lowercase characters with

 CEDAR – Merge uppercase and lowercase characters with

• Recognition Rate = 99.03%

•According to [13] the lowest recognition rate for the

[13] The MNIST Database, http://yann.lecun.com/exdb/mnist/

 Typewritten Database (TW)

 Recognition Rate = 99.53%

 Methodology described in [6] with two changes

• The character is then corrected by am using:

 In order for features to be invariant of translation the

 Feature extraction technique described in [6]

 Normalize all distances in every Ri to [0, 1] by diving each one with

Typewritten German Historical Collection (dataset – 1) and

Keyword Dataset # of instances Threshold # of words Recognition

 Flowchart of the OCR methodology [16]

 Tool for Database Creation

 ASCII code to each cluster

 Handling clustering errors

 Conversion of a historical handwritten document into ASCII

 A major difficulty for designing a document image

[17] G.Vamvakas, N. Stamatopoulos, B.Gatos, S.J.Perantonis, “Automatic Unsupervised Parameter Selection

 The optimal parameter set Sopt is defined as :

 20 Historical German Documents

 Two Character Segmentation Algorithms

Running Length Smoothing Algorithm (RLSA)

MinCharWidth = 0.7 a = 0.4

 Evaluation based on counting the number of matches between the

You might also like