You are on page 1of 4

Telugu Handwritten Character Recognition using

Zoning Features
Panyam Narahari Sastry T.R.Vijaya Lakshmi N.V.Koteswara Rao
Associate Professor
Department of ECE
CBIT
Hyderabad
ananditahari@yahoo.com

Assistant Professor
Department of ECE
MGIT
Hyderabad

Professor and Head


Department of ECE
CBIT
Hyderabad

AbstractCharacter recognition is one of the oldest applications of pattern recognition. Recognizing Hand-Written Characters (HWC) is an effortless task for humans, but for a computer
it is a difficult job. Research in character recognition is very
popular for various potential applications such as in banks, post
offices, defense organizations, reading aid for the blind, library
automation, language processing and multi-media design. Optical
Character Recognition (OCR) is based on optical mechanism
which consists of a machine to recognize scanned and digitized
character automatically. Automatic recognition of handwritten
text can be done either Offline or Online. Offline handwritten
recognition is the task of recognizing the image of a hand
written text, in contrast to Online recognition where the dynamic
characteristics of the writing are available and recorded while
the scriber is writing on a special screen with a pen/stylus made
for this application. Zonal based feature extraction is used in the
present proposed method. The character image is divided into
predefined number of zones and a statistical feature is computed
from each of these zones. Usually, this feature is based on the
pixels contained in that zone. The gray values of the pixels in
that selected zone are summed up to form a feature for that
zone in that image. The features of all the zones in the image
form a feature vector which is used for handwritten character
recognition. Using this Zoning method the recognition accuracy
is found to be 78%.
Index TermsHand Written Character Recognition, Zonal
feature extraction, Nearest Neighborhood Classifier, Pattern
Recognition.

I. I NTRODUCTION
Optical Character Recognition is based on optical mechanism which consists of a machine to recognize the scanned and
digitized character images. Character recognition (CR) is one
of the oldest applications of pattern recognition [1]. Computer
technology can store and process the image documents in multimedia systems also. Recognition of hand written characters is
an effort less task for humans, but for a computer it is difficult
task. Hand Written Character Recognition (HWCR) is the
process of classifying written characters into appropriate class,
based on the features extracted from each character image [2].
In the Character recognition area, new methodologies are required for the increasing needs in newly emerging areas, such
as development of electronic libraries, multimedia databases
and systems which require handwriting data entry. The intensive research effort in the field of character recognition

T.V.Rajinikanth

Abdul Wahab

Professor
Department of CSE
SNIST
Hyderabad

Research student
Department of ECE
CBIT
Hyderabad

is not only because its challenging on simulation of human


reading, but also because it provides efficient applications
such as the automatic processing of bulk amount of papers, transferring data into machines. Hand Written Character
Recognition (HWCR) can be performed either Online or Offline. The Offline and Online character recognition techniques
have different approaches. They also share a lot of common
problems and solutions. HWCR is relatively more complex
and requires more research compared to Online and machine
printed recognition [3].
Hand written recognition includes the components of
recognition algorithms, namely, preprocessing, representation,
stroke or character segmentation, feature extraction, recognizers and post processing steps. Some of the approaches do
not use all of these elements but only a subset. First, an
image is cleaned with image processing techniques. It may be
converted to a more concise representation, and then features
are extracted from words or characters. With the features as
input, a recognizer returns the identified text string. They
may be pre computed for use in segmentation, computed on
individual letters after segmentation, or both [4], [5]. There are
various techniques of Nearest Neighborhood Classifier (NNC)
such as Euclidean distance, support vector machine (SVM) or
Hidden Markov Model (HMM) etc. In the present work we
have used NNC i.e, Euclidean distance for identifying and
classification of Telugu HWCR.
A. Difficulties in Indian/Telugu character recognition
India being a multilingual country having more languages
and in that more number of regional languages, which clearly
show the need of multilingual and multiscript recognition
systems. Also majority of the documents in India contain
text information in more than one script forms. There are
more Telugu speaking people in the world which indicates
the need of HCR for Telugu. There are 18 vowels and 36
consonants in Telugu language. Telugu words appear on coins
that date 400BC [1]. The first writing in Telugu made in
575AD was probably by Renati Cholas who started writing
royal proclamations in Telugu. Telugu emerged as a poetic
and literary language during the 11th century. Until the 20th
century, Telugu was written in old style, different from the

978-1-4799-6541-0/14/$31.00 2014 IEEE

everyday spoken style. During 20th century, a new writing


standard similar to spoken style is emerged.
Telugu being the local language is known to most of
the rural population. Also there are many documents having
English and Telugu characters together. The huge number
of influencing factors in HWCR relate directly to the differences in conditions for experimentation. One of the main
differences is the type of handwriting database used for
experimentation. In some cases, researchers have constrained
their experiments heavily, only using one persons handwriting,
while other researchers experiments were not performed on
benchmark databases. The greatest difficulty encountered in
handwriting recognition lies, in the freedom the user takes
when he writes. Irregular handwriting aggravates ambiguities
described as above and makes it harder to group symbols
and to distinguish relations among them. It results in Layout
problems affecting the recognition of the whole expression.
A reason for this is due to inexperienced users, because
they normally take excessive freedom with the location and
alignment of handwritten symbols. Other kinds of irregular
writings arise during the correction, deletion, and insertion of
symbols. Challenges faced for preprocessing, deal with the
choice of whether to convert raw handwriting into a more
efficient form i.e. whether to binarize the handwriting or keep
it in grey-scale form. Another issue is whether the handwriting
should be thinned or should remain the way it is to preserve
the features. Feature extraction further poses the problem of
choosing the right features to extract and the right technique
to perform the task. For example researchers may choose
between extracting features such as the entire contours of
characters or by extracting many features such as end-points,
loops and holes.
Finally, the task of finding a suitable classification technique has been exhaustively pursued. However, again the
variability of handwriting and the lack of reliable feature
extraction and preprocessing techniques have impaired many
unconstrained approaches. For most of the aforementioned
problems, including feature extraction and classification there
are additional complexities associated with preprocessing steps
like removing noise, skew correction and segmentation.
II. R ELATED WORK
B.B Chaudhuri and U Bhattacharya [6] have published
that a major obstacle to research on handwritten character
recognition of Indian Scripts is the non-existence of standard /
benchmark databases. They further state that previous studies
were reported on the basis of small databases collected in
the laboratory environment. The characters were normalized
to a size of 128 x128 after minimum boundary rectangle
concept was applied to the binary image of Bangla and
English handwritten numerals. They have mixed 7000 Bangla
and 14000 English handwritten numerals even with various
resolutions starting from 16X16, 32X32 and 64X64 sizes,
reporting an accuracy of 98.64% on these input samples.
Cheng-Lin-Liu and others [7] tested databases CENPARMI,
CEDAR, and MNIST and presented a paper of handwrit-

ten digit recognition. The feature vector included the chain


code feature, gradient feature, profile structure feature and
peripheral direction contributivity. Various classifiers like knearest neighbor classifier, support vector classifier (SVC) and
a Discriminative Learning Quadratic Discriminant Function
(DLQDF) were used. It is reported that the SVC has the
highest accuracy in most of the cases but is highly expensive in
computation and storage. Also among non SVCs computation
it is found that DLQDF is the best among the results previously
reported on the same database.
Chaudhuri, Pal and Sinha [8] proposed separation of many
languages using peak and valley positions. This helps in
feeding the separated languages text to their respective OCRs.
The feature selection was based on water reservoir principle,
contour tracing, profile etc. Automatic recognition of text
line of different Indian scripts was possible with an overall
accuracy of about 97.52%. This scheme does not depend on
the size of the characters in the text line.
III. M ETHODOLOGY
In the Zoning feature extraction method, a character is
usually divided into zones of predefined size. Whenever a
document is considered for recognition, there are enumerable
factors involved. In Zoning any basic Telugu character image
is chosen and is binarized. Binarization plays an important role
in document processing. Due to binarization the segmentation
of character and its recognition accuracy is affected. Basically
separation of background and foreground of a scanned image
is called binarization. The most popular technique for binarization is thresholding in which an optimum threshold is 0.7.
Each individual character can be of different size and hence
normalizing them to images of fixed size. It is very important
for every training and testing images that though sizes vary
for different characters, a frame of 5050 is appropriate as
the normalized size to contain distinct features. Furthur the
image is divided into 100 zones of size 55. The resulting
partitions allow us to determine specific features of the pattern
to be recognized. In any zone, the sum of all the pixels is
obtained which form the feature vector of that zone. Similarly
the summation of all other zones in the selected image is
computed. After computing this feature for all the zones, these
values of different zones, are concatenated one below the other.
This column vector would have a size of 100 rows and this
vector becomes the feature vector for this image.
The following is the step by step algorithm for zonal feature
extraction implemented on basic and isolated handwritten
characters.
1) Load images of size of 5050 pixels.
2) Read all the images.
3) Convert the images into binary type using a threshold
value 0.7.
4) Divide the image into 100 zones of size 55.
5) The feature vector is the sum of all pixel intensities in
the zone.
6) Reshape these 100 features into a column matrix of size
1001.

7) Repeat the above procedure for all the training/database


and test images.
8) Finding the Euclidean distance between column matrix
of test image and each of training image.
9) Select the database image which has minimum distance
from the test image.
10) Display the test image and its corresponding matched
database image.
A. Mathematical model of zoning
In Zoning method, firstly the data base images and test
images are normalized to 5050. Each image is divided into
100 zones, the size of each zone being 55.
For finding the feature vector of an image, all the pixel
intensities in a particular zone should be added. So, feature
vector of each image obtained is of 1001 size column vector.
This procedure is applied for both training and testing images.
Let f(x,y) be a digital image and I(x,y) be the pixel intensity
of each image.
Then
f (x, y) = I (x, y)
(1)
Equation (1) represents the corresponding gray level intensity values of pixel in the given image f(x,y). Adding the pixel
intensities I(x,y) of image in a zone are calculated by
Vi =

5
5 

1

Ii (x, y)

TABLE I
C OMPARISON OF Z ONAL FEATURE EXTRACTION WITH 2-D FFT

Number of training samples


Number of testing samples
Classifier
Recognition Accuracy

Published method [9]


18,750(375/class)
500(10/class)
NNC
63%

Proposed method
18,750(375/class)
500(10/class)
NNC
78%

In table I the Zoning method results are compared with


the published method [9]. The feature for Telugu characters
were extracted using 2-D FFT [9] followed by Nearest Neighborhood Classifier for classification. The recognition accuracy
was reported as 63% for 375 training samples/class. However
in the proposed Zoning method, the recognition accuracy
increased from 63% to 78%, which is a considerable increase
in recognition accuracy. Fig.1 shows the result of correctly
matched test image with the data base image. There are
three sets of images in the figure. In each set the left image
represents test image, where as the right side image represents
data base image to which the test image is matched using the
proposed Zoning method algorithm. It is very clear from the
displayed results that all the three test images showed in fig.1
have matched correctly and in fig.2 the results of test images
matched correctly to the group are displayed.

(2)

where 1 i 100.
Equation (2) describes Vi as the summation of all the pixel
gray level intensity values of ith zone(submatrix of size 5X5)
in the image.
The feature set of an image
F = [Vi ]

Fig. 1.

Correctly matched test images to database images

(3)

In the above equation (3) F represents a set of feature


vectors for any given image. Similarly the feature sets of all
the training / testing images are found. The number of images
used in the work for training are 18,750 and for testing are 500
images. Hence the training set is a matrix of size, 10018750,
since there are 100 features for any image.
IV. R ESULTS AND DISCUSSIONS
To identify an image distinctly, feature extraction is a very
important step. There are many methods to extract features of
an image. In this chapter an image is divided into number of
zones and for each zone, the pixel gray level intensities are
summed up to form the feature vector for that zone. Similarly
for each zone we find a feature vector. All these feature vectors
of each zone are concatenated to form a column vector. Further
Nearest Neighborhood Classifier (NNC) method is used for
identifying and classifying the results. Further these results
are compared with the published methods. As discussed in
the earlier section, the step by step algorithm is applied to the
data base images and also for the 500 test set images.

Fig. 2.

Correctly matched to the group

In Telugu there is lot of confusion between character to


character and all the Telugu characters can be grouped in to
six characters as published in literature [10], [11].
The first group comprises of 11 confusing characters where
as the second group consists of 12 confusing characters. The
other set of confusion characters are shown in fig.3 [10], [11].
Fig.4 shows results of wrongly matched test characters using
proposed algorithm. The recognition accuracy is represented
graphically by a bar graph between the published and proposed
methods as shown in fig.5 below.
V. C ONCLUSIONS AND FUTURE SCOPE
1) The HWCR has less than 60% recognition accuracy as
reported in literature.
2) HWCR is at nascent stage for Indian languages due to
less number of speakers and have composite characters
compared to English, more numbers of modifiers and

7) The HWCR of multilingual documents can be performed.


ACKNOWLEDGMENT
This work is done as a part of AICTE project titled
Design and Development of Palm Leaf Character Recognition
System under RPS(Research Promotion Scheme). Hence the
authors thank the funding agency, AICTE, New Delhi, India,
in carrying out this work. Further, the authors express their
sincere thanks to TEQIP-II for sponsoring to present and
publish this work in IEEE International Conference on IT
Convergence and Security 2014 (ICITCS 2014).
R EFERENCES
Fig. 3.

Fig. 4.

3)

4)
5)

6)

Set of confusion characters

Wrongly matched test images to the database images

also due to non-uniform spacing of characters with in a


word.
There is no standard data base for any of the Indian languages for validating the results, as reported
in literature. Hence a total of 18,750 samples as
database/knowledgebase were successfully developed.
Also 500 samples of images for testing algorithm has
been successfully developed.
The recognition accuracy using proposed Zoning method
is obtained as 78%.
Some of the characters of Telugu like Va, Ma, Ya, Pa or
Na are very similar to each other. Hence very low rate
of recognition.
HWCR is performed only on basic Telugu characters and
hence can be extended to samyukt akshars(combinations
of 2 or more basic characters).

Fig. 5.

Comparison of published and proposed methods

[1] P. Sastry and R. Krishnan, Isolated telugu palm leaf character recognition using radon transform, a novel approach, in World Congress on
Information and Communication Technologies (WICT), 2012, pp. 795
802.
[2] P. N. Sastry, R. Krishnan, and B. V. S. Ram, Classification and
identification of telugu handwritten characters extracted from palm
leaves using decision tree approach, J. Applied Engn. Sci, vol. 5, no. 3,
pp. 2232, 2010.
[3] H. Swethalakshmi, A. Jayaraman, V. S. Chakravarthy, C. C. Sekhar
et al., Online handwritten character recognition of devanagari and
telugu characters using support vector machines, 2006.
[4] U. Pal and B. Chaudhuri, Indian script character recognition: a survey,
Pattern Recognition, vol. 37, no. 9, pp. 18871899, 2004.
[5] U. Pal, R. Jayadevan, and N. Sharma, Handwriting recognition in indian
regional scripts: A survey of offline techniques, ACM Transactions on
Asian Language Information Processing (TALIP), vol. 11, no. 1, pp.
1:11:35, 2012.
[6] U. Bhattacharya and B. B. Chaudhuri, Handwritten numeral databases
of indian scripts and multistage recognition of mixed numerals, IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 31,
no. 3, pp. 444457, 2009.
[7] C.-L. Liu, K. Nakashima, H. Sako, and H. Fujisawa, Handwritten
digit recognition: benchmarking of state-of-the-art techniques, Pattern
Recognition, vol. 36, no. 10, pp. 22712285, 2003.
[8] U. Pal, S. Sinha, and B. Chaudhuri, Multi-script line identification from
indian documents, in 2013 12th International Conference on Document
Analysis and Recognition, vol. 2. IEEE Computer Society, 2003, pp.
880880.
[9] T. Lakshmi, P. N. Sastry, and T.V.Rajinikanth, Palm leaf telugu character recognition using hough transform, in Proceedings of the 2nd
International conference on Advanced Computing Methodologies, ser.
ICACM 13, 2013, pp. 372376.
[10] P. N. Sastry, R. Krishnan, and B. V. S. Ram, Telugu character
recognition on palm leaves- a three dimensional approach, Technology
Spectrum, vol. 2, no. 3, pp. 1926, 2008.
[11] P. N. Sastry, R. Krishnan, and T.V.Rajinikanth, Palm leaf telugu
character recognition using hough transform, in Proceedings of the
International conference on Advanced Computing Methodologies, ser.
ICACM 11, 2011, pp. 2128.

You might also like