You are on page 1of 35

Optical Character Recognition system for printed Telugu text

MTech Project Report

Submitted in partial fulfillment of the requirements for the degree of

Master of Technology
by
Udaya Kumar Ambati
Roll No : 09305073

under the guidance of


Prof.M.R.Bhujade

Department of Computer Science and Engineering


Indian Institute of Technology, Bombay
April 2010

Acknowledgements
I would sincerely like to thank my guide,Prof. M.R. Bhujade for his motivating support throughout the semester and the consistent directions that he has fed into my work.I would like to thank
each and every one who helped me throughout my work.

Abstract

Telugu is a language spoken by more than 66 million people of South India. Not much work
has been reported on the development of optical character recognition (OCR) systems for Telugu
text. Therefore, it is an area of current research. Some characters in Telugu are made up of
more than one connected symbol. Compound characters are written by associating modifiers
with consonants, resulting in a huge number of possible combinations, running into hundreds
of thousands. A compound character may contain one or more connected symbols. Therefore,
systems developed for documents of other scripts, like Roman, cannot be used directly for the
Telugu language.

This project aims at developing a complete Optical Character Recognition system for printed
Telugu text. The system segments the document image into lines and words. The features of
each character are extracted. The extracted features are passed to a Support Vector Machine
where the characters are classified by Supervised Learning Algorithm.

Contents
1 Introduction

2 Structure of Telugu text and Segmentation issues[5]

2.1

Characteristics of Telugu script . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Segmentation issues in OCR of Telugu script . . . . . . . . . . . . . . . . . . . .

3 Preprocessing phase
3.1

Thresholding and noise removal . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.1.1

The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Skew detection and correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

3.2.1

Skew angle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

3.2.2

Image rotation transformation . . . . . . . . . . . . . . . . . . . . . . . .

11

3.3

Connected Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

3.4

Line Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

3.5

Word Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.6

Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.7

Pattern classification [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.7.1

16

3.2

SVM Classifier:[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Implementation

22

5 Results

24

6 Conclusion and Future work

29

List of Figures
2.1

Harshapriya and Godavari fonts . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Vowels their associated modifiers (Matras) and their phonetic English representation

2.3

Consonants and their associated modifiers (Matras) and their phonetic English
representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.4

Various combinations forming compound characters . . . . . . . . . . . . . . . .

3.1

Original Text lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

3.2

Smoothed Text lines with Histogram . . . . . . . . . . . . . . . . . . . . . . . . .

13

3.3

Highest peak and vertical line drawn at the middle of highest peak . . . . . . . .

13

3.4

middle line detection for considering small length text . . . . . . . . . . . . . . .

14

3.5

(a).Initial segmentation line through the white pixels of horizontal histogram (b).
Result after considering only the candidate lines from original histogram. . . . .

14

3.6

Output for word segmentation

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

5.1

Home page of the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

5.2

Displaying the original image . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

5.3

Bounding Connected Components . . . . . . . . . . . . . . . . . . . . . . . . . .

26

5.4

Line Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

5.5

Word Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

ii

Chapter 1

Introduction
During the past few decades, substantial research efforts have been devoted to optical character recognition (OCR) [7, 6]. The object of OCR is automatic reading of optically sensed
document text materials to translate human-readable characters into machine-readable codes.
Research in OCR is popular for its various potential applications in banks, post offices and
defence organizations. Other applications involve reading aids for the blind, library automation,
language processing and multi-media design .

Commercial OCR packages are already available for languages like English. Considerable work
has also been done for languages like Japanese and Chinese [7]. Recently, work has been done
in the development of OCR systems for Indian languages. This includes work on recognition of
Devanagari characters , Bengali characters , Kannada characters and Tamil characters .

The Indian subcontinent has more than 18 constitutionally recognized languages with several
scripts but commercial products in Optical Character Recognition(OCR) are very few. Telugu
is one of the oldest and most popular languages of India. Historically, Telugu has evolved from
the ancient Brahmi script. It also used features of the Dravidian (Pali) language for script
generation. In the process of evolution, this script was carved with needles on palm leaves, and
so, it favored rounded letter shapes. Work on Telugu character recognition is not substantial.

Motivation

In spite of Telugu being the third mostly used language in India there are only a

few OCR systems for Telugu script. This gave us a motivation to approach the problem. Further
1

motivation to develop a Telugu language OCR is the digitization of thousands of printed books
of Indian languages by both private and public sector. For an efficient access of these scanned
documents an OCR specific for printed Telugu text is of very urgent need.

Scope of the report

The first section of the report deals with the explanation of the structure

of Telugu characters and its segmentation issues. In second section we explain the algorithm used
for noise removal and binarization. Next section explains an efficient algorithm that segments
the given scanned document into lines and words. The last section explains the concept of
Support Vector Machines(SVM) and the method of feature extraction of Telugu letters and
their classification using SVM.

Most document analysis systems can be visualized as consisting of two steps: the pre-processor
and the recognizer. in preprocessing, the raw image obtained by scanning a page of text is converted to a form acceptable to the recognizer by extracting individually recognizable characters.
the pre-processed image of the character is processed to obtain meaningful elements, called features; recognition is completed by searching for a feature vector in a database of stored feature
vectors of all possible Telugu characters that matches with the feature vector of the character
to be recognized.

In Indian scripts, one or more vowel and consonant modifiers are attached to the consonant
forms in a variety of combinations forming compound characters. The total number of possible
compound characters is in of the order of hundreds of thousands. Therefore, the question, What
constitutes a character?, assumes many new dimensions for Indian languages. Is a modifier an
independent character or not? Does being treated as an independent character depend on the
way it is written, i.e. whether it is written touching the character it is to modify or separated
from it? A more detailed discussion of these issues for Telugu script is provided in Sect. 2.

In this project, an approach has been presented for Telugu.

Chapter 2

Structure of Telugu text and


Segmentation issues[5]
2.1

Characteristics of Telugu script

Telugu is a syllabic languageconfusion and spelling problems. In that sense, it is a WYSIWYG


(what you see is what you get) script. This form of script is considered to be most scientific
by linguists. The Telugu script consists of 18 vowels, 36 consonants and two dual symbols. Of
the vowels, sixteen are in common usage. Fig 2.1 lists some of the vowels in Harshapriya and
Godavari fonts.

All vowels and consonants, along with their modifiers and phonetic equivalent symbols, are
listed in Fig 2.2 and Fig 2.3, respectively. Compound characters in Telugu follow some phonetic

[5]
Figure 2.1: Harshapriya and Godavari fonts
3

[5]
Figure 2.2: Vowels their associated modifiers (Matras) and their phonetic English representation

sequences that can be represented in grammatical form, as shown in Fig 2.4. Base consonants
are vowel-suppressed consonants. These are typically used when words of other languages are
written in Telugu. The third combination, i.e. of a base consonant and a vowel, is an extremely
important and often used combination in Telugu script. As there are 38 (36+2 dual symbols)
base consonants and 16 vowels, logically, 608 (3816 = 608) combinations are possible.

The combinations from the fourth to the seventh combinations are categorized under conjunct
formation. Telugu has a special feature of providing a unique symbol of dependent form for each
of the consonants. In all conjunct formations, the first consonant appears in its actual form.
The dependent vowel sign and the second (third) consonant act as dependent consonants in the
formation of the complete character. combinations from the fourth to seventh combinations
generate a large number of conjuncts in Telugu script. The fourth combination logically generates (383816) 23,104 different compound characters. This is an important combination. The
fifth combination is similar to the fourth combination. The second and the third consonants act
as the dependent consonants. Logically 746,496 different compound characters are possible in
this combination, but their frequency of appearance in the text is less when compared to the
previous combination. In the sixth and seventh combinations, 1,296 combinations and 46,656
combinations, respectively, are logically possible.
The sixth and seventh combinations are used when words from other languages are written in
Telugu script. In these combinations, the vowel is omitted. The first consonant appears as a
base consonant and the other consonants act as dependent consonants.

[5]
Figure 2.3: Consonants and their associated modifiers (Matras) and their phonetic English
representation

[5]
Figure 2.4: Various combinations forming compound characters

2.2

Segmentation issues in OCR of Telugu script

A connected region in an image of Telugu text may be:


1. A part of a character or a compound character
2. A character
3. A compound character

This complicates the segmentation issues. The areas occupied by individual characters in a
line of text are not in a horizontal line, unlike in English text, and in some cases, the area of
a single complex character formation can be equal to the sum of the areas of two individual
characters. The segmentation algorithm has to take these factors into consideration. The basic
question to be answered in segmentation is: What are the symbols that will be isolated during
segmentation and provided to the recognizer for completing the OCR?

The first approach is to treat all types of conjuncts, together with the base consonants, as
units for the purpose of segmentation and further recognition. This is not preferable for a
number of reasons. The first reason is that the sheer number of possibilities has been shown
to be enormous. The second reason is that, in compound characters like KRAI, we have to
identify all the three parts, i.e. below and on the left, as being together in the same compound
character, although they are not connected in the image. This is, in general, difficult because the
association information is difficult to generate until the recognition process is at least partially
completed, and the reason we are segmenting is to perform this recognition process. This is the
catch-22 situation referred earlier, and, therefore, treating all types of conjuncts together is not
possible. The second alternative is to attempt to isolate the base consonants, vowel modifiers,
etc. This is difficult and leads to unmanageable complications at the segmentation stage where
the symbols are yet to be recognized. This is primarily because the symbols are full of curves
and their separation is not clear. However, this is a popular approach for Indian scripts like
Devanagari and Bangla [3].

Chapter 3

Preprocessing phase
3.1

Thresholding and noise removal

The task of thresholding is to extract the foreground from the background. Generally an OCR
expects a text printed against clean backgrounds. Usually a simple global binarization technique
is adopted which does not handle well text printed against shaded or texture backgrounds, and/or
embedded in images.

In this project, a simple yet effective algorithm is proposed for document image binarization
and cleanup. It is especially robust for extracting from images.

There are basically two classes of binarization techniques global and adaptive. Global methods
binarize the entire image using a single threshold. For example, a typical OCR system separates
text from background by global thresholding[12, 8] . A simple way to automatically select a
global threshold is to value at the valley of the intensity histogram of the image, assuming that
there are two peaks in the histogram, one corresponding to the foreground and the other to the
background. Methods have also been proposed to facilitate more robust valley picking.

There are problems with the global thresholding paradigm. First, due to noise and poor
contrast, many documents do not have well differentiated foreground and background intensities. Second, the bimodal histogram assumption is not always valid in the case of complicated
documents such as photographs and advertisements. Third, the foreground peak is often over8

shadowed by other peaks which makes the valley detection difficult or impossible. Some research
has been carried out to overcome these problems. For example, weighted histograms[1] are used
to balance the size difference between the foreground and background, and /or convert the valleyfinding into maximum peak detection. Minimum-error thresholding models the foreground and
background intensity distributions as Gaussian distributions and the threshold is selected to
minimize the classification error. Otsu[9] models the intensity histogram as probability distribution and the threshold is chosen to maximize the separability of the resultant background and
foreground classes. Similarly entropy measures[] have been used to select the threshold which
maximizes the sum of background and foreground entropies.

In contrast, adaptive algorithms compute a threshold for each pixel based on information
extracted from its neighborhood. For images in which the intensity ranges of foreground objects
and backgrounds entangle, different thresholds must be used for different regions.

3.1.1

The Algorithm

The algorithm proposed by Wu and Manmatha[11] works under the assumption that text input
image or a region of the input image has more or less the same intensity value. However the
unique feature of this algorithm is it works well even of the text is printed against shaded or
hatched background

The following are the steps in the algorithm:


1. smooth the input text chip.
2. compute the intensity histogram of the smoothed chip.
3. smooth histogram using a low-pass filter.
4. pick a threshold at the first valley counted from the left side of the histogram.
5. binarize the smoothed text chip using the threshold.

A low-pass Gaussian filter is used to smooth the text chip in step 1. The smoothing operation
affects the background more than the text because text is normally is of lower frequency than
the shading. Thus it cleans up the background.
9

The histogram generated by step 2 is often jagged, hence it needs to be smoothed to allow
the valley to be detected. Again a Gaussian filter is used for this purpose.

Text is normally the darkest item in the detected chips. Therefore, a threshold is picked
at the first valley closest to the darkest side of the histogram. To extract text against darker
background, a threshold at the last valley is picked instead.

3.2

Skew detection and correction

Skew estimation of document refers to the process of finding the angle of inclination made
by the document with respect to horizontal axis,which is often introduced during document
scanning. For any ensuing document image processing tasks(such as page layout analysis,
OCR,document retrieval etc.)to yield accurate results,the skew angle must be detected and corrected beforehand.The algorithms for skew estimation can mainly be classified as the ones based
on(i)projection profile(PP) , nearest neighbor(NN) (iii)Hough transform(HT) and (iv)crosscorrelation. We used the variation of the hough transform method [4] to detect skew in our
project.

3.2.1

Skew angle Detection

The skew angles detection process used in this project can be divided into three steps:
detection point determination
coarse skew angle estimation
Hough transformation.

First, the skew image is vertically separated into several blocks, each block consisting of one
hundred rows. Then the locations of detection points in each block are recorded to estimate
the coarse skew angle e . The coarse skew angle here can be estimated by selecting the angle
which possesses most detection points.Finally, the accurate skew angle can be determined by
choosing the peak in the Hough plane within the small range of [ e - 3 , e + 3] A detailed
description of the three steps to detect the skew angle follows.
10

Step 1. Detection point (DP) determination

First of all, the input image is vertically

divided into several blocks. According to our empirical study, 100 rows are chosen as the size of
each block. A detection point is defined as the left-most black pixel in each block. Each divided
block is scanned from left to right and then from top to bottom to find the detection point. If
the scanned pixel is not a background pixel, it is declared as a detection point. Following the
above procedure, we can find all detection points embedded in the input image. These detection
points are then fed into Step 2 for the estimation of the coarse skew angle.

Step 2. Coarse skew angle estimation In this step, the coarse skew angle 0 e is determined
by selecting the majority of local skew angles which are generated from the detection points.
Before the majority selection procedure, the local skew angle i has to be calculated first.Consider
two detection points DPi1 (xi1 , yi1 ) and DPi (xi , yi ) in two consecutive divided blocks Bi1
and Bi . The local skew angle i is defined as




yi
yi yi1
1
1
i = tan
= tan
xi
xi xi1
Here, the value

yi
xi

(3.1)

is adopted to represent the local skew angle i to avoid the computation

burden of tan1 function. The coarse skew angle r is then assigned as the majority of local
skew angles.

Hough Transformation Following the previous two steps, the search range of the skew angle
in the Hough plane is reduced from [90 , 90 ] to [e 3 , e + 3 ]. Last, the left-most pixel
Pi (xi , yi ) in each row of the x y plane is transformed to the Hough plane by making use
of the following equation:

i = xi . cos i + yi . sin

(3.2)

where i is located in the range [e 3 , e + 3 ]. The skew angle of the input document can
thereby be determined by selecting the angle with the largest value in the transformed Hough
plane.

3.2.2

Image rotation transformation

In this section, a skew image will be corrected to generate a non-skew image by rotating
over a skew angle 0 which is obtained in Section 3.2.1.The rotation transformation is a mapping
11

function f (x, y) which maps the coordinates of pixels in the original image to those in the output
image. However, some pixel values in the output image which correspond to the pixels in the
original image cannot be defined via the mapping function f because the range and domain
defined in image processing are integer. In program implementation, we can devise an inverse
function f 1 to define all output pixel values from the original image. Each pixel value in the
output image can thereby be determined from the value in the original image via the inverse
function f 1 .

Geometrically, the value of pixel P 0 (x0 , y 0 ) in the output image can be determined from that
of the corresponding pixel P (x, y) in the original image. The location of pixel P can be obtained
from the location of pixel P 0 via the following function f 1 :(x, y)



cos
()
sin
()
= x0 cos + y 0 sin , x0 sin + y 0 cos
= x0 , y 0
sin () cos ()

3.3

(3.3)

Connected Components

The connected components are computed for the whole document using a recursive labeling
algorithm. The algorithm works by first negating the whole image. Each black pixel is replaced
by -1 and white pixel with 0. Each pixel in this image is now checked for a black pixel. If a
pixel is a text pixel, We define a search function which takes a text pixel, its coordinates and
defines its neighbors. This function recursively searches the black pixels that are part of this
component and labels them. Again it reaches a new component.

3.4

Line Segmentation

There are several steps in the line segmentation method proposed by Priyanka and Srikanth[10]
that are systematically described below.

Step1:Run length smearing

A smoothing algorithm is applied in the text of a document

page. In this step we use run length smearing technique [12] to increase the strength of the
histogram. Here we consider the consecutive run of white pixels in between two black pixels and
then we compute the length of that white run. If the length of white run is less than five times
12

the stoke width, fill the white run length into black. in figure there are two original text lines
and in figure there are smoothed text lines with horizontal histogram corresponding to their
text lines.

[10]
Figure 3.1: Original Text lines

[10]
Figure 3.2: Smoothed Text lines with Histogram

Step2:Recursive procedure to get middle lines for segmentation Getting the histogram of every line from the smoothed document page, we consider the highest peaks of the
projection profile. After that we find the middle point of the length of the highest peak, and
then we draw a vertical line from top to bottom at the middle point of the highest peak as
shown in fig.

[10]
Figure 3.3: Highest peak and vertical line drawn at the middle of highest peak

The continuity of this step is to find the middle lines of each and every peaks of histogram. At
the line (the line passes vertically through middle point of the highest peak) we find middle point
of peaks. We draw the horizontal lines based on this middle point of the width of histogram. In
some cases all peak of histograms do not cross this vertical line. For these cases we find distances
between middle lines and find the average value of these distances.If the distance between the
two middle lines is greater than two times of average value then we assume that region contains

13

[10]
Figure 3.4: middle line detection for considering small length text

one or more text lines and we need recursive segmentation for that region. After getting that
region (the region between two middle lines of peaks) we apply the same procedure to find
vertical line through the middle of highest peak and middle lines of that particular region. This
procedure runs recursively; until we find middle lines of particular image as shown in Fig .10

Step3:Finding candidate line In this step, from the starting point of first histogram we
vertically scan the region in between the first middle and second middle line of histogram until
we get first two white pixels. We consider that two white pixels as minimum points. The line,
where we get the first white pixel, we consider that line as first minimum. Similarly the line
where we get second white pixel, we consider that line as second minimum. Now we calculate
the vertical distances from first middle line to first minimum point and from first middle line
to second minimum point. Getting these two distances, we consider the maximum distance.
The minimum point which contains maximum vertical distance as a separator between two
consecutive middle lines. In this way we find all line separators between two consecutive middle
lines and shown in Fig below. If we consider only the point where we get minimum black pixel
in the histogram is separator line, then we will get many errors.

[10]
Figure 3.5: (a).Initial segmentation line through the white pixels of horizontal histogram (b).
Result after considering only the candidate lines from original histogram.

14

3.5

Word Segmentation

In word segmentation method, a text line has taken as an input. After a text line is segmented,
it is scanned vertically. If in one vertical scan two or less black pixels are encountered then
the scan is denoted by 0, else the scan is denoted by the number of black pixels. In this way
a vertical projection profile is constructed. Now, if in the profile there exist a run of at least
k1 consecutive 0s then the midpoint of that run is considered as the boundary of a word. The
value of k1 is taken as 1/3 of the text line height. Word segmentation results of a Telugu text
line are shown in Fig.

[10]
Figure 3.6: Output for word segmentation

3.6

Feature Extraction

Feature Extraction [5]: The output of the Normalization phase gives a normalized image of size
N N. Real Valued Directional Features[] are calculated for each normalized image of size NN.
These are based on the percentage of pixels in each direction range within each partition. An
adaptive gradient magnitude threshold, r is computed over the whole character image gradient
map. This threshold is needed to filter out spurious responses to the Sobel operator used to find
the gradients. Threshold value ,rt is computed as
rt =

X r(i, j)
D1 D2

Thresholding is performed to nullify the pixels whose gradient magnitude values below the
computed threshold.

The feature vector is extracted basing on the direction of the gradient at each pixel. We divided the whole character image into MN partitions. In our project we selected M=N=8. The
directions of the gradient are quantized into K values. Thus each pixel can have now gradient
direction values from 1 to K. Percentage of pixels in each partition with direction quantised to k
are calculated. Thus each partition gives us K such values. We have total MNK dimensional
15

feature vector for each character image. We chose the value of K = 12. In our project we have
total 192 dimensional feature vector for each normalized character image.

The steps to extract feature vector are as follows:


For each connected component.
Obtain the bonding box for each connected component eliminating the blank surrounding
space.
Calculate the gradient magnitude and direction at each pixel.
Calculate the adaptive threshold of gradient magnitude and perform thresholding to obtain
the new gradient direction each pixel.
Partition the adaptive gradient direction map and extract the complete feature vector.

3.7

Pattern classification [2]

The feature vector extracted from the normalized image has to be assigned a label using a
pattern classifier[2]. There are many methods for designing pattern classifiers such as Bayes classifier based on density estimation, using neural networks, linear discriminant functions, nearest
neighbor classification based on prototypes etc. In this system we have used the Support Vector
Machine (SVM) classifier. SVMs represent a new pattern classification method which grew out
of some of the recent work in statistical learning theory. The solution offered by SVM methodology for the two class pattern recognition problem is theoretically elegant, computationally
efficient and is often found to give better performance by way of improved generalizations. In
the next subsection we provide a brief overview of SVMs.

3.7.1

SVM Classifier:[2]

classifier is a two-class classifier based on the use of discriminant functions. A discriminant


function represents a surface which separates the patterns so that the patterns from the two
16

classes lie on the opposite sides of the surface. The SVM is essentially a separating surface which
is optimal according to a criterion as explained below.

Consider a two-class problem where the class labels are denoted by +1 and 1. Given a
set of labeled (training) patterns = ((x)i , yi ), yi {1, +1} the hyper-plane represented by
(w, b) where w <d , -represents the normal to the hyper-plane and b < the offset, forms a
separating hyper-plane or a linear discriminant function if the following separability conditions
are satisfied.
wt xi + b > 0f ori : yi = +1;
wt xi + b > 0f ori : yi = 1;

(3.4)

Here,wt xi denotes the inner product between the two vectors, and g (x + b) is the linear discriminant function.

In general, the set may not be linearly separable. In such a case one can employ the
generalized linear discriminant function defined by,
g (x) = wt (x) + b

where

: <d
<d

(3.5)

The original feature vector x is d-dimensional. The function represents some nonlinear transformation of the original feature space and (x) is d0 -dimensional. By proper choice of the
function one can obtain complicated separating surfaces in the original feature space. For
0

any choice of , the function g given by 4.2 is linear discriminant function in <d , the range of
space of range of space of . However, this by itself does not necessarily mean that one can (efficiently) learn arbitrary separating surfaces using only techniques of linear discriminant functions
by this trick of using A good class of discriminant functions (say, polynomials of degree p)
in the original feature space may need a very high dimensional (of the order of dp ) vector,(x),
and thus d0 can become much larger than d. This would mean that the resulting problem of
learning a linear discriminant function in the d0 -dimensional space can be very expensive both
in terms of computation and memory. Another reakelated problem is that we need to learn
the d0 -dimensional vector w and hence we would expect that we need a correspondingly larger
number of training samples as well. The methodology of SVMs represents an efficient way of
tackling both these issues. Here we only explain the computational issues.

17

Let zi = (xi ) Thus now we have a training sample (zi , yi ) to learn a separating hyperplane
0

in <d . The separability conditions are given by 4.1 with xi replaced by zi . Since there are
only finitely many samples, given any w <d , b < that satisfy 4.1, by scaling them as needed,
we can find w, b, that satisfy


yi wt zi + b 1

wherei = 1, ....l

(3.6)

Note that we have made clever use of the fact that yi +1, 1 while writing the separability
constraints as above. The w, b, that satisfy (4.3) define a separating hyper-plane, textbf wt +b =
0, such that there are no training patterns between the two parallel hyper-planes given by
wt + b = +1, and wt + b = 1, The distance between these two parallel hyper-planes is

2
||w|| ,

which is called the margin (of separation) of this separating hyper-plane. It is intuitively clear
that among all separating hyper-planes the ones with higher margin are likely to be better at
generalization. The SVM is, by definition, the separating hyper-plane with maximum margin.

Hence, the problem of obtaining the SVM can be formulated as an optimization problem of
obtaining w <d and b <, to
1
||w||2
2

M inimize :

Subject to : 1 yi (zti ) 0

i = 1, ...., l.

(3.7)
(3.8)

Suppose w and b represent the optimal solution to the above problem. Using the standard
Lagrange multipliers technique, one can show that

w =

l
X

i yi zi

(3.9)

i=1

where i i are the optimal Lagrange multipliers. There would be as many Lagrange multipliers
as there are constraints and there is one constraint for each training pattern (4.5 ). From
standard results in optimization theory, we must have i [1 yi (zti w + b ] = 0, i. Thus
i = 0 for all i
such that the separability constraint (4.5) is satisfied by strict inequality. Define a set of indices,
S = i : yi (zti w + b ) 1 = 0, 1 i l.

18

(3.10)

Now it is clear that i = 0 if i


/ S Hence we can rewrite (4.6) as
w =

i yi zi

(3.11)

iS

The set of patterns zi : i s.t.i > 0 g are called the support vectors. From (4.8), it is clear
that w is a linear combination of support vectors and hence the name SVM for the classifier.
The support vectors are those patterns which are closest to the hyper-plane and are sufficient
to completely define the optimal hyper-plane. Hence these patterns can be considered to be the
most important training examples.

To learn the SVM all we need are the optimal Lagrange multipliers corresponding the problem
given by (4.4) and (4.5). This can be done efficiently by solving its dual which is the optimization
problem given by: Find i , i = 1, ...., l, to
M aximize :

1X
i j yi yj zti zj
2
i,j

Subject to : i 0, i = 1, 2, ..., l,

l
X

i yi = 0.

(3.12)

By solving this problem we obtain i i and using these we get w and b . It may be noted
that the dual given by 4.(9) is a quadratic optimization problem of dimension l (recall that l
is the number of training patterns) with one equality constraint and nonnegativity constraints
on the variables. This is so irrespective of how complicated the function is. Once the SVM
is obtained, the classification of any new feature vector,x, is based on the sign of (recall that
z = (x)
X

f (x) = (x)t w + b =

i yi (xi )t (x) + b

(3.13)

iS

where we have used (4.8). Thus, both while solving the optimization problem (given by (4.9))
and while classifying a new pattern, the only way the training pattern vectors, xi come into
picture are as inner products (xi )t (xj ). This is the only way, also enters into the picture.
Suppose we have a function,K : <d <d
< such that K(xi , xj ) = Such a function is called a
0

Kernel function. Now we can replace zi zj (4.9) by K(xi , xj ). Then we never need get into <d

while solving the dual. Often we can choose the kernel function so that it is computationally
0

much simpler than computing inner products in <d . Once (4.9) is solved and i are obtained,
0

during classification also we never need enter <d . We can calculate the needed value of f defined
by (4.10) once again by using the kernel function.
19

Table 3.1: Some popular kernels for SVMs.


Type of kernel

K(xi , xj )

Comments

Polynomial kernel

(xti xj + 1)p

Power p is specified a priori by


the user

Gaussian kernel

exp( 21 2 ||xi xj ||2 )

The width 2 common to


all the kernels, is specified a
priori

Perceptron Kernel

tanh(0 xti xj + 1 )

Mercers condition satisfied


only for certain values of 0
and 1

Given any symmetric function K : <d <d


<, there are some sufficient conditions, called
Mercers conditions, to ensure that there is some function such that K gives the inner product in
the transformed space. Some of the Kernels used in SVMs are listed in table 1. The polynomial
Kernel results in a separating surface in <d represented by a polynomial of degree p. With
the Gaussian Kernel, the underlying is such that (x) is infinite dimensional! However,
by the trick of Kernel functions, we can get such arbitrarily complicated separating surfaces
by solving only a quadratic optimization problem given by (4.9). The SVM with a Gaussian
Kernel is equivalent to a radial basis function neural network and SVM with perceptron kernel
is equivalent to a three-layer feed forward neural network with sigmoidal activations. In both
cases the learning problem for SVM (namely, the optimization problem given by (4.9)) is much
simpler computationally.

The optimization problem given by (9) has a lot of interesting structure and hence there are
available many efficient algorithms for solving it .

So far, we have assumed that the optimization problem specified by (4.4)(4.5), whose dual is
given by (4.9), has a solution. This is so only if in the d0 -dimensional -space, the (transformed)
pattern vectors are linearly separable. In general, this is difficult to guarantee. To overcome

20

this, we can change the optimization problem to


l

M inimize :

X
1
||w||2 + C
i ,
2

(3.14)

i=1

Subject to : 1 yi (zti w + b) i 0

i = 1, ...., l.i 0,

i = 1, ...., l

(3.15)

Here i can be thought of as penalties for violating separability constraints. Now these are
also variables over which optimization is to be performed. The constant C is a user specified
parameter of the algorithm and as C
we get the old problem. It so turns out that the dual
of this problem is same as (4.9) except that the non negativity constraint on i is replaced by
0 i C. The optimal values of the new variables i are irrelevant to th e final SVM solution.

To sum up, the SVM method for learning two class classifiers is as follows. We choose a
Kernel function and some value for the constant C in (4.11). Then we solve its dual which is
same as (4.9) except that the variables i also have an upper bound, namely, C. (It may be
noted that here we use K(()x)i , xj in place of zti zj in (4.9)).Once we solve this problem, all we
need to store are the non-zero i i and the corresponding xi (which are the support vectors).
Using these, given any new feature vector x, we can calculate the output of SVM, namely, f (x)
through (4.10). The classification of x would be +1 if the output of SVM is positive; otherwise
it is 1.

SVM classifier for OCR

We have used SVM classifiers for labeling each segment of a

word. As explained earlier, we have trained a number of two-class classifiers (SVMs), each one
for distinguishing one class from all others. Thus each of our class labels has an associated
SVM.A test example is assigned the label of the class whose SVM gives the largest positive
output. If no SVM gives a positive output then the example is rejected. The output of the SVM
gives a measure of the distance of the example from the separating hyper-plane in the space.
Hence higher the value of the (positive) output for a given pattern higher is the confidence in
classifying the pattern.

21

Chapter 4

Implementation
Developing an OCR for printed Telugu text consists of two stages, Pre-processing and Recognition. In the prep-processing phase thresholding and noise removal are implemented using the
algorithm specified in sect 3.1.1. Skew detection and removal are implemented using a variant
of Hough transform.

The first step of the OCR starts with taking the document image as input.The image is then
converted into a grayscale image. The grayscale image is converted to a binary image using the
method described in section Thresholding. Connected components in the whole document are
found out with their bounding box using a two pass algorithm. These connected components are
then used to line segment the whole document. Line segmentation takes the array of connected
components as parameters and returns the top and bottom row numbers of each line with respect
to image coordinate system.

Each text line is given as input to the word segmentation phase. This function segments the
text line into words and returns the left and right column numbers of each word. The connected
components which belong to each word are grouped.

Each component is normalized into an image of 4848 image. This image is given as input
to the feature extraction function. This function takes just an image and returns the feature
vector of 192 dimensions using Sobel operator and the adaptive threshold gradient. The feature
vector is then given as an input to the SVM classifier which is trained using training SVM phase
22

described in later sections.

All these functions are implemented in Java Advanced Imaging package of Oracle Sun Microsystems in Netbeans6.8 IDE. LibSVM is the package used for training and using the SVM
classifier.

23

Chapter 5

Results

Figure 5.1: Home page of the tool

24

Figure 5.2: Displaying the original image

25

Figure 5.3: Bounding Connected Components

26

Figure 5.4: Line Segmentation

27

Figure 5.5: Word Segmentation

28

Chapter 6

Conclusion and Future work


Conclusion

The main aim of this project is to develop a Optical Character Recognition for

printed Telugu text. Telugu script has a complex structure and has thousands of combinations of
vowel, consonant and consonant modifier.Hence detection and recognition of basic symbols helps
in reducing the number of classes. This project develops a tool that takes a document image
as input and displays each characters Unicode.This Unicode can be further used to display the
corresponding Telugu text.

Future work

The recognition accuracies can be further increased by post processing which

makes use of the association of the basic symbols. For example, it is known that the some
modifiers occur very frequently with some characters and some modifiers occur very infrequently.
This feature vector can be further used for recognizing handwritten Telugu script. The final
output of the proposed system can be used further for text to speech conversion.

29

Bibliography
[1] Histogram modification for threshold selection. Systems, Man and Cybernetics, IEEE
Transactions on, 9(1):38 52, jan. 1979.
[2] T V Ashwin and P S Sastry. font and sizeindependent ocr system for printed kannada
documents using support vector machines. Sadhana, 27:3558, 2002.
[3] B. B. Chaudhuri and U. Pal. A complete printed bangla ocr system. Pattern Recognition,
31(5):531 549, 1998.
[4] Huei-Fen Jiang, Chin-Chuan Han, and Kuo-Chin Fan. A fast approach to the detection
and correction of skew documents. Pattern Recogn. Lett., 18(7):675686, 1997.
[5] C. Vasantha Lakshmi and C. Patvardhan. An optical character recognition system for
printed telugu text. Pattern Analysis and Applications, 7:190204, 2004. 10.1007/s10044004-0217-2.
[6] S. Mori, C.Y. Suen, and K. Yamamoto. Historical review of ocr research and development.
Proceedings of the IEEE, 80(7):1029 1058, jul. 1992.
[7] G. Nagy. Twenty years of document image analysis in pami. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 22(1):38 62, jan. 2000.
[8] L. Ogorman. Binarization and multithresholding of document images using connectivity.
CVGIP: Graphical Models and Image Processing, 56(6):494 506, 1994.
[9] N. Otsu. A threshold selection method from grey-level histograms. SMC, 9(1):6266,
January 1979.
[10] Nallapareddy Priyanka, Srikanta Pal, and Ranju Manda. Article:line and word segmentation approach for printed documents. IJCA,Special Issue on RTIPPR, (1):3036, 2010.
Published By Foundation of Computer Science.
[11] Victor Wu and R. Manmatha. Document image clean-up and binarization. In In Proc.
SPIE Symposium on Electronic Imaging, pages 263273, 1998.
[12] Hong Yan. Skew correction of document images using interline cross-correlation. CVGIP:
Graph. Models Image Process., 55(6):538543, 1993.

30

You might also like