You are on page 1of 48

PRESENTATION

on
HANDWRITTEN URDU SCRIPT NUMERALS RECOGNITION
By
Sartaj Khan
M Tech (Sequential) IV Sem Roll No : 6410110020

Supervisor Mr Hitendra Garg


MCA,MS(BITS,Pilani),Ph d*

CONTENTS
INTRODUCTION AIM OCR DATA SET COLLECTION PREPROCESSING
NORMALIZATION NOISE REMOVAL

FEATURE EXTRACTION
ZONING DENSITY CONCAVITY CONTOUR

RESULTS GENETIC ALGORITHM COMPARING RESULTS CONCLUSION REFERENCES

INTRODUCTION
Handwritten numeral recognition is in general a benchmark problem of Pattern Recognition and Artificial Intelligence. Compared to the problem of printed numeral recognition, the problem of handwritten numeral recognition is compounded due to variations in shapes and sizes of handwritten characters. Considering all these, the problem of handwritten numeral recognition is addressed under the present work in respect to handwritten Urdu numerals.

AIM
Density, Concavity and Contour features are extracted; best results are reported using combination of density and concavity. To find out the optimal feature subset, genetic solution is suggested so as to reduce computational effort and increase recognition accuracy. On experimentation with a database of 20000 samples, the technique yields an average recognition rate of 97.8% evaluated after three-fold cross validation of results. It is useful for applications related to OCR of handwritten Urdu Numerals and can also be extended to include OCR of handwritten characters of Urdu alphabets

OCR
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text

Contd..

OCR
In on-line character recognition systems, the

computer recognizes symbols as they are drawn.


While off-line recognition is performed after

writing or printing is completed.

Contd..

APPLICATIONS
Assigning ZIP codes to letter mail. Reading data entered in forms, e.g. tax forms. Automatic accounting procedures used in processing utilities bills. Verification of account numbers and courtesy amounts on bank checks. Automatic accounting of airline passenger tickets. Automatic validation of passports.

DATA SET COLLECTION

Our objective is to obtain a set of handwritten samples of

Urdu numerals that capture variations in handwriting between and within writers.
Therefore, we need numeral samples from multiple writers,

as well as multiple samples from each writer.

Contd..

CRITERIA FOR SELECTION OF NUMERALS


The different numerals would be written in the specified

block as shown below. The persons writing the numbers are free to use different quality pens, different ink color etc. They should try to write the numerals in the specified grids, the numerals should not touch the grid lines and one numeral should also entirely written within the specified boundary, if it fails this criteria the algorithm used will remove the parts which lies outside boundary.

Contd.

CRITERIA FOR SELECTION OF NUMERALS


Each person would write 1-10 (in Urdu Script) ten times in

each row. Thus, each numeral would be written 10 times and one person would write 100 numerals.
On the Above criteria we collected the samples of 200

person. (Samples of one numeral 200*10=2000)


This resulted in 2000 samples. Out of these, 1000 samples

(10 x100) were randomly selected and were stored in the database and 1000 were used as test images.

Contd.

BLANK FORMAT FOR COLLECTING HAND WRITTEN URDU NUMERAL

SAMPLE DATA SHEET AFTER BEING DULY FILLED

PREPROCESSING

The images of the samples collected as described above are preprocessed and made suitable for further processing.
Steps for Preprocessing
Normalization

Noise Removal

NORMALIZATION
Normalization is the process of standardize the size of each image.

Steps in Normalization Process: Start the image frame with Xsize0 * Ysize0 pixels which fit the isolated numeral, by removing blank rows and columns. Rescale the size of the image to Xsize * Ysize pixels which is the maximum size according to Xsize0 or Ysize0 i.e. Xsize = max(Xsize0,Ysize0) Ysize = max(Xsize0, Ysize0)

Contd..

NORMALIZATION

Image before normalization

Image after normalization

NOISE REMOVAL
During the scanning process some noise is introduced the reasons for such noise could be some specks of dust on the scanner, poor quality of the paper on which the numerals are written etc.
If there is a single black pixel or continuously 2 or three black pixels then it is a noise ,remove the noise i.e. convert the noise pixel in to white pixel.

With noise

Without Noise

FEATURE EXTRACTION
In feature extraction stage each character is represented as a feature vector, which becomes its identity. The major goal of feature extraction is to extract a set of features, which maximizes the recognition rate with the least amount of elements

REQUIREMENTS OF A GOOD FEATURE SET


It should have a good discriminating power in order to enable the correct identification even among very similar symbols. It should not be too time consuming to compute. As far as possible, the features set should be rotation scaling and translation invariant so that the recognition is independent of font, size and pitch.

The feature set should accord some immunity to noise.


The feature set must offer a complete description of the character set to be recognized.

DIFFERENT FEATURES EXTRACTION


ZONING DENSITY

CONCAVITY
CONTOUR

ZONING
The character image is divided into NxM zones. From each zone features are extracted to form the feature vector. The goal of zoning is to obtain the local characteristics instead of global characteristics.

DENSITY
The number of dark pixels in each cell is considered a feature.

Darker squares indicate higher density of zone pixels.

Density Feature is Calculated as

Contd.

DENSITY
Steps for Density Feature: Break the box into thirty six equal parts regions/zones size of (6 x 6) each.

Compute the Density of Pixels in each zone.


Store these features in to a file.

Contd.

CONCAVITY
These features are used to highlight the topological and geometrical properties of the digit classes. Each concavity feature represents the number of white pixels that belong to a specific concavity configuration. The label for each white pixel is chosen based on the Freeman code with four direction. Each direction is explored until the encounter of a black pixel or the limits imposed by the digitbounding box. A white pixel is labeled if at least two consecutive directions find black pixels. Thus, we have 9 possible concavity configurations. Moreover, we consider four more configurations, in order to detect more precisely the presence of loops. The total length of this feature vector is then 13.
Contd.

Showing the 9 concavity configurations and also 4 configurations for false loop

Contd.

CONCAVITY

Contd.

CONTOUR
The number of interior and exterior contours is extracted from the chain code representation of the image.

Connectivity features extracted for a line Contd.

CONTOUR
To extract the direction of the numerals contour, the normalized image ( 36 x36) pixels) is divided into 6 x 6 cells. The size of each cell is 6 x 6 pixels. There are 4 feature windows in 3 x 3 pixels, consisting of 4 directions in horizontal (A), vertical (B), left diagonal (C) and right diagonal (D)

XC = ( X1A, X1B, X1C, X1D, X2A, X2B, ..X36A, X36B, X36C, X36D)

TEST RESULTS
DENSITY

CONCAVITY
CONTOUR

DENSITY & CONCAVITY


DENSITY & CONTOUR

CONCAVITY & CONTOUR


DENSITY , CONCAVITY & CONTOUR

DENSITY FEATURE'S RESULT 86.9%


NU M ER AL

Recognized As

99
0 0 0 0 2 0 3 0

0
93 0 0 3 1 0 1 1

0
0 86 1 1 1 0 0 1

0
0 2 89 2 5 0 1 1

0
3 2 4 70 6 0 4 2

0
0 2 1 7 80 0 1 0

0
4 0 1 0 0 89 5 0

1
0 2 0 7 0 0 80 0

0
0 1 1 1 0 1 1 93

0
0 5 3 12 5 10 4 2

90

CONCAVITY FEATURE'S RESULT 86.9%


NU M ER AL

Recognized As

99
0 0 0 0 2 0 3 0

0
93 0 0 3 1 0 1 1

0
0 86 1 1 1 0 0 1

0
0 2 89 2 5 0 1 1

0
3 2 4 70 6 0 4 2

0
0 2 1 7 80 0 1 0

0
4 0 1 0 0 89 5 0

1
0 2 0 7 0 0 80 0

0
0 1 1 1 0 1 1 93

0
0 5 3 12 5 10 4 2

90

CONTOUR FEATURE'S RESULT 86.9%


NU M ER AL

Recognized As

99
0 0 0 0 2 0 3 0

0
93 0 0 3 1 0 1 1

0
0 86 1 1 1 0 0 1

0
0 2 89 2 5 0 1 1

0
3 2 4 70 6 0 4 2

0
0 2 1 7 80 0 1 0

0
4 0 1 0 0 89 5 0

1
0 2 0 7 0 0 80 0

0
0 1 1 1 0 1 1 93

0
0 5 3 12 5 10 4 2

90

DENSITY & CONCAVITY FEATURE'S RESULT 86.9%


NU M ER AL

Recognized As

99
0 0 0 0 2 0 3 0

0
93 0 0 3 1 0 1 1

0
0 86 1 1 1 0 0 1

0
0 2 89 2 5 0 1 1

0
3 2 4 70 6 0 4 2

0
0 2 1 7 80 0 1 0

0
4 0 1 0 0 89 5 0

1
0 2 0 7 0 0 80 0

0
0 1 1 1 0 1 1 93

0
0 5 3 12 5 10 4 2

90

DENSITY & CONTOUR FEATURE'S RESULT 86.9%


NU M ER AL

Recognized As

99
0 0 0 0 2 0 3 0

0
93 0 0 3 1 0 1 1

0
0 86 1 1 1 0 0 1

0
0 2 89 2 5 0 1 1

0
3 2 4 70 6 0 4 2

0
0 2 1 7 80 0 1 0

0
4 0 1 0 0 89 5 0

1
0 2 0 7 0 0 80 0

0
0 1 1 1 0 1 1 93

0
0 5 3 12 5 10 4 2

90

CONCAVITY & CONTOUR FEATURE'S RESULT 86.9%


NU M ER AL

Recognized As

99
0 0 0 0 2 0 3 0

0
93 0 0 3 1 0 1 1

0
0 86 1 1 1 0 0 1

0
0 2 89 2 5 0 1 1

0
3 2 4 70 6 0 4 2

0
0 2 1 7 80 0 1 0

0
4 0 1 0 0 89 5 0

1
0 2 0 7 0 0 80 0

0
0 1 1 1 0 1 1 93

0
0 5 3 12 5 10 4 2

90

DENSITY, CONCAVITY & CONTOUR FEATURE'S RESULT 86.9%


NU M ER AL

Recognized As

99
0 0 0 0 2 0 3 0

0
93 0 0 3 1 0 1 1

0
0 86 1 1 1 0 0 1

0
0 2 89 2 5 0 1 1

0
3 2 4 70 6 0 4 2

0
0 2 1 7 80 0 1 0

0
4 0 1 0 0 89 5 0

1
0 2 0 7 0 0 80 0

0
0 1 1 1 0 1 1 93

0
0 5 3 12 5 10 4 2

90

SUMMARY OF THE RESULTS USING VARIOUS FEATURES


Name of Features Size of Feature Vector Results

CONCAVITY + DENSITY CONCAVITY + CONTOUR CONTOUR + DENSITY CONCAVITY DENSITY

90 144 180 54 36

92.6% 88.6% 88.6% 87% 86.9%

CONTOUR
CONTOUR + CONCAVITY + DENSITY 234 88.7%

GENETIC ALGORITHM
The GA is a searching process based on the laws of natural selection and genetics. Usually, a simple GA consists of three operations: Selection, Genetic Operation, and Replacement.

Contd.

GENETIC ALGORITHM

Contd.

GENETIC OPERATION
Crossover is a recombination operator that
combines subparts of two parent chromosomes to produce offspring that contain some parts of both parents genetic material.

Contd.

GENETIC OPERATION
Mutation is an operator that introduces variations into the chromosome. It randomly alters the value of a string position. Each bit of a bitstring is replaced by a randomly generated bit.

Contd.

FEATURES SUBSET SELECTION USING GA


Representation of chromosome:A string of 90 binary numbers is taken as a representation of the subset of features selected. 1 1 1 2 0 3 0 4 0 89 0 1 1 0 1 90

If a 1 appears in the string at position i , then it implies that this feature corresponding to position i is selected to be in the subset being formed and if it is a 0 then the corresponding feature is not selected.

Contd.

SELECTION OF PARAMETERS
Population Size: 10 Number of generations: 1000 Probability of crossover: 0.9 Probability of mutation: 0.01

Contd.

COMPARISION RESULTS
Numerals
Result (90 Features) 98% 96% For Best String out of 90 No. of Features 44 50 % Results 100 99

92%
96% 87% 92% 98% 83% 97% 87%

53
49 48 53 47 40 48 52

91
97 93 93 100 93 99 95

92.6%

96%

CONCLUSION
To improve the performance of Urdu Script Numeral we mainly develop number of feature such Density, Concavity, and Contour. We apply these feature and combination of these features on the sample data. We also use genetic algorithm for the above purpose and also calculate the difference of accuracy between genetic algorithm and other feature develop. The work presented a GA based method for the Optimal Selection of subset of features for increasing the recognition accuracy and speed of recognition of Urdu Script numerals.

REFERENCES
J. Sadri, et.al , Application of Support Vector Machines for recognition of handwritten Arabic/Persian digits, Proceeding of the Second Conference on Machine Vision and Image Processing & Applications (MVIP), Vol. 1, Feb.2003, Iran, pp. 300-307. Harifi et.al, A New Pattern for Handwritten Persian/Arabic Digit Recognition, International Journal of Information Technology, Vol 1, Number 4, pp 174-177. S.V. Rajashekararadhya et.al Efficient Zone Feature Extraction Algorithm for Handwritten Numerals Recognition of Four South Indian Scripts, Journal of Theoretical and Applied Information Technology 2008.

M. Hanmandlu et.al , Input fuzzy for the recognition of handwritten Hindi International Conference on Informational Technology 2007.

numeral:a,

Al-Taani Ahmad et.al, Recognition of On-line Handwritten Arabic Digits Using Structural Features and Transition Network Informatica 2008.

Contd.

REFERENCES
Kam-Fai Chan and Dit-Yan Yeung. Recognizing on-line handwritten alphanumeric characters through flexible structural matching. Pattern Recognition, Vol 32, pp. 1099 - 1114, 1999.

M.I.Razzak, Muhammad Sher, S.A.Hussain, Z.S.Khan, Combining online and offline preprocessing for online Urdu character recognition IMECS 09.
M. Pechwitz, V. Margner, Baseline Estimation For Arabic Handwritten Words, IWFHR02.

Javad sadri et.al State of the art in Farsi script recognition Signal Processing and its application, 2007.

Faouzi Bouchiareb, Mouldi Bedda, Salim Ouchetai "New Preprocessing Methods for Handwritten Arabic Word" Asian Journal of Information Technology. A. Amin, Off-line Arabic character recognition: The state of the art, Pattern recognition, vol.31, pp.517-530, 1998. Contd.

REFERENCES
S. Mori, C. Y. Suen and K. Yamamoto, Historical review of OCR research and development, Proceedings of the IEEE, vol.80, pp.1029-1058,1992. V. K. Govindan and A. P. Shivprasad, Character Recognition - A Review", Pattern Recognition, vol. 23, no. 7, pp 671-683, 1990. B. B. Chaudhuri and U. Pal, A complete printed Bangla OCR system, Pattern Recognition, vol.31, pp.531-549, 1998.

Thanks

You might also like