You are on page 1of 4

International Journal of Wisdom Based Computing , Vol.

1 (2), August 2011

55

A Note on Document Recognition System


K. Priya
CMS College of Science and Commerce Coimbatore, India priya.sks@gmail.com

Abstract Natural language processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. Document recognition is a task in which a document in its physical presentation format is transformed into a structured author-oriented model of the document. This paper presents a note on Document Recognition System. Keywords- Segmentation, Document Recognition, Character Recognition

any handwritten document recognition/retrieval task. The goal is to extract all the word images from a full page of handwritten document [1]. It is very important because, first of all, in handwritten recognition, word recognition methods can be categorized into two categories: segmentation based and non-segment based, and both of them need to work on pre-extracted word images. Secondly, content-based image retrieval techniques, such as word spotting, also require all the word images in the documents to be pre-segmented properly. Wrongly segmented word images will fail most of the techniques in handwritten document recognition/retrieval system. Techniques for document segmentation and layout analysis are traditionally subdivided into three main categories[6]: bottom-up, top-down and hybrid techniques. Some other up-to-date methods are introduced by recent progresses in this area, so as to expand the scope of above categorization. Bottom-up techniques progressively merge evidence at increasing scales to form, e.g., words from characters, lines from words, columns from text lines. They are usually more flexible than topdown methods, but they may suffer from the accumulation of mistakes when going from the smallscale details up to the large scale features. Top-down techniques start by detecting the large-scale features of the image (e.g., columns) and proceed by successive splitting until they reach the smallest-scale features (i.e., individual characters, or text lines). For the procedure to be effective, a priori knowledge about the structure of the page is necessary. These techniques are therefore particularly useful when the layout is constrained, such as is often the case when considering pages from scientific journals. Most methods do not fit into one of these two categories and are therefore called hybrid. Among these we can find methods based on texture analysis and methods based on background analysis. In methods based on texture analysis the problem of reconstructing the document layout is seen as a problem of texture segmentation. The document page is subdivided into small regions each of which is classified as belonging to one of a few categories (text, drawing, image, etc) [1] according to an analysis of its texture. Once each region in the image has been tentatively classified, a globally consistent segmentation is carried out by the usual techniques of machine vision. This paper presents a note on document recognition system.

I.

INTRODUCTION

Document Recognition System (DRS) is a task in which a document in its physical presentation format is transformed into a structured another-oriented model of the document. Segmentation is an important preprocessing step for document recognition. The size and shape of characters generally plays a vital role in the process of segmentation. But for any Document recognition system (DRS), the handwritten documents further decreases correct segmentation as well as recognition rate drastically. Because one cannot control the size and shape of characters in handwritten documents so the segmentation process for the handwritten is too difficult. So many (algorithm) techniques are implemented and also have shown encouraging results. In the field of document recognition many improvements have been made during last decade. However, automatic recognition of handwritten words remains a challenging task especially with Arabic, Chinese, Japanese, Korean etc., The handwritten recognition either on-line or off-line is a difficult task due to the high variability and uncertainty of human writing. Segmentation may be text segmentation, word segmentation, character segmentation, sentence segmentation, speech segmentation, so segmentation is an important process in document recognition. Document recognition system considers the input as a scanned image document and then they are put in preprocessing stage. Later the pre-processed image is given to segmentation phase where the image is fragmented into lines, words, and characters. Once the image has been segmented, the feature has been extracted and classified. Finally the desired output is obtained. Line segmentation, word segmentation and character segmentation are the most critical pre-processing steps for

International Journal of Wisdom Based Computing , Vol. 1 (2), August 2011

56

II.

LITERATURE REVIEW

In [1], they describe a new approach to distinguish and extract text from images with various objects and complex backgrounds. The goal of their approach is to present characters in images with clear background and without other objects. The proposed approach mainly includes two steps. Firstly, a density-based clustering method is employed to segment candidate characters by integrating spatial connectivity and color feature of characters pixels. In most images, colors of pixels in one character are commonly non-uniform due to the noise. So a new histogram segmentation method is proposed in this step to obtain the color thresholds of characters. Secondly, priori knowledge and texture-based method are performed on the candidate characters to filter the non-characters. Paper [3] presents an efficient and computationally fast method to extract text regions from documents. In this paper, they propose Haar discrete wavelet transform (DWT)[4] which operates the fastest among all wavelets because its coefficients are either 1 or -1. This is one of the reasons they employ Haar DWT to detect edges of candidate text regions. First, they detect edges and then line feature vector graph is generated based on the edge map and the stroke information is extracted. Finally text regions are generated and filtered according to line features. In [5], the authors review previous work done on text line segmentation in handwritten documents which can be generally categorized into bottom-up and top-down. In the bottom-up approach, the connected components based methods merge neighboring connected components using simple rules on the geometric relationship between neighboring blocks. On the other hand, projection based methods may be one of the most successful top-down algorithms for machine printed documents since the gap between two neighboring text lines in machine printed documents is typically significant, thus the text lines are easily separable. However, these projection based methods cannot be directly used in handwritten documents, unless gaps between lines are significant or handwritten lines are straight. After a brief description of the characteristics of text line structures in handwritten documents, the rest of the paper is organized as follows. They describes the challenges in text line segmentation like Line fluctuation, Line proximity and Writing fragmentation. In Section 3 we review the different approaches to segment a handwritten document into text lines and propose a taxonomy. Optical character recognition [6] of cursive scripts presents a number of challenging problems in both segmentation and recognition processes in different languages, including Persian. In order to overcome these problems, we use a newly developed Persian word segmentation method and a recognition-based segmentation technique to overcome its segmentation problems. This method is robust as well as flexible. It also increases the systems tolerances to font variations. The implementation results of this method on a comprehensive database show a high degree of accuracy

which meets the requirements for commercial use. Extended with a suitable pre and post-processing, the method offers a simple and fast framework to develop a full OCR system. Character segmentation [7] is an important preprocessing step for text recognition. The size and shape of characters generally play an important role in the process of segmentation. But for any optical character recognition (OCR) system, the presence of touching characters in textual as well handwritten documents further decreases correct segmentation as well as recognition rate drastically. Because one can not control the size and shape of characters in handwritten documents so the segmentation process for the handwritten document is too difficult. We tried to segment handwritten text by proposing some algorithms, which were implemented and have shown encouraging results. Algorithms have been proposed to segment the touching characters. These algorithms have shown a reasonable improvement in segmenting the touching handwritten characters in Gurmukhi script. This method was used to detect the lines present in scanned document in hand-written Gurmukhi script. To find out the lines present in the document then to find words present in each line detected at the first step. Using the detected words it is to segment characters present in each word. Therefore using line detection algorithm (the first approach) lines were detected. Mostly we found the correct lines, but some were not detected correctly. The correctly detected lines were further put to word detection algorithm. Here the results were good, but sometimes when the words were not joined properly then that was detected as a different word. The locations of the detected words were used to segment the characters. At few point segmentation was good but at few point it was not up to the expectations. Handwriting recognition [8] has attracted voluminous research in recent times. The segmentation and recognition of the characters from handwritten scripts incorporates considerable overhead. Almost all the existing handwritten character recognition techniques use neural network approach, which requires lot of preprocessing and hence accomplishing these problems using neural network is a tedious task. In this study we propose a novel solution for performing character recognition in Tamil, the official language of the south Indian province of Tamil Nadu. Pursued by the preprocessing techniques, Segmentation, Normalization and Feature Extraction the approach utilizes octal graph conversion for recognizing off-line handwritten Tamil characters which improves the slant correction. The graph tries to represent the basic form of a letter independent of the style of writing. Using the weights of the graphs and by the appropriate feature matching with the predefined characters, the written characters are recognized. III. DOCUMENT RECOGNITION SYSTEM

Numerous researches have been carried on this document recognition system. This section of the paper presents a survey on various document recognition techniques that were proposed earlier in literature.

International Journal of Wisdom Based Computing , Vol. 1 (2), August 2011 Input : The aim of Document Recognition System (DRS) is to process the image of a scanned document page containing characters and render the information into a suitable form for modification and manipulation. It is a process of converting scanned images into original text. Pre-processing: Pre-processing of the scanned image is done to prepare it for other stage. It increases the accuracy of the recognizing algorithms by enhancing some of the features and eliminating some of the inconsistencies. The raw input of the digitizer typically contains noise due to erratic hand movements and inaccuracies in digitization of the actual input. Original documents are often dirty and due to smearing and smudging of text and aging. In some cases, documents are of very poor quality due to seeping of ink from the other side of the page and general degradation of the paper and ink. Pre-processing is concerned mainly with the reduction of these kinds of noise and variability in the input. The number and type of pre-processing algorithms employs on the scanned image depend on many factors such as paper quality, resolution of the scanned image, the amount of skew in the image and the layout of the text. Pre-processing is at the image-to-image transformation level. It is the process of compensating a poor quality and / or poor-quality scanning. Pre-processing operations performed prior to recognition are: Thresholding, Skeletonization, Line Segmentation, Character Segmentation, Slant removal, and Normalization. The image is then ready for segmentation. Segmentation: After the images are scanned, the segmentation process begins. Segmentation plays a vital role in Document Recognition System (DRS). Segmentation is an important pre-processing step for document recognition. The size and shape of characters generally plays a vital role in the process of segmentation. But for any Document recognition system (DRS), the handwritten documents further decreases correct segmentation as well as recognition rate drastically. Because one cannot control the size and shape of characters in handwritten documents so the segmentation process for the handwritten is too difficult. So many (algorithm) techniques are implemented and also have shown encouraging results. Segmentation may be classified into: a) Sentence segmentation, b) Line segmentation, c) Word segmentation, d) Character segmentation, e) Topics segmentation Word segmentation is the most critical pre-processing step for any handwritten document recognition/retrieval system. To separate a line of unconstrained (written in a natural manner) handwritten text into words we use different approaches. When the writing style is unconstrained, recognition of individual components may be unreliable so they must be grouped together into word hypotheses, before recognition algorithms can be used. Line segmentation, word segmentation and character segmentation are the most critical pre-processing steps for any handwritten document recognition/retrieval task. The

57

goal is to extract all the word images from a full page of handwritten document. It is very important because, first of all, in handwritten recognition, word recognition methods can be categorized into two categories: segmentation based and non-segment based, and both of them need to work on pre-extracted word images. Secondly, content-based image retrieval techniques, such as word spotting, also require all the word images in the documents to be pre-segmented properly. Wrongly segmented word images will fail most of the techniques in handwritten document recognition/retrieval system. The purpose of this step is to produce a sequence of tentative character segmentation lines. Although these lines do not necessarily segment characters from a word correctly, the location and the number of them would directly affect the accuracy. It is important to note that this step only produces a sequence of fragments, while the segmentation of characters is confirmed at the classification stage. Feature Extraction: After an image has been segmented into regions, it is ready to enter the next level that is the feature extraction stage. The end result of the image acquisition, pre-processing, segmentation, and fragmentation is a matrix of numbers that represents a character fragment in some way. Structural features of each character fragment are extracted in this method. Feature extraction is a problem of extracting from the raw data the information which is most relevant for classification purposes, in the sense of minimizing the within class pattern variability while enhancing the between class pattern variability. Feature extraction consists of three steps: extreme coordinates measurement, grabbing character into grid, and character digitization. The handwritten character is captured by its extreme coordinates from left/right and top/bottom and is subdivided into a rectangular grid of specific rows and columns. The algorithm automatically adjusts the size of grid and its constituents according to the dimensions of the character. Then it searches for the presence of character pixel in every box of the grid. The boxes found the character pixels are considered On and the rest are marked Off. Classification: The classification process is carried out at the final stage to recognize the character. It assigns an input character to one of many pre-specified classes which are based on the extracted features and their analysis. Neural network classifiers exhibit powerful discriminative properties and they have been used in handwriting recognition particularly with digit, isolated characters, and words in small vocabularies. To recognize characters, sub-words are first fragmented into a sequence of character fragments. Each fragment is numbered from right to left, left to right, top to bottom, and bottom to top. During the recognition process, the first fragment is fed into the feature extraction stage to determine the concentrated codes. These codes are then put into any of the neural network approach to find the best match.

International Journal of Wisdom Based Computing , Vol. 1 (2), August 2011 Some of the methods and algorithms, include but not limited to, (a) Histogram segmentation method [1] is used for eliminating the influence of noise by determining the grayscale thresholds of a character. DBSCAN[2] based algorithm is proposed to ensure all connected parts could be clustered into one region (b) Haar Discrete Wavelet Transform (DWT) [3] method is employed to detect the edges of candidate text regions. Bottom-up technique (approach) is used to document segmentation and layout analysis. (c) Projection based [5], smearing, grouping, Houghbased, graph-based and Cut Text Minimization (CTM) approach are been used in the project. (d) Use of horizontal histogram [6] to separate lines and vertical histogram to separate sub-words. For classification they implement ANN( Artificial Neural Network) in this method. (e) Techniques like projection-based, [8] smearing, grouping, Hough-based, graph-based, and cut text minimization (CTM) are been used. They used normalization for slant removal, width normalization and vertical scaling. Octal graph algorithm is used to convert the image into octal graph IV. CONCLUSION
[6]

58

REFERENCES
[1] Fang Liu, Xiang Peng, Tianjiang Wang, Songfeng Lu, A Density-based Approach for Text Extraction in Images, This work was partially supported by HTRDP (Hi-Tech Research and Development Program of China) 2007AA01Z161, 2008,IEEE.. M. Ester, H.P. Kriegel, and J. Sander et al., "A Density Based algorithm for discovering clusters in large spatial databases", Proceedings of International Conference on knowledge Discovery and Data Mining, AAAI Press, 1996, pp 226-231 S.Audithan and RM. Chandrasekaran, Document Text Extraction from Document Images Using Haar Discrete Wavelet Transform, European Journal of Scientific Research, ISSN 1450216X Vol.36 No.4 (2009), pp.502-512, EuroJournals Publishing, Inc. 2009 Acharya, Tinku., Chen, Po-Yueh. 1998. VLSI implementation of DWT architecture. Proceedings of the IEEE International Symposium on Circuits and Systems, 2: 272-275. Zaidi Razak, Khansa Zulkiflee , Mohd Yamani Idna Idris, Emran Mohd Tamil, Mohd Noorzaily Mohamed Noor, Rosli Salleh, Mohd Yaakob @ Zulkifli Mohd Yusof, and Mashkuri Yaacob, Off-line Handwriting Text Line Segmentation : A Review, IJCSNS International Journal of Computer Science and Network S 12 ecurity, VOL.8 No.7, July 2008 Mohsen Zand, Ahmadreza Naghsh Nilchi, and S. Amirhassan Monadjemi, Recognition-based Segmentation in Persian Character Recognition, International Journal of Computer and Information Science and Engineering 2;1 2008. Rajiv K. Sharma & Dr. Amardeep Singh, Segmentation of Handwritten Text in Gurmukhi Script, International Journal of Computer Science and Security, volume (2) issue (3) R. Jagadeesh Kannan and R. Prabhakar, An Improved Handwritten Tamil Character Recognition System using Octal Graph, Journal of Computer Science 4 (7): 509-516, 2008

[2]

[3]

[4]

[5]

[7]

[8]

This paper has presented a note on document recognition system.