You are on page 1of 8

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & ISSN

0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), IAEME

TECHNOLOGY (IJCET)

ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 4, Issue 5, September October (2013), pp. 224-231 IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com

IJCET
IAEME

IDENTIFICATION OF DEVANAGARI SCRIPT FROM IMAGE DOCUMENT


SATISH R. DAMADE1, K. P. ADHIYA2 and RANJANA S. ZINJORE3 Computer Engineering, KBSs College of Engineering & Technology, North Maharashtra Knowledge city, Jalgaon. 2 Computer Engineering, SSBTs College of Engineering & Technology, Bambhori, Jalgaon. 3 Computer Application, KCESs Institute of Management and Research, Jalgaon.
1

ABSTRACT Texts that appear in the image contain useful and important information. Optical Character Recognition technology is restricted to finding text printed against clean backgrounds, and cannot handle text printed against shaded or textured backgrounds or embedded in images. It is necessary to extract the text form image which is helpful in a society for a blind and visually impaired person when voice synthesizer is attached with the system. In this paper, we present a methodology for extracting text from printed image document and then identified Devanagari Script (Hindi language) from extracted text. Firstly we used Morphological Approach for extracting the text from image documents. The resultant text image is passed to Optical Character Recognition for Identification purpose. Projection profile is used for segmentation followed by Visual Discriminating approach for feature extraction. Finally for classification purpose Heuristic search is used. The result of proposed method for text extraction is compared with edge based and connected component with projection profile approach. After comparison using precision and recall rate it is observed that proposed algorithm work well. Keywords: Area, Bounding Box, Canny edge detector, Heuristic Search, Projection Profile, Visual Discriminating feature. I. INTRODUCTION In recent years, the escalating use of physical documents has made to progress towards the creation of electronic documents to facilitate easy communication and storage of documents. Now a day, information is becoming increasingly enriched by multimedia components containing images and video in addition to the textual information. The extraction of text in an image is a classical
224

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), IAEME

problem in the computer vision research area. Text extraction from images and video find many applications in document processing, detection of vehicle license plate, mobile robot navigation, object identification, text in www images, content based image retrieval from image database and video content analysis [1]. There are basically three kinds of images: Document image, Scene text images and Caption text images for extracting the text from these images basically two approaches are used i.e. Region based approach and Textual based approach [2]. After extracting the text from image document, script identification play a vital role in designing of Optical character Recognition. Script Identification is a key step that arises in document image analysis especially when the environment is multi script and language identification is required to identify the different language that exists in the same script. In India, Script Identification facilitates many important applications such as sorting the images, selecting appropriate script specific text understanding system and searching online archives of document image containing a particular script [3]. In this paper we used Hindi language for identification purpose because Hindi is the third most spoken language of the world after Chinese and English, and there are approximately 500 billion people all over the world who speak and write in Hindi language. Also there are many forms and application available in combination of state official language and English language. In this paper we used printed images consist of Hindi and English text and then identifying Hindi language from such image documents. Hindi is derived from Devanagari script consisting 12 vowels and 34 consonants apart from a horizontal line at the upper part of a characters called as Shirorekha. English alphabet is a Latinbased alphabet consisting of 26 letters each of upper and lower case characters. The structure of the English alphabet contains more vertical and slant strokes. II. CHALLENGES AND RELAED WORK Text extraction from complex images is one of the most useful and difficult applications of pattern recognition and computer vision. Also identifying script form extracted image text is also very difficult task due to similar shape of the characters of the script. Authors presented a technique for detecting caption text for indexing purposes. Caption text objects are detected combining texture and geometric features and textured areas is detected using wavelet analysis. [4]. Zhong et al. [5] located text form complex images like compact disc, book cover, or traffic scenes. For finding text location authors used higher spatial variance of the image intensity of horizontal text lines. In this paper authors [6] proposed a four step system which automatically detects and extracts text in images including texture segmentation in which image is filter using bank of linear filters followed by strokes extraction, drawing rectangular box around the text and finally detecting the text. Authors [7] are used edge based approach for extracting the text based on generating a feature map using three important properties of edges: edge strength, density and variance of orientation. Neha Gupta[8] have proposed a method for image segmentation for text extraction based on 2d-Discrete Wavelet Transform which decompose the image into four sub component. After that edges of three sub-bands are fused to create a candidate text region followed by projection profile approach and based on some threshold text is extracted. C. V. Jawahar proposed [9] a technique to distinguish between Hindi and Telugu script. For Hindi, segmentation involves the removal of shirorekha. For Telugu, component extraction implies the separation of connected components. S. Basavaraj Patil [10] presents a approach for identification of Hindi, English and Kannada language script. For feature extraction, input image is dilated using 3x3 masks in Horizontal, Vertical, left and right diagonal direction followed by average pixel distribution of resultant image and neural network is used for classification. Pal and Chaudhuri [11] have proposed an automatic technique of separating the text lines from 12 Indian scripts (English, Devanagari, Bangla, Gujarati, Tamil, Kashmiri, Malayalam, Oriya, Punjabi, Telugu and Urdu) using ten triplets formed by grouping English and Devanagari with
225

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), IAEME

any one of the other scripts. Santanu Choudhuri, et al. [12] have proposed a method for identification of Indian languages by combining Gabor filter based technique and direction distance histogram classifier considering Hindi, English, Malayalam, Bengali, Telugu and Urdu. III. PROPOSED ARCHITECTURE AND METHODOLOGY 3.1 Proposed Architecture: The proposed architecture for identification of Hindi Script from image document is shown in Fig.1. 3.2. Methodology 3.2.1) Preprocessing: i) In this step we have convert the image having RGB color space (Fig. 2) into gray scale image. Gray scale image is converted into binary using Ostus thresholding.
Color image with complex background

Preprocessing

Detection of text region

Display the text region and removing non-text region which is a resultant image

Input Improving the quality of text Subtract resultant image with input image Final Image containing Text Segmentation of text into line & words

Output

Passed the extracted text for script Identification

Feature Extraction

Heuristic Search Identified Hindi Script

Identification of Hindi Script

Figure 1: Proposed Architecture for Identification of Hindi script from Image Document ii) Edge detection and Morphological Dilation: An efficient canny method is used for edge extraction. The edge image is dilated using square structuring element of size. Mathematical dilation of A by B is denoted by: ----------------- (1.1) iii) Hole filling: Hole filling is determined by selection of marker and mask images.

0
226

-------------------(1.2)

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), IAEME

3.2.2) Detection of Text Region i) The dilated image is labeled using bwlabel in matlab followed by 8 way connectivity. To obtained measurement (Bounding Box and Area) of image region we used region props properties in Matlab (Fig. 3). ii) Further for extraction of text region, we have computed a new-value by multiplying height and width of a Bounding Box and then resultant new-value is divided by Area. By experimentation it is found that if the ratio (new-value/area) is less than 1.78 and height is greater than 9 then the region so obtained are text region (specific condition).

Figure 2: Input color Image

Figure 3: Bounding box

3.2.3) Displayed Text region and removed non-text Region i) We found the connected component (CC) of a binary image which returns CC using bwconncomp Function in matlab. ii) Obtained the size of dilated image and make its value as false to make a blank background. iii) The connected component which satisfied the above specific condition makes that connected component value as true using PixelIndexList. iv) The resultant image visualization is very poor. To increases the visualization we subtract the resultant image with input image (Fig. 4).

Figure 4: Final Result

Figure 5: Segmentation of text into line


227

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), IAEME

3.2.4) Segmentation of Text into lines and words i) The document image is segmented into several text lines using horizontal projection profiles computed by a row-wise sum of black pixels. After that we found valley of minimum and maximum points of a text document in the histogram and then draw a line (cut point) from minimum point to size of documents. Shown in Fig. 5. ii) After line segmentation we have used vertical projection profile by considering the threshold value (which maintained the inter character gap) for Bilingual text (Devanagari English) word segmentation. The words obtained are thin for feature extraction and also we have provided the bounding box to the word by obtaining left, right, top and bottom first pixel. The extracted words are inverted for feature extraction which is shown in Fig. 6.

Figure 6: Word segmented Image 3.2.5) Script Identification The distinct features used for script Identification are: i) Feature 1:Top_profile and Bottom_profile: The top_profile (bottom_profile) of a text line represents a set of black pixels obtained by scanning each column of the text line from top (bottom) until it reaches a first black pixel. Thus, a component of width N gets N such pixels. ii) Feature 2: Top-max-row: Represents the row number of the top_profile at which the maximum number of black pixels lies (black pixels having the value 0s correspond to object and white pixels having value 1s correspond to background). iii) Feature 3: Bottom-max-row: Represents the row number of the bottom_profile at which the maximum number of black pixels lies (black pixels having the value 0s correspond to object and white pixels having value 1s correspond to background).\ iv) Feature 4: Top-horizontal-line: (i) Obtain the top-max-row from the top-profile. (ii) Find the components whose number of black pixels is greater than threshold1 (threshold1 = half of the height of the bounding box) and store the number of such components in the attribute horizontal-lines. (iii) Compute the feature top-horizontal-line using the equation (1.3) below: Top-horizontal-line = (hlines * 100) / tc -----------------(1.3)

Where- hlines represent number of horizontal lines and tc represents total number of components of the top-max-row.
228

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), IAEME

3.2.6) Heuristic script identification algorithm (): (Result is shown in Fig. 7) Input: Pre-processed text words of Devnagari and English scripts Output: Range of feature values 1. Compute top-profile 2. Compute bottom-profile 3. Compute features F3 & F4 4. Identify the Script type as follows If Top_max_row=Bottom_max_row OR Top_horizontal_lines >= 60 then Script=Hindi else Script=Others 5. Return Script

Figure 7: Hindi Identified words IV) RESULTS AND DISCUSSION We compared proposed algorithm with edge based and connected component using projection algorithm. For comparison we are used precision and recall rate.

--- (1.2)

---- (1.3) Precision rate takes into consideration the false positives, which are the non-text regions in the image and have been detected by the algorithm as text regions. Recall rate takes into consideration the false negatives, which are text words in the image, and have not been detected by the algorithm. Thus, precision and recall rates are useful as measures to determine the accuracy of each algorithm in locating correct text regions and eliminating non-text regions as shown in table 1.1.
229

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), IAEME

Input Image

Image1 Image2 Image3 Image4 Image5

Table 1.1: Comparison of Results of Three Algorithms Edge based Algorithm Connected component Proposed Algorithm Based Algorithm Precision Recall Rate Precision Recall Precision Recall Rate Rate Rate Rate Rate 76.19 80.00 83.33 99.00 94.11 80.00 68.42 76.47 77.27 99.00 86.66 76.47 61.53 100.00 66.66 100.00 72.72 100.00 0.00 0.00 53.84 100.00 63.63 100.00 83.33 99.90 66.66 80.00 90.90 98.00

Table 1.2: Results of Identified Devanagari (Hindi) script from Images Sr. No Dataset Name Hindi Words 1 Image1 Correct Classification 100% Misclassification 0% Rejection 0% 2 Image2 Correct Classification 80.00% Misclassification 20.00% Rejection 0% 3 Image3 Correct Classification 66.66% Misclassification 00.01% Rejection 33.33% 4 Image4 Correct Classification 100% Misclassification 0% Rejection 0% 5 Image5 Correct Classification 100% Misclassification 0% Rejection 0% V) CONLUSION

In this paper, we have presented a very efficient and easy algorithm for extraction of text from image document based on connected component. The morphological approach is applied followed by finding a result by multiplication of height and width and then divided the result by area. By experimentation we have fixed the value of result to remove non text region from image. The proposed algorithm is tested on five images having same font size and obtained average accuracy of precision rate 81.60% and recall rate 90.89%. The extracted text form image is passed for script identification. Using Heuristic Search classifier we have obtained correct classification accuracy of 89.33%. In future we are tested the algorithm on variable font size images. REFERENCES 1. Keechul Jung, Kwang In Kim and Anil K. Jain, Text information extraction in images and video: a survey, The journal of the Pattern Recognition society, Vol. 37, Issue 5, pp. 977-997, May 2004. 2. Chitrakala Gopalan and D. Manjula, Contourlet Based Approach for Text Identification and Extraction from Heterogeneous Textual Images, International Journal of Electrical and Electronics Engineering 2(8), pp. 491-500, 2008.
230

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), IAEME 3. M. C. Padma and P.A. Vijaya, Script Identification form Trilingual Documents using profile Based Features, International Journal of Computer Science and Applications, Vol. 7 No. 4, pp. 16 - 33 , 2010. 4. Leon, M., Vilaplana, V., Gasull, A. and Marques, F., "Caption text extraction for indexing purposes using a hierarchical region-based image model," 16th IEEE International Conference on Image Processing (ICIP), Nov. 2009. 5. Yu. Zhong, K. Karu, A. K. Jain, Locating text in complex color images, 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 146-149,1995. 6. Victor. Wu, R. Manmatha, E. M. Riseman, Text Finder: an automatic system to detect and recognize text in images, IEEE Transactions on PAMI, vol. 21, pp. 1224-1228, 1999. 7. Jagath Samarabandu and Xiaoqing Liu, An Edge-based Text Region Extraction Algorithm for Indoor Mobile Robot Navigation, World Academy of Science, Engineering and Technology , pp 382-389, 2007 8. Neha Gupta, V .K. Banga, Image Segmentation for Text Extraction, 2nd International Conference on Electrical, Electronics and Civil Engineering (ICEECE'2012), Singapore, April 28-29, 2012 9. C. V. Jawahar, Pavan Kumar, S.S.Ravi Kiran, A Bilingual OCR for Hindi-Telugu Documents and its applications, Proceedings of 7th International Conference on Document Analysis and Recognition (ICDAR)- Aug 2003, Vol 1, pp 408-412,2003. 10. S. Basavaraj Patil and N V Subbareddy, Neural network based system for script identification in Indian documents, Sadhana, Academy Proceedings in Engineering Sciences, Vol. 27, Part 1, pp. 8397, , February 2002 11. K. Roy, U. Pal, and B. B. Chaudhuri, Neural Network based Word wise Handwritten Script Identification System for Indian Postal Automation, Proceedings of ICISIP, International Conference on IEEE, pp 240-245,2005. 12. Santanu Choudhury, Gaurav Harit, Shekar Madnani, R.B. Shet, Identification of Scripts of Indian Languages by Combining Trainable Classifiers, ICVGIP, Dec.20-22, Bangalore, India, (2000). 13. M Swamy Das, D Sandhya Rani, C R K Reddy and A Govardhan, Script identification from Multilingual Telugu, Hindi and English Text Documents, International Journal of Wisdom Based Computing, Vol. 1 (3), December 2011. 14. M. M. Kodabagi and S. R. Karjol, Script Identification from Printed Document Images using Statistical Features, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 2, 2013, pp. 607 - 622, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 15. R. Edbert Rajan and Dr.K.Prasadh, Spatial and Hierarchical Feature Extraction Based on Sift for Medical Images, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp. 308 - 322, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 16. M. M. Kodabagi, S. A. Angadi and Chetana. R. Shivanagi, Character Recognition of Kannada Text in Scene Images using Neural Network, International Journal of Graphics and Multimedia (IJGM), Volume 4, Issue 1, 2013, pp. 9 - 19, ISSN Print: 0976 6448, ISSN Online: 0976 6456. 17. Patange V.V and Prof. Deshmukh B.T, Visual Acknowledgement [O.C.R.] A Method to Identify the Printed Characters, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp. 108 - 114, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 18. M. M. Kodabagi, S. A. Angadi and Anuradha. R. Pujari, Text Region Extraction from Low Resolution Display Board Images using Wavelet Features, International Journal of Information Technology and Management Information Systems (IJITMIS), Volume 4, Issue 1, 2013, pp. 38 - 49, ISSN Print: 0976 6405, ISSN Online: 0976 6413.

231

You might also like