Text Extraction Using Document Structure Features and Support Vector Machines

Proceedings of the 11th IASTED International Conference Computer Graphics and Imaging (CGIM 2010) February 17 - 19, 2010
Innsbruck, Austria
TEXT EXTRACTION USING DOCUMENT STRUCTURE FEATURES AND SUPPORT VECTOR MACHINES
Konstantinos Zagoris and Nikos Papamarkos Image Processing and Multimedia Laboratory Department of Electrical & Computer Engineering Democritus University of Thrace 67100 Xanthi, Greece kzagoris@ee.duth.gr papamark@ee.duth.gr The proposed method detects and extracts homogeneous text in document images indifferent to font types and size by using connected components analysis to detect the objects, document structure features to construct a descriptor and Support Vector Machines to tag the appropriate objects as text. The proposed technique has the ability to adapt to the peculiarities of each document images database since the features adjust to it each time.
ABSTRACT In order to successfully locate and retrieve document images such as technical articles and newspapers, a text localization technique must be employed. The proposed method detects and extracts homogeneous text areas in document images indifferent to font types and size by using connected components analysis to detect blocks of foreground objects. Next, a descriptor that consists of a set of structural features is extracted from the merged blocks and used as input to a trained Support Vector Machines (SVM). Finally, the output of the SVM classifies the block as text or not. KEY WORDS Page Layout, Text Extraction, Support Vector Machines, Document Structure Elements, Connected Component Analysis
2. Text Extraction Algorithm

Figure 1, depicts the overall structure of the proposed algorithm. After applying preprocessing techniques (binarization etc), the initial blocks are identified using the Connected Component Analysis (CCA) method. Then, these blocks are expanded and merged to model lines of text. Locate, Merge and Extract Blocks
1. Introduction
Nowadays, there is abundance of document images such as technical articles, business letters, faxes and newspapers sparked by the easiness to create them using scanners or digital cameras. Therefore, the needs to easily locate and retrieve this kind of documents quickly arise. In order to successfully exploit them a text localization technique must be employed with the purpose of determine the location of the text inside them. In previous literature, there are top-down approaches [1,2] employing recursive algorithms to segment the whole page to small regions. On the other hand, the most frequently used approaches are the bottom-up methods which segment the page to small regions and then merge them based on some criteria. Such method is proposed by Strouthopoulos et al. [3] which it is a technique to automatically detect and extract text in mixed-type color documents using a combination of an adaptive color reduction technique and a page layout analysis approach. Jain et al. [4] have presented a geometric layout analysis of technical journal pages using connected component extraction to efficiently implement page segmentation and region identification.
679-009 88
Extract the Features from the Blocks Find the Blocks Which Contain Text Using Support Vector Machines Extract or Locate the Text Blocks and Present them to User
Figure 1. The steps of the proposed text-extraction algorithm.
Next, a descriptor that consists of a set of structural features (determine by a procedure which we call Feature Standard Deviation Analysis of Structure Elements) is extracted from the merged blocks and used as input to a
trained Support Vector Machines (SVM). Finally, the output of the SVM defines the block as text or not.
a)
Finally, to locate the text-lines the overlapping CCs are merged (Figure 2(d)).
4. Creation of the Block Descriptor

The next step involves the feature extraction stage of the blocks. The extracted features construct a descriptor of each block that maximizes the separability between the blocks. The spatial features are constructed by a number of suitable Document Structure Elements (DSEs) which the blocks contain.
b) a)
b8 b5 b2
b7 b4 b1
b6 b3 b0
b)
c)
Figure 3. (a) The Pixel Order of the DSEs (b) The DSE of L142
Analytically, a DSE is any 3x3 binary block, as Figure 3 depicts. Therefore it is obvious that there are total 29 = 512 DSEs. An integer L j is assign to each DSE such as L j = b ji 2i (Figure 3(a)). For a block B , if C the
i=0 8
d)
number of its columns and R the number of its rows then the block B contains (C 2)( K 2) DSEs. The initial descriptor of the block B is the histogram H ( L j ) of the DSEs that the block B contains and it is calculated by the following equation: H ( Ln ) + 1, if L j = Ln H ( Ln ) = if L j Ln (1) H ( Ln ),
Figure 2. The Block Extraction Steps: (a) The Original Document, (b) the Connected Components,(c) the Expanded Connected Components, (d) The Final Blocks after the Merging of the Connected Components
3. Block Extraction
The primary goal of the block extraction method is to detect and extract the objects of the document. This is accomplished by using the Connected Components Labeling and Filtering technique. After applying a binarization method appropriate for the document (e.g. Otsu [5]) and indentifying all the Connected Components (CCs) (Figure 2(b)), the most common height of the CCs ( CCh ) is calculated. The next step includes the expansion of the left and right sides of the CCs by 50% of the CCh as Figure 2(c) depicts.
for n = 1, 2,....(C 2)( K 2) L j , Lv [1,510] . Note that the 0 and 511 DSEs are where removed because they correspond to pure background and pure document objects, respectively. According to the above analysis a histogram is constructed by the following equation: H ( Ln ) X ( Ln ) = 510 (2) H ( Li )
i =1
where X ( L ) is a vector of 510 elements. Next, a feature reduction algorithm applied which reduces the number of features from 510 to 32. We call this algorithm Feature Standard Deviation Analysis of Structure Elements (FSDASE). If there are T text blocks and P non text blocks then the stages of the FSDASE algorithm are:
89
1.
Find the Standard Deviation (SD) SDXT ( Ln ) of the X ( Ln ) for the T blocks and for each Ln DSEs. Do the same for the P blocks: Find the SD SDXP ( Ln ) of the X ( Ln ) for each Ln DSEs. Normalize the SDXT ( Ln ) and SDXP ( Ln ) :
SDXT ( Ln ) = SDXP ( Ln ) = SDXT ( Ln ) 510 SDXP ( Ln )
where ( x ) is the feature map mapping the input space to a high dimensional feature space where the training data become linearly separable. The most common used kernels are the Polynomial ( ( xT x + 1) p ), Radial Basis Function ( exp { x x } ) and the Sigmoid kernel (
tanh ( kxT x ) ). Our experiments showed the Radial
2. 3.
Basis Function as the most robust kernel. If w = i xi the SVM conditions of Eq. (3) transforms
i =1 n
4. 5.
510 Then define the vector O ( Ln ) as:

O ( Ln ) = SDXT ( Ln ) SDXPT ( Ln )
to: yi i k ( x, xi ) + b 1 0
(5)
Finally, take those 32 DSEs that correspond to the first 32 maximum values of O ( Ln ) .
In practice sometimes the classifier must misclassify some data points (for instance to overcome the over fitting problem). This is achieved using the slack variables i > 0 . So Eq. (5) is changed to:
yi i k ( x, xi ) + b 1 + 0 (6) Finally, the maximum margin classifier is calculated by solving the following constrained optimization problem which is expressed in terms of variables i :
The goal of the FSDASE is to find those DSEs that have maximum SD at the text blocks and minimum SD at the non text blocks and the opposite. Obviously, a training dataset is required to determine the optimal DSEs. Unfortunately, this does not cause a problem because such dataset already is required for the training of the SVMs. Therefore the final block descriptor is a vector with 32 elements and it corresponds to the frequency of the 32 DSEs that the block contains. This descriptor is used to train the Support Vector Machines. Note that the descriptor has the ability to adapt to the demands of each set of documents images. The advantages are twofold: A noisy document has different set of DSEs than a clear document. If there is available more computational power, the descriptor can increase its size easily above 32.
maximize
a
i =1 n
1 n n yi y ji j xiT x j 2 i =1 j =1
j i
subject to:
y
i =1
(7)
= 0, 0 i C
5. Support Vector Machines

The Support Vector Machines (SVMs), introduced in 1992 [6,7] are based on statistical learning theory and recently have been applied to many and various classification problems. If
D
is
given
training
x [0,1], y {1, +1} , i [1, n] , where xi is the i input

vector and y is the label correspond to the xi . The original linear SVM classifier satisfies the following conditions: wT xi + b +1, when yi = +1 T (3) yi w xi + b 1 0 wT xi + b 1, when yi = 1 If the training data are not linear separable (as in our case) then they mapped from the input space X to a feature space F using the kernel method is defined as: T k ( x, x ) = ( x ) ( x ) (4)
90
dataset {( xi , yi )}i =1 ,
n
The constant C > 0 defines the tradeoff between the training error and the margin. The training data xi for which ai > 0 , are called support vectors. One of the difficulties of the SVM consists of finding the correct parameters to train them. In our case, we have two parameters: the C from the maximum margin classifier and the from the Radial Basis Function kernel. The goal is to find the values of the two parameters C and so that the classifier can accurately predict the unknown data. This is achieved through a cross-validation procedure by using a grid search for the two parameters. The values of the above parameters for our document images database are calculated as: C = 8, = 8 . Final, the output of the SVM classifies each block as text or not (Figure 4).
6. Implementation and Evaluation

The proposed technique is implemented in a visual environment (Figure 5) with the help of the Visual Studio 2008 and LIBSVM [8] and is based on the Microsoft .NET Framework 3.5. The programming language which is used is C#/XAML. The program can be downloaded from the following web address:
http://orpheus.ee.duth.gr/download/TextFind er_1.0.9.zip
To evaluate the proposed text extraction technique, the Document Image Database from the University of Oulu
[9,10] is employed, which includes 198 various types of documents. In our experiments we used a set of the 48 article documents. Those image documents contained a mixture of text and pictures. From this database five images are selected and the extracted blocks used to determine the proper DSEs and to be employ as training samples for the SVMs. The overall results are presented at Table 1.
Acknowledgement
This work is co-funded by the project "PENED 200303679".
References
[1] R. Ingold, and D. Armangil, A Top-Down Document Analysis Method for Logical Structure Recognition, Proc. First Intl Conf. Document Analysis and Recognition, Saint Malo, France, 1991, 41-49. [2] Y. Chenevoy and A. Belaid, Hypothesis Management for Structured Document Recognition, Proc. First Intl Conf. Document Analysis and Recognition, Saint-Malo, France, 1991, 121-129. [3] C. Strouthopoulos, N. Papamarkos, and A. E. Atsalakis, Text extraction in complex colour document, Pattern Recognition, 35, 2002, 1743-1758. [4] K. Anil, J. Fellow, and Bin Yu, Document Representation and Its Application to Page Decomposition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), March 1998. [5] N.Otsu, A threshold selection method from gray-level histograms, IEEE Trans. Systems, Man, and Cybernetics, 9, 1979, 62-66. [6] B. E. Boser, I. Guyon, and V. Vapnik, A training algorithm for optimal margin classifiers, In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, ACM Press, 1992, 144-152. [7] C. Cortes, and V. Vapnik, Support-vector network, Machine Learning, 20, 1995, 273-297. [8] Chih-Chung Chang, and Chih-Jen Lin, LIBSVM : a library for support vector machines, Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001. [9] University of Oulu, Finland, Document Image Database, http://www.ee.oulu.fi/research/imag/document/. [10] J. Sauvola and H. Kauniskangas, MediaTeam Document Database II, a CD-ROM collection of document images, University of Oulu, Finland, 1999.
Figure 4.The Final Text Extracted Blocks by the SVM.
Table 1. Experimental Results.
Document Images 48
Blocks 25958
Success Rate 98.453%
7. Conclusion
In this paper a bottom-up text localization technique is proposed that detects and extracts homogeneous text from document images. A Connected Component analysis technique is applied which detects the objects of the document. Then a powerful descriptor is extracted based on structural elements. Finally, a trained SVM classify the objects as text and non-text. The proposed technique is implemented in a visual environment and the experimental results are much promised.
Figure 5. The application of the proposed method.
91

Text Extraction Using Document Structure Features and Support Vector Machines

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Extraction Using Document Structure Features and Support Vector Machines

Uploaded by

Copyright:

Available Formats

Proceedings of the 11th IASTED International Conference Computer Graphics and Imaging (CGIM 2010) February 17 - 19, 2010

2. Text Extraction Algorithm

4. Creation of the Block Descriptor

510 Then define the vector O ( Ln ) as:

5. Support Vector Machines

x [0,1], y {1, +1} , i [1, n] , where xi is the i input

6. Implementation and Evaluation

Figure 4.The Final Text Extracted Blocks by the SVM.

Table 1. Experimental Results.

Success Rate 98.453%

Figure 5. The application of the proposed method.

You might also like