You are on page 1of 7

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.

ORG

24

Exploring textural analysis for historical documents characterization


Anis Kricha, Najoua Essoukri Ben Amara
Abstract In this paper we propose a new approach to characterize images from historical documents based on textual analysis. Our contribution fits in the whole context of historical document digitalization delivered from the National Library of Tunisia. It mainly explores the correlation between different bands of decomposition in the wavelet transform. However, the choice of a set of non-redundant and relevant primitives remains delicate. The features chosen in our approach stem from a study based on both the reliefF algorithm, which eliminates irrelevant features, and factor analysis, which excludes the redundant features. The whole system is evaluated on a set of historical documents to separate text and graphics of historical documents and to separate different types of alphabet (Arabic, Latin and Hebrew).

Index Terms Correlation, wavelet transform, ReliefF, factor analysis, historical documents.

1 INTRODUCTION

HE digitization of historical documents is defined as a set of steps that usually starts with the digitization which includes different stages; essentially pre-processing, segmentation, analysis and recognition. Each step involves several problems, each with a specific degree of difficulty [14, 16, 26, 27]. In our research, we address the problem of the characterization of images drawn from historical documents, using textural analysis. This phase is crucial for several applications such as physical and logical segmentation, Optical Character Recognition, indexing and content-based image retrieval. Indeed, it can extract information that describes the document without prior knowledge of the semantics or the structural content. Thus the content of image documents can be viewed as different textures: text, background, graphics, title, etc. Characterization methods of texture can be divided into four families [21].

1.2 Geometrical methods


These approaches, known as structural methods, describe texture by defining the primitives (texton) and the rules of arrangement which connect them. These methods are adapted to study periodic or regular textures [22].

1.3 Model based methods


Model-based texture analysis methods are based on the construction of an image model, which can be used not only to describe texture, but also to synthesize it. The model parameters capture the essential perceived qualities of texture. Markov models and fractals are the two most used tools to model and generate textures [11].

1.4 Signal processing methods


Psychophysical research has given evidence that the human brain does a frequency analysis of the image [17]. Frequency methods can be adapted to the characterization of the texture because of its properties. In recent years the frequency methods have been widely used to characterize the documents, some approaches extract features of the Fourier transform or wavelet transform, or by means of Gabor filters [18, 19, 20]. Faced with this multitude of methods for characterization, the choice of features is still an open problem; in fact, the characterization is generally non-generic and strongly dependent on the application studied [11]. The remainder of this paper is organized as follow: in section 2, we propose our approach to characterize historical documents based on wavelet transform. An analysis step is developed in section 3 to eliminate irrelevant and dependant features. In section four we describe the different experiment carried out to validate our approach.

1.1 Statistical methods


One of the defining qualities of texture is the spatial distribution of grey values. Statistical approaches do not attempt to understand explicitly the hierarchical structure of the texture. Instead, they represent the texture indirectly by the non-deterministic properties that govern the distributions and relationships between the grey levels of an image. The higher the statistic method order, the higher the number of pixels (1 to N) concerned. Among the most used methods is the co-occurence matrix [11].

A. Kricha is with the National Engineering School of Monastir, Tunisia, Monastir university, UR: SAGE (Systmes Avancs en Gnie Electrique. N.Essoukri. Ben Amara is with the National Engineering School of Sousse, Tunisia, Sousse university, UR: SAGE (Systmes Avancs en Gnie Electrique)

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

25

2 APPROACH OF PROPOSED CHARACTERIZATION


The wavelet transform has been used in several studies to characterize the texture [5, 23, 24, 25]. It has been shown effective for the characterization and analysis of image documents using its properties and its ability to locate frequencies in space at several resolutions [11]. Several features have been proposed in the literature to characterize different kinds of images and specifically document image. In [5], authors proposed 384 features from the relationship between the approximation and each sub-band of details. In [2], authors calculated the energy in each resolution to characterize the texture of a document. In [9], the authors also used wavelet transform at different resolutions using moments calculated from the coefficients of high resolutions. In [13], the authors characterized the texture by the energy of each sub-band at each level of decomposition; the features calculated were operated at a C-fuzzy classifier. On one hand, many researchers used primitives derived from first order statistics for the characterization of texture that ignore one very important property of the wavelet transform of the localization of spatial frequencies. We illustrated through figure 1, the limitation of first order features from the wavelet transform for the characterization of texture. In figure 1, we consider three images that have the same characteristics (mean and standard deviation of each subband) from the wavelet transform. Figures 1.b and 1.c are generated after a permutation of the positions of coefficients of the wavelet transform image 1.a.

correlation. The autocorrelation of each sub-band allows us to have an idea about different patterns present in the analysis window. As there is some dependency between different sub-bands, we decided to exploit the correlation between the approximation and detail sub-bands at each level of decomposition.

2.2 Analysis windows


Considering the great irregularity and variety of historical documents, the choice of analysis window is critical and can be decisive for the results of the characterization. The analysis window can be pixel-wise or block-wise [6, 15]. In the first case each pixel is assigned a window of analysis, which gives us straddling blocks in the second case, the image is divided into non-overlapping blocks. The first approach is more accurate but too expensive in terms of memory and computing time [13]. In this work, we opt for a strategy block-wise, seen the good compromise offered between timeliness and accuracy of the results of segmentation, however, the size of the window of analysis remains a problem to study. In literature, several researchers have proposed to exploit multi-resolution approach to overcome the problem of window size analysis. In his thesis, N. Journet proposed to choose between two multi-resolution techniques [7]: Set a window size and resize the image. Keep the size of the original image and varying the window size. In this work, we have chosen to keep the size of the original image and to vary the size of the window, instead. We proposed then to do a multi-resolution analysis by selecting concentric windows, increasing in size, centered at the studied block, with sizes: 16x16, 32x32, 64x64 and 128x128.. Figure 2 illustrates an example of concentric windows, and shows that every window provides additional information on the texture studied.

a)

b)

c)

Fig.1 Images having the same first order features extracted from wavelet transform.

On the other hand, the majority of features used in the literature are extracted from each sub-band separately, which ignores the existing correlation between subbands at the same level of decomposition. Indeed, different studies show the presence of a relationship between sub bands of the same level [12]. This relationship has been proven essential for the reconstruction and texture especially for its characterization [5]. To remedy these insufficiencies and exploit the dependency between sub bands of the same level of decomposition, taking into consideration the spatial location of frequencies, we decided to exploit the correlation between the approximation and detail sub-bands and autocorrelation of each sub-band.

Fig 2. Example of concentric windows of increasing size.

2.3 Proposed features


To characterize the images of historical documents, we propose primitives from the correlation function of each analysis window. To describe each matrix, we retain 4 characteristics: the mean, standard deviation, the moment of order 3 and the moment of order 4. Since the window size analysis is relatively small, we limited our study to 3 levels of decomposition.

2.1 Correlation
Recall that the correlation of two images measures their mutual dependence, the autocorrelation of an image then measures the internal dependencies, eg a strongly regular image will have a high auto-

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

26

We calculate for each window (j)1..4 and for each level of decomposition (i)1..3, the matrix (Xki,j)i=1..3, j=1..4, k=1..7 : X i1, j =(AH)i,j: correlation between the approximation and

X i6, j =(VV)i,j : autocorrelation of vertical details. X i7, j =(DD)i,j: autocorrelation diagonal details.

horizontal details. X i2, j =(AV)i,j: correlation between the approximation and the vertical details. X i3, j =(AD)i,j: correlation between the approximation and diagonal details. X i4, j =(AA)i,j: autocorrelation approximation.
X i5, j =(HH)i,j: autocorrelation horizontal details.

So, each block is associated with four windows, each window is associated with 3 decomposition levels and each level is associated with 7 matrices, and for each matrix 4 features are extracted, that to say in total 336 features. Figure 3 illustrates the features offered and the methodology adopted to retain the most discriminating primitives. In what follows we will call the main features, the indices corresponding to one analysis window to a single level of decomposition, ie 28 features.

Basis of labelled historical documents

Features exraction
j=4
A3 H2 3 V3 AD3 H3 H2 3 V3 AD3 3H3 H3 A H2 V2 V3 DD2 3 V3 D3 V2 D2 V2 V1 V1 V1 H3

j=3 j=2
H1

H2 D2
D1

H1 H1

j=1

Wavelet Transform

V2

D2

H1

D1

V1

D1

D1

Yi,j: sub-band Y (A, H, V or D) of scale i, of window j.

X i1, j =Cor(Ai,j,Hi,j), X i2, j =Cor(Ai,j,Vi,j)

( X ik, j )i 1..3 =
j 1..4 k 1..7

X i3, j =Cor(Ai,j,Di,j), X i4, j =Cor(Ai,j,Ai,j) X i5, j =Cor(Hi,j,Hi,j), X i6, j =Cor(Vi,j,Vi,j) i=1..3 j=1..4 X i7, j =Cor(Di,j,Di,j)

Features selection

Average Standard deviation Moment of order 3 Moment of order 4 Average ( X ik, j )

k=1..7

(Ch)h=1..336= Relevant features


Selection of relevance features ReliefF

Standard deviation ( X ik, j ) Moment of order 3 ( X ik, j ) Moment of order 4 ( X k )


i, j

Factor analysis

Labels

Retained features
Fig.3 Scheme of the proposed approach.

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

27

3 ANALYSIS OF PROPOSED CHARACTERISTICS


After calculating the matrix characteristics, a selection of the relevant primitives seems essential not only to reduce the storage space but also to increase the performance of classification. We propose in what follows a methodology of data analysis that can be applied to any type of features to select the most relevant indices for a given application. First, we are going to discuss the relevance of the most discriminating features. Then we will study the dependence between the features to select ones independent and relevant.

dows and at each level of decomposition, which shows the invariance and robustness of the proposed features. After this step, we keep only 168 primitives.

3.2

Study of the correlation between features

3.1

Selection of relevant features

In the literature there are two types of selection algorithms: supervised and unsupervised ones [1, 3, 8]. As we may have characteristics that may affect the results, we chose to use a supervised selection method. The principle is to select the subset of features to better discriminate the different classes of data. We chose the algorithm ReliefF [10]. This algorithm does not simply eliminate redundancy but defines a criterion of relevance. This test measures the ability of each feature to consolidate data from the same label and discriminate those of different labels. The weight of a feature is even larger than the data from the same class have similar values and that data from different classes are well separated. Figure 4 illustrates the relevance of 28 main indices in every window for all levels of decomposition of an image from our test database.
Window 16x16 100 80 Pertinence 60 40 20 0 0 10 20 Features Window 64x64 30 level 1 level 2 level 3 100 80 Pertinence 60 40 20 0 0 10 20 Features Window 128x128 30 Window 32x32 level 1 level 2 level 3

Once the irrelevant features are eliminated, we study the dependence between the features. To achieve this we conduct a factor analysis of characteristics through the maximum likelihood estimator [4]. Figure 5 show that for a given window, all of the averages (AH) are correlated for all levels of decomposition. However the average characteristics (AH) is not correlated between different windows. Based on previous results (generalized on all the features and applied to multiple images), we retain only one level of decomposition. Following this study, we can keep only those 56 characteristics.
j 1 .. 3 Cj j 4 .. 6 moy ( AH , j ,16 x16 ) moy ( AH , j ,128 x128 )

( j 1..3 moy AH, j,16x16) Cj j 4..6 moy( AH, j,32x32)


1 Component 2 C4 C5 C6

1 Component 2

C5 C4 C6

C3 C1 C2

C1 C2 C3

-1 -1

0 Component 1

-1 -1

0 Component 1

Fig. 5 Features correlation.

After this study, we consider the following features: Standard deviation ((Xki,j)i=1, j=1..4, k=1..7) and Average ((Xki,j) i=1, j=1..4, k=1..7), with: X2i,j=corr(Ai,j,Vi,j), X3i,j=corr(Ai,j,Di,j), X1i,j=corr(Ai,j,Hi,j), X4i,j=corr(Ai,j,Ai,j), X5i,j=corr(Hi,j,Hi,j), X6i,j=corr(Vi,j,Vi,j), X7i,j=corr(Di,j,Di,j), i: level of decomposition and j=1,2,3,4 correponds to : window.16x16, 32x32, 64x64, 128x128.

4 RESULTS AND EXPERIMENTS


level 1 level 2 level 3

100 80 Pertinence 60 40 20 0 0 10 20 Features

Pertinence

level 1 level 2 level 3

100 80 60 40 20

30

10 20 Features

30

Fig.4 Features pertinence in each window.

The analysis of the degrees of relevance of the different characteristics shows that the moments of order 3 and the moments of order 4 are not discriminating for all windows of analysis to all levels of decomposition. The mean and standard deviation of the autocorrelation corresponding to the approximation in all the windows at all levels of decomposition provide large degree of relevance. We can also notice in Figure 4, that the features remain almost with the same relevance in all the win-

The characterization step has permitted us to select the discriminating primitives which allows us to separate the text from the graphic. A classification applied in the space of features allows us then to find the different classes present in a document. Two types of classifiers can be used: supervised classifiers and unsupervised classifiers. In our work, we chose to apply an unsupervised classification of type k-means. We applied a classification on about twenty historical documents from the National Library of Tunisia, to separate text, background and image (Figure 6). Although the used classifier is simple and unsupervised, it permitted to separate the three major classes (text, background, image) existing in the studied documents, which proves the relevance of the retained features.

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

28

Original Images

Classification results

Fig. 6 Historical documents segmentation using selected features.

In order to evaluate our approach with other type of images, we have exploited the retained features to separate the textures on the Broadtz database; the results are illustrated in Figure 7.

a. Document with Gaussian noise Fig.7 Texture separation using selected features. Also, we have verified the relevance of our features to separate different kinds of texts. Figure 8 shows the result of segmentation of a document with mixed text including Latin, Arabic and Hebrew characters.

b. Result of classification

Fig.9 Robustness of the selected features for the separation of Arabic, Latin and Hebrew texts in the presence of Gaussian noise.

a. Document wtith Salt and pepper noise

b. Result of classification

Fig.10 Robustness of the selected features for the separation of Arabic, Latin and Hebrew texts in the presence of salt and pepper noise.

a. Document

b. Rsult of classification

Fig.8 Performance of the selected features for the separation of Arabic, Latin and Hebrew texts.

5 CONCLUSION
In this work, we focused mainly on the characterization of images of historical documents for a possible physical segmentation. First, we proposed features from the wavelet transform that allows us to maximize the properties of this technique. Then we studied the relevance and the dependence of the characteristics through the algorithm ReliefF and factor analysis, respectively. A classification stage has to find the different components of an image of a document through a simple classifier like k-means. To eliminate the noise classification, we proposed a stage of

Figure 8 confirms the good discriminatory power of proposed features. In effect, by considering each alphabet as a texture, we could separate the Arabic, Latin and Hebrew texts. Figures 9 and 10 show the robustness of our features side noise. In fact adding noise (Gaussian and salt and pepper noise), the classification results remain virtually unchanged.

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

29

post-treatment based on some operators from mathematical morphology. The proposed method was applied to about twenty images of historical documents from the National Library of Tunisia and the results are considered encouraging. We are under study of the relevance of characteristics in other applications for the identification of the font in a multi font context OCR, the writer identification or for the separation of multi-alphabet text.

ACKNOWLEDGMENT
We would like to thank the National Library of Tunisia for providing images of historical documents.

REFERENCES
[1] Blum Avrim L., Pat Langley , "Selection of relevant features and examples in machine learning", Artificial Intelligence journal, special issue on relevance, vol. 97, pp. 245271, 1997. P. Gupta, N. Vohra, S. Chaudhury, S. Joshi. "Wavelet based page segmentation", Indian Conference on Computer Vision, Graphics and Image Processing , ICVGIP, pp. 51-56, 2000. (Guyon et al. 2003) Guyon I, Elisseeff A, "An introduction to feature and variable selection", Journal of Machine Learning Research, vol. 3, pp. 11571182, 2003. Harry H. Harman, Modern Factor Analysis, 3rd Edition, University of Chicago Press, Chicago, 1976. P. S. Hiremath, S. Shivashankar, "Wavelet based co-occurrence histogram features for texture classification with an application to script identification in a document image", Pattern Recognition Letters, vol 29, Issue 9, pp. 1182-1189, 2008. Jia Li, James Ze Wang, and Gio Wiederhold, "Classification of textured and non-textured images using region segmentation," International Conference on Image Processing, pp. 754-757, September, 2000. N.Journet, Analyse dimages de documents anciens: une approche texture, Thse de doctorat, universit de La Rochelle, 2006. Kohavi R, John G. H, "Wrappers for feature subset selection", Artificial Intelligence journal, special issue on relevance, vol. 97, Issue 1-2, pp. 273324, December 1997. H. Li, D. Doerman, and O. Kia, "Automatic Text Detection and Tracking in Digital Video", IEEE Transactions on Image Processing, vol 9, Issue 1, pp. 147-156, January 2000. Marko Robnik-Sikonja, Igor Kononenko, "Theoretical and Empirical Analysis of ReliefF and RReliefF", Machine Learning Journal, vol 53, Issue 1-2, pp. 23-69, October-November 2003. Mihran Tuceryan, Anil K. Jain, "Texture Analysis", The Handbook of Pattern Recognition and Computer Vision (2nd Edition), by C. H. Chen, L. F. Pau, P. S. P. Wang (eds.), World Scientific Publishing Co., pp. 207-248, 1998. Portilla, Javier, Simoncelli, E.P.,"A parametric texture model based on joint statistics of complex wavelet coefficients", International Journal Computer Vision, vol 40, Issue 1, pp. 49-70, 2000. Sahbani Mahersia Hela, Hamrouni Kamel, "Segmentation dimages textures par transformes en ondelettes et classification C-moyenne floue," International conference :Sciences of Electronic,Technologies of information and Telecommunications, SETIT, Mars 2005. Mohamed Kricha, "Contribution lindexation des documents

[2]

[3]

[4] [5]

[6]

[7]

[8]

[9]

[10]

anciens, mastre en Systmes Intelligentes et communicants, Ecole Nationale dIngnieurs de Sousse, 2011. [15] Ying Liu, "Texture segmentation based on features in wavelet domain for image retrieval", Visual Communications and Image Processing, Lugano, Switzerland, vol. 5150, issue 3, pp.20262034, July 2003. [16] Amina Ghardallou Lasmar, " Prtraitement des documents anciens arabes par ondlettes", Mastre, Facult des Sciences de Monastir, 2005-2006. [17] Campbell F.W, J.G. Robson, "Application of Fourier Analysis to the Visibility of Gratings", Journal of Physiology, pp. 551-566, 1968. [18] W. Chan, G. Coghill, "Text analysis using local energy", Pattern Recognition, 34(12), pp. 2523-2532, December 2001. [19] S.S. Raju, P.B. Pati, A.G. Ramakrishnan,"Text localization and extraction from complex color images", ISVC, vol 380, pp.486493, 2005. [20] J. Li, R.M. Gray, "Context-based multiscale classification of document images using wavelet coefficient distributions", Image Processing, IEEE Transactions on image processing, Vol 9, pp. 1604-1616, Septembre 2000. [21] M. Tuceryan, A. K. Jain, "Texture analysis", The Handbook of Pattern Recognition and Computer Vision (2me Edition), pp. 207248, 1998. [22] Bela Julesz, "Textons, the elements of texture perception, and their interaction", Nature, no. 290, pp. 91-97, 12 Mars 1981. [23] Yongsheng Dong and Jinwen Ma,"Wavelet-Based Image Texture Classification Using Local Energy Histograms", IEEE SIGNAL PROCESSING LETTERS, vol. 18, NO. 4, pp. 247-250, April 2011. [24] Islam, M.R. ; Yin Chai Wang ; Khatun, A., Partial iris image recognition using wavelet based texture features, International Conference on Intelligent and Advanced Systems (ICIAS), pp. 1-6, 15-17 June 2010. [25] Xavier, L. ; Thusnavis, B.M.I. ; Newton, D.R.W. , "Content based image retrieval using textural features based on pyramidstructure wavelet transform", International Conference on Electronics Computer Technology (ICECT), pp. 79 83, 2011. [26] El-etriby, S.S. ; Amin, K.M. ,"Detection and correction of deformed historical arabic manuscripts", International Conference on Computer and Communication Engineering (ICCCE), 11-12 May 2010. [27] Anis Kricha, Amina Ghardallou Lasmar, Najoua Essoukri Ben Amara. Exploration des Ondelettes en Prtraitement des Documents Anciens, Colloque International Francophone sur l'Ecrit et le Document (CIFED), Fribourg, Suisse, 18-21 septembre 2006.

[11]

Anis Kricha is a PhD student at the Department of Electrical Engineering, in the National Engineering School of Tunis, University ElManar, Tunisia. He received his Electrical Engineer diploma from the National engineering School of Tunis, University El Manar, Tunisia. Since 2006, he is working as assistant professor in the Department of Electrical Engineering in the National Engineering School of Monastir, University of Monastir, Tunisia. His research interests are in the areas of image processing. Najoua Essoukri Ben Amara received the B.Sc., M.S., Ph.D. and HDR degrees in Electrical Engineering, Signal Processing, System Analysis and Pattern Recognition from the National Engineering School of Tunis, University El Manar, Tunisia, in 1985, 1986,

[12]

[13]

[14]

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

30

1999,2004 respectively. From 1985 to 1989, she was a researcher at the Regional Institute of Informatics Sciences and Telecommunications, Tunis, Tunisia. In September 1989, she joined the Electrical Engineering Department of the National National Engineering School of Monastir,University of Monastir, Tunisia, as an assistant professor. She becomes a senior lecturer in July 2004 and a Professor in October 2009 in Electrical Engineering at the National School of Engi-

neers of Sousse-ENISo, University of Sousse, Tunisia. Between July 2008 and july 2011, she was the Director of the ENISo. Her research interests include mainly pattern recognition applied to Arabic documents, ancient image processing, compression, watermarking, segmentation, biometric and the use of stochastic models and hybrid approaches in the above domains. She is the responsable of the research unit SAGE: Systmes Avancs en Gnie Electrique.

You might also like