Unsupervised Writer Adaptation of Whole-Word HMMs With Application To Word-Spotting 2010

Pattern Recognition Letters 31 (2010) 742749
Contents lists available at ScienceDirect
Pattern Recognition Letters

journal homepage: www.elsevier.com/locate/patrec
Unsupervised writer adaptation of whole-word HMMs with application to word-spotting

Jos A. Rodrguez-Serrano a,b,*,1, Florent Perronnin a, Gemma Snchez b, Josep Llads b
a b
Textual and Visual Pattern Analysis, Xerox Research Centre Europe (XRCE), 38240 Meylan, France Centre de Visi per Computador (CVC), Universitat Autnoma de Barcelona, 08193 Bellaterra, Spain
a r t i c l e
i n f o
a b s t r a c t
In this paper we propose a novel approach for writer adaptation in a handwritten word-spotting task. The method exploits the fact that the semi-continuous hidden Markov model separates the word model parameters into (i) a codebook of shapes and (ii) a set of word-specic parameters. Our main contribution is to employ this property to derive writer-specic word models by statistically adapting an initial universal codebook to each document. This process is unsupervised and does not even require the appearance of the keyword(s) in the searched document. Experimental results show an increase in performance when this adaptation technique is applied. To the best of our knowledge, this is the rst work dealing with adaptation for word-spotting. The preliminary version of this paper obtained an IBM Best Student Paper Award at the 19th International Conference on Pattern Recognition. 2010 Elsevier B.V. All rights reserved.
Article history: Available online 14 January 2010 Keywords: Word-spotting Handwriting recognition Writer adaptation Hidden Markov model Document analysis
1. Introduction Handwritten word-spotting is the pattern recognition task which consists in detecting words in handwritten documents (Rath and Manmatha, 2007). The key aspect of word-spotting is that search is performed without a full-blown handwriting recognition (HWR) system. Indeed, a straightforward strategy for keyword detection would be to apply a HWR system to recognize all words in a document and then search for the keyword in the obtained text. However, the inuential work by Manmatha et al. (1996) revealed that such a strategy is too cumbersome and that in practice an image matching approach is sufcient for certain applications. As opposed to the previous works that have employed wordspotting for historical document retrieval (Rath and Manmatha, 2007; Adamek et al., 2007; Edwards et al., 2004; Chan et al., 2006; Kolcz et al., 2000; Terasawa and Tanaka, 2007; Van der Zant et al., 2008), the application of interest of this work is the ltering of modern mail documents. Our system is applied to a ow of incoming mail documents (customer letters), where documents containing a particular keyword (such as cancellation) have to be agged.
In that scenario, our word-spotting system can be confronted with documents produced by a huge variety of writers (normally, one different writer per document). In HWR, a set of techniques known as writer adaptation have been proposed to improve the performance of a writer-independent system by customizing the model to the current writer. However, these techniques have not been investigated for word-spotting before. Moreover, we consider the case of whole-word models, in which case the direct application of these techniques is not possible. The main contribution of this article is to propose a novel writer style adaptation method for word-spotting. The proposed method is unsupervised, i.e. a keyword model is adapted using only unlabeled data. Moreover, examples of the keyword are not even required to be present in the adaptation set. The rest of the introduction provides a more detailed picture of the background, existing adaptation techniques and nally our approach.
1.1. Handwritten word-spotting The term spotting refers to search without explicit recognition. Originally, word-spotting was formulated for detecting words or phrases in speech messages (Myers et al., 1981; Rose and Paul, 1990; Knill and Young, 1994), and then extended to locate words in typed text documents (Kuo and Agazzi, 1994; Cho and Kim, 2004; Chen et al., 1993). The work (Manmatha et al., 1996) pioneered the application of word-spotting to off-line handwritten documents, as a way to automatically index historical document collections. This enabled the paradigm of search engines for
* Corresponding author. Address: School of Computing, University of Leeds, LS2 9JT Leeds, United Kingdom. E-mail addresses: scsjars@leeds.ac.uk (J.A. Rodrguez-Serrano), Florent.Perronnin @xrce.xerox.com (F. Perronnin), gemma@cvc.uab.es (G. Snchez), josep@cvc.uab.es (J. Llads). 1 When this work was carried out, he was a Ph.D. student at the CVC and visitor at XRCE. He is now with the University of Leeds, UK. 0167-8655/$ - see front matter 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2010.01.007
J.A. Rodrguez-Serrano et al. / Pattern Recognition Letters 31 (2010) 742749
743
handwritten document images (Saykol et al., 2004; Rath et al., 2003; Srihari et al., 2005). The important contribution of the work (Manmatha et al., 1996) is to deliberately avoid the use of a HWR system for the search and indexing of word images. Instead, these authors revealed that an image matching approach allows easy indexing without any training requirements (as opposed to the costly training phase in a HWR system). Since then, word-spotting has been posed as a contentbased image retrieval problem (Kolcz et al., 2000; Srihari et al., 2004; Adamek et al., 2007; Terasawa, 2005). Assuming that the words of a document collection have been segmented, word-spotting can be formulated as an image database search application: given an exemplary image (the query), the goal is to retrieve all word images that are close enough to the example, as determined by a similarity measure of choice. This paradigm is also referred to as query-by-example (QBE). A typical similarity measure is dynamic time warping (Rath and Manmatha, 2003). Our previous work (Rodrguez-Serrano and Perronnin, 2009) showed that the QBE accuracy can be boosted by using more than one example image for querying and combining these using whole-word HMMs. We also demonstrated that the particular choice of a semi-continuous HMM provides competitive performance with reduced training sets: with as low as a single training sample, the accuracy is higher than with a traditional DTW approach. This is thanks to the prior information incorporated by the Gaussian codebook. We take this approach as baseline for the current work and more details about it can be read in Section 2. The described approach involves little training compared to HWR systems based on character models and it is much simpler to set up. In our application we estimate N whole-word models with little training material (typically of the order of 10100 positive samples per keyword). In contrast, a sophisticated HWR system requires of the order of 10K or 100K samples (El-Yacoubi et al., 1999; Knerr et al., 1998) of labeled word images. This fact, however, impedes the use of traditional adaptation techniques as discussed next. 1.2. Adaptation techniques in handwriting recognition Personalization of handwriting models, known as writer adaptation, has been a subject of interest among the handwriting recognition community (Connell and Jain, 2002; Kienzle and Chellapilla, 2006; Brakensiek et al., 2001; Mouchre et al., 2007). Owing to the practical limitation of maintaining well-trained models for each possible individual, recognition systems are trained with large amounts of varied data so that an overall good performance is obtained for all the styles. Adaptation techniques go a step further and modify the parameters of a writer-independent system such that the new parameters are optimal on a (relatively small) set of data of a particular writer. We concentrate on statistical adaptation techniques, successful in speech recognition (Gauvain and Lee, 1994; Leggetter and Woodland, 1995) and handwriting recognition (Vinciarelli and Bengio, 2002; Brakensiek et al., 2001), since they are especially suited to HMM-based frameworks. Here, the speaker/writer-independent set of parameters h is transformed into had using a (relatively) small amount of data from the corresponding speaker/writer. This new data set is referred to as the adaptation set, to make an explicit distinction to the training set, which refers to the set of samples used to train the writer-independent model. Two types of adaptation techniques are common in the literature: supervised adaptation and self-adaptation. In supervised adaptation techniques a labeled set of samples from the writer is available. The writer-independent parameters can be updated using learning techniques such as Maximum-APosteriori (MAP) (Gauvain and Lee, 1994) or Maximum Likelihood
Linear Regression (MLLR) (Leggetter and Woodland, 1995). While this can be useful for a system which allows an enrollment phase, it is not applicable to our scenario. Self-adaptation techniques allow adaptation in an unsupervised way. Here, a document is recognized using a HWR system, and the output can be treated as writer-dependent labeled material and exploited to retrain the model (Ball and Srihari, 2008). When using this technique, one must assume that there will be errors in the transcription. Therefore, sample selection criteria must be imposed, e.g. considering only the samples whose recognition condence is above a threshold. Nevertheless, self-adaptation cannot be used in our case either, for a more subtle reason. The output of a HWR system on a new document is a set of characters together with their labels. One assumes that on a sufciently long document there will be enough character-label pairs to retrain the original model accurately. However, when we apply a word detector to the typical mail document, we may have few or even zero occurrences of the search word, which is probably not enough for retraining. 1.3. Our approach The limitations of supervised and self-adaptation approaches in the explained scenario raise the need for an unsupervised adaptation method. Given a new document image and a keyword model, the challenge is to extract writer style information from the words of the document and use it to improve the keyword model for that document. Here, word labels are not available, which means that all the information for the adaptation must come from unlabelled image data. We propose the use of a semi-continuous hidden Markov model (SC-HMM) (Huang and Jack, 1990) to achieve such an unsupervised adaptation method. In this type of model, rst the input space is clustered using a GMM, which is usually referred to as universal background model or universal GMM. Then the means and covariances of all the states are xed to the values of the means and covariances of this GMM. Finally, the remaining HMM parameters (transition probabilities and mixture weights) are estimated from the data. The main contribution of this article is to exploit this parameter separation to propose a novel adaptation technique for tasks in which a whole handwritten word is modeled with an HMM, such as word-spotting. In writer-independent scenarios, a word model is obtained by training a universal GMM with many different words from many different writers. But for a new input document, the models of the keywords to spot can be adapted by replacing the universal GMM with a document-specic GMM, and leaving the word-dependent parameters unchanged. To obtain the document-specic GMM, one can apply standard adaptation techniques (such as MAP or MLLR) to the universal GMM. To the best of the authors knowledge (and except for the preliminary conference version (Rodrguez et al., 2008)) this adaptation technique is novel, and also this is the rst work to consider adaptation in a word-spotting problem. The rest of the article is structured as follows. Section 2 describes the writer-independent scenario based on SC-HMMs which is the baseline of this work. Section 3 describes how this is modied by the proposed writer style adaptation. Section 4 explains the particular adaptation techniques that are implemented to obtain personalized GMMs. Section 5 reports the experimental validation. Finally, in Section 6 conclusions are drawn. 2. Word modeling We use a statistical approach that builds on (Rodrguez-Serrano and Perronnin, 2009) to model handwritten words. A word image
744
is described as a sequence X x1 x2 ; . . . ; xT , where a frame xt is a vector of features extracted at different positions of a sliding window. Each keyword to be searched is modeled by a SC-HMM (Huang and Jack, 1990) which is trained using several sequences of the keyword. The main property of a SC-HMM is that all the states of all keywords share a common pool of Gaussians fpk ; k 1; . . . ; Kg. We denote by h the parameters of this pool of Gaussians (mean vectors lk and covariance matrices Rk ). Let pn;s be the emission probability in state s of keyword W n . Then, the probability of emitting the frame x in this state can be written as:
At training time, a new sequence X is scored against the keyword model of W n by using a score Sn X which is proportional to the posterior pW n jX. This score can be obtained as (Rodrguez-Serrano and Perronnin, 2009)
Sn X
pXjkn ; h ; pXjh
pn;s x
K X k1
wn;s;k pk x:
where pXjkn ; h is the likelihood on the SC-HMM and Q pXjh T pxt jh is the likelihood computed on the shape vocabt1 ulary GMM. A binary decision about the word class can be done by thresholding the score. In the following section we address how to modify these ideas to adapt the models to the particular writer styles. 3. Proposed writer style adaptation The main contribution of this work is to provide an adaptation method that exploits the separation of the parameters kn and h explained in the previous section. The procedure is as follows. First, a universal shape vocabulary is built by training the GMM pjh using frames from a large amount of samples of different writers. Then, for a new document, we apply a statistical adaptation technique to make the vocabulary specic to that document. The parameters kn remain unchanged. This implies modications to the writer-independent SC-HMM (explained in the previous section) both at testing and training time. At test time the score of a sample X j (where the index j runs over all training samples), previously given by Eq. (3), is now
The mixture weights wn;s;k are the only word- and state-specic parameters. If the parameters h of the Gaussian pool were learned from the training set, then the only advantage of the SC-HMM would be a signicant reduction in the number of Gaussian computations (which represent the majority of the cost in a typical speech/ handwriting recognition system). However, we use the SC-HMM for another reason: the common pool of Gaussians models a vocabulary of shapes. In our case, the pool of Gaussians is obtained by estimating a GMM from unlabeled word samples and discarding the mixture weights. Here, the Gaussians represent soft clusters of similar frames, so they can be interpreted as codewords of a vocabulary like in the computer vision literature (Sivic et al., 2009). We should refer to this GMM as shape vocabulary, since each codeword typically represents a part of a character, a connector, a whole character, etc. Then, the SC-HMM for a keyword W n is composed of the parameters h of the GMM and the remaining HMM parameters (the set of mixture weights and transition probabilities) denoted compactly kn . Note that a weight wn;s;k is the frequency of the codeword shape k in state s of word n. The key point in this model is that the HMM parameters can be decomposed into: The parameters in kn , which only depend on the keyword W n but not on the writer styles, and The Gaussian parameters in h, which are independent of the keywords (indeed, common to all keywords) and implicitly depend on the writer styles of the samples on which the GMM was estimated. This concept is illustrated in Fig. 1. Given a set of training sequences fX j g for a given keyword W n , the parameters kn are estimated by maximizing the log-likelihood
Sn X j
pX j jkn ; hdj ; pX j jhdj
with dj indicating the index of the document X j belongs to. At training time, the log-likelihood function that must be maximized is
X
j
log pX j jkn ; hdj :
Note the difference with Eq. (2) for the writer-independent case. For clarity, the steps of the proposed writer style adaptation are summarized below. Training process Learn the parameters h of a universal GMM (shape vocabulary) on a large set of varied data containing many words and writing styles. For each document i, adapt the universal GMM using all the data available in the document to obtain the writer-dependent parameters hi . The particular adaptation methods are discussed below.
X
j
log pX j jkn ; h:
Fig. 1. Illustration of the proposed word models. A universal vocabulary (GMM) is built using samples from many different words and writers. The means and covariances of this GMM are retained. For each new word, a HMM is built by xing the previous means and covariances in all the states and only optimizing the transition probabilities and mixture weights. There is a single set of means and covariances h but a different set of transition probabilities and mixture weights kn for each word.
745
Estimate the mixture weights and transition probabilities kn of the SC-HMM of word W n by maximizing the log-likelihood function in Eq. (5) over all training samples fX j g where dj indicates the index of the document X j belongs to. Testing process Adapt the universal GMM parameters h to the current document i using all samples of the document to obtain hi . Score each sample X using Eq. (4). The proposed method is able to adapt word models only from unlabeled data, i.e. even if no image of the modeled keyword is in the adaptation set. This property is an advantage with respect to classical adaptation methods. Indeed, when applying classical adaptation techniques to whole-word HMMs one would need a new set of samples of the modeled word for adaptation. However, as discussed in Section 1.2, in our application there would be no way of getting such an adaptation set. While we believe this adaptation method is novel, we identied some common points with the work of Fink and Pltz (2006). These authors also apply adaptation to a global GMM to obtain document-dependent GMMs. However, these GMMs are not used directly for adaptation. Instead, the GMMs are employed to cluster the documents and then self-adaptation is applied in each partition independently. There are two main differences with respect to the present work. First, they make use of selfadaptation which is not directly applicable in our case (as discussed in 1.2). Secondly, while they also model writer styles with GMMs and exploit these for adaptation, this is done in an indirect way. In contrast, in our approach the writer style information is directly included in the keyword models, which we believe is more principled. It may be not surprising that writer style parameters can be modeled with adapted GMMs. Adapted GMMs have been used in speaker identication (Reynolds et al., 2000) and more recently in writer identication (Schlapbach et al., 2008) with success. However, a fundamental difference in the present article is that we embed this adapted GMMs into the word HMMs. One could ask why the personalized GMMs are obtained by adapting a universal background model and not, for instance, by training a new GMM only from frames of the document using traditional maximum-likelihood estimation (MLE). There are several reasons that support this choice. First, the frames contained in a single document might be insufcient to train a GMM that comprehensively models the writer style, which is usually a good reason for using adaptation. Secondly, and more important, adaptation imposes a correspondence between the original and adapted Gaussian components; see, for instance, Liu and Perronnin (2008). We interpret the Gaussians as codewords of a vocabulary. Therefore, Gaussian k in an adapted GMM indicates how the codeword k of the vocabulary is expressed for that particular writer style. If one trains independent GMMs for each document using MLE, the Gaussians of different writer styles do not have a relationship between each other. In that case, Gaussians cannot be interpreted as codewords and the proposed framework would probably not be applicable. We shall get back to this point in the experimental section.
4.1. Maximum-A-Posteriori (MAP) In MAP adaptation (Gauvain and Lee, 1994; Reynolds et al., 2000), one assumes the existence of a prior distribution ph) over the SD parameters h, and the adapted parameters hMAP are given by:
hMAP arg max pDjhph;

h
where D are the SD samples. It was shown that a convenient form of the prior ph is the product of a Dirichlet density (accounting for the mixture weight parameters) and a Normal-Wishart density (accounting for the Gaussian parameters). One can then apply the EM algorithm as in the case of MLE, but substituting the auxiliary function Q h; ^ by Rh; ^ Q h; ^ log ph (Gauvain and Lee, h h h 1994). The maximization of this expression leads to the equations for updating the parameters. For instance, the adapted means are transformed as
lMAP ak lk 1 ak lMLE ; k k
MAP
which basically expresses the new means lk as a weighted average between the SI value lk and the MLE estimate using the MLE adaptation data lk . Since ak depends on the occupancy of Gaussian k, MAP does not adapt Gaussian components for which frames have not been observed in the adaptation set. Therefore a large amount of adaptation data is usually needed. 4.2. Maximum likelihood linear regression (MLLR) In MLLR adaptation (Leggetter and Woodland, 1995), the adapted means are linearly transformed versions of the original ones:
lMLLR Alk b: k
After this transformation, the Gaussian k becomes
pk x
1 1 exp x Alk bT R1 x Alk b ; k 2 2p

D=2
R1=2 k
where D is the dimensionality of the feature vectors, and T indicates the transpose operator. In this case, the lk are the values in the writer-independent model. The EM algorithm is also employed to estimate the matrix A and offset vector b. This reduces to solving multiple systems of linear equations (Leggetter and Woodland, 1995). Usually, the means are the only adapted parameters. Since MLLR transforms all the means even if the corresponding Gaussians have no support, it tends to give better results than MAP with fewer SD data. 4.3. Multiple transforms Eqs. (8) and (9) assume that there is a single transform applied to all Gaussians. However, this expression can be extended to apply different transforms to different groups of Gaussians, also known as regression classes. Although more advanced methods such as regression class trees (Gales, 1996) could be employed, in this article we use a simple hierarchical clustering with the log-likelihood loss as the specic distance between Gaussians. Having multiple transforms increases the exibility of the adaptation process. This may improve performance but may also lead to overtting in the case of small adaptation sets, since the number of parameters to estimate grows linearly with the number of Gaussian clusters.
4. Adaptation techniques We have experimented with the two most popular statistical adaptation techniques for obtaining a set of source-dependent (SD) parameters had from a source-independent (SI) set h: MAP and MLLR.
746
5. Experiments 5.1. Experimental conditions Adaptation is evaluated in the context of a word detection task. Experiments are conducted on word images extracted from 630 scanned letters (written in French), which contain unconstrained handwriting from approximately the same amount of writers. Some letter examples are shown in Fig. 2. Given a new document, the pipeline of the baseline system is as follows. First, a segmentation process produces word images from the documents. Each resulting word image is checked against each of the keywords in a cascade of classiers, the goal of which is to discard unlike hypothesis early and only pass the more difcult samples to the next, more accurate block. The rst block of the cascade is a threshold on the width of the image, followed by a linear classier which works on holistic features. The surviving samples are normalized with respect to slant, skew and text height. Fig. 4
shows four examples of the normalized word Monsieur. Note that, even if the images are normalized, the images still present different writing styles, as one can note e.g. in the different allographs for the characters M and s. After normalization, local gradient histogram (LGH) features are extracted using a sliding window and these samples sent to the last block of the cascade, namely the SC-HMM classier. The reader is referred to (Rodrguez-Serrano and Perronnin, 2009) for details about the segmentation, pruning, normalization and feature extraction processes. The experiments presented in this section focus on the SC-HMM block of the cascade, since this is where adaptation is applied. For these experiments, the set of letters is divided into 6 folds (05). Samples from fold 0 are employed to construct the universal vocabulary GMM, as explained in Section 2. Ten keywords relevant to this type of documents are selected to be searched: Monsieur, Madame, contrat, rsiliation, salutation, rsilier, demande, abonnement, the company name (not shown for obvious condentiality reasons), and veuillez (these words are translated into English as
Fig. 2. Examples of business letters of the benchmark dataset (personal and condential information has been removed).
J.A. Rodrguez-Serrano et al. / Pattern Recognition Letters 31 (2010) 742749 Table 1 Mean average precision for the different adaptation techniques. Adaptation method None MAP MLLR-b MLLR-diag MLLR-full Mean AP (%) 79.3 80.3 80.2 80.4 80.5
747
2. A being a diagonal matrix (MLLR-diag), and 3. A being a full matrix A (MLLR-full). This allows to test settings with different number of parameters to estimate. In all these experiments, the only adapted parameters are the means, since the adaptation of other parameters did not produce a signicant increase in accuracy. Table 1 shows the mean of the AP across the 10 keywords, for each adaptation method. Adaptation improves the AP in all cases. The best case is MLLR-full where an increase of 1.2% is obtained on average. Looking at the individual words, the largest observed increase is 2.8%. In our experiments, MLLR performs better than MAP. It is a well-known result in speech and handwriting recognition (see, for instance (Vinciarelli and Bengio, 2002)) that MLLR performs better for smaller adaptation sets, while MAP outperforms MLLR for larger adaptation sets. So our results may indicate that the data in a single document constitutes a rather small adaptation set. It would be interesting to investigate whether an increase in performance could be obtained by working with larger documents or by grouping documents; this will be considered future work. For completeness, in Fig. 3 we plot the detection-error tradeoff (DET) curves (Martin et al., 1997) of the unadapted and adapted (MLLR-full) models for two particular keywords (abonnement and demande), where the error reduction of adapted models can be appreciated. The following analysis can help understand the importance of the performance increase. For the writer-independent model, the working point in the SC-HMM block corresponds to a false rejection rate of FR = 40% and average false acceptance rate of FA = 0.32%. Using the MLLR-full adaptation, the average FA is reduced to 0.26%. This means, the number of errors is reduced by about 19%, a signicant reduction of the number of misclassied words, especially in systems processing a large amount of documents per day. (Note that these performance values refer only to the SC-HMM block of the cascade and thus the nal accuracy of the system is higher. However, we focus on the SC-HMM block since the other steps of the cascade are not relevant for this study.) Regarding the MLLR-b and MLLR-diag methods, the best results are obtained when using 32 transforms. For the case of MLLR-full, the introduction of multiple transforms degrades performance. This clearly reveals an overtting effect due to the higher number of parameters with a full matrix. Finally, we also experimented with training directly GMMs for each document, without adaptation, as discussed at the end of Section 3. This means, we train a GMM directly from the feature sets appearing in the document using MLE, not making use of a universal vocabulary. However, we obtain performances below the
Sir, Madam, contract, cancellation, greeting, to cancel, request, subscription, and if you please). The number of available examples of these keywords are between 250 and 750, approximately. Folds 15 are used for cross-validation. The 10 SC-HMMs are trained with positive samples from 4 folds and tested on positive and negative samples of the remaining fold, as explained in Section 3. The SC-HMMs use 10 states per character and the GMM contains 512 Gaussians as optimized in separate experiments. For evaluation, test samples are ranked according to the obtained scores, and then the average precision (AP) is computed. This gure is obtained as the average of the precision scores in a precision/recall curve. Such a curve is obtained by setting a threshold on the score in order to yield accept/reject decisions and then by varying the threshold. A single overall gure is obtained by averaging the APs across all words and folds, which is referred to as the mean AP or, shortly, mAP. 5.2. Results We experimented with MAP and three variations of MLLR. These are obtained by assuming different forms of the matrix A in Eq. (8): 1. A being the identity matrix, which reduces to an offset-only transform (MLLR-b),
Fig. 3. Examples of normalized images of the word Monsieur.
60 False Rejection probability (in %)
abonnement
Unadapted Adapted
60 False Rejection probability (in %)
demande
Unadapted Adapted
40
40
20
20
10 5 2 10.01 0.05 0.2 0.5 1 2

5 10 20 40 60
10 5 2 10.01 0.05 0.2 0.5 1 2

5 10 20 40 60
False Acceptance probability (in %)
False Acceptance probability (in %)
Fig. 4. Comparison of DET plots for unadapted vs. adapted models of the keywords abonnement and demande.
748
J.A. Rodrguez-Serrano et al. / Pattern Recognition Letters 31 (2010) 742749 Ball, G., Srihari, S.R., 2008. Writer adaptation in off-line Arabic handwriting recognition. In: Document Recognition and Retrieval XV. Brakensiek, A., Kosmala, A., Rigoll, G., 2001. In: Proc. 23rd DAGM Symposium on, Pattern Recognition, Munich, Germany, September 1214. Springer, Berlin, Heidelberg, p. 32 (Writer Adaptation for Online Handwriting Recognition). Chan, J., Ziftci, C., Forsyth, D., 2006. Searching off-line Arabic documents. In: Proc. 2006 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pp. 14551462. Chen, F.R., Wilcox, L.D., Bloomberg, D.S., 1993. Word spotting in scanned images using hidden Markov models. In: IEEE Conf. on Audio, Speech and Signal Processing, vol. 5. pp. 140.5. Cho, B.-J., Kim, J.H., 2004. Print keyword spotting with dynamically synthesized pseudo 2d HMMs. Pattern Recognition Lett. 25 (9), 9991011. Connell, S.D., Jain, A.K., 2002. Writer adaptation for online handwriting recognition. IEEE Trans. Pattern Anal. Machine Intell. 24 (3), 329346. Edwards, J., Teh, Y.W., Forsyth, D.A., Bock, R., Maire, M., Vesom, G., 2004. Making Latin manuscripts searchable using gHMMs. In: Neural Information Processing Systems. El-Yacoubi, A., Sabourin, R., Suen, C.Y., Gilloux, M., 1999. An HMM-based approach for off-line unconstrained handwritten word modeling and recognition. IEEE Trans. Pattern Anal. Machine Intell. 21 (8), 752760. Fink, G.A., Pltz, T., 2006. Unsupervised estimation of writing style models for improved unconstrained off-line handwriting recognition. In: Proc. 10th Internat. Workshop on Frontiers in Handwriting. IEEE, La Baule, France. Gales, M., 1996. The generation and use of regression class trees for mllr adaptation. Tech. Rep. CUED/F-INFENG/TR.263, Cambridge University, Cambridge, UK. Gauvain, J.-L., Lee, C.-H., 1994. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2 (2), 291298. Huang, X.D., Jack, M.A., 1990. Semi-continuous hidden Markov models for speech signals. In: Readings in Speech Recognition. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 340346. Kienzle, W., Chellapilla, K., 2006. Personalized handwriting recognition via biased regularization. In: ICML06: Proc. 23rd Internat. Conf. on Machine Learning, pp. 457464. Knerr, S., Augustin, E., Baret, O., Price, D., 1998. Hidden Markov model based word recognition and its application to legal amount reading on French checks. Comput. Vision Image Understand. 70 (3), 404419. Knill, K., Young, S., 1994. Speaker dependent keyword spotting for accessing stored speech. Tech. Rep. CUED/F-INFENG/TR 193, Cambridge University Engineering Department. Kolcz, A., Alspector, J., Augusteijn, M., Carlson, R., Popescu, G.V., 2000. A lineoriented approach to word spotting in handwritten documents. Pattern Anal. Appl. 3 (2), 153168. Kuhn, R., Junqua, J.-C., Nguyen, P., Niedzielski, N., 2000. Rapid speaker adaptation in the eigenvoice space. IEEE Trans. Speech Audio Process. 8. Kuo, S.S., Agazzi, O.E., 1994. Keyword spotting in poorly printed documents using pseudo 2-D hidden Markov models. IEEE Trans. Pattern Anal. Machine Intell. 16 (8), 842848. Leggetter, C., Woodland, P., 1995. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Language 9, 171185. Liu, Y., Perronnin, F., 2008. A similarity measure between unordered vector sets with application to image categorization. In: IEEE Conf. on Computer Vision and Pattern Recognition. Manmatha, R., Han, C., Riseman, E.M., 1996. Word spotting: A new approach to indexing handwriting. In: Proc. 1996 IEEE Conf. on Computer Vision and Pattern Recognition, p. 631. Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M., 1997. The DET curve in assessment of detection task performance. In: Proc. EuroSpeech97, pp. 18951898. Mouchre, H., Anquetil, E., Ragot, N., 2007. Writer style adaptation in online handwriting recognizers by a fuzzy mechanism approach: The adapt method. Internat. J. Pattern Recognition Artif. Intell. 21 (1), 99116. Myers, C.S., Rabiner, L.R., Rosenberg, A.E., 1981. On the use of dynamic time warping for word spotting and connected word recognition. Bell System Technol. J. 60, 303325. Rath, T.M., Manmatha, R., 2003. Word image matching using dynamic time warping. In: Proc. 2003 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 521527. Rath, T.M., Manmatha, R., 2007. Word spotting for historical documents. Int. J. Doc. Anal. Recognition 9, 139152. Rath, T., Lavrenko, V., Manmatha, R., 2003. A statistical approach to retrieving historical manuscript images. CIIR Technical Report MM-42. Reynolds, D.A., Quatieri, T.F., Dunn, R.B., 2000. Speaker verication using adapted Gaussian mixture models. Digital Signal Process. 10, 1941. Rodrguez, J.A., Perronnin, F., Snchez, G., Llads, J., 2008. Unsupervised writer style adaptation for handwritten word spotting. In: Internat. Conf. on Pattern Recognition. Rodrguez-Serrano, J.A., Perronnin, F., 2009. Handwritten word-spotting using hidden Markov models and universal vocabularies. Pattern Recognition 42, 21062116. Rose, R.C., Paul, D.B., 1990. A hidden Markov model based keyword recognition system. In: Internat. Conf. on Acoustics, Speech, and Signal Processing, pp. 129 132.
baseline. This result supports our discussion of why adaptation is a more correct methodology in this case. 6. Conclusions In the proposed system, the detection results of a writer-independent word-spotting system are improved by adapting to the writer style of each input page. To the best of our knowledge, this is the rst work to apply writer adaptation in a word-spotting task. Traditional HMM adaptation techniques cannot be used for problems where whole words are modeled with an HMM (such as word-spotting or small-vocabulary word recognition) because (i) the writer is not present to provide adaptation samples, and (ii) the amount of output data is insufcient for performing selfadaptation. Therefore, a novel HMM adaptation method is proposed. It is based on the separation of the word-dependent parameters from the writer-dependent (shape vocabulary) parameters in a SCHMM and the adaptation of the latter for each writer at training and test time. Experiments show that this adaptation technique improves the performance of a detection task. Adaptation can be performed in a fully unsupervised way even without requiring the presence of the modeled word in the adaptation set. Let us discuss future developments which we believe could further improve the adaptation performance. These are motivated by two experimental observations: (i) the adaptation material contained in a single document might be insufcient and (ii) while (ideally) the Gaussian codewords in the universal vocabulary should be writer-independent, it is possible that in practice this assumption may not completely hold. The rst observation might be addressed by employing very fast adaptation techniques such as eigenvoices (Kuhn et al., 2000), successfully applied to speaker adaptation. The eigenvoices technique constrains the adapted model to be a linear combination of a small number of basis vectors obtained ofine from a set of reference speakers. This effectively reduces the number of parameters to be estimated and guarantees that the obtained eigenvoice basis vectors represent the most important components of variation between the reference speakers. This allows to obtain a fairly good adaptation performance with a very small adaptation set (even few or one single character). However, for larger amounts of adaptation material the performance usually reaches a plateau. As for the second problem, iterative techniques such as speakeradaptive training (SAT) (Anastasakos et al., 1996) could be applied. Indeed, SAT starts from an idea similar to ours: one assumes that the variability in the models is a combination of phonetic variability and speaker variability. The SAT framework, however, works at sub-word level, and all models are adapted simultaneously. This means that, for instance, the data from one character can be meaningful for adapting the model of another character. This idea could be imported at whole-word level and this could be applied to remove the writer variability from the GMM shape vocabulary. Acknowledgments The work of the CVC authors was partially supported by the Spanish projects TIN2006-15694-C02-02 and CONSOLIDER-INGENIO 2010 (CSD2007-00018). References
Adamek, T., Connor, N.E., Smeaton, A.F., 2007. Word matching using single closed contours for indexing handwritten historical documents. Internat. J. Doc. Anal. Recognition 9 (2), 153165. Anastasakos, T., McDonough, J., Schwartz, R., Makhoul, J., 1996. A compact model for speaker-adaptive training. In: Internat. Conf. on Spoken Language Processing, vol. 2. pp. 11371140.
J.A. Rodrguez-Serrano et al. / Pattern Recognition Letters 31 (2010) 742749 Saykol, E., Sinop, A., Gudukbay, U., Ulusoy, O., Cetin, A., 2004. Content-based retrieval of historical ottoman documents stored as textual images. IEEE Trans. Image Process. 13 (3), 314325. Schlapbach, A., Liwicki, M., Bunke, H., 2008. A writer identication system for online whiteboard data. Pattern Recognition 41 (7), 23812397. Sivic, J., Zisserman, A., 2009. Efcient visual search of videos cast as text retrieval. Pattern Anal. Mach. Intell. IEEE Trans. 31 (4), 591606. Srihari, S.N., Huang, C., Srinivasan, H., 2004. Content-based retrieval of handwritten document images. In: Knowledge Based Computer Systems (KBCS 2004). Srihari, S.N., Huang, C., Srinivasan, H., 2005. A search engine for handwritten documents. In: Document Recognition and Retrieval XIII, pp. 6675.
749
Terasawa, K., Nagasaki, T., Kawashima, T., 2005. Eigenspace method for text retrieval in historical document images. In: Proc. 8th Internat. Conf. on Document Analysis Recognition, pp. 436441. Terasawa, K., Tanaka, Y., 2007. Locality sensitive pseudo-code for document images. In: Proc. 9th Internat. Conf. on Document Analysis and Recognition. IEEE Computer Society, Washington, DC, USA, pp. 7377. Van der Zant, T., Schomaker, L., Haak, K., 2008. Handwritten-word spotting using biologically inspired features. IEEE Trans. Pattern Anal. Machine Intell. 30, 19451957. Vinciarelli, A., Bengio, S., 2002. Writer adaptation techniques in HMM based off-line cursive script recognition. Pattern Recognition Lett. 23 (8), 905916.

Unsupervised Writer Adaptation of Whole-Word HMMs With Application To Word-Spotting 2010

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unsupervised Writer Adaptation of Whole-Word HMMs With Application To Word-Spotting 2010

Uploaded by

Copyright:

Available Formats

Pattern Recognition Letters 31 (2010) 742749

Contents lists available at ScienceDirect

Pattern Recognition Letters