You are on page 1of 11

Text Segmentation via Topic Modeling: An Analytical Study

Hemant Misra
Dept of Computing Science University of Glasgow

Franois Yvon
Univ Paris-Sud 11 and LIMSI-CNRS Orsay, France

hemant@dcs.gla.ac.uk Joemon Jose


Dept of Computing Science University of Glasgow

yvon@limsi.fr Olivier Capp


TELECOM ParisTech and LTCI/CNRS Paris, France

jj@dcs.gla.ac.uk

cappe@telecomparistech.fr

ABSTRACT
In this paper, the task text segmentation is approached from a topic modeling perspective. We introduce a new application of latent Dirichlet allocation (LDA) topic model: to segment a text into semantically coherent segments. A major benet of the proposed approach is that along with the segment boundaries, it outputs the topic distribution associated with each segment. This information is of potential use in applications like segment retrieval and discourse analysis. The new approach consistently outperforms a standard baseline method on three dierent datasets and yields signicantly better performance than most of the available unsupervised methods on a benchmark dataset. However, the proposed approach suers with unrealistically high computational cost. Based on an analysis of the dynamic programming (DP) algorithm used for segmentation, we suggest an optimization to DP that dramatically speeds up the process, with no loss in performance.

Keywords
text segmentation, topic modeling, latent Dirichlet allocation, semantic information, dynamic programming

1.

INTRODUCTION

On several occasions, information needs of a user can be reasonably satised by presenting only the relevant part(s) of a document, and presenting the complete document(s) may result in information overload. In information retrieval (IR), passage retrieval [25] is a step in this direction, where instead of retrieving a set of documents in response to a query, a set of passages are retrieved. In this context, estimating passage boundaries reliably in an unsupervised manner is a key step in performing IR at the segment level.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. CIKM November 2-6, 2009, Hong Kong Copyright 2009 ACM X-XXXXX-XX-X/XX/XX ...$10.00.

Linear text segmentation is the task of dividing a given text data into topically coherent segments [15, 11, 20, 21, 2, 3, 8, 24, 17]. Text segmentation is a fundamental requirement for many IR applications, e.g., segmenting a news broadcast transcription into stories (if possible, with a topic tag) could be very useful for browsing/retrieval. In a normal setting (no text segmentation performed), if a user needs to access a particular story in a news broadcast, he may have to search the entire broadcast to get the story. In contrast, if the news is segmented (either manually or automatically) into stories and labeled, the relevant story can be retrieved directly. Text segmentation can also improve a users retrieval experience by segmenting a document into topics and subtopics, and presenting only the relevant parts of the document during a search operation. Text segmentation assumes importance in text summarization and discourse analysis (the detection of topic changes) as well. Several approaches have been proposed in the past to perform this task. Most of the unsupervised approaches exploit lexical chain information, the fact that related or similar words tend to be repeated in topically coherent segments and segment boundaries often correspond to a change in the vocabulary [15, 11, 8, 24, 17]. Such approaches do not require a training phase (data), and can be directly applied to any text from any domain, subject to the (only) constraint that word boundaries can be identied. These approaches may fail to produce reliable segment boundaries if the word repetition is avoided by design 1 . A notable variation of these approaches (to overcome the issue of non-repetition because of words being replaced by synonyms or hypernyms) is when data is projected onto some latent space before performing the similarity match [7, 6, 4]. Nevertheless, a potential drawback of all these approaches is that even when the segment boundaries are estimated correctly, the segments are not associated (labeled) with any topic information. In order to overcome these limitations of the lexical chain approaches, a new methodology for text segmentation is proposed in this paper which builds upon well established latent Dirichlet allocation (LDA) [5] model and some of its unexplored properties. LDA is a generative and unsupervised topic model; for text segmentation task, rst it is trained
1 See, for example, http://www.thinkage.ca/~jim/prose/ repetitionvselegantvariation.htm

and then used to segment running texts. During training, the model learns the semantics information from the dataset and hence does not rely on mere word repetitions to segment the text. This is a departure from the lexical chain approaches which are typically knowledge-free. The proposed LDA based approach also diers from the lexical chain approaches in that it jointly performs segmentation and topic labeling (outputs the topic distribution associated with each segment). An expected benet of this approach is its ability to identify the topic of each segment, thus allowing to track topics within a long document or within a collection. However, a fallout of topic estimation process is that it increases the computational cost of the entire process by several orders of magnitude. The main objective of this study is to investigate whether text segmentation can be achieved from a topic modeling perspective. The empirical contributions of this paper for text segmentation task are manyfold: We demonstrate that the LDA based text segmentation approach proposed in this paper yields a better performance than the standard baseline technique proposed in [24], provided that in-domain data is available for training (or adapting) the LDA model. Our study also provides new insights regarding the strengths and weaknesses of both the methods. We also introduce a heuristic to dynamic programming (DP) algorithm that brings a substantial reduction in computational cost associated with LDA based text segmentation approach. This makes topic modeling a viable alternative even when the document has to be segmented online. The proposed heuristic is not specic to the LDA based method only and can be directly applied to every method that uses DP for text segmentation task. The proposed LDA based method is rst evaluated on a standard dataset. Later, a bigger and more realistic dataset is designed to do an in-depth investigation of the issues associated with the text segmentation task. Finally, we validate and reconrm the ndings of this paper on a third dataset which was a part of TRECVid 2003 evaluations. The LDA based method achieves a better performance than that of the the baseline method on all the three datasets. In particular, on the standard dataset typically used for benchmarking, it outperforms all the unsupervised methods previously reported in the literature. The rest of the paper is organized as follows: In Section 2, we recall the principles of dynamic programming (DP) for text segmentation, rst reviewing the method proposed by Utiyama and Ishara [24] (one of the most cited baseline approaches to date) and then explaining how to adapt these principles when fragments are scored under the LDA topic model. In Section 3.3, we compare and analyze the performance of the two methods on Chois benchmarks [8]. We then investigate the two methods on a very large dataset derived from the Reuters News Corpus Volume 1 (RCV1) [16]. Section 3.4 presents the results obtained on this data, and discusses the merits and weaknesses of both the approaches. We revisit the (DP) algorithm and analyze it in Section 4. An optimization procedure is proposed which brings a signicant reduction in the computational cost. Finally, in Sec-

tion 5, the two methods are evaluated for a real application, TRECVid 2003 news story segmentation task, to revalidate the main ndings of this paper. Conclusions of this study are drawn and the future prospects are discussed in Section 6.

2. ALGORITHMICS OF SEGMENTATION
In order to understand the typical dynamic programming (DP) based approaches for text segmentation task, in Section 2.1 we explain DP, the concept of score of a segment, and the process of nding the path which maximizes the cumulative score. Subsequently, in Section 2.2 we explain the processes of score computation in an often quoted baseline method of Utiyama and Isahara [24]. Next, we propose our LDA based approach for text segmentation task, its motivation, and the procedure of estimating the score of a segment using LDA. Finally, in Section 2.4, we discuss the computational cost associated with typical DP based approaches and its aect on the proposed LDA based approach.

2.1 Dynamic Programming (DP) with Probabilistic Segments Scores


Text segmentation can be eciently implemented with DP techniques [24, 14]. Assuming that text is represented as a
Seg
5 1

Figure 1: Nodes and segments in dynamic programming. linear graph, a segment is dened by two nodes, the begin (B) and the end (E) nodes. For instance, in Fig. 1, segment Seg5 1 (dotted line in red) is from begin node B1 (excluding B1 ) to end node E5 (including E5 ). Node 0 is treated as null node for convenience. In the standard DP approach, scores for all the possible node pairs are computed. Therefore, if the graph contains N nodes, one has to consider N (N + 1)/2 node pairs as shown in Fig. 2.
0-1 1-2 0-2 2-3 1-3 0-3 3-4 2-4 1-4 0-4 4-5 3-5 2-5 1-5 0-5 5-6 4-6 3-6 2-6 1-6 0-6 ........... (N-3)-(N-2) (N-4)-(N-2) .... 0-(N-2) (N-2)-(N-1) (N-3)-(N-1) .... 0-(N-1) (N-1)-N (N-2)-N (N-3)-N ............ 0-N

Figure 2: Node pairs (segments) in standard DP As described in [24], text segmentation thus proceeds as follows: We denote d = w1 w2 wld a document of length ld ; and S = S1 S2 Sm a particular segmentation S made up of m segments. The likelihood of S is thus P (S |d) = P (d|S )P (S ) P (d ) (1)

In (1), P (d|S ) is the probability of d under segmentation S and P (S ) is a prior over segmentations, which corresponds to a penalty factor. Assuming Si contains ni word j tokens, and that wi denotes the j th word token in Si , we ni 1 denote W = w w i i . Therefore, d = W1 Wm with Pm i ld = i=1 ni . Under these assumptions, Wi and Si are in a one to one correspondence. Further, assuming that segments are independent of each other, (1) can be rewritten as 2 : # "m Y P (Wi |S ) P (S ) P (S |d)
i=1

penalty factor. In [24], it was optimized to log P (S ) = m log(ld ) to yield the best performance.

2.3 Scoring segments with LDA


LDA is a generative unsupervised topic model [5, 10]. In [5], the authors showed that the model can capture semantic information from a collection of documents, and demonstrated its superiority vis-` a-vis several other models, including multinomial mixture model [19, 22] and probabilistic latent semantic analysis [13]. They also investigated the use of LDA for the task of text modeling, text classication and collaborative ltering. In [10], the LDA model was used to identify hot topics by analyzing the temporal dynamics of topics. Lately, the use of LDA model has been investigated in several applications where topic detection plays a key role, such as unsupervised language model adaptation for automatic speech recognition (ASR) [23, 12], fraud detection in telecommunications [26], and detecting semantically incoherent documents [18]. This paper explores the use of topic modeling properties of LDA for yet another important application, the task of text segmentation. To the best of our knowledge, LDA has not been used previously for this application. Our approach is based on the premise that using a topic model may allow better detection of segment boundaries because segment change should be associated with a signicant change in the topic distribution. LDA adopts the traditional view that texts are represented as word count vectors, and relies upon a two step generation process for these vectors. A key assumption is that each document is represented by a specic topic distribution and each topic has an underlying word distribution. The standard methodology of Gibbs sampling used in this paper to train the LDA model is explained in detail in [10]. In LDA model, the probabilistic generative story of a document is as follows: assuming a xed and known number of topics, T , for each topic t, a distribution t is drawn from a Dirichlet distribution of order V , where V is the vocabulary size of the training corpus. The rst step for generating a document is to draw a topic distribution, = {t , t = 1...T } from a Dirichlet distribution over the T -dimensional simplex. Next, assuming that the document length is xed, for each word occurrence in the document, a topic, z , is chosen from and a word is drawn from the word distribution associated with the topic z . Given the topic distribution, each word is thus drawn independently from every other word using a document specic mixture model. Given , the probability of wi , the ith word token in a document is thus
T X t=1

"m n i YY

"m Y

P (Wi |Si ) P (S )
j P (wi |Si ) P (S )

i=1

i=1 j =1

(2)

=argmax P (S |d), The most likely segmentation is dened as S


S

and can be recovered using DP in a manner similar to the resolution of shortest path problems. During the forwardpass, for each pair of nodes (B, E ), the score of SegE B is computed. The path that maximizes the cumulative score from the rst to the last node is searched, and for each E node the value of the best start node B is stored. The information about the best start node is used during trace back to nd the path that maximizes the score, and in turn, the segment boundaries.

2.2 Scoring Segments by method of Utiyama and Isahara


The method proposed in [24] consists of modeling each segment using the conventional multinomial model, assuming segment specic parameters are estimated using the usual maximum likelihood estimates with Laplace smoothing. In the literature, this approach has often been used as a standard baseline and shown to deliver competitive results on several datasets [24, 6, 9, 4, 17]. The factors in (2) are thus computed as
j j P (wi |Si ) = (Ci + 1)/(ni + Vd )

(3)

j j where Ci is the frequency of word wi in Si and Vd is the number of unique words in d. Changing to log-domain to avoid the problem of underow, the log of P (Wi |Si ) can be written as ! ni j Y Ci +1 log P (Wi |Si ) = log ni + Vd j =1

VWi

ni h i X j log(Ci +1) ni log(ni + Vd ) j =1

P (wi |, ) =

P (zi = t|)P (wi |zi , )

(5)

X
v =1

[CWi v log(CWi v +1)] (4)

ni log(ni + Vd )

T X t=1

t twi

(6)

In (4), CWi v is the count of word v in Wi . The second term intervening in the probability of a segmentation is the
2 For a given document d, P (d) is constant for all the segmentations and can be dropped from the equation.

where P (zi = t|) is the probability that the tth topic was chosen for the ith token and P (wi |zi = t, ) is the probability of wi given topic t. The likelihood of a document, represented as a count vector C , is a mere product of terms

such as (6) P (C |, ) =
T V X Y t=1

v =1

"

(t tv )

#Cv

(c) The likelihood of the segment is treated as its score: log P (Wi |Si ) = log P (Wi |, ) (7) 2. Substitute the scores of the segments in (2), and use DP to nd the segmentation which maximizes the score. The penalty factor we used is dened as log P (S ) = p m log(ld ), where p = 3 was empirically found to yield the best performance on one of the conditions (Method - LDA, Nodes - Word: Table 1, 5th row) and used throughout3 .

where Cv is the count of word v in the document. Training LDA on a text collection reveals the thematic structure of the collection, and has been the primary application of LDA in, e.g. [5, 10]. Being a generative model, LDA can also be used to make predictions regarding novel documents (assuming they use the same vocabulary as the training corpus - vocabulary mismatch issue in LDA model is explained in Table 2). As the topic distribution of a test document gives its representation along the latent semantic dimensions, computing this distribution is important in many contexts, including the present task of text segmentation. This computation can be performed using the iterative procedure suggested in [12, 18], which relies on the following update rule dt
V 1 X Cdv dt tv P ld v=1 T t =1 dt t v

2.4 Time Complexity of DP


If there are N nodes in a document, the score of N (N + 1)/2 node pairs (segments) needs to be computed in a typical DP algorithm. If the cost of computing scores is negligible, as in (4), the issue of time complexity of the DP algorithm is not observed. However, in order to compute the score of a segment in our LDA based approach, rst we need to estimate the topic distribution () of the segment using (8). This iterative procedure required to compute for every possible segment is highly resource demanding. In Section 4, we propose an optimization to the DP algorithm and try to estimate the number of relevant node pairs for which must be computed while doing the forward pass.

(8)

where ld is the document length, computed as the number of running words. As discussed in [18], this update rule converges monotonically towards a local optimum of the likelihood, and convergence is typically reached in less than 10 15 iterations. Once the has been obtained for a document, the likelihood of the document can be computed by (7). This recently proposed step for computing for unseen documents is key to computing the likelihood of a document. In this paper, we extend this idea to compute likelihood of a segment and use the estimated likelihood of segments as scores for performing the text segmentation task. The LDA based method proposed in this paper is based on the following premise: if a segment is made up of only one story it will have only a few active topics, whereas if a segment is made up of more than one story it will have a comparatively higher number of active topics. Extending this reasoning further, if a segment is coherent (the topic distribution for a segment has only a few active topics), the log-likelihood for that segment is typically high, as compared to the log-likelihood in the case when a segment is not coherent [18]. This observation is of critical importance in the success of the proposed LDA based approach for text segmentation task, and has been left unexplored except for its original use in detecting coherence of a document [18]. It is thus tempting to use the log-likelihood of each possible segment as a score in the DP algorithm and to recover the segmentation from the path that yields the highest loglikelihood. The proposed LDA based approach for text segmentation task works like this: 1. For each possible segment, Si , (a) Compute its by performing 20 iterations of (8): PV CW v Si t tv 1 PT i S i t n v =1 i
t =1 Si t t v

3. EXPERIMENTAL RESULTS 3.1 Databases


The rst dataset used in this study is the Chois dataset 4 , which has been used repeatedly in benchmarking text segmentation algorithms. Chois dataset is derived from Brown corpus. Brown corpus consists of running text of edited English prose printed in US during the calendar year 1961. Chois dataset is divided into 4 subsets (3-5, 6-8, 9-11 and 3-11) depending upon the number of sentences in a segment/story. For example, in subset X-Y, a segment is derived by (randomly) choosing a story from Brown corpus, followed by selecting rst N (a random number between X and Y) sentences from that story. Exactly 10 such segments are concatenated to make a document. Further, in each subset there are 100 documents to be segmented. By design, the segments are not complete stories. To study the performances on a more realistic dataset, we created a dataset similar to Chois dataset from the Reuters Corpus Volume 1 (RCV1) [16]. RCV1 is a collection of over 800,000 news items in English from August 1996 to August 1997. We selected a set of 23,326 news items from RCV1 for generating the Reuters dataset (RDS) for text segmentation which is more comprehensive than Chois dataset. As in Chois dataset, 4 subsets are created in RDS (3-5, 6-8, 9-11 and 3-11) depending upon the number of sentences in a segment. In addition, a 5th subset was created in RDS which had complete new items (Fulltext). Again, as in Chois dataset, there are 100 documents to be segmented in each subset of RDS, and each document has 10 segments (Set 1) in it. However, with a realization that 10 segments in each document is too restrictive and may not be the only condition to occur in practise, we created two more sets varying the number of segments in a document. In RDS, Set 2 and Set 3 contain 50 and 100 segments respectively.
3 Though a small change in performance was observed with a change in p, it didnt aect the performance signicantly. 4 http://www.freddychoi.co.uk/

(b) Compute its log-likelihood using (7): P (Wi |, ) = iCW v QV hPT i ( ) S t tv i v =1 t=1

Though similar to Chois dataset, RDS covers more situations. The number of segments in a document is not xed to 10; it is 10, 50 or 100. Further, subset Fulltext in RDS has complete news items randomly chosen and then concatenated to form the documents. From RCV1 collection, we selected another 27,672 news items for training the LDA model (ReutersTrain). In these experiments, number of topics (T ) and Dirichlet priors ( and ) are set to the following values: T = 50, = 1 and = 0.01. The RCV1 was used for the following reasons: a) It is a large database so we can have separate train (ReutersTrain) and test (RDS) sets, b) RCV1 is built over a period of one year, using real news stories, so biases to a particular domain or vocabulary are less, c) ReutersTrain had enough train documents to cover minimum vocabulary size and do reliable LDA parameters estimation, d) In RDS, the subset Fulltext simulated realistic conditions whereas other subsets were to emulate Chois dataset, e) In RDS, the number of segments in a document were varied (10, 50 and 100) to simulate wider spectrum of the text segmentation problem.

are classied incorrectly. A lower value of Pk indicates a higher accuracy in text segmentation. The results reported in Table 1 suggest that the performance of both the approaches improves with an increase in the segment size. This is an expected result: longer segments allow a better estimation of the multinomial parameters for the baseline method and of the topic distribution for LDA. Compared to LDA based approach, the baseline method is: (i) more accurate and (ii) an order of magnitude faster. An inspection of outputs (other than just the segment boundaries provided by the two methods) gives a possible explanation for (i): There is a serious mismatch in vocabulary Vocabulary Size 3-5 6-8 9-11 3-11 395.3 640.3 831.1 640.6 347.7 566.1 733.0 566.0 Number of Content Words 3-5 6-8 9-11 3-11 500.8 867.1 1199.4 851.3 444.1 775.9 1074.6 762.0

Baseline LDA

Baseline LDA

3.2 Test conditions


In this study, we consider two basic test conditions encountered in text segmentation task: 1. No information about the sentence or paragraph boundary is used: each word is a possible B or E node. 2. Information about sentence end is used: each sentence beginning is a possible B node and each sentence end is a possible E node. This is the condition studied in [24]. Condition 2 yields much faster decoding times, as fewer nodes have to be evaluated. In addition, considering a smaller number of nodes may also lead to less confusion and better performance. However, in absence of explicit sentence or paragraph marker, for example in the case of an ASR output, condition 1 is the only possibility for segmenting the text.

Table 2: The Average vocabulary size of documents and Average number of content words in documents when the two methods are used for computing scores.

3.3 Chois Data: Results and Analysis


In this section, we compare the results of the following segmentation systems (Table 1) 5 : The results reported in [24] (using Chois implementation of Porter stemmer) Our own implementation of Utiyamas method, with and without sentence boundary information LDA based segmentation with and without sentence boundary information. In the latter case, due to the extremely high computational cost, it was not possible to compute the scores for all the possible segments. Therefore an approximation was made and only segments of size 10 to 100 words with an interval of 10 words were considered (last row in Table 1). The results are presented in terms of Pk , the probabilistic error metrics introduced in [3]. Pk is the probability that two randomly drawn sentences which are k sentences apart
5 In the case of LDA model, the number of topics is 50 and the model is trained on the ReutersTrain dataset.

between ReutersTrain dataset (used for LDA training) and Chois dataset used for testing. Similar issues of semantic mismatch were highlighted, for example, in [4], where the authors used a generic latent semantic space while performing text segmentation. This issue is explained further by the statistics presented in Table 2. The baseline method utilizes the full available vocabulary and all the content words (except the stop words which were removed before segmenting the text) for computing the score of a segment. In contrast, the vocabulary of an LDA model is dened by its training data. During test, the content words in the (test) data that are not present in the training vocabulary are not used for computing the score of a segment. Comparing the baseline and LDA methods, approximate loss in vocabulary and content words by LDA is 11.6% and 10.5% respectively. To overcome this problem of vocabulary mismatch, we divided the Chois dataset into two parts. The rst 50 documents from each subset were put in SET A (50 documents * 10 segments/document * 4 sub-sets = 2000 segments) and the last 50 documents from each subset were put in SET B. SET A was used along with ReutersTrain to train the LDA model, and SET B was used for testing. The results for the baseline and adapted LDA models for SET B are given in Table 3. The results present in Table 3 show that the vocabulary mismatch was the main reason for the poor performance of LDA based method. The performance of the adapted LDA model is signicantly better than that of the unadapted LDA model, and it even performs better than the baseline method. In fact, the performance of the adapted LDA model is better than any other unsupervised model previously reported in the literature for the text segmentation task on this benchmark [8, 24, 7, 6, 9]. Noticeably, all these results are for the case when the algorithms used the information about the number of segments for getting the best performance, and in certain cases some data was used for adaptation [24, 9]. In the table, [9] has better per-

Method Baseline Baseline Baseline LDA LDA

Nodes Sentence Sentence Word Sentence Word

Porter Stemmer Chois None None None None 3-5 13 14 (0.332) 21 (42.1) 22.5 (310.2) 33

Pk , in % (time, in sec) 6-8 9-11 6 6 7 (1.399) 7 (3.457) 13 (200.7) 11 (500.3) 15.4 (1560.6) 13.1 (4031.1) 25 21

3-11 11 11 (1.472) 16 (199.2) 15.5 (1527.1) 25

Table 1: The performance of text segmentation algorithms for various setups on Chois dataset (along with the average time taken per document). The time in case is not reported because scores of all the possible segments were not computed during DP. Method Baseline Unadapted LDA (ReutersTrain) Adapted LDA (ReutersTrain + Choi SET A) C 99(b) Choi et al [8] U 00(b) Utiyama and Isahara [24] CW M 1 Choi et al [7] Brants et al [6] Fragkou et al [9] 3-5 14.9 23.0 2.2 12 9 10 7.4 5.5 Pk , in % 6-8 9-11 8.1 7.7 15.8 14.4 2.3 4.1 9 9 7 5 7 5 8.0 6.8 3 1.3

3-11 11.2 16.1 2.3 12 10 9 10.7 7

Table 3: The text segmentation performance on the SET B of Chois dataset by the baseline and LDA (unadapted and adapted) methods. For comparison, the performance by several other methods for complete Chois dataset is also mentioned from their respective papers. All these methods used the information about the number of segments present in a document for obtaining the best performance.

formance than adapted LDA model for subset 9-11. This particular method introduced some extra variables in the DP algorithm to capture the local as well as global dynamics. These variables were rst adapted over a train dataset and then the best performance was reported. The results by [17] are not reported in the table because the authors presented only the average results across all the subsets (a Pk value of 21.2). Lastly, though our baseline method is also from [24], it does not utilize the information about the number of segments. That is why the results of Row 1 and Row 5 are dierent in Table 3. In the next section, we present the results on a bigger and more realistic dataset (RDS) to investigate the other issues related with the text segmentation task which were not observed while working with Chois dataset.

3.4 RDS: Results and Analysis


As explained earlier, RDS includes sets that contain a larger number of segments (Set 2 and Set 3 contain 50 and 100 segments per document, respectively). In addition, a subset consisting of complete Reuters news stories was also considered (Fulltext). The results on this dataset are presented in Table 4. An analysis of these results unfolds a dierent story. For RDS, the performance of the baseline method drops signicantly, especially when used to segment longer documents. In comparison, in spite of high computational cost the topic based approach yields better results, at least for the test documents that could be segmented. This result shows that LDA can produce segmentations that compare favorably with the baseline, provided the LDA model is trained on a corpus that is similar in diversity to the test documents to be segmented.

The poor performance of Utiyamas method on longer pieces of text was also reported in [24, 17]. A possible explanation is as follows: The maximum-likelihood (ML) estimator in [24], even with smoothing, is not robust for small segments. Based on the observation of a handful of words, [24] claims to estimate a distribution over the entire vocabulary, whose size increases rapidly as the number of segments in a document increase and could easily be a few thousands parameters for reasonably sized documents. By comparison, the LDA based method only attempts to estimate the distribution, which corresponds to a much smaller number of parameters, and can thus be performed more robustly, even for small segments. In Section A, we show the segmentation estimated by the proposed LDA based approach for a sample document. Along with the segmentation, we also show the topic assigned (only the topic with the highest probability is printed) by the LDA model to each segment. A comparison of the top words in a topic and the topic assigned to a segment based on the words in that segment gives a very promising picture. In the near future, we will study these results in more detail for applications like discourse analysis. Encouraged by these results, we then analyzed the DP algorithm to alleviate the issue of high computation cost associated with the topic based segmentation algorithm.

4. SPARSE PATH AND MODIFIED DP


The main diculty of DP based segmentation stems from its requirement to score all possible node pairs as segments, whose number grows quadratically with the number of words or sentences in a document. In LDA, scoring a segment requires performing an iterative estimation of its topic dis-

Method Baseline Baseline Baseline LDA LDA LDA

Segments 10 (Set 1) 50 (Set 2) 100 (Set 3) 10 (Set 1) 50 (Set 2) 100 (Set 3) 3-5 16 (0.208) 36 (21.85) 41 (210.0) 6.5 (200.3) 9.6 (19364) 6.6(202085) 6-8 14 (0.776) 34 (92.77) 41 (625.7) 4.5 (801.9) 5.8(80486) HCC

Pk , in % 9-11 14 (1.541) 33 (191.3) 40 (1285.6) 5.6 (2309.9) HCC HCC

3-11 14 (0.665) 34 (85.82) 40 (400.0) 5.9 (734.2) HCC HCC

Fulltext 17 (10.619) 29 (906.2) 38 (5382.9) 11.1 (8750.8) HCC HCC

Table 4: The text segmentation performance on RDS (along with the average time taken per document). * is partial result, and HCC is could not be completed because of high computation cost.

tribution. An obvious remedy to this problem lies in computing the score of only a subset of the most relevant segments. In [17], for instance, the authors used a threshold to prune/lter the segments that were too long (their B and E nodes were too far apart), a strategy that proved eective to reduce the computational cost of their technique. Our proposed optimization is based on the observation that during the forward path, most nodes are never active as B nodes. In other words, for most of the nodes, we cannot nd a single E node whose best starting node is among these nodes. As a consequence, these nodes are never on the maximum score path. This observation reects in Fig. 3,
15

Node frequency

10

Active nodes Utiyama Actual

mostly not in the maximum score path. This suggests a solution to reduce the computational cost of DP, which consists of disregarding segments whose starting node lies before the last active B node. This heuristic, which holds irrespective of the underlying probabilistic model, does not require to dene a threshold for segment size as suggested in [17]. In fact, this sparsity of data for text segmentation task is visible in many previous publications, though in a dierent format called as similarity plot (for example, Fig. 1 in [17], Fig. 1 in [9], Fig. 1 and 3 in [8]). However, it was never recognized, discussed and put to use. This information is present during the forward-pass of the DP itself and the saving in cost could be signicant if it is used wisely. As explained before, the DP path is sparse and most nodes are never active B nodes for any E node. If an active node (B = X ) is found while doing the forward pass, the paths originating from the nodes before this active node (B < X ) are not considered for score computation.
0-1 1-2 0-2 2-3 1-3 0-3 3-4 2-4 1-4 0-4 4-5 3-5 5-6 4-6 3-6 ........... (N-3)-(N-1) (N-4)-(N-2) (N-2)-(N-1) (N-3)-(N-1) (N-4)-(N-1) (N-1)-N (N-2)-N (N-3)-N (N-4)-N

200

400

600

800

1000

1200

Word number
15 Active nodes Utiyama Actual

Node frequency

10

200

400

600

800

1000

1200

1400

Word number

Figure 3: Frequency of a node as an active B node for two example test documents.

Figure 4: Relevant node pairs (segments) in optimized DP. In Fig. 4, for example, if B = 3 is an active node (gives the maximum cumulative score for E = 4), for E = 5 there is no need to compute the scores for segments 2-5, 1-5, and 0-5. As the DP progresses, the last active node keeps getting updated. For example, if towards the end of the document B = N 4 is an active node, scores of only a few segments need to be computed. In order to make this algorithm more robust, a node is considered active only when it has been active for at least two consecutive segments. In the discussion above, B = 3 is considered an active node only if it is the best starting point for E = 4 and E = 5. If B = 3 was an active node for at least 2 times, then for E = 6 the scores for segments 2-6, 1-6, and 0-6 need not be computed. This simple scheme, which trade-os the exactness of search with speed proves to be very ecient in terms of reducing the computational load, specially for longer texts, with almost no loss in performance in most of the cases. This is con-

where we plot, for two example test documents, the frequency count of active nodes during the forward pass. More precisely, we display the number of times each node Bi is an active begin node for any of the possible nodes Ej , with j > i. The gure also plots the actual segment boundaries and the segment boundaries obtained by the baseline method after back-tracing. As shown on this Fig. 3, only a few nodes are active B nodes. Conversely, those B nodes that are highly active (meaning that they are the best starting point of many segments) are very often a good candidate for segment boundaries. This also means that the most likely candidates for segment boundaries can be found in the forward pass itself, thus being in a position to segment in a streamwise fashion. Moreover, once an active B node is crossed, the segments starting from before this B node are

Method LDA ODP LDA ODP LDA ODP

Segments 10 (Set 1) 50 (Set 2) 100 (Set 3) 3-5 6.5 (33.3) 11.2 (172.8) 13.6 (548.9) 6-8 4.4 (57.5) 6.0 (320.0) 7.8 (1009.8)

Pk , in % 9-11 5.4 (91.9) 5.5 (545.0) 6.4 (1540.1)

3-11 5.9 (58.3) 7.3 (332.4) 8.4 (1030.6)

Fulltext 11.9 (349.6) 8.4 (2054.7) 8.4 (7950.1)

Table 5: The text segmentation performance on RDS (along with the average time taken per document) by Optimized DP ODP.

rmed by the results achieved by optimized DP algorithm (Table 5). The results in Table 5 reect the eectiveness of this simple heuristics, both in terms of speed and accuracy. For all segment sizes, the time needed to segment documents is reduced by more than 10 times, with even larger gains for long documents. For test Set 1 that contains 10 segments per document, we can actually compare the performance of exact and approximate search. We nd that both methods yield comparable results. In fact, the heuristic method proves slightly better in couple of situations, as it discards solutions that are optimal for the model, but would yield very long segments. For test Set 2 and Set 3 containing more segments (50 and 100 segments, respectively), our method outperforms baseline by a wide margin. Compared to the performance of the baseline, the performance of the topic based model degrades very gracefully with an increase in number of segments.

was utilized for performing this task. It is worth mentioning that no cues specic to the news broadcast were employed to facilitate the task. We show the performance for the baseline and two LDA based methods in Table 6. In the rst LDA based method, the model is trained on ReutersTrain dataset while the second LDA based method utilizes ReutersTrain and TRECVidTrain dataset for training the LDA model. The results are presented in terms of two measures, Pk and F1. It must be noted that a lower value of Pk indicates a better performance, whereas a higher value of F1 measure means a better performance. A comparison of the results shows that Method Baseline Unadapted LDA Adapted LDA Pk , in % 22.3 22.7 20.7 F1 0.39 0.40 0.44

5.

A REAL APPLICATION: NEWS STORY SEGMENTATION TASK

In TRECVid 2003 [1], news story segmentation was dened as a separate evaluation task. In this section, we present the performance of the proposed LDA based method on the dataset proposed for TRECVid 2003 news story segmentation evaluation task. The dataset consists of 120 hours of video news story collected from ABC and CNN news channels in the year 1998. Approximately 60 hours of data was set aside for training or development (TRECVidTrain) and the rest for testing (TRECVidTest). The close caption text for these news broadcasts was also provided. The participants were expected to utilize multi-modal features from text, audio and video streams to perform the segmentation task. It the actual evaluation, the participants were allowed to use specic cues like good evening, xyz (person) reporting from abc (place), coming next etc from the text stream to facilitate the task. Such cues are assumed to be closely associated with news broadcasts and typically suggest a story change. These cues were obtained from the TRECVidTrain data. The TRECVid results are typically reported in terms of precision, recall and F1 measures. These measures are dened as no. of estimated boundaries which are actual precision = no. of estimated boundaries no. of estimated boundaries which are actual recall = no. of actual boundaries 2 precision recall F1 = precision + recall In this paper, we approach news story segmentation task as a text segmentation problem. Only the close caption text

Table 6: The text segmentation performance in terms of Pk and F1 measures on TRECVidTest dataset by the baseline, unadapted LDA (ReutersTrain) and adapted LDA (ReutersTrain+TRECVidTrain) methods. the baseline and unadapted LDA methods yield similar performances, and the performance of the LDA method can be improved by adaptation. These results validate the earlier results obtained on Chois dataset that adaptation improves the performance of LDA based method. These encouraging results on a real application by the unsupervised topic modeling approach (the proposed LDA based method in our case) suggest the following: text segmentation from unsupervised topic modeling perspective is an alternative and competitive tool to give a reasonable performance on several dierent datasets. Still, the text segmentation problem is far from being solved and below are some of the observations from TRECVid 2003 datsset: In this dataset, a typical news broadcast of CNN and ABC started with a short main headlines followed by the actual news where the headlines were presented in detail. In the main headlines, each story was represented by a single sentence. In most of the cases, it was not possible to estimate the boundaries correctly in the main headlines section. This highlights the problem of unsupervised approaches which need some minimum amount of data to perform reliable estimates of the score. We expect our LDA based approach to be complementary to the existing approaches. To give a glimpse about the future avenues of improvement, in our last experiment we tried to investigate the complementarity of the two approaches by combining the segment boundaries estimated by the baseline and the LDA methods by simple OR operation. That is, if

any one of the methods suggested a segment boundary, we accepted it is a segment boundary. This combination was performed because an analysis of the segment boundaries estimated by the two methods revealed that: a) if a boundary was estimated by these methods it was mostly an actual boundary; b) both the methods do undersampling, that is, the number of boundaries estimated by individual methods is always less than the number of actual boundaries; c) the boundaries estimated by the two methods are not the same. These observations suggest that the outputs provided by the two methods are complementary. This is expected because both the methods rely on dierent techniques to compute the segment scores and as a consequence have dierent estimates for segment boundaries. This simple combination yields an F1 measure of 0.52 and 0.55 for Baseline + unadapted LDA and Baseline + adapted LDA respectively.

6.

CONCLUSIONS

In this study, we proposed an LDA based method for the text segmentation task. The proposed method computes topic distributions (Section A) jointly with segmentation, thus allowing to collect information about the thematic content of each segment. This information can be used to keep track of recurring topics. We investigated and compared the performance of our method with a standard approach often used as a baseline for the text segmentation task [24]), and analyzed its potential strengths and weaknesses. The LDA based method gave a better performance than that of the baseline in matched conditions (train and test data from the same domain) and adapted conditions (a small amount of data from the test domain is used along with a large amount out-of-domain data to train the LDA model). This trend was observed on all the three datasets studied in this paper. In fact, the proposed LDA based method outperformed every other unsupervised approach on Chois dataset and was far superior than the baseline method on RDS dataset. One drawback of the proposed LDA based approach was its high computational cost. We proposed an optimized DP algorithm that eectively reduces the computational cost. The unsupervised methodology for discarding the unimportant segments from the process of score computation brings a substantial reduction in computational cost without aecting the performance signicantly. The proposed optimization in DP algorithm is not only useful for text segmentation task but can be used in many other situations where a similar sparsity in data is encountered. LDA based approach remains an order of magnitude slower than the baseline, indicating that there is a scope for improvement. One possible solution would be to combine both the techniques, using the baseline method to locate the most promising boundaries, and then rescoring them using the topic based methods. The second issue with the topic based approach is related to mismatch between the train and the test data. This could be alleviated, for instance, by adapting the parameters of the LDA model via a rst pass on the test data, during which the topic assignments for the whole document would be computed and used to revise the parameters.

7.

REFERENCES

[1] TREC Video Retrieval Evaluation (2003). http://wwwnlpir.nist.gov/projects/tv2003/tv2003.html,

Nov. 2003. [2] J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report, 1998. [3] D. Beeferman, A. Berger, and J. Laerty. Statistical models for text segmentation. Machine Learning, 34(13):177210, 1999. [4] Y. Bestgen. Improving text segmentation using latent semantic analysis: A reanalysis of Choi, Wiemer-Hastings, and Moore (2001). Computational Linguistics, 32(1):512, 2006. [5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems (NIPS), volume 14, pages 601608, Cambridge, MA, 2002. MIT Press. [6] T. Brants, F. Chen, and I. Tsochantaridis. Topic-based document segmentation with probabilistic latent semantic analysis. In Proceedings of the International Conference on Information and Knowledge Management, pages 211218. ACM, 2002. [7] F. Choi, P. Wiemer-Hastings, and J. Moore. Latent semantic analysis for text segmentation. In Proceedings of EMNLP, pages 109117, 2001. [8] F. Y. Y. Choi. Advances in domain independant linear text segmentation. In Proceedings of the Conference of North American Chapter of the ACL, Seattle, WA, 2000. [9] P. Fragkou, V. Petridis, and A. Kehagias. A dynamic programming algorithm for linear text segmentation. Journal of Intelligent Information System, 23(2):179197, 2004. [10] T. L. Griths and M. Steyvers. Finding scientic topics. Proceedings of the National Academy of Sciences, 101 (supl 1):52285235, 2004. [11] M. Hearst. TextTiling: Segmenting texts into multi-paragraph subtopic passages. Computational Linguistics, 23(1):3364, 1997. [12] A. Heidel, H. an Chang, and L. shan Lee. Language model adaptation using latent Dirichlet allocation and an ecient topic inference algorithm. In Proceedings of EuroSpeech, Antwerp, Belgium, 2007. [13] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning Journal, 42(1):177196, 2001. [14] A. Kehagias, F. Pavlina, and V. Petridis. Linear text segmentation using a dynamic programming algorithm. In Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, pages 171178, Morristown, NJ, U.S.A., 2003. [15] H. Kozima. Text segmentation based on similarity between words. In Meeting of the Association for Computational Linguistics, pages 286288, 1993. [16] D. D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Machine Learning Research, 5:361397, 2004. [17] I. Malioutov and R. Barzilay. Minimum cut model for spoken lecture segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

Association for Computational Linguistics, pages 2532, Sydney, Australia, July 2006. H. Misra, O. Capp e, and F. Yvon. Using LDA to detect semantically incoherent documents. In Proceedings of CoNLL, Manchester, U.K., 2008. K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell. Text classication from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103134, 2000. J. M. Ponte and W. B. Croft. Text segmentation by topic. In European Conference on Digital Libraries, pages 113125, 1997. J. C. Reynar. Topic Segmentation: Algorithms and Applications. PhD thesis, University of Pennsylvania, 1998. L. Rigouste, O. Capp e, and F. Yvon. Inference and evaluation of the multinomial mixture model for text clustering. Information Processing and Management, 43(5):12601280, Sept. 2007. Y.-C. Tam and T. Schultz. Unsupervised language model adaptation using latent semantic marginals. In Proceedings of ICSLP, Pittsburgh, Pennsylvania, 2006. M. Utiyama and H. Isahara. A statistical model for domain-independent text segmentation. In Meeting of the Association for Computational Linguistics, pages 491498, 2001. R. Wilkinson. Eective retrieval of structured documents. In In Proceedings of ACM Special Interest Group on Information Retrieval, pages 311317, Dublin, Ireland, 1994. D. Xing and M. Girolami. Employing latent Dirichlet allocation for fraud detection in telecommunications. Pattern Recognition Letters, 28(13):17271734, Oct. 2007.

TOPIC 50: plc stg group london pounds british million plus investment uk TOPIC 35: iraq iraqi u.s. united kurdish northern military states iran gulf TOPIC 32: union workers strike government percent italian wage state pay sao TOPIC 25: government billion budget tax state minister year nance economic ministry TOPIC 42: percent point year july june billion month rose growth august TOPIC 19: german marks tobacco car germany industry sales ag million new TOPIC 26: court u.s. lloyds case plan judge names legal law investors TOPIC 16: people ocials police plane miles killed passengers spokesman ight airport

A.3 Segmented Text Output


indian nance minister p chidambaram said on friday india was concerned over the impact ... look at what happened over the last two or three ... what happens to our oil import bill MISSED BOUNDARY pakistani ocials said on saturday a war like situation existed ... they accused indian forces of starting the ring in the ... public rallies and marches were held in azad kashmir to ... at least one civilian was killed and two were wounded ... ESTIMATED BOUNDARY is CORRECT: TOPIC 46 has highest probability (0.28) shares in industrial conglomerate btr plc rose in early trading ... the stock was up 5p at two hundred sixty ve ... british newspaper reports over the weekend had anticipated the sale ... ESTIMATED BOUNDARY is CORRECT: TOPIC 50 has highest probability (0.26) the washington post carried the following stories on its front ... dukan iraq government backed kurdish guerrillas overran sulaimaniya and captured ... washington more than hundred iraqi dissidents and military ocers associated ... little rock a deant susan mcdougal reported to jail this ... ESTIMATED BOUNDARY is CORRECT: TOPIC 35 has highest probability (0.44) south africas chamber of mines said on wednesday it had agreed wage increases ... the national union of mineworkers num said earlier it had ... minimum wage rates for dierent mining houses have been adjusted ... ESTIMATED BOUNDARY is CORRECT: TOPIC 32 has highest probability (0.41) the czech government plans a balanced thousand nine hundred ninty ... the cabinet did not nd bigger room for tax cuts ... the budget plan is expected to go to a nal vote ... ESTIMATED BOUNDARY is INCORRECT: TOPIC 25 has highest probability (0.54) the thousand nine hundred ninty seven budget assumes growth in gross domestic ... ESTIMATED BOUNDARY is CORRECT: TOPIC 42 has highest probability (0.70) deutsche bahn chairman heinz duerr said on wednesday that the current building ... the construction industry must be creative costs have to be reduced by at least fty percent duerr told ... duerr cited infrastructure as one of a number of investment challenges facing the german railway ... ESTIMATED BOUNDARY is CORRECT: TOPIC 19 has highest probability (0.19) california state treasurer matt fong has called on the securities and exchange commission ... fong called for the investigations in letters to sec chairman arthur ... the letters were released thursday at a meeting of the california ... ESTIMATED BOUNDARY is CORRECT:

APPENDIX A. TEXT-SEGMENTATION BY LDA: SAMPLE OUTPUT A.1 An analysis of LDA output


In this section, we show the following two complementary outputs of the LDA model: Training phase output [Section A.2]: Top ten words of the topics that are found in the example segmented document in Section A.3 Text segmentation phase output [Section A.3]: For a sample document segmented by LDA based approach, the top topic label assigned by LDA model to each segment. To save space, long sentences were terminated by ... to show continuation beyond the printed words. An quick analysis of the segmented output in Section A.3 shows that the topic associated with a segment is mostly relevant to the words present in that segment. In particular, the MISSED BOUNDARY occurs because both the segments have a common theme india.

A.2

Matrix: Top 10 words of some topics TOPIC 46: rupees india indian pakistan bombay new million sri indias market

TOPIC 26 has highest probability (0.60) about fty containers of aluminium phosphate have been found on the ... weve found about fty small tubes with the chemical inside them none were open but theres always a risk that some of the poison ... aluminium phosphate is used as rat poison and produces a toxic gas on ... ESTIMATED BOUNDARY is CORRECT: TOPIC 16 has highest probability (0.38) contract talks continued saturday between ford motor co and the united auto workers ... bargainers were optimistic according to ford spokesman jon harmon but ... theyre not there yet on all the issues its going to take a while to get through harmon told reporters ... ESTIMATED BOUNDARY is CORRECT: TOPIC 32 has highest probability (0.29)

You might also like