Professional Documents
Culture Documents
ument collection more frequently than s2 , or if both appear If this estimate is smaller than the best similarity already
equally frequent and s1 is alphabetically smaller than s2 . found, the sentence can be skipped and the algorithm con-
s1 and s2 are never lexicographically identical beacuse sen- tinues with the next sentence. Only otherwise, the real sim-
tence instances are mapped to an entity with a frequency
ilarity has to be calculated.
count. This ensures a canonical ordering of the sentences in
The algorithm can also terminate once a sentence with
the postings lists.
greatest possible similarity of one has been found, because
the sentences are sorted by decreasing frequency. The com-
term t1 term tn plete retrieval algorithm is displayed in Table 1.
...
index1 indexn Theorem 1. The algorithm in Table 1 accepts an initial
? ? fragment (query) q of length and an indexed collection with
sentence sindex1 ,1 sentence sindexn ,1 TF-IDF vectors of initial fragments of length and returns
the most similar and secondly most frequent sentence s =
? ? argmaxs∈Sms f req(s) with Sms = {s|argmax sim(f, s )} =
sentence sindex1 ,2 sentence sindexn ,2 {s|argmaxf, s }, where f is the vector representation of q
and s refers to the initial words of s.
? ? Proof (sketch). The proof can be split into two parts:
... ...
1. The algorithm always stores the best sentence sbest
Figure 1: Inverted index structure over the data.
seen so far.
2. When the algorithm terminates, there is no better sen-
When searching for the best sentence, it is only necessary tence than sbest in the remaining sentences.
to look through the lists that belong to the terms that occur
in the fragment, as at least one correct term should be in First, we want to show (1). In the first iteration of the
the best sentence. loop, the first sentence in the index structure becomes the
A further reduction can be achieved by utilizing an upper best sentence sbest . This follows from the initialization of
bound on the similarity – which is this algorithm’s main the best similarity with −1. As the smallest possible value
novel aspect. Since we have normalized TF-IDF vectors, we for the similarity is zero, the first sentence is always better
can calculate and bound the similarity between f and s as than −1.
not changed. This proves that the algorithm always stores
Table 1: Retrieval algorithm. the best fitting sentence.
Now, we show (2). There are three situations, in which the
Input: Inverted index, sentence fragment q of length .
algorithm terminates. In the first case, the index structure is
1. Construct an inverted index for the query by copying empty. This implies that the best sentence has already been
the sentence lists for the terms in q from the global found because all sentences that have not been ruled out
inverted index structure. Let index1 , . . . , index be have been checked. In the second case, simU Ball ≤ simbest
the postings lists for terms t1 , . . . , t . is valid. In this case, all sentences remaining in the index
structure are worse than the best sentence found so far by
2. Let f be the TF-IDF representation of q. definition of simU Ball . In the third case, the similarity be-
tween the best sentence and the fragment is one. In this
3. Let sbest be the best sentence found so far. Set sbest = case, the algorithm is allowed to terminate, because one
N U LL. is the highest possible similarity and the sentence has the
4. Let simbest be the similarity between f and sbest . Set highest possible frequency of all sentences with a similarity
simbest = −1. of one because of the sorting criterion “<” (frequency sort-
ing). This proves that the algorithm only terminates if there
5. Do Until break is no better sentence than sbest in the remaining sentences.
This completes the proof.
(a)
Determine upper bound simU Ball =
j:indexj =∅ fj for all sentences in the index The worst case runtime is still O(n), since there is the
structure. possibility that the index includes all sentences and the best
(b) Determine the first sentence in the index struc- one is the last. However, this case is very rare. The best
ture sf irst , which is the smallest first sentence case has a time complexity of O(1). It occurs when already
over all postings lists according to the canonical the first sentence has a similarity of 1. Nevertheless, the
sorting criterion “<” (sf irst = minj {sindexj ,1 }). most interesting question is how this algorithm behaves on
practical collections. We will explore this issue in section 6.
(c) If sf irst = N U LL or simU Ball ≤ simbest Then
break. 5. DATA COMPRESSION BY CLUSTER-
(d) Determine
upper bound simU Bf irst = ING
fj for sf irst .
j:sf irst =sindex ,1
j The algorithm presented in the last section has to store
(e) If simU Bf irst > simbest Then vector space representations of the whole training data and
access the most similar sentence in order to retrieve it. An
i. Determine true similarity simf irst = inverted index improves access time but leaves us with the
sim(sf irst , f ). problem of having to store and access a potentially huge
ii. If simf irst > simbest Then amount of data.
A. Set sbest = sf irst , simbest = simf irst . When trying to address this problem, one can exploit the
B. If simbest = 1 Then break. fact that rare sentences are less likely to contribute to predic-
tions than clusters of frequently occurring similar sentences.
(f) Move pointers to next element in all index lists
Hence, removing these elements from the training data ap-
that include sf irst , thereby removing all postings
pears to be a reasonable heuristic.
of sf irst from the index. Let sindexj ,1 be the new
In the following, we describe the clustering algorithm for
first posting of term tj .
finding groups of semantically equivalent sentences which we
Return sbest . use in our experiments. Our problem cannot easily be solved
with an agglomerative hierarchical clustering method – such
as the matrix update algorithm – because the distance ma-
trix has a memory requirement of O(n2 ) which is prohibitive
According to Equation 3, the similarity between f and for any reasonably large collection.
any indexed sentence can be bounded by simU Ball =
Since we have no prior knowledge on the number of clus-
j:indexj =∅ fj . When simbest ≥ simU Ball , then no remain- ters, we run the EM algorithm with mixtures of two Gaus-
ing sentence contains sufficiently many query terms to out- sian models recursively. Each data element is assigned to the
perform sbest . When simbest = simU Ball , then there may cluster with higher likelihood, and the clustering is started
be another sentence equally similar to sbest . However, as recursively. We end the recursion when the number of in-
our canonical ordering prefers frequent sentences, there can stances in a cluster falls below a threshold, or the variance
be no equally similar sentence more frequent than sbest . within the cluster falls below a variance threshold.
Equation 5 bounds the similarity for the smallest indexed The Gaussians that describe the clusters are characterized
sentence. Therefore, if simU Bf irst < simbest , then sf irst is by a mean vector and a scalar variance. We refrain from
strictly worse than sbest . If simbest = simU Bf irst then we characterizing the Gaussians with their covariance matrix
have already found a more frequent sentence sbest which is since this would be quadratic in the dictionary length.
at least as good as sf irst ; in both cases, sf irst does not need The result of the clustering algorithm is a tree of clus-
to be investigated. Only if both criteria do not exclude the ters. In the leaf nodes, it contains the groups of – ideally
current sentence, the real similarity has to be determined. semantically equivalent – sentences. The similarity of the
If it is higher than the best similarity, the current sentence sentences is quantified by the variance. When the variance
is better and will be stored. Otherwise, the best sentence is of a cluster falls below a certain threshold, it is assumed to
only contain semantically equivalent sentences. The data y lie in the same equivalence class. If this is the case, we
is reduced by pruning all leaves with less than a minimum count a correct prediction; if x and y lie in distinct classes,
number of sentences (in our experiments, 10). we count an incorrect prediction. We count the precision
The tree can also be used to access the data more quickly. of a method as the number of correct predictions over the
When predicting a sentence, the algorithm first decides for number of both, correct and incorrect prediction. Recall
a cluster by greedily moving through the tree from the root is calculated as the number of correct predictions over the
until a leaf node is reached. Then it chooses a sentence by number of queries.
the presented algorithm from the data in this cluster. Figure 2 shows the precision for collection A for the re-
Table 2 shows characteristic sentences (translated from trieval algorithm using our index structure algorithm and
German into English) from the ten largest clusters found in for the algorithm extended with clustering, depending on
collections of emails sent by the service centers of an online- the initial fragment length. Both algorithms are investi-
provider of education services (left hand side) and a large gated for two different confidence (similarity) threshold val-
online shop (right hand side). The clusters contain varia- ues. Only if the similarity is greater or equal than 0.6, or 1.0
tions of sentences most of which we would intuitively expect respectively, a prediction is made. Figure 3 shows the cor-
to be used frequently by service centers. Surprisingly, the responding recall, again depending on the fragment length.
salutation lines for two individuals (“Dear Mr. X”, “Dear Figures 4 and 5 show the same information for collection B.
Mr. Y”) appear in the set of most frequent sentences. How-
ever, a closer look at the data reveals that, in the period in 1
which the data were recorded, these customers did in fact 0.95
ask substantially many questions. 0.9
0.85
6. EMPIRICAL STUDIES
0.8
precision
In this section, we want to investigate the relative benefits
of the retrieval, and the clustering approach to sentence com- 0.75
pletion. We also want to shed light on the question whether 0.7
sentence completion can have a significant beneficial impact 0.65 Clustering, threshold 0.6
in practical applications. Clustering, threshold 1.0
0.6 Retrieval algorithm, threshold 0.6
We use two collections: collection A has been provided by
0.55 Retrieval algorithm , threshold 1.0
an online education provider and contains emails that have
been sent to students – or prospective students – by the 0.5
service center, in response to inquiries. Collection B has 0 5 10 15 20 25
fragment length
been provided by a large online shop and contains emails
that have been sent in reply to customer requests. Both
collections have about the same size, which is around 10000 Figure 2: Precision for collection A
sentences. They differ in the distribution over text; collec-
tion A includes more diverse sentences than collection B.
Note that both data sets contain only answer messages, 0.8
not the corresponding questions. It is intuitively clear that Clustering, threshold 0.6
the questions contain some information about the sentences 0.7 Clustering, threshold 1.0
that will most likely occur in the answer. However, the Retrieval algorithm, threshold 0.6
0.6 Retrieval algorithm, threshold 1.0
algorithms that we compare base their prediction only on
the initial fragment of the sentence that the user is currently 0.5
recall
1 0.8
Clustering, threshold 0.6
0.9 0.7 Clustering, threshold 1.0
0.8 Retrieval algorithm, threshold 0.6
0.6 Retrieval algorithm, threshold 1.0
0.7
0.5
precision
0.6
recall
0.4
0.5
0.3
0.4 Clustering, threshold 0.6
Clustering, threshold 1.0 0.2
0.3 Retrieval algorithm, threshold 0.6
Retrieval algorithm, threshold 1.0 0.1
0.2
0.1 0
0 5 10 15 20 25 0 5 10 15 20 25
fragment length fragment length
Acknowledgment
400
This work was supported by the German Science Foundation
300 DFG under grants SCHE 540/10-1 and NU 131/1-1.
200 8. REFERENCES
[1] Marcio Araujo, Gonzalo Navarro, and Nivio Ziviani. Large
text searching allowing errors. In Proceedings of the South
100 American Workshop on String Processing, 1997.
[2] R. Baeza-Yates and G. Navarro. Faster approximate string
0 matching. Algorithmica, 23(2):127–158, 1999.
0 1000 2000 3000 4000 5000 6000 7000 [3] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information
number of sentences in the model Retrieval. Addison Wesley, 1999.
[4] J. Darragh and I. Witten. The Reactive Keyboard.
Cambridge University Press, 1992.
Figure 6: Prediction times for collection A [5] B. Davison and H. Hirsch. Predicting sequences of user
actions. In Proceedings of the AAAI/ICML Workshop on
140 Predicting the Future: AI Approaches to Time Series
Clustering Analysis, 1998.
120 Index Structure [6] M. Debevc, B. Meyer, and R. Svecko. An adaptive short
Sequential Search list for documents on the world wide web. In Proceedings of
prediction time in ms