You are on page 1of 8

A Classification Approach to Word Prediction*

Yair Even-Zohar Dan Roth

D e p a r t m e n t of C o m p u t e r Science
U n i v e r s i t y of Illinois a t U r b a n a - C h a m p a i g n
{evenzoha, danr}~uiuc, edu

Abstract beneficial. In particular, we provide the learner with


The eventual goal of a language model is to accu- a rich set of features that combine the information
rately predict the value of a missing word given its available in the local context along with shallow
context. We present an approach to word prediction parsing information. At the same time, we study
that is based on learning a representation for each a learning approach that is specifically tailored for
word as a function of words and linguistics pred- problems in which the potential number of features
icates in its context. This approach raises a few is very large but only a fairly small number of them
new questions that we address. First, in order to actually participates in the decision. Word predic-
learn good word representations it is necessary to tion experiments that we perform show significant
use an expressive representation of the context. We improvements in error rate relative to the use of the
present a way that uses external knowledge to gener- traditional, restricted, set of features.
ate expressive context representations, along with a Background
learning method capable of handling the large num- The most influential problem in motivating statis-
ber of features generated this way that can, poten- tical learning application in NLP tasks is that of
tially, contribute to each prediction. Second, since word selection in speech recognition (Jelinek, 1998).
the number of words "competing" for each predic- There, word classifiers are derived from a probabilis-
tion is large, there is a need to "focus the attention" tic language model which estimates the probability
on a smaller subset of these. We exhibit the contri- of a sentence s using Bayes rule as the product of
bution of a "focus of attention" mechanism to the conditional probabilities,
performance of the word predictor. Finally, we de-
scribe a large scale experimental study in which the Pr(s) - =
approach presented is shown to yield significant im-
provements in word prediction tasks.
-- H~=lPr(wi]wl,...Wi_l )
-- H?=lPr(wi[hi )
1 Introduction
where hi is the relevant history when predicting wi.
The task of predicting the most likely word based on Thus, in order to predict the most likely word in a
properties of its surrounding context is the archetyp- given context, a global estimation of the sentence
ical prediction problem in natural language process- probability is derived which, in turn, is computed
ing (NLP). In many NLP tasks it is necessary to de- by estimating the probability of each word given its
termine the most likely word, part-of-speech (POS) local context or history. Estimating terms of the
tag or any other token, given its history or context. form P r ( w l h ) is done by assuming some generative
Examples include part-of speech tagging, word-sense probabilistic model, typically using Markov or other
disambiguation, speech recognition, accent restora- independence assumptions, which gives rise to es-
tion, word choice selection in machine translation, timating conditional probabilities of n-grams type
context-sensitive spelling correction and identifying features (in the word or POS space). Machine learn-
discourse markers. Most approaches to these prob- ing based classifiers and maximum entropy models
lems are based on n-gram-like modeling. Namely, which, in principle, are not restricted to features of
the learning methods make use of features which are these forms have used them nevertheless, perhaps
conjunctions of typically (up to) three consecutive under the influence of probabilistic methods (Brill,
words or POS tags in order to derive the predictor. 1995; Yarowsky, 1994; Ratnaparkhi et al., 1994).
In this paper we show that incorporating addi- It has been argued that the information available
tional information into the learning process is very in the local context of each word should be aug-
* This research is supported by NSF grants IIS-9801638 and mented by global sentence information and even in-
SBR-987345. formation external to the sentence in order to learn

124
better classifiers and language models. Efforts in and semantic context in which the word tends to
this directions consists of (1) directly adding syn- appear. Our features are defined as simple relations
tactic information, as in (Chelba and Jelinek, 1998; over a collection of predicates that capture (some of)
Rosenfeld, 1996), and (2) indirectly adding syntac- the information available in a sentence.
tic and semantic information, via similarity models;
2.1 I n f o r m a t i o n Sources
in this case n-gram type features are used when-
ever possible, and when they cannot be used (due Definition 1 Let s = < w l , w 2 , . . . , w n > be a sen-
to data sparsity), additional information compiled tence in which wi is the i-th word. Let :£ be a col-
into a similarity measure is used (Dagan et al., lection of predicates over a sentence s. IS(s)) 1, the
1999). Nevertheless, the efforts in this direction so I n f o r m a t i o n s o u r c e ( s ) available for the sentence
far have shown very insignificant improvements, if s is a representation ors as a list of predicates I E :r,
any (Chelba and Jelinek, 1998; Rosenfeld, 1996). XS(S) = {II(Wll , ...Wl,),-.., /~g(W~l , ..-Wk,)}.
We believe that the main reason for that is that in-
corporating information sources in NLP needs to be Ji is the arity of the predicate Ij.
coupled with a learning approach that is suitable for E x a m p l e 2 Let s be the sentence
it. < John, X, at, t h e , c l o c k , to, see, what, time, i t , i s >
Studies have shown that both machine learning Let ~={word, pos, subj-verb}, with the interpreta-
and probabilistic learning methods used in NLP tion that word is a unary predicate that returns the
make decisions using a linear decision surface over value of the word in its domain; pos is a unary
the feature space (Roth, 1998; Roth, 1999). In this predicate that returns the value of the pos of the
view, the feature space consists of simple functions word in its domain, in the context of the sentence;
(e.g., n-grams) over the the original data so as to s u b j - verb is a binary predicate that returns the
allow for expressive enough representations using a value of the two words in its domain if the second is
simple functional form (e.g., a linear function). This a verb in the sentence and the first is its subject; it
implies that the number of potential features that returns ¢ otherwise. Then,
the learning stage needs to consider may be very
large, and may grow rapidly when increasing the ex- IS(s) = {word(wl) = John, ..., word(w3) = at,...,
pressivity of the features. Therefore a feasible com- w o r d ( w n ) = i s , pos(w4) = D E T , . . . ,
putational approach needs to be feature-efficient. It
s bj - verb(w , w2) = {John, x}...}.
needs to tolerate a large number of potential features
in the sense that the number of examples required The IS representation of s consists only of the pred-
for it to converge should depend mostly on the num- icates with non-empty values. E.g., pos(w6) =
ber features relevant to the decision, rather than on modal is not part of the IS for the sentence above.
the number of potential features. subj - verb might not exist at all in the IS even if the
This paper addresses the two issues mentioned predicate is available, e.g., in The b a l l was g i v e n
above. It presents a rich set of features that is con- t o Mary.
structed using information readily available in the Clearly the I S representation of s does not contain
sentence along with shallow parsing and dependency all the information available to a human reading s;
information. It then presents a learning approach it captures, however, all the input that is available
that can use this expressive (and potentially large) to the computational process discussed in the rest
intermediate representation and shows that it yields of this paper. The predicates could be generated by
a significant improvement in word error rate for the any external mechanism, even a learned one. This
task of word prediction. issue is orthogonal to the current discussion.
The rest of the paper is organized as follows. In 2.2 Generating Features
section 2 we formalize the problem, discuss the in-
formation sources available to the learning system Our goal is to learn a representation for each word
and how we use those to construct features. In sec- of interest. Most efficient learning methods known
tion 3 we present the learning approach, based on today and, in particular, those used in NLP, make
the SNoW learning architecture. Section 4 presents use of a linear decision surface over their feature
our experimental study and results. In section 4.4 space (Roth, 1998; Roth, 1999). Therefore, in or-
we discuss the issue of deciding on a set of candi- der to learn expressive representations one needs to
date words for each decision. Section 5 concludes compose complex features as a function of the in-
and discusses future work. formation sources available. A linear function ex-
pressed directly in terms of those will not be expres-
sive enough. We now define a language that allows
2 Information Sources and Features
1We denote IS(s) as IS wherever it is obvious what the
Our goal is to learn a representation for each word referred sentence we is, or whenever we want to indicate In-
in terms of features which characterize the syntactic formation Source in general.

125
one to define "types" of features 2 in terms of the 2.4 S t r u c t u r e d I n s t a n c e s
information sources available to it. Definition 5 (Structural Information Source)
Let s =< w l , w 2 , ...,Wn >. SIS(s)), the S t r u c t u r a l
D e f i n i t i o n 3 ( B a s i c F e a t u r e s ) Let I E Z be a I n f o r m a t i o n s o u r c e ( s ) available for the sentence
k-ary predicate with range R. Denote w k = s, is a tuple (s, E 1 , . . . ,Ek) of directed acyclic
( W j l , . . . , wjk). We define two basic binary relations graphs with s as the set of vertices and Ei 's, a set
as follows. For a e R we define: of edges in s.
1 iffI(w k)=a E x a m p l e 6 ( L i n e a r S t r u c t u r e ) The simplest
f ( I ( w k ) , a) = 0 otherwise (1) SIS is the one corresponding to the linear structure
of the sentence. That is, S I S ( s ) = ( s , E ) where
A n existential version of the relation is defined by: (wi, wj) E E iff the word wi occurs immediately
before wj in the sentence (Figure 1 bottom left
liff3aERs.tI(w k)=a part).
f(I(wk),x) = 0 otherwise (2)
In a linear structure (s =< Wl,W2,...,Wn > , E ) ,
Features, which are defined as binary relations, can where E = { ( w i , w i + l ) ; i = 1, . . . n - 1}, we define
be composed to yield more complex relations in the chain
terms of the original predicates available in IS.
c ( w j , [l, r]) = { w , _ , , . . . , w j , . . , n s.
D e f i n i t i o n 4 ( C o m p o s i n g f e a t u r e s ) Let fl, f2
be feature definitions. Then fand(fl, f 2 ) for(f1, f 2 ) We can now define a new set of features that
fnot(fl) are defined and given the usual semantic: makes use of the structural information. Structural
features are defined using the SIS. When defining a
liff:----h=l feature, the naming of nodes in s is done relative to
fand(fl, f2) = 0 otherwise a distinguished node, denoted wp, which we call the
focus word of the feature. Regardless of the arity
liff:=l orf2=l of the features we sometimes denote the feature f
fob(f1, f 2 ) = 0 otherwise defined with respect to wp as f(wp).
{ l~ffl=O D e f i n i t i o n 7 ( P r o x i m i t y ) Let S I S ( s ) = (s, E) be
fnot(fx) = 0 otherwise
the linear structure and let I E Z be a k-ary predicate
In order to learn with features generated using these with range R. Let Wp be a focus word and C =
C(wp, [l, r]) the chain around it. Then, the proximity
definitions as input, it is important that features
features for I with respect to the chain C are defined
generated when applying the definitions on different
ISs are given the same identification. In this pre- as:
sentation we assume t h a t the composition operator 1 ifI(w) = a,a E R,w E C
along with the appropriate IS element (e.g., Ex. 2, fc(l(w), a) = { 0 otherwise
Ex. 9) are written explicitly as the identification of (3)
the features. Some of the subtleties in defining the
output representation are addressed in (Cumby and The second type of feature composition defined
Roth, 2000). using the structure is a collocation operator.
2.3 S t r u c t u r e d F e a t u r e s
D e f i n i t i o n 8 ( C o l l o c a t i o n ) Let f l , . . . f k be fea-
So far we have presented features as relations over ture definitions, col locc ( f l , f 2, . . . f k ) is a restricted
I S ( s ) and allowed for Boolean composition opera- conjunctive operator that is evaluated on a chain
tors. In most cases more information than just a list C of length k in a graph. Specifically, let C =
of active predicates is available. We abstract this {wj,, wj=, .. . , wjk } be a chain of length k in S I S ( s ) .
using the notion of a structural information source Then, the collocation feature for f l , . . , fk with re-
( S I S ( s ) ) defined below. This allows richer class of spect to the chain C is defined as
feature types to be defined.
2We note t h a t we do not define the features will be used in
the learning process. These are going to be defined in a data 1 ifVi = 1 , . . . k , fi(wj,) = 1
collocc(fl, . . . , fk) = { 0 otherwise
driven way given the definitions discussed here and the input
ISs. The importance of formally defining the "types" is due (4)
to the fact t h a t some of these are quantified. Evaluating them
on a given sentence might be computationally intractable and
a formal definition would help to flesh out the difficulties and
The following example defines features that are
aid in designing the language (Cumby and Roth, 2000). used in the experiments described in Sec. 4.

126
E x a m p l e 9 Let s be the sentence in Example 2. We feature, fsubj-verb, can be defined as a collocation
define some of the features with respect to the linear over chains constructed with respect to the focus
structure of the sentence. The word X is used as word j o i n . Moreover, we can define fsubj-verb to
the focus word and a chain [-10, 10] is defined with be active also when there is an aux_vrb between
respect to it. The proximity features are defined with the subj and v e r b , by defining it as a disjunction
respect to the predicate word. We get, for example: of two collocation features, the sub j - v e r b and the
fc(word) ----John; fc(word) = at; fc(word) = clock. s u b j - a u x _ v r b - v e r b . Other features that we use are
Collocation features are defined with respect to a conjunctions of words that occur before the focus
chain [-2, 2] centered at the focus word X . They are verb (here: j o i n ) along all the chains it occurs in
defined with respect to two basic features f l , f2 each (here: w i l l , b o a r d , as) and collocations of obj
of which can be either f(word, a) or f(pos, a). The and verb.
resulting features include, for example:
As a final comment on feature generation, we note
collocc(word, w o r d ) = { J o h n - X}; that the language presented is used to define "types"
of features. These are instantiated in a data driven
collocc(word, word) = { X - at}; way given input sentences. A large number of fea-
collocc(word, pos) = {at- DET}. tures is created in this way, most of which might not
be relevant to the decision at hand; thus, this pro-
2.5 N o n - L i n e a r S t r u c t u r e cess needs to be followed by a learning process that
So far we have described feature definitions which can learn in the presence of these many features.
make use of the linear structure of the sentence and
yield features which are not too different from stan- 3 The Learning Approach
dard features used in the literature e.g., n-grams Our experimental investigation is done using the
with respect to pos or word can be defined as colloc SNo W learning system (Roth, 1998). Earlier ver-
for the appropriate chain. Consider now that we are sions of S N o W (Roth, 1998; Golding and Roth,
given a general directed acyclic graph G = (s, E ) 1999; Roth and Zelenko, 1998; Munoz et al., 1999)
on the the sentence s as its nodes. Given a distin- have been applied successfully to several natural lan-
guished focus word wp 6 s we can define a chain in guage related tasks. Here we use SNo W for the task
the graph as we did above for the linear structure of word prediction; a representation is learned for
of the sentence. Since the definitions given above, each word of interest, and these compete at evalua-
Def. 7 and Def. 8, were given for chains they would tion time to determine the prediction.
apply for any chain in any graph. This generaliza-
tion becomes interesting if we are given a graph that 3.1 T h e S N O W A r c h i t e c t u r e
represents a more involved structure of the sentence. The SNo W architecture is a sparse network of linear
Consider, for example the graph DG(s) in Fig- units over a common pre-defined or incrementally
ure 1. DG(s) described the dependency graph of learned feature space. It is specifically tailored for
the sentence s. An edge (wi,wj) in DG(s) repre- learning in domains in which the potential number of
sent a dependency between the two words. In our features might be very large but only a small subset
feature generation language we separate the infor- of them is actually relevant to the decision made.
mation provided by the dependency grammar 3 to Nodes in the input layer of the network represent
two parts. The structural information, provided in simple relations on the input sentence and are being
the left side of Figure 1, is used to generate S I S ( s ) . used as the input features. Target nodes represent
The labels on the edges are used as predicates and words that are of interest; in the case studied here,
are part of IS(s). Notice that some authors (Yuret, each of the word candidates for prediction is repre-
1998; Berger and Printz, 1998) have used the struc- sented as a target node. An input sentence, along
tural information, but have not used the information with a designated word of interest in it, is mapped
given by the labels on the edges as we do. into a set of features which are active in it; this rep-
The following example defines features that are resentation is presented to the input layer of SNoW
used in the experiments described in Sec. 4. and propagates to the target nodes. Target nodes
are linked via weighted edges to (some of) the input
E x a m p l e 10 Let s be the sentence in Figure 1 features. Let At = { Q , . . . , i,~} be the set of features
along with its IS that is defined using the predicates that are active in an example and are linked to the
word, p o s , sub j , o b j , aux_vrb. A sub j - v e r b target node t. Then the linear unit corresponding to
3This information can be produced by a functional de- t is active iff
t
pendency grammar (FDG), which assigns each word a spe- E w i > Ot,
cific function, a n d t h e n s t r u c t u r e s t h e s e n t e n c e h i e r a r c h i c a l l y iEAt
b a s e d on it, as we do here ( T a p a n a i n e n a n d J r v i n e n , 1997),
b u t can also be g e n e r a t e d by a n e x t e r n a l r u l e - b a s e d p a r s e r or where w~ is the weight on the edge connecting the ith
a l e a r n e d one. feature to the target node t, and Ot is the threshold

127
~ , Nov.29.
~sub?"willJ
~.
aux_vrb obj
,, Nov. 29.
b°~rd corped~%~
old
will- board -~¢
det pcomp det..
a-..
Vi ~ , ~ t!e dire~a Vir~.n
attr "mocl tl~e director,
Pierr/e Ye~ars old l Pier're years attr
-~ non-executive *
qnt non-eXecutive
61 6t
Pierre -*Vinken-.-, -~ 61 -*-years -'*old ~ -" will - - ]
¼join ---the -,-board -.- as--- a +non-executive I Pierre Vinken, 61 years old, will join the board as a
¼
director-.- 29. -=-Nov. nonexecutive director Nov. 29.

Figure 1: A s e n t e n c e w i t h a linear a n d a d e p e n d e n c y g r a m m a r s t r u c t u r e

for the target node t. In this way, SNo W provides 4 Experimental Study
a collection of word representations rather than just 4.1 Task definition
discriminators.
The experiments were conducted with four goals in
mind:
A given example is treated autonomously by each
target subnetwork; an example labeled t may be 1. To compare mistake driven algorithms with
treated as a positive example by the subnetwork naive Bayes, trigram with backoff and a simple
for t and as a negative example by the rest of the maximum likelihood estimation (MLE) base-
target nodes. The learning policy is on-line and line.
mistake-driven; several update rules can be used
2. To create a set of experiments which is compa-
within SNOW. The most successful update rule is
rable with similar experiments that were previ-
a variant of Littlestone's Winnow update rule (Lit-
ously conducted by other researchers.
tlestone, 1988), a multiplicative update rule that is
tailored to the situation in which the set of input 3. To build a baseline for two types of extensions of
features is not known a priori, as in the infinite the simple use of linear features: (i) Non-Linear
attribute model (Blum, 1992). This mechanism is features (ii) Automatic focus of attention.
implemented via the sparse architecture of SNOW. 4. To evaluate word prediction as a simple lan-
T h a t is, (1) input features are allocated in a data
guage model.
driven way - an input node for the feature i is al-
located only if the feature i was active in any input We chose the verb prediction task which is sim-
sentence and (2) a link (i.e., a non-zero weight) ex- ilar to other word prediction tasks (e.g.,(Golding
ists between a target node t and a feature i if and and Roth, 1999)) and, in particular, follows the
only if i was active in an example labeled t. paradigm in (Lee and Pereira, 1999; Dagan et al.,
1999; Lee, 1999). There, a list of the confusion sets is
One of the important properties of the sparse ar- constructed first, each consists of two different verbs.
chitecture is that the complexity of processing an The verb vl is coupled with v2 provided that they
example depends only on the number of features ac- occur equally likely in the corpus. In the test set,
tive in it, na, and is independent of the total num- every occurrence of vl or v2 was replaced by a set
ber of features, nt, observed over the life time of the {vl, v2} and the classification task was to predict the
system. This is important in domains in which the correct verb. For example, if a confusion set is cre-
total number of features is very large, but only a ated for the verbs "make" and "sell", then the data
small number of them is active in each example. is altered as follows:

Once target subnetworks have been learned and make the paper --+ {make,sell} the paper
the network is being evaluated, a decision sup- sell sensitive data --~ {make,sell} sensitive data
port mechanism is employed, which selects the
dominant active target node in the SNoW unit The evaluated predictor chooses which of the two
via a winner-take-all mechanism to produce a fi- verbs is more likely to occur in the current sentence.
nal prediction. SNoW is available publicly at In choosing the prediction task in this way, we
http ://L2R. cs. uiuc. edu/- cogcomp, html. make sure the task in difficult by choosing between

128
competing words that have the same prior proba- Bline NB SNoW
bilities and have the same part of speech. A fur- Linear 49.6 13.54 11.56
ther advantage of this paradigm is that in future N o n Linear 49.6 12.25 9.84
experiments we may choose the candidate verbs so
that they have the same sub-categorization, pho- Table 1: W o r d Error R a t e results for linear
netic transcription, etc. in order to imitate the first and non-linear features
phase of language modeling used in creating can-
didates for the prediction task. Moreover, the pre- these results we conclude that using more expressive
transformed data provides the correct answer so that features helps significantly in reducing the WER.
(i) it is easy to generate training data; no supervi- However, one can use those types of features only
sion is required, and (ii) it is easy to evaluate the if the learning method handles large number of pos-
results assuming that the most appropriate word is sible features. This emphasizes the importance of
provided in the original text. the new learning method.
Results are evaluated using word-error rate
(WER). Namely, every time we predict the wrong Similarity NB SNoW
word it is counted as a mistake. W S J data 54.6% 59.1%
AP news 47.6%
4.2 D a t a
We used the Wall Street Journal (WSJ) of the years Table 2: C o m p a r i s o n o f the improvement
88-89. The size of our corpus is about 1,000,000 achieved using similarity m e t h o d s (Dagan et
words. The corpus was divided into 80% training al., 1999) and using the m e t h o d s presented in
and 20% test. The training and the test data were this paper. Results are shown in percentage
processed by the FDG parser (Tapanainen and Jrvi- of i m p r o v e m e n t in accuracy over the baseline.
nen, 1997). Only verbs that occur at least 50 times
in the corpus were chosen. This resulted in 278 verbs Table 2 compares our method to methods that use
that we split into 139 confusion sets as above. Af- similarity measures (Dagan et al., 1999; Lee, 1999).
ter filtering the examples of verbs which were not in Since we could not use the same corpus as in those
any of the sets we use 73, 184 training examples and experiments, we compare the ratio of improvement
19,852 test examples. and not the WER. The baseline in this studies is
different, but other than that the experiments are
4.3 Results
identical. We show an improvement over the best
4.3.1 Features similarity method. Furthermore, we train using only
In order to test the advantages of different feature 73,184 examples while (Dagan et al., 1999) train
sets we conducted experiments using the following using 587, 833 examples. Given our experience with
features sets: our approach on other data sets we conjecture that
we could have improved the results further had we
1. Linear features: proximity of window size 4-10 used that many training examples.
words, conjunction of size 2 using window size
4-2. The conjunction combines words and parts 4.4 F o c u s o f attention
of speech. SNoW is used in our experiments as a multi-class
2. Linear + Non linear features: using the lin- predictor - a representation is learned for each word
ear features defined in (1) along with non in a given set and, at evaluation time, one of these
linear features that use the predicates sub j , is selected as the prediction. The set of candidate
o b j , word, pos, the collocations s u b j - v e r b , words is called the confusion set (Golding and Roth,
v e r b - o b j linked to the focus verb via the graph 1999). Let C be the set of all target words. In previ-
structure and conjunction of 2 linked words. ous experiments we generated artificially subsets of
size 2 of C in order to evaluate the performance of
The over all number of features we have generated our methods. In general, however, the question of
for all 278 target verbs was around 400,000. In all determining a good set of candidates is interesting in
tables below the NB columns represent results of the it own right. In the absence, of a good method, one
naive Bayes algorithm as implemented within SNoW might end up choosing a verb from among a larger
and the SNoW column represents the results of the set of candidates. We would like to study the effects
sparse Winnow algorithm within SNOW. this issue has on the performance of our method.
Table 1 summarizes the results of the experiments In principle, instead of working with a single large
with the features sets (1), (2) above. The baseline confusion set C, it might be possible to,split C into
experiment uses MLE, the majority predictor. In subsets of smaller size. This process, which we call
addition, we conducted the same experiment using the focus of attention (FOA) would be beneficial
trigram with backoff and the W E R is 29.3%. From only if we can guarantee that, with high probability,

129
given a prediction task, we know which confusion experiments were conducted that use the phonetic
set to use, so that the true target belongs to it. In transcription of the words to generate confusion sets.
fact, the FOA problem can be discussed separately
for the training and test stages. Bline NB SNoW
T r a i n All T e s t P C 19.84 11.6 12.3
1. Training: Given our training policy (Sec. 3) ev- Train P C Test P C 19.84 11.6 11.3
ery positive example serves as a negative exam-
ple to all other targets in its confusion set. For Table 4: S i m u l a t i n g S p e e c h R e c o g n i z e r : W o r d
a large set C training might become computa- E r r o r R a t e for T r a i n i n g and testing w i t h
tionally infeasible. c o n f u s i o n sets d e t e r m i n e d b a s e d o n p h o n e t i c
2. Testing: considering only a small set of words classes ( P C ) f r o m a s i m u l a t e d s p e e c h re c o g -
as candidates at evaluation time increases the nizer.
baseline and might be significant from the point
of view of accuracy and efficiency. In the first experiment (Table 4), the transcription
of each word is given by the broad phonetic groups
To evaluate the advantage of reducing the size of to which the phonemes belong i.e., nasals, fricative,
the confusion set in the training and test phases, we etc. 4. For example, the word "b_u_y" is transcribed
conducted the following experiments using the same using phonemes as "b_Y" and here we transcribe it
features set (linear features as in Table 1). as "P_VI" which stands for "Plosive_Vowell". This
partition results in a partition of the set of verbs
Bline NB SNoW into several confusions sets. A few of these confusion
T r a i n All T e s t All 87.44 65.22 65.05 sets consist of a single word and therefore have 100%
T r a i n All T e s t 2 49.6 13.54 13.15 baseline, which explains the high baseline.
Train 2 Test 2 49.6 13.54 11.55
Bl i ne NB SNoW
Table 3: E v a l u a t i n g Foc us o f A t t e n t i o n : W o r d T r a i n All T e s t P C 45.63 26.36 27.54
E r r o r R a t e for T r a i n i n g a n d testing using Train P C Test P C 45.63 26.36 25.55
all t h e w o r d s t o g e t h e r a g a i n s t using pai rs o f
wo r d s . Table 5: S i m u l a t i n g S p e e c h R e c o g n i z e r : W o r d
"Train All" means training on all 278 targets to- E r r o r R a t e for T r a i n i n g and testing w i t h
gether. "Test all" means that the confusion set is c o n f u s i o n sets d e t e r m i n e d b a s e d o n p h o n e t i c
of size 278 and includes all the targets. The results classes ( P C ) f r o m a s i m u l a t e d s p e e c h recog-
shown in Table 3 suggest that, in terms of accuracy, nizer. In this case o n l y c o n f u s i o n sets that
the significant factor is the confusion set size in the have less t h a n 98% basel i ne a r e used, w h i c h
test stage. The effect of the confusion set size on e x p l a i n s t h e overal l lower baseline.
training is minimal (although it does affect training
Table 5 presents the results of a similar exper-
time). We note that for the naive Bayes algorithm
iment in which only confusion sets with multiple
the notion of negative examples does not exist, and
words were used, resulting in a lower baseline.
therefore regardless of the size of confusion set in
As before, Train All means that training is done
training, it learns exactly the same representations.
with all 278 targets together while Train PC means
Thus, in the NB column, the confusion set size in
that the PC confusion sets were used also in train-
training makes no difference.
ing. We note that for the case of SNOW, used here
The application in which a word predictor is used
with the sparse Winnow algorithm, that size of the
might give a partial solution to the FOA problem.
confusion set in training has some, although small,
For example, given a prediction task in the context
effect. The reason is that when the training is done
of speech recognition the phonemes that constitute
with all the target words, each target word repre-
the word might be known and thus suggest a way
sentation with all the examples in which it does not
to generate a small confusion set to be used when
occur are used as negative examples. When a smaller
evaluating the predictors.
confusion set is used the negative examples are more
Tables 4,5 present the results of using artificially
likely to be "true" negative.
simulated speech recognizer using a method of gen-
eral phonetic classes. That is, instead of transcrib-
5 Conclusion
ing a word by the phoneme, the word is transcribed
by the phoneme classes(Jurafsky and Martin, 200). This paper presents a new approach to word predic-
Specifically, these experiments deviate from the task tion tasks. For each word of interest, a word repre-
definition given above. The confusion sets used are sentation is learned as a function of a common, but
of different sizes and they consist of verbs with dif- 4In this experiment, the vowels phonemes were divided
ferent prior probabilities in the corpus. Two sets of into two different groups to account for different sounds.

130
potentially very large set of expressive (relational) L. Lee and F. Pereira. 1999. Distributional similar-
features. Given a prediction task (a sentence with ity models: Clustering vs. nearest neighbors. In
a missing word) the word representations are evalu- A CL 99, pages 33-40.
ated on it and compete for the most likely word to L. Lee. 1999. Measure of distributional similarity.
complete the sentence. In A CL 99, pages 25-32.
We have described a language that allows one to N. Littlestone. 1988. Learning quickly when irrel-
define expressive feature types and have exhibited evant attributes abound: A new linear-threshold
experimentally the advantage of using those on word algorithm. Machine Learning, 2:285-318.
prediction task. We have argued that the success of M. Munoz, V. Punyakanok, D. Roth, and D. Zimak.
this approach hinges on the combination of using a 1999. A learning approach to shallow parsing. In
large set of expressive features along with a learning EMNLP-VLC'99, the Joint SIGDAT Conference
approach that can tolerate it and converges quickly on Empirical Methods in Natural Language Pro-
despite the large dimensionality of the data. We cessing and Very Large Corpora, June.
believe that this approach would be useful for other A. Ratnaparkhi, J. Reynar, and S. Roukos. 1994. A
disambiguation tasks in NLP. maximum entropy model for prepositional phrase
We have also presented a preliminary study of a attachment. In ARPA, Plainsboro, N J, March.
reduction in the confusion set size and its effects R. Rosenfeld. 1996. A maximum entropy approach
on the prediction performance. In future work we to adaptive statistical language modeling. Com-
intend to study ways that determine the appropriate puter, Speech and Language, 10.
confusion set in a way to makes use of the current D. Roth and D. Zelenko. 1998. Part of speech
task properties. tagging using a network of linear separators.
Acknowledgments In COLING-ACL 98, The 17th International
We gratefully acknowledge helpful comments and
Conference on Computational Linguistics, pages
programming help from Chad Cumby. 1136-1142.
D. Roth. 1998. Learning to resolve natural language
ambiguities: A unified approach. In Proc. Na-
References
tional Conference on Artificial Intelligence, pages
A. Berger and H. Printz. 1998. Recognition perfor- 806-813.
mance of a large-scale dependency-grammar lan- D. Roth. 1999. Learning in natural language. In
guage model. In Int'l Conference on Spoken Lan- Proc. of the International Joint Conference of Ar-
guage Processing (ICSLP'98), Sydney, Australia. tificial Intelligence, pages 898-904.
A. Blum. 1992. Learning boolean functions in P. Tapanainen and T. Jrvinen. 1997. A non-
an infinite attribute space. Machine Learning, projective dependency parser. In In Proceedings
9(4):373-386. of the 5th Conference on Applied Natural Lan-
E. Brill. 1995. Transformation-based error-driven guage Processing, Washington DC.
learning and natural language processing: A case D. Yarowsky. 1994. Decision lists for lexical ambi-
study in part of speech tagging. Computational guity resolution: application to accent restoration
Linguistics, 21(4):543-565. in Spanish and French. In Proc. of the Annual
C. Chelba and F. Jelinek. 1998. Exploiting syntac- Meeting of the A CL, pages 88-95.
tic structure for language modeling. In COLING- D. Yuret. 1998. Discovery of Linguistic Relations
A CL '98. Using Lexical Attraction. Ph.D. thesis, MIT.
C. Cumby and D. Roth. 2000. Relational repre-
sentations that facilitate learning. In Proc. of
the International Conference on the Principles of
Knowledge Representation and Reasoning. To ap-
pear.
I. Dagan, L. Lee, and F. Pereira. 1999. Similarity-
based models of word cooccurrence probabilities.
Machine Learning, 34(1-3):43-69.
A. R. Golding and D. Roth. 1999. A Winnow based
approach to context-sensitive spelling correction.
Machine Learning, 34(1-3):107-130. Special Issue
on Machine Learning and Natural Language.
F. Jelinek. 1998. Statistical Methods for Speech
Recognition. MIT Press.
D. Jurafsky and J. H. Martin. 200. Speech and Lan-
guage Processing. Prentice Hall.

131

You might also like