Professional Documents
Culture Documents
tree kernel when a vector composition function with spe- Then, our basic idea is to look for a function Fb : T Rd
;
cific ideal properties is used. We then propose two approx- that directly maps trees T into small vectors T . We call
imations of the ideal vector composition function and we these latter distributed trees (DT) in line with Distributed
study their properties. Finally, we empirically investigate Representations (Hinton et al., 1986). The computation of
the execution time of DTKs and how well these new ker- similarity over distributed trees is the distributed tree ker-
nels approximate original tree kernels. We show that DTKs nel (DTK):
are faster, correlate with tree kernels, and obtain a statisti- ; ;
cally similar performance in two natural language process- DT K(T1 , T2 ) , T1 T2 = Fb(T1 ) Fb(T2 ) (3)
ing tasks.
As the two distributed trees are in the low dimensional
The rest of the paper is organized as follows. Section 2 space Rd , the dot product computation, having constant
introduces the notation, the basic idea, and the expected complexity, is extremely efficient. Computation of func-
properties for DTKs. Section 3 introduces the DTKs and tion Fb is more expensive than the actual DTK, but it is
proves their properties. Section 4 compares the complex- done once for each tree and outside of the learning algo-
ity of DTKs with other tree kernels. Section 5 empirically rithms. We also propose a recursive algorithm with linear
investigates these new kernel algorithms. Finally, section 6 complexity to perform this computation.
draws some conclusions.
2.2. Distributed Trees, Distributed Tree Fragments,
2. Challenges for Distributed Tree Kernels and Expected Properties
Distributed tree kernels are faster than tree kernels. We
2.1. Notation and Basic Idea
here examine the properties required of Fb so that DTKs
Tree kernels (TK) (Collins & Duffy, 2001) have been pro- are also approximated computations of TKs, i.e.:
posed as efficient methods to implicitly compute dot prod-
ucts in feature spaces Rm of tree fragments. A direct com- DT K(T1 , T2 ) T K(T1 , T2 ) (4)
putation in these high-dimensional spaces is impractical.
Given two trees, T1 and T2 in T, tree kernels T K(T1 , T2 ) To derive these properties and describe function Fb, we
perform weighted counts of the common subtrees . By show the relations between the traditional function I : T
construction, these counts are the dot products of the vec- Rm that maps trees into forests of tree fragments, in the tree
tors representing the trees, T~1 and T~2 in Rm , i.e.: fragments feature space, I : T Rm that maps tree frag-
ments into the standard orthogonal basis of Rm , the linear
T K(T1 , T2 ) = T~1 T~2 (1) embedding function f : Rm Rd that maps T~ into a
;
smaller vector T = f (T~ ), and our newly defined function
Vectors T~ encode trees T as forests of active tree fragments Fb.
F(T ). Each dimension ~i of Rm corresponds to a tree frag- Equation 2 presents vectors T~ with respect to the standard
ment i . The trivial weighting scheme assigns i = 1 to orthonormal basis E = {~e1 . . . ~em } = {~1 . . . ~m } of Rm .
dimension ~i if tree fragment i is a subtree of the original Then, according to this reading, we can rewrite the dis-
tree T and i = 0 otherwise. Different weighting schemes ;
are possible and used. Function I, that maps trees in T to tributed tree T Rd as:
; X ;
vectors in Rm , is:
X X
T = f (T~ ) = f ( i~i ) = i f (~i ) = i i
m m i i i
X X
T~ = I(T ) = i I(i ) = i~i (2) ;
where each i represents tree fragment i in the new space.
i=1 i=1
The linear function f works as a sort of approximated ba-
where I maps tree fragments into related vectors of the sis transformation, mapping vectors ~ of the standard ba-
;
standard orthogonal basis of Rm , i.e., ~i = I(i ). sis E into approximated vectors that should represent
;
them. As i represents a single tree fragment i , we call
To reduce computational complexity of tree kernels, we it a distributed tree fragment (DTF). The set of vectors
want to explore the possibility of embedding vectors T~ e = {; ;
; E 1 . . . m } should be the approximated orthonor-
Rm into smaller vectors T Rd , with d m, to allow mal basis of Rm embedded in Rd . Then, these two prop-
for an approximated but faster and explicit computation of erties should hold:
these kernel functions. The direct embedding f : Rm
Rd is, in principle, possible with techniques like singular Property 1 (Nearly Unit Vectors) A distributed tree frag-
;
value decomposition or random indexing (Sahlgren, 2005), ment representing a tree fragment is a nearly unit vec-
;
but it is again impractical due to the huge dimension of Rm . tor: 1 < || || < 1 +
Distributed Tree Kernels
Proof The proof is done by induction on the structure of 3.2.1. R ECURSIVE FUNCTION
a and b .
The structural recursive formulation for the computation of
Basic step ;
distributed trees T is the following:
1
We assume the degree of commutativity k as the lowest num-
; ;
; ; ;
ber such that is non-commutative, i.e., a b 6= b a , and for
X
; ; ; ;
T = s(n) (6)
any j < k, a c1 . . . cj b 6= b c1 . . . cj a nN (T )
Distributed Tree Kernels
where N (T ) is the node set of tree T and s(n) represents Step. Let n be a node with children Xc1 , p
. . . , cm . The
the sum of distributed vectors for the subtrees of T rooted inductive hypothesis is then s(ci ) = | |1 fb( ).
in node n. Function s(n) is recursively defined as follows: R(ci )
Applying the inductive hypothesis, the definition of s(n)
s(n) = ~0 if n is a terminal node. and the property 2.3, we have
; ; ;
s(n) = n (c1 + s(c1 )) . . . (cm + s(cm )) ; ;
;
if n is a node with children c1 . . . cm . s(n) = n c1 + s(c1 ) . . . cm + s(cm )
;
As for the classic TK, the decay factor decreases the c;1
X
= n + |1 | fb(1 ) ...
weight of large tree fragments in the final kernel value. 1 R(c1 )
With dynamic programming, the time complexity of this
c;
X
function is linear O(|N (T )|) and the space complexity is d m + |m | fb(m )
(where d is the size of the vectors in Rd ). m R(cm )
; X p X p
=n |1 | fb(1 ) . . . |m | fb(m )
3.2.2. T HE RECURSIVE FUNCTION COMPUTES 1 T1
p m Tm ;
DISTRIBUTED TREES
X
= |1 |+...+|m | n fb(1 )
The overall theorem we need is the following. (n,1 ,...,m ){n}T1 ...Tm
. . . fb(m )
Theorem 3 Given the ideal vector composition function
, the equivalence between equation (5) and equation (6)
holds, i.e.: where Ti is the set R(ci ) {ci }. Thus, by means of Lemma
; X X 4 and the definition of fb, we can conclude that s(n) =
X p
T = s(n) = i fb(i ) | |1 fb( ).
nN (T ) i F (T ) R(n)
Accuracy
ues. Test trees were taken from the QC corpus in table (a) and the
RTE corpus in table (b). 75
70
For QC, we used a standard question classification train- TK
65 DT K
ing and test set2 , where the test set are the 500 TREC 2001
test questions. To measure the task performance, we used a DT K
60
question multi-classifier by combining n binary SVMs ac- 0.2 0.4 0.6 0.8 1
cording to the ONE-vs-ALL scheme, where the final output
class is the one associated with the most probable predic-
tion.
Figure 3. Performance on Question Classification task (DT K
For RTE we considered the corpora ranging from the first and DT K rely on vectors of d = 8192).
challenge to the fifth (Dagan et al., 2006), except for the
fourth, which has no training set. These sets are re-
ferred to as RTE1-5. The dev/test distribution for RTE1- 64
3, and RTE5 is respectively 567/800, 800/800, 800/800,
62 T K + Lex
and 600/600 T-H pairs. We used these sets for the tra-
DT K + Lex
ditional task of pair-based entailment recognition, where DT K + Lex
60
a pair of text-hypothesis p = (t, h) is assigned a pos-
Accuracy
TK
itive or negative entailment class. For our comparative 58 DT K
analysis, we use the syntax-based approach described in DT K
(Moschitti & Zanzotto, 2007) with two kernel function 56
schemes: (1) P KS (p1 , p2 ) = KS (t1 , t2 ) + KS (h1 , h2 );
and, (2) P KS+Lex (p1 , p2 ) = Lex(t1 , h1 )Lex(t2 , h2 ) + 54
KS (t1 , t2 ) + KS (h1 , h2 ). Lex is a standard similarity fea-
ture between the text and the hypothesis and KS is realized 52
with T K, DT K , and DT K . In the plots, the different 0.2 0.4 0.6 0.8 1
P KS kernels are referred to as T K, DT K , and DT K
whereas the different P KS+Lex kernels are referred to as
T K + Lex, DT K + Lex, and DT K + Lex. Figure 4. Performance on Recognizing Textual Entailment task
(DT K and DT K rely on vectors of d = 8192). Each point is
5.2.2. C ORRELATION BETWEEN TK AND DTK the average of accuracy on the 4 data sets.
As a first measure of the ability of DTK to emulate the We performed both QC and RTE experiments for different
classic TK, we considered the Spearmans correlation of values of parameter . Results are shown in Fig. 3 and 4
their values computed on the parse trees for the sentences for QC and RTE tasks respectively.
contained in QC and RTE corpora. Table 2 reports results
For QC, DTK leads to worse performances with respect to
and shows that DTK does not approximate adequately TK
TK, but the gap is narrower for small values of 0.4
for = 1. This highlights the difficulty of DTKs to cor-
(with DT K better than DT K ). These values produce
rectly handle pairs of large active forests, i.e., trees with
better performance for the task. For RTE, for 0.4,
many subtrees with weights around 1. The correlation
DT K and DT K is similar to T K. Differences are
improves dramatically when parameter is reduced. We
not statistically significant except for for = 0.4 where
can conclude that DTKs efficiently approximate TK for the
DT K behaves better than T K (with p < 0.1). Statis-
2
The QC set is available at http://l2r.cs.uiuc.edu/ tical significance is computed using the two-sample Stu-
cogcomp/Data/QA/QC/ dent t-test. DT K + Lex and DT K + Lex are statisti-
Distributed Tree Kernels
cally similar to T K + Lex for any value of . DTKs are a of the Johnson-Linderstrauss lemma. Tech. Rept. TR-99-006,
good approximation of TKs for 0.4, that are the values ICSI, Berkeley, California, 1999.
where T Ks have the best performances in the tasks. Dussel, Patrick, Gehl, Christian, Laskov, Pavel, and Rieck, Kon-
rad. Incorporation of application layer protocol syntax into
5.3. Average computation time anomaly detection. In ICISS, 188202, 2008.
Gildea, Daniel and Jurafsky, Daniel. Automatic Labeling of
Computation time (ms)