You are on page 1of 8

Distributed Tree Kernels

Fabio Massimo Zanzotto FABIO . MASSIMO . ZANZOTTO @ UNIROMA 2. IT


University of Rome Tor Vergata, Viale del Politecnico, 1, 00133 Rome, Italy
Lorenzo DellArciprete LORENZO . DELLARCIPRETE @ GMAIL . COM
University of Rome Tor Vergata, Viale del Politecnico, 1, 00133 Rome, Italy

Abstract tences that hardly go beyond the hundreds of nodes (Rieck


In this paper, we propose the distributed tree ker- et al., 2010). But these tree kernels hardly scale to large
nels (DTK) as a novel method to reduce time and training and application sets.
space complexity of tree kernels. Using a lin- As worst-case complexity of TKs is hard to improve, the
ear complexity algorithm to compute vectors for biggest effort has been devoted in controlling the average
trees, we embed feature spaces of tree fragments execution time of TK algorithms. Three directions have
in low-dimensional spaces where the kernel com- been mainly explored. The first direction is the exploita-
putation is directly done with dot product. We tion of some specific characteristics of trees. For example,
show that DTKs are faster, correlate with tree it is possible to demonstrate that the execution time of the
kernels, and obtain a statistically similar perfor- original algorithm becomes linear in average for parse trees
mance in two natural language processing tasks. of natural language sentences (Moschitti, 2006). Yet, the
tree kernel has still to be computed over the full underlying
feature space and the space occupation is still quadratic.
1. Introduction The second explored direction is the reduction of the un-
derlying feature space of tree fragments to control the ex-
Trees are fundamental data structures used to represent
ecution time by approximating the kernel function. The
very different objects such as proteins, HTML documents,
feature selection is done in the learning phase. Then, for
or interpretations of natural language utterances. Thus,
the classification, either the selection is directly encoded
many areas for example, biology (Vert, 2002; Hashimoto
in the kernel computation by selecting subtrees headed by
et al., 2008), computer security (Dussel et al., 2008),
specific node labels (Rieck et al., 2010) or the smaller se-
and natural language processing (Collins & Duffy, 2001;
lected space is made explicit (Pighin & Moschitti, 2010).
Gildea & Jurafsky, 2002; Pradhan et al., 2005; MacCartney
In these cases, the beneficial effect is only during the clas-
et al., 2006) fostered extensive research in methods for
sification and learning is overloaded with feature selection.
learning classifiers that leverage on these data structures.
The third direction exploits dynamic programming on the
Tree kernels (TK), firstly introduced in (Collins & Duffy, whole training and application sets of instances (Shin et al.,
2001) as specific convolution kernels (Haussler, 1999), are 2011). Kernel functions are reformulated to be computed
widely used to fully exploit tree structured data when learn- using partial kernel computations done for other pairs of
ing classifiers. Different tree kernels modeling different trees. As any dynamic programming technique, this ap-
feature spaces have been proposed (see (Shin et al., 2011) proach is transferring time complexity in space complexity.
for a survey), but a primary research focus is the reduc-
In this paper, we propose the distributed tree kernels (DTK)
tion of their execution time. Kernel machines compute
(introduced in (Zanzotto & DellArciprete, 2011)) as a
many times TK functions during learning and classifica-
novel method to reduce time and space complexity of tree
tion. The original tree kernel algorithm (Collins & Duffy,
kernels. The idea is to embed feature spaces of tree frag-
2001), that relies on dynamic programming techniques, has
ments in low-dimensional spaces, where the computation
a quadratic time and space complexity with respect to the
is approximated but its worst-case complexity is linear with
size of input trees. Execution time and space occupation
respect to the dimension of the space. As a direct embed-
are still affordable for parse trees of natural language sen-
ding is impractical, we propose a recursive algorithm with
Appearing in Proceedings of the 29 th International Conference linear complexity to compute reduced vectors for trees in
on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright the low-dimensional space. We formally show that the dot
2012 by the author(s)/owner(s). product among reduced vectors approximates the original
Distributed Tree Kernels

tree kernel when a vector composition function with spe- Then, our basic idea is to look for a function Fb : T Rd
;
cific ideal properties is used. We then propose two approx- that directly maps trees T into small vectors T . We call
imations of the ideal vector composition function and we these latter distributed trees (DT) in line with Distributed
study their properties. Finally, we empirically investigate Representations (Hinton et al., 1986). The computation of
the execution time of DTKs and how well these new ker- similarity over distributed trees is the distributed tree ker-
nels approximate original tree kernels. We show that DTKs nel (DTK):
are faster, correlate with tree kernels, and obtain a statisti- ; ;
cally similar performance in two natural language process- DT K(T1 , T2 ) , T1 T2 = Fb(T1 ) Fb(T2 ) (3)
ing tasks.
As the two distributed trees are in the low dimensional
The rest of the paper is organized as follows. Section 2 space Rd , the dot product computation, having constant
introduces the notation, the basic idea, and the expected complexity, is extremely efficient. Computation of func-
properties for DTKs. Section 3 introduces the DTKs and tion Fb is more expensive than the actual DTK, but it is
proves their properties. Section 4 compares the complex- done once for each tree and outside of the learning algo-
ity of DTKs with other tree kernels. Section 5 empirically rithms. We also propose a recursive algorithm with linear
investigates these new kernel algorithms. Finally, section 6 complexity to perform this computation.
draws some conclusions.
2.2. Distributed Trees, Distributed Tree Fragments,
2. Challenges for Distributed Tree Kernels and Expected Properties
Distributed tree kernels are faster than tree kernels. We
2.1. Notation and Basic Idea
here examine the properties required of Fb so that DTKs
Tree kernels (TK) (Collins & Duffy, 2001) have been pro- are also approximated computations of TKs, i.e.:
posed as efficient methods to implicitly compute dot prod-
ucts in feature spaces Rm of tree fragments. A direct com- DT K(T1 , T2 ) T K(T1 , T2 ) (4)
putation in these high-dimensional spaces is impractical.
Given two trees, T1 and T2 in T, tree kernels T K(T1 , T2 ) To derive these properties and describe function Fb, we
perform weighted counts of the common subtrees . By show the relations between the traditional function I : T
construction, these counts are the dot products of the vec- Rm that maps trees into forests of tree fragments, in the tree
tors representing the trees, T~1 and T~2 in Rm , i.e.: fragments feature space, I : T Rm that maps tree frag-
ments into the standard orthogonal basis of Rm , the linear
T K(T1 , T2 ) = T~1 T~2 (1) embedding function f : Rm Rd that maps T~ into a
;
smaller vector T = f (T~ ), and our newly defined function
Vectors T~ encode trees T as forests of active tree fragments Fb.
F(T ). Each dimension ~i of Rm corresponds to a tree frag- Equation 2 presents vectors T~ with respect to the standard
ment i . The trivial weighting scheme assigns i = 1 to orthonormal basis E = {~e1 . . . ~em } = {~1 . . . ~m } of Rm .
dimension ~i if tree fragment i is a subtree of the original Then, according to this reading, we can rewrite the dis-
tree T and i = 0 otherwise. Different weighting schemes ;
are possible and used. Function I, that maps trees in T to tributed tree T Rd as:
; X ;
vectors in Rm , is:
X X
T = f (T~ ) = f ( i~i ) = i f (~i ) = i i
m m i i i
X X
T~ = I(T ) = i I(i ) = i~i (2) ;
where each i represents tree fragment i in the new space.
i=1 i=1
The linear function f works as a sort of approximated ba-
where I maps tree fragments into related vectors of the sis transformation, mapping vectors ~ of the standard ba-
;
standard orthogonal basis of Rm , i.e., ~i = I(i ). sis E into approximated vectors that should represent
;
them. As i represents a single tree fragment i , we call
To reduce computational complexity of tree kernels, we it a distributed tree fragment (DTF). The set of vectors
want to explore the possibility of embedding vectors T~ e = {; ;
; E 1 . . . m } should be the approximated orthonor-
Rm into smaller vectors T Rd , with d  m, to allow mal basis of Rm embedded in Rd . Then, these two prop-
for an approximated but faster and explicit computation of erties should hold:
these kernel functions. The direct embedding f : Rm
Rd is, in principle, possible with techniques like singular Property 1 (Nearly Unit Vectors) A distributed tree frag-
;
value decomposition or random indexing (Sahlgren, 2005), ment representing a tree fragment is a nearly unit vec-
;
but it is again impractical due to the huge dimension of Rm . tor: 1  < || || < 1 + 
Distributed Tree Kernels

Property 2 (Nearly Orthogonal Vectors) Given two differ- A


ent tree fragments 1 and 2 , their distributed vectors are HH
; ; B C
nearly orthogonal: if 1 6= 2 , then |1 2 | <  Z
W1 D E
;
As vectors E e represent the basic tree fragments , the
; W2 W3
idea is that can be obtained directly from tree fragment
by means of a function fb( ) = f (I( )) that composes
Figure 1. A sample tree
f and I. Using this function to obtain distributed tree frag-
; ;
ments , distributed trees T can be obtained as follows:
; vectors representing nodes. Each node n is mapped to a
;
X
T = Fb(T ) = i fb(i ) (5) vector n N . To ensure that these basic vectors are sta-
;
i F (T ) tistically nearly orthonormal, their elements ( n)i are ran-
This latter equation is presented with respect to the active domly drawn from a normal distribution N (0, 1) and they
;
tree fragments forest F(T ) of T , neglecting vectors where are normalized so that || n|| = 1 (cf. the demonstration
i = 0. It is easy to show that, if properties 1 and 2 hold for of Johnson-Lindenstrauss Lemma in (Dasgupta & Gupta,
function fb, distributed tree kernels approximate tree ker- 1999)). Actual node vectors depend on the node labels, so
; ;
nels (see Equation 4). that n 1 = n 2 if L(n1 ) = L(n2 ), where L() is the node
label.
3. Computing Distributed Tree Fragments Tree structure can be univocally represented in a flat for-
and Distributed Trees mat using a parenthetical notation. For example, the tree in
Fig. 1 is represented by the sequence (A (B W1)(C (D
Johnson-Lindenstrauss Lemma (JLL) (Johnson & Linden- W2)(E W3))). This notation corresponds to a depth-first
strauss, 1984) guarantees that the embedding function f : visit of the tree, augmented with parentheses so that the tree
Rm Rd exists. It also points out the relation between the structure is determined as well.
desired approximation  of Property 2 (Nearly Orthogonal
Replacing the nodes with their corresponding vectors and
Vectors) and the required dimension d of the target space,
introducing a vector composition function  : Rd Rd
for a certain value of dimension m. This relation affects
Rd , the above formulation can be seen as a mathematical
how well DTKs approximate TKs (Equation 4).
expression that defines a representative vector for a whole
Knowing that f exists, we are presented with the following tree. The example tree would then be represented by vector
; ; ; ; ; ; ; ; ;
issues: = (A  (B  W 1)  (C  (D  W 2)  (E  W 3))).
building a function fb that directly computes the dis-
; Then, we formally define function fb( ), as follows:
tributed tree fragment i from tree fragment i (Sec.
3.2); Definition 1 Let be a tree and N the set of nearly orthog-
;
showing that distributed trees T = Fb(T ) can be com- onal vectors for node labels. We recursively define fb( ) as:
puted efficiently (Sec. 3.1).
; ;
fb(n) = n if n is a terminal node, where n N
Once the above issues are solved, we need to empirically
;
show that Equation (4) is satisfied and that computing fb( ) = ( n  fb(c1 . . . ck )) if n is the root of and
DTKs is more efficient than computing TKs. These latter ci are its children subtrees
points are discussed in the experimental section.
fb(1 . . . k ) = (fb(1 )  fb(2 . . . k )) if 1 . . . k is a
3.1. Computing Distributed Tree Fragments from sequence of trees
Trees
3.1.2. T HE IDEAL VECTOR COMPOSITION FUNCTION
This section introduces function fb for distributed tree frag-
ments and shows that, using an ideal vector composition We here introduce the ideal properties of the vector com-
function , the proposed function fb(i ) satisfies properties position function , such that function fb(i ) has the two
1 and 2. desired properties.
The definition of the ideal composition function follows:
3.1.1. R EPRESENTING TREES AS VECTORS
The basic blocks needed to represent trees are their nodes. Definition 2 The ideal composition function is  : Rd
; ; ; ;
We then start from a set N Rd of nearly orthonormal Rd Rd such that, given a , b , c , d N , a scalar s
Distributed Tree Kernels
;
and a vector t obtained composing an arbitrary number of Let a be the single node a. Two cases are possible: b is
vectors in N by applying , the following properties hold: the single node b 6= a. Then, by the properties of vectors in
; ;
N , |fb(a ) fb(b )| = | a b | < ; Otherwise, by Property
2.1 Non-commutativity with a very high degree k 1 ;
; ; 2.5, |fb(a ) fb(b )| = | a fb(b )| < .
; ; ; ;
2.2 Non-associativity: a  ( b  c ) 6= ( a  b )  c
Induction step
2.3 Bilinearity:
; ;
Case 1 Let a be a tree with root production a a1 . . . ak
; ; ; ; ;
I) ( a + b )  c = a  c + b  c and b be a tree with root production b b1 . . . bh .
; ; ;
; ; ; ; ; The expected property becomes |fb(a ) fb(b )| = |( a 
II) c  ( a + b ) = c  a + c  b ;
; ; ; ; ; ; fb(a . . . a )) ( b  fb(b . . . b ))| < . We have two
1 k 1 h
III) (s a )  b = a  (s b ) = s( a  b ) ; ;
cases: If a 6= b, | a b | < . Then, |fb(a ) fb(b )| <  by
Approximation Properties Property 2.6. Else if a = b, then a1 . . . ak 6= b1 . . . bh
; ; ; ;
2.4 || a  b || = || a || || b || as a 6= b . Then, as |fb(a1 . . . ak ) fb(b1 . . . bh )| <  is
; ;
true by inductive hypothesis, |fb(a ) fb(b )| <  by Prop-
; ;
2.5 | a t | <  if t 6= a erty 2.6.
; ; ; ; ; ; ; ; Case 2 Let a be a tree with root production a a1 . . . ak
2.6 | a  b c  d | <  if | a c | <  or | b d | < 
and b = b1 . . . bh be a sequence of trees. The expected
;
property becomes |fb(a ) fb(b )| = |( a  fb(a1 . . . ak ))
The ideal function  cannot exist. Property 2.5 can be only ;
(fb(b1 )  fb(b2 . . . bh ))| < . Since | a fb(b1 )| <  is true
statistically valid and never formally as it opens to an infi-
nite set of nearly orthogonal vectors. But, this function can by inductive hypothesis, |fb(a ) fb(b )| <  by Property
be approximated (see Sec. 5.1). 2.6.
Case 3 Let a = a1 . . . ak and b = b1 . . . bh be
3.1.3. P ROPERTIES OF DISTRIBUTED TREE FRAGMENTS two sequences of trees. The expected property becomes
Having defined the ideal basic composition function , we |fb(a ) fb(b )| = |(fb(a1 )  fb(a2 . . . ak )) (fb(b1 ) 
can now focus on the two properties needed to have DTFs fb(b2 . . . bh ))| < . We have two cases: If a1 6=
as a nearly orthonormal basis of Rm embedded in Rd , i.e., b1 , |fb(a ) fb(b )| <  by inductive hypothesis. Then,
Property 1 and Property 2. |fb(a ) fb(b )| <  by Property 2.6. Else, if a1 = b1 ,
then a2 . . . ak 6= b2 . . . bh as a 6= b . Then, as
For property 1 (Nearly Unit Vectors), we need the follow-
ing lemma: |fb(a2 . . . ak ) fb(b2 . . . bh )| <  is true by inductive hy-
pothesis, |fb(a ) fb(b )| <  by Property 2.6.
Lemma 1 Given tree , vector fb( ) has norm equal to 1.
3.2. Recursive Algorithm for Distributed Trees
This lemma can be easily proven using property 2.4 and
knowing that vectors in N are versors. This section discusses how to efficiently compute DTs. We
focus on the space of tree fragments implicitly defined in
For property 2 (Nearly Orthogonal Vectors), we first need (Collins & Duffy, 2001). This feature space refers to sub-
to observe that, due to properties 2.1 and 2.2, a tree gener- trees as any subgraph which includes more than one node,
ates a unique sequence of application of function  in fb( ) with the restriction that entire (not partial) rule productions
representing its structure. We can now address the follow- must be included. We want to show that the related dis-
ing lemma: tributed trees can be recursively computed using a dynamic
programming algorithm without enumerating the subtrees.
Lemma 2 Given two different trees a and b , the corre- We first define the recursive function and then we show that
sponding DTFs are nearly orthogonal: |fb(a ) fb(b )| < . it exactly computes DTs.

Proof The proof is done by induction on the structure of 3.2.1. R ECURSIVE FUNCTION
a and b .
The structural recursive formulation for the computation of
Basic step ;
distributed trees T is the following:
1
We assume the degree of commutativity k as the lowest num-
; ;
; ; ;
ber such that  is non-commutative, i.e., a  b 6= b  a , and for
X
; ; ; ;
T = s(n) (6)
any j < k, a  c1  . . .  cj  b 6= b  c1  . . .  cj  a nN (T )
Distributed Tree Kernels

where N (T ) is the node set of tree T and s(n) represents Step. Let n be a node with children Xc1 , p
. . . , cm . The
the sum of distributed vectors for the subtrees of T rooted inductive hypothesis is then s(ci ) = | |1 fb( ).
in node n. Function s(n) is recursively defined as follows: R(ci )
Applying the inductive hypothesis, the definition of s(n)
s(n) = ~0 if n is a terminal node. and the property 2.3, we have
; ; ;
s(n) = n  (c1 + s(c1 ))  . . .  (cm + s(cm )) ; ;
;
 
if n is a node with children c1 . . . cm . s(n) = n  c1 + s(c1 )  . . .  cm + s(cm )

;
As for the classic TK, the decay factor decreases the c;1
X
= n  + |1 | fb(1 )  ... 
weight of large tree fragments in the final kernel value. 1 R(c1 )
With dynamic programming, the time complexity of this


c;
X
function is linear O(|N (T )|) and the space complexity is d m + |m | fb(m )
(where d is the size of the vectors in Rd ). m R(cm )
; X p X p
=n |1 | fb(1 )  . . .  |m | fb(m )
3.2.2. T HE RECURSIVE FUNCTION COMPUTES 1 T1
p m Tm ;
DISTRIBUTED TREES
X
= |1 |+...+|m | n  fb(1 ) 
The overall theorem we need is the following. (n,1 ,...,m ){n}T1 ...Tm
. . .  fb(m )
Theorem 3 Given the ideal vector composition function
, the equivalence between equation (5) and equation (6)
holds, i.e.: where Ti is the set R(ci ) {ci }. Thus, by means of Lemma
; X X 4 and the definition of fb, we can conclude that s(n) =
X p
T = s(n) = i fb(i ) | |1 fb( ).
nN (T ) i F (T ) R(n)

According to (Collins & Duffy, 2001), the contribution of


tree fragment to the TK is | |1 , where |
| is the number
4. Comparative Analysis of Computational
of nodes in . Thus, we consider i = |i |1 . We
Complexity
demonstrate Theorem 3 by showing that s(n) computes the DTKs have an attractive constant computational complex-
weighted sum of vectors for the subtrees rooted in n (see ity. We here compare their complexity with respect to the
Theorem 5). traditional tree kernels (TK) (Collins & Duffy, 2001), the
fast tree kernels (FTK) (Moschitti, 2006), the fast tree ker-
Definition 3 Let n be a node of tree T . We define R(n) = nels plus feature selection (FTK+FS) (Pighin & Moschitti,
{ | is a subtree of T rooted in n} 2010), and the approximate tree kernels (ATK) (Rieck
We need to introduce a simple lemma, whose proof is triv- et al., 2010). We discussed basic features of these kernels
ial. in the introduction.
Table 4 reports time and space complexity of the kernels
Lemma 4 Let be a tree with root node n. Let c1 , . . . , cm in learning and in classification. DTK is clearly competi-
be the children of n. Then R(n) is the set of all trees 0 = tive with respect to other methods, since both complexities
(n, 1 , ..., m ) such that i R(ci ) {ci }. are constant, according to the size d of the reduced feature
space. In these two phases, kernels are applied many times
Now we can show that function s(n) computes exactly the
by the learning algorithms. Then, a constant complexity is
weighted sum of the distributed tree fragments for all the
extremely important. Clearly, there is a trade-off between
subtrees rooted in n.
the chosen d and the average size of trees n. A comparison
Theorem 5 Let n be a node of tree T . Then s(n) = among execution times is done applying these algorithms
X p to actual trees (see Section 5.3).
| |1 fb( ).
R(n)
5. Empirical Analysis and Experimental
Proof The theorem is proved by structural induction. Evaluation
Basis. Let n be a terminal node. Then we have R (n) = .
X p In this section we propose two approximations of the ideal
~
Thus, by its definition, s(n) = 0 = |1 fb( ).
|
composition function , we investigate on their appropri-
R(n)
Distributed Tree Kernels

Learning Classification than .


Time Space Time Space
TK O(n2 ) O(n2 ) O(n2 ) O(n2 ) Properties 2.5 and 2.6 were tested by measuring similari-
FTK A(n) O(n2 ) A(n) O(n2 ) ties between some combinations of vectors. The first ex-
; ;
FTK+FS A(n) O(n2 ) k k periment compared a single vector a to a combination t
n2 2
ATK O( q ) O(n2 ) O( qn ) O(n2 ) of several other vectors, as in property 2.5. Both functions
DTK d d d d resulted in average similarities below 1%, independently
;
of the number of vectors in t , satisfying property 2.5. To
Table 1. Computational time and space complexities for several test property 2.6 we compared two compositions of vectors
tree kernel techniques: n is the tree dimension, q is a speed-up ; ; ; ;
factor, k is the size of the selected feature set, d is the dimension a  t and b  t , where all the vectors are in common
of space Rd , O() is the worst-case complexity, and A() is the except for the first one. The average similarity fluctuates
average case complexity. around 0, with performing better than ; this is mostly
notable observing that the variance grows with the number
ateness with respect to the ideal properties, we evaluate ;
whether these concrete basic composition functions yield of vectors in t as shown in Fig. 2(b). A similar test was
to effective DTKs, and, finally, we evaluate the computa- performed, with all the vectors in common except for the
tion efficiency by comparing average computational exe- last one, yielding to similar results.
cution times of TKs and DTKs. For the following experi-
ments, we focus on a reduced space Rd with d = 8192. 1.2 0.6

1 0.5
5.1. Approximating Ideal Basic Composition Function 0.8 0.4
0.6 0.3
5.1.1. C ONCRETE COMPOSITION FUNCTIONS
0.4 0.2
We consider two possible approximations for the ideal 
0.2 0.1
composition function : the shuffled -product  and shuf- 0 0
fled circular convolution . These functions are defined as
0 10 20 30 40 50 0 10 20 30 40 50
follows:
; ; ; ; (a) Average norm of the vec- (b) Variance of the dot prod-
a b = p1 ( a ) p2 ( b ) tor obtained as combination uct between two combina-
; ; ; ;
of different numbers of basic tions of basic random vectors
a b = p1 ( a ) p2 ( b ) random vectors with one common vector
where: is the element-wise product between vectors and Figure 2. Statistical properties for vectors on 100 samples (d =
is the circular convolution (as for distributed representa- 8192).
tions in (Plate, 1995)) between vectors; p1 and p2 are two In light of these results, seems to be a better choice
different permutations of the vector elements; and is a than , although it should be noted that, for vectors of di-
normalization scalar parameter, computed as the average mension d,  is computed in O(d) time, while takes
norm of the element-wise product of two vectors. O(d log d) time.
5.1.2. E MPIRICAL E VALUATIONS OF P ROPERTIES
5.2. Evaluating Distributed Tree Kernels: Direct and
Properties 2.1, 2.2, and 2.3 hold by construction. The two Task-based Comparison
permutation functions, p1 and p2 , guarantee Prop. 2.1, for
In this section, we evaluate whether DTKs with the two
a high degree k, and Prop. 2.2. Property 2.3 is inherited
concrete composition functions, DT K and DT K , ap-
from element-wise product and circular convolution .
proximate the original TK (as in Equation 4).We perform
Properties 2.4, 2.5 and 2.6 can only be approximated. Thus, two sets of experiments: (1) a direct comparison where we
we performed tests to evaluate the appropriateness of the directly investigate the correlation between DTK and TK
two considered functions. values; and, (2) a task based comparison where we com-
pare the performance of DTK against that of TK on two
Property 2.4 approximately holds for since approximate
natural language processing tasks, i.e., question classifica-
norm preservation already holds for circular convolution,
tion (QC) and textual entailment recognition (RTE).
whereas  uses factor to preserve norm. We empiri-
cally evaluated this property. Figure 2(a) shows the average
5.2.1. E XPERIMENTAL S ET- UP
norm for the composition of an increasing number of basic
vectors (i.e. vectors with unitary norm) with the two basic For the experiments, we used standard datasets for the two
composition functions. Function behaves much better NLP tasks of QC and RTE.
Distributed Tree Kernels

QC RTE 0.6. These values are relevant for the applications as


DT K DT K DT K DT K we will also see in the next section.
0.2 0.993 0.994 0.997 0.998
0.4 0.980 0.989 0.990 0.961 5.2.3. TASK - BASED COMPARISON
0.6 0.908 0.880 0.890 0.350
0.8 0.644 0.377 0.469 0.039 90
1.0 0.316 0.107 0.169 0.000
85

Table 2. Spearmans correlation between DTK values and TK val- 80

Accuracy
ues. Test trees were taken from the QC corpus in table (a) and the
RTE corpus in table (b). 75

70
For QC, we used a standard question classification train- TK
65 DT K
ing and test set2 , where the test set are the 500 TREC 2001
test questions. To measure the task performance, we used a DT K
60
question multi-classifier by combining n binary SVMs ac- 0.2 0.4 0.6 0.8 1
cording to the ONE-vs-ALL scheme, where the final output
class is the one associated with the most probable predic-
tion.
Figure 3. Performance on Question Classification task (DT K
For RTE we considered the corpora ranging from the first and DT K rely on vectors of d = 8192).
challenge to the fifth (Dagan et al., 2006), except for the
fourth, which has no training set. These sets are re-
ferred to as RTE1-5. The dev/test distribution for RTE1- 64
3, and RTE5 is respectively 567/800, 800/800, 800/800,
62 T K + Lex
and 600/600 T-H pairs. We used these sets for the tra-
DT K + Lex
ditional task of pair-based entailment recognition, where DT K + Lex
60
a pair of text-hypothesis p = (t, h) is assigned a pos-
Accuracy

TK
itive or negative entailment class. For our comparative 58 DT K
analysis, we use the syntax-based approach described in DT K
(Moschitti & Zanzotto, 2007) with two kernel function 56
schemes: (1) P KS (p1 , p2 ) = KS (t1 , t2 ) + KS (h1 , h2 );
and, (2) P KS+Lex (p1 , p2 ) = Lex(t1 , h1 )Lex(t2 , h2 ) + 54
KS (t1 , t2 ) + KS (h1 , h2 ). Lex is a standard similarity fea-
ture between the text and the hypothesis and KS is realized 52
with T K, DT K , and DT K . In the plots, the different 0.2 0.4 0.6 0.8 1
P KS kernels are referred to as T K, DT K , and DT K
whereas the different P KS+Lex kernels are referred to as
T K + Lex, DT K + Lex, and DT K + Lex. Figure 4. Performance on Recognizing Textual Entailment task
(DT K and DT K rely on vectors of d = 8192). Each point is
5.2.2. C ORRELATION BETWEEN TK AND DTK the average of accuracy on the 4 data sets.

As a first measure of the ability of DTK to emulate the We performed both QC and RTE experiments for different
classic TK, we considered the Spearmans correlation of values of parameter . Results are shown in Fig. 3 and 4
their values computed on the parse trees for the sentences for QC and RTE tasks respectively.
contained in QC and RTE corpora. Table 2 reports results
For QC, DTK leads to worse performances with respect to
and shows that DTK does not approximate adequately TK
TK, but the gap is narrower for small values of 0.4
for = 1. This highlights the difficulty of DTKs to cor-
(with DT K better than DT K ). These values produce
rectly handle pairs of large active forests, i.e., trees with
better performance for the task. For RTE, for 0.4,
many subtrees with weights around 1. The correlation
DT K and DT K is similar to T K. Differences are
improves dramatically when parameter is reduced. We
not statistically significant except for for = 0.4 where
can conclude that DTKs efficiently approximate TK for the
DT K behaves better than T K (with p < 0.1). Statis-
2
The QC set is available at http://l2r.cs.uiuc.edu/ tical significance is computed using the two-sample Stu-
cogcomp/Data/QA/QC/ dent t-test. DT K + Lex and DT K + Lex are statisti-
Distributed Tree Kernels

cally similar to T K + Lex for any value of . DTKs are a of the Johnson-Linderstrauss lemma. Tech. Rept. TR-99-006,
good approximation of TKs for 0.4, that are the values ICSI, Berkeley, California, 1999.
where T Ks have the best performances in the tasks. Dussel, Patrick, Gehl, Christian, Laskov, Pavel, and Rieck, Kon-
rad. Incorporation of application layer protocol syntax into
5.3. Average computation time anomaly detection. In ICISS, 188202, 2008.
Gildea, Daniel and Jurafsky, Daniel. Automatic Labeling of
Computation time (ms)

Semantic Roles. Computational Linguistics, 28(3):245288,


10 2002.
FTK
1 DTK Hashimoto, Kosuke, Takigawa, Ichigaku, Shiga, Motoki, Kane-
0.1 hisa, Minoru, and Mamitsuka, Hiroshi. Mining significant tree
patterns in carbohydrate sugar chains. Bioinformatics, 24:167
0.01 173, 2008.
0.001 Haussler, David. Convolution kernels on discrete structures. Tech.
0 40 80 120 160 200 240 280 320 Rept., Univ. of California at Santa Cruz, 1999.
Sum of nodes in the trees Hinton, G. E., McClelland, J. L., and Rumelhart, D. E. Dis-
tributed representations. In Rumelhart, D. E. and McClelland,
J. L. (eds.), Parallel Distributed Processing: Explorations in
Figure 5. Computation time of FTK and DTK (with d = 8192) the Microstructure of Cognition. Volume 1: Foundations. MIT
for tree pairs with an increasing total number of nodes, on a 1.6 Press, Cambridge, MA., 1986.
GHz CPU.
We measured the average computation time of FTK (Mos- Johnson, W. and Lindenstrauss, J. Extensions of lipschitz map-
pings into a hilbert space. Contemp. Math., 26:189206, 1984.
chitti, 2006) and DTK (with vector size 8192) on trees from
the Question Classification corpus. Figure 5 shows the rela- MacCartney, Bill, Grenager, Trond, de Marneffe, Marie-
tion between the computation time and the size of the trees, Catherine, Cer, Daniel, and Manning, Christopher D. Learning
to recognize features of valid textual entailments. In NAACL,
computed as the total number of nodes in the two trees. As
New York, NY, pp. 4148, 2006.
expected, DTK has constant computation time, since it is
independent of the size of the trees. On the other hand, Moschitti, Alessandro. Making tree kernels practical for natural
computation time for FTK, while being lower for smaller language learning. In EACL, Trento, Italy, 2006.
trees, grows very quickly with the tree size. The larger are Moschitti, Alessandro and Zanzotto, Fabio Massimo. Fast and
the trees considered, the higher is the computational advan- effective kernels for relational learning from texts. In ICML,
tage offered by using DTK instead of FTK. Corvallis, Oregon, 2007.
Pighin, Daniele and Moschitti, Alessandro. On reverse feature en-
gineering of syntactic tree kernels. In CoNLL, Uppsala, Swe-
6. Conclusion den, 2010.
In this paper we proposed the distributed tree kernels Plate, T. A. Holographic reduced representations. IEEE Transac-
(DTKs) as an approach to reduce computational com- tions on Neural Networks, 6(3):623641, 1995.
plexity of tree kernels. Having an ideal function for Pradhan, Sameer, Ward, Wayne, Hacioglu, Kadri, Martin,
vector composition, we have formally shown that high- James H., and Jurafsky, Daniel. Semantic role labeling us-
dimensional spaces of tree fragments can be embedded in ing different syntactic views. In ACL, pp. 581588. NJ, USA,
low-dimensional spaces where tree kernels can be directly 2005.
computed with dot products. We have empirically shown Rieck, Konrad, Krueger, Tammo, Brefeld, Ulf, and Muller, Klaus-
that we can approximate the ideal function for vector com- Robert. Approximate tree kernels. J. Mach. Learn. Res., 11:
position. The resulting DTKs correlate with original tree 555580, March 2010.
kernels, obtain similar results in two natural language pro- Sahlgren, Magnus. An introduction to random indexing. In
cessing tasks, and, finally, are faster. Workshop of Methods and Applications of Semantic Indexing
at TKE, Copenhagen, Denmark, 2005.

References Shin, Kilho, Cuturi, Marco, and Kuboyama, Tetsuji. Mapping


kernels for trees. In ICML, pp. 961968, NY, 2011.
Collins, Michael and Duffy, Nigel. Convolution kernels for natu-
ral language. In NIPS, pp. 625632, 2001. Vert, Jean-Philippe. A tree kernel to analyse phylogenetic profiles.
Bioinformatics, 18(suppl 1):S276S284, 2002.
Dagan, Ido, Glickman, Oren, and Magnini, Bernardo. The pascal
recognising textual entailment challenge. In Quionero-Candela Zanzotto, Fabio Massimo and DellArciprete, Lorenzo. Dis-
et al.(ed.), LNAI 3944: MLCW 2005, pp. 177190. Springer- tributed structures and distributional meaning. In Proceedings
Verlag, 2006. of the Workshop on Distributional Semantics and Composition-
Dasgupta, Sanjoy and Gupta, Anupam. An elementary proof ality, pp. 1015, Portland, Oregon, USA, 2011.

You might also like