Professional Documents
Culture Documents
VOL. 27,
NO. 2,
FEBRUARY 2015
INTRODUCTION
1041-4347 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
559
RELATED WORK
560
HYBRIDSEG FRAMEWORK
VOL. 27,
NO. 2,
FEBRUARY 2015
s1 ;...;sm
m
X
Csi :
(1)
i1
2
:
1 eSCP s
(2)
1
jsj1
Pjsj1
i1
Prs2
Prw1 . . . wi Prwi1 . . . wjsj
: (3)
561
Illustrated in Fig. 2, the segment phraseness Prs is computed based on both global and local contexts. Based on
Observation 1, Prs is estimated using the n-gram probability provided by Microsoft Web N-Gram service, derived
from English Web pages. We now detail the estimation of
Prs by learning from local context based on Observations
2 and 3. Specifically, we propose learning Prs from the
results of using off-the-shelf Named Entity Recognizers
(NERs), and learning Prs from local word collocation in a
batch of tweets. The two corresponding methods utilizing
the local context are denoted by HybridSegNER and
HybridSegNGram respectively.
^ r s
Pr
i
a
fri ;s
fs
fri ;s
(5)
:
(6)
562
(7)
VOL. 27,
NO. 2,
FEBRUARY 2015
(11)
(12)
563
(13)
(14)
^ 0 s fc;s f 0 : (16)
Pr
s
s2N \
In this paper, we select named entity recognition as a downstream application to demonstrate the benefit of tweet
segmentation. We investigate two segment-based NER
564
TABLE 1
Three POS Tags as the Indicator of a Segment Being a Noun
Phrase, Reproduced from [17]
Tag
N
$
Definition
common noun (NN, NNS)
proper noun (NNP, NNPS)
numeral (CD)
Examples
books; someone
lebron; usa; iPad
2010; four; 9:30
(17)
s
1e
This equation considers two factors. The first factor estimates the probability as the percentage of the constituent
words being labeled with an NP tag for all the occurrences
of segment s, where w is 1 if w is labeled as one of the three
POS tags in Table 1, and 0 otherwise; For example, chiam
see tong, the name of a Singaporean politician and lawyer,6
6. http://en.wikipedia.org/wiki/Chiam_See_Tong.
VOL. 27,
NO. 2,
FEBRUARY 2015
TABLE 2
The Annotated Named Entities in SIN and SGE Data Sets,
Where fsg Denotes the Frequency of Named Entity s in the
Annotated Ground Truth
P g
Data set #NEs minfsg maxfsg
#NEs s.t.fsg > 1
fs
SIN
SGE
746
413
1
1
49
1,644
1,234
4,073
136
161
EXPERIMENT
We report two sets of experiments. The first set of experiments (Sections 6.1 to 6.3) aims to answer three questions: (i)
does incorporating local context improve tweet segmentation quality compared to using global context alone? (ii)
between learning from weak NERs and learning from local
collocation, which one is more effective, and (iii) does iterative learning further improves segmentation accuracy? The
second set of experiments (Section 6.4) evaluates segmentbased named entity recognition.
565
TABLE 3
Recall of the Four Segmentation Methods
Method
SIN
HybridSegWeb
HybridSegNGram
HybridSegNER
HybridSegIter
0:758
0:806
0:857
0:858
SGE
0:874
0:907
0:942
0:946
(ii)
(iii)
566
VOL. 27,
NO. 2,
FEBRUARY 2015
Fig. 5. Re and normalized mNER values of HybridSegNER with varying in the range of 0; 0:95:
567
TABLE 4
Numbers of the Occurrences of Named Entities that Are Correctly Detected by HybridSegWeb , HybridSegNER , and HybridSegNGram ,
and the Percentage of Change against HybridSegWeb
SIN data set
n
HybridSegWeb
1
2
3
4
694
232
7
2
HybridSegNER
79314:3%
2466%
1271:4%
6200%
HybridSegNGram
#Overlap
HybridSegWeb
82018:2%
17225:9%
528:6%
150%
767
158
4
0
2889
519
149
1
HybridSegNER
30064%
58011:8%
23859:7%
4300%
HybridSegNGram
#Overlap
29321:5%
60015:6%
1618:1%
0100%
2895
524
143
N.A
#Overlap: number of the occurrences that are both detected by HybridSegNER and HybridSegNGram .
TABLE 5
HybridSegIter up to Four Iterations
SIN data set
Iteration
Re
JSD
Re
0
1
2
3
0:857
0:857
0:858
0:858
0:0059
0:0001
0
0:942
0:946
0:946
0:946
JSD
0:0183
0:0003
0
12. http://en.wikipedia.org/wiki/National_Solidarity_Party_
(Singapore).
568
VOL. 27,
NO. 2,
FEBRUARY 2015
TABLE 6
Fully Detected, Missed, and Partially Detected Segments for HybridSegIter (Three Iterations) and HybridSegWeb
SIN data set
Data set
Method/
Iteration
0
1
2
HybridSegWeb
Fully
detected
Missed
Partially detected
#NE
#Occ
#NE
#Occ
#NE
#Det
581
581
583
504
944
959
961
647
146
154
152
195
152
163
161
214
19
11
11
47
113
98
98
113
#Miss
25
14
14
85
Fully
detected
Missed
Partially detected
#NE
#Occ
#NE
#Occ
#NE
295
291
289
234
1464
1858
1856
708
94
110
112
140
168
191
193
336
24
12
12
40
#Det
#Miss
2374
1996
1996
2850
67
28
28
179
#NE: number of distinct segments, #Occ: number of occurrences, #Det: number of detected occurrences, #Miss: number of missed occurrences.
TABLE 7
Accuracy of GlobalSeg and HybridSeg with RW and POS
SIN dataset
Method
UnigramPOS
GlobalSegRW
HybridSegRW
GlobalSegPOS
HybridSegPOS
P
0:516
0:576
0:618
0:647
0:685
R
0:190
0:335
0:343
0:306
0:352
SGE dataset
F1
0:278
0:423
0:441
0:415
0:465
F1
0:845
0:929
0:907
0:903
0:911
0:333
0:646
0:683
0:657
0:686
0:478
0:762
0:779
0:760
0:783
The best results are in boldface. *indicates the difference against the best F1 is statistically significant by one-tailed paired t-test with p < 0:01.
Election Rally is detected by weak NERs with strong confidence (i.e., all its occurrences are detected). Due to its
strong confidence, NSP therefore cannot be separated
from this longer segment in next iteration regardless setting. Although NSP Election Rally is not annotated as a
named entity, it is indeed a semantically meaningful
phrase. On the other hand, a large portion of the occurrences for the 12 partially detected segments have been
successfully detected from other tweets.
Compared to the baseline HybridSegWeb which does not
take local context, HybridSegIter significantly reduces the
number of missed segments, from 195 to 152 or 22 percent
reduction on SIN data set, and 20 percent reduction on SGE
data set from 140 to 112. Many of these segments are fully
detected in HybridSegIter .
569
ACKNOWLEDGMENTS
This paper was an extended version of two SIGIR conference papers [1], [2]. Jianshu Wengs contribution was done
while he was with HP Labs Singapore.
REFERENCES
[1]
[2]
[3]
[4]
TABLE 8
Accuracy of the Three Weak NERs, Where *Indicates the
Difference against the Best F1 is Statistically Significant by
One-Tailed Paired t-Test with p < 0:01
SIN data set
Method
LBJ-NER
T-NER
Stanford-NER
P
0.335
0.273
0.447
R
0.357
0.523
0.448
0:346
0:359
0:447
P
0.674
0.696
0.685
[5]
[6]
[7]
F1
0.402 0:504
0.341 0:458
0.623 0:653
CONCLUSION
[8]
570
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
VOL. 27,
NO. 2,
FEBRUARY 2015