Professional Documents
Culture Documents
iSentenizer-
I +
iSentenizer-
2 SBD Punkt
1.
SBD
NLPPOS[1][2][3]
MT[4]IR[5] SBD
SBD
[6]
SBD
WSJ 42
11
58 WSJ 89
; subsentences 19
14
0.5
SBD
SBD
[7][7][8][9]
[10-12]
SBD
SBD
[13]
SBD
iSentenizer-[14]1 +[15]
2.
nonboundary
[17]
[7] MxTerminator[9] C4.5
MxTerminator
[18]
Punkt
[14] SBD
11
Europarl[13]
3.
SBD I + i+
POFC-DT
IONR-DT
;
3.1
[19]
[20] Kolmogorov-
KS[2122]
3.2
IONR-DT /
POFC-DT
IONR-DT
ITI[23]
IONR-DT
3.3+ LRA
[20]
1 + LRA
1 + LRA
1
TAB1
1
3.4
;;
;
4.
iSentenizer-
SBD [91424]
;
SBD
4.1
;
SAR
[9,25];
4.2
/ lowercasing
4.3
[8,926]
nonlexical
WSJ
[27]
4.4
figbox1
1
figbox2
2
5.
5.1
Punkt NLTK Python Punkthttp://www.nltk.org /
MxTerminator Apache OpenNLP
http://opennlp.apache.org/
iSentenizer- BSD
-ScoreROC PR DET [2829]
2
TAB2
2 BSD
[8];
[1827]
misdetects
5.2
iSentenizer-
5.2.1WSJ
[30] SBD
15 subcorpora 500
2500
POS
[27] SBD
5.2.2
[31] 52 20000
- [32] POS
14 19
5.2.3 Europarl
11 DADEEL
enESFIFRITNL
PTSV Europarl
[13] 30 100
BSD
4 5
10
5.3
4 SBD [6]
SBD [8]
[18]
1;
0
4 ;;
iSentenizer-;
3 WSJ
SBD
7 8
iSentenizer- 1-Score
Score97.86 9 94
67
TAB3
3
TAB4
4
tab5
5Europarl
5.4
iSentenizer-
SBD Punkt
MxTerminator
5.4.1 PUNKT
[18] SBD
5.4.2 MxTerminator
Reynar Ratnaparkhi[9]
5.5
iSentenizer- Punkt
MxTerminator 4 5 11
WSJ Europarl
6 10 iSentenizer-
-Scores iSentenizer- -0.11 Europarl-1.25
MxTerminator +19.57 63.36+
MxTerminator Punkt WSJ
5.6
iSentenizer-
;
; SBD 11
12 -Score SBD
; iSentenizer- Europarl 11
[13]iSentenizer-
-Score97.47 13
tab11
11
tab12
12
TAB13
13 Europarl
5.7
iSentenizer-
iSententizer- iSentenizer-
Europarl iSent_ED;
iSentenizer- iSent_E+DD;
Europarl iSent_E+DG
iSent_E+ D +GG 14 4
8-Score2
tab14
14 Europarl
6.
iSentenizer-
SBD iSentenizer-
iSentenizer-
iSentenizer-
http://nlp2ct.cis.umac.mo/views/utility.html
MYRG076Y 1-L2-FST13-WF
MYRG070Y 1-L2-FST12-CS
References
1.
12.
K. Tomanek, J. Wermter, and U. Hahn, Sentence and token
splitting based on conditional random fields, in Proceedings of the
10th Conference of the Pacific Association for Computational
Linguistics, pp. 4957, 2007.
13.
P. Koehn, Europarl: a parallel corpus for statistical machine
translation, in MT Summit, vol. 5, 2005.
14.
F. Wong and S. Chao, iSentenizer: an incremental sentence
boundary classifier, in Proceedings of the 6th International Conference
on Natural Language Processing and Knowledge Engineering (IEEE NLPKE '10), August 2010. View at PublisherView at Google ScholarView
at Scopus
15.
S. Chao and F. Wong, An incremental decision tree learning
methodology regarding attributes in medical data mining,
in Proceedings of the 8th International Conference on Machine Learning
and Cybernetics, pp. 16941699, chn, July 2009. View at
PublisherView at Google ScholarView at Scopus
16.
C. N. Silla Jr., J. Dalla Valle Jr., and C. A. A. Kaestner, Detec ao
automtica de sentenas com o uso de expressoes regulares,
in Proceedings of the Brazilian Congress of Computation (CBComp '03),
pp. 548560, 2003.
17.
C. N. Silla and C. A. A. Kaestner, An analysis of sentence
boundary detection systems for English and Portuguese documents,
in Lecture Notes in Computer Science, pp. 135141, 2004.
18.
T. Kiss and J. Strunk, Unsupervised multilingual sentence
boundary detection, Computational Linguistics, vol. 32, no. 4, pp.
485525, 2006. View at PublisherView at Google ScholarView at
Scopus
19.
H. R. Bittencourt and R. T. Clarke, Use of classification and
regression trees (CART) to classify remotely-sensed digital images,
in Proceedings of the IEEE International Geoscience and Remote
Sensing Symposium (IGARSS '03), vol. 6, pp. 37513753, July
2003. View at Scopus
20.
S. R. Safavian and D. Landgrebe, A survey of decision tree
classifier methodology, IEEE Transactions on Systems, Man and
Cybernetics, vol. 21, no. 3, pp. 660674, 1991. View at PublisherView
at Google ScholarView at Scopus
21.
J. H. Friedman, A recursive partitioning decision rule for
nonparametric classification, IEEE Transactions on Computers, vol.
100, no. 4, pp. 404408, 1977. View at Scopus
22.
P. Utgo and J. Clouse, A Kolmogorov-Smirnoff metric for decision
tree induction, Technical Report96-3, University of Massachusetts,
Department of Computer Science, Amherst, Mass, USA, 1996.
23.
P. E. Utgoff, N. C. Berkman, and J. A. Clouse, Decision tree
induction based on efficient tree restructuring, Machine Learning, vol.
29, no. 1, pp. 544, 1997. View at Scopus
24.
E. Stamatatos, N. Fakotakis, and G. Kokkinakis, Automatic
extraction of rules for sentence boundary disambiguation,
in Proceedings of the Workshop on Machine Learning in Human
Language Technology, pp. 8892, 1999.
25.
D. G. Lee and H. C. Rim, Towards language-independent
sentence boundary detection, inComputational Linguistics and
Intelligent Text Processing, pp. 142145, 2004.
26.
M. Stevenson and R. Gaizauskas, Experiments on sentence
boundary detection, in Proceedings of the 6th Conference on Applied
Natural Language Processing, pp. 8489, 2000.
27.
N. Agarwal, K. H. Ford, and M. Shneider, Sentence boundary
detection using a maxEnt classifier, 2005.
28.
A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M.
Przybocki, The DET curve in assessment of detection task
performance, in Proceedings of the 5th European Conference on
Speech Communication and Technology, 1997.
29.
Y. Liu and E. Shriberg, Comparing evaluation metrics for
sentence boundary detection, in Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP '07,
vol. 4, pp. 185188, usa, April 2007. View at PublisherView at Google
ScholarView at Scopus
30.
M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, Building a
large annotated corpus of English: the Penn Treebank, Computational
Linguistics, vol. 19, no. 2, pp. 313330, 1993.
31.
C. Galves and H. Britto, A construo do corpus anotado do
Portugus histrico tycho Brahe: o sistema de anotao morfolgica,
in Proceedings of the 4th International Conference on Computational
Processing of Portuguese Language (PROPOR '99), pp. 5567,
Universidade de vora, vora, Portuguese, 1999.
32.
H. Britto, C. Galves, I. Ribeiro, M. Augusto, and A. P. Scher,
Morphological annotation system for automatic tagging of electronic
textual corpora: from English to Romance languages, in Proceedings
of the 6th International Symposium of Social Communication, pp. 582
589, 1999.