（设计）語料庫建構技術研究報告 PDF

NAER-101-12-F-2-03-00-3-02
101 1 101 12
101 12
NAER

101

12

F

2

03

00

3

02

................................................................................1
........................................................................................................2
............................................................................................3
(concordancer) .................................................................3
..........................................................................................................4
(lemmatizer)...................................................................5
....................................................................................................5
(part-of-speech tagger)...............................................................5
(parser) ...........................................................................................6
....................................................................................................9
......................................................................................10
..........................................................................................12
(Corpus Processing) .............................................13
tr ..............................................................................................13
sort...................................................................................................13
uniq ........................................................................13
grep .....................................14
awk .........................................................14
perl ...................................................................................................................14
......................................................................................15
(mutual information)........................................................................15
t (t-score) ............................................................................................................ 15
(entropy) ............................................................................................................... 16
n (ngram language model) ............................................................ 16
..................................................................................17
i

..........................................................................................21
..............................................................................21
......................................................22
() .......................................................................23
() ...................................................................................................23
() Champollion Tool Kit (CTK) .........................27
..........................................................29
NP Chunking............................................................34
................................................42
........................................................................................46
Lucene ..............................................................49
LDC Sketch Engine .........................50
........................................................................................................50
........................................................................................................................................ 53
............................................................... 53
Support Vector Machine (SVM) ............................. 55
Bayesian Classification ..................................................................... 57
................................................................................................................................ 58
................................................................................................................................ 59
ii

AntConc ...............................................................4
AntConc ...............................................................................4
......................................6
......................................7
..........................................7
Stanford Parser ....................................................8
Stanford Parser ....................................................8
Lund Mate-too ...................9
Lund Mate-tool ..................9
....................................................10
................................................10
E-HowNet .........................................................11
........................................................17
....................................................18
() .................................18
....................................................19
....................19
Sketch Engine ................................................................20
Sketch Engine ........................................................20
CTK ......................28
............................................................28
........................................................................36
Yamcha .....................43
Lucene ...................................................49
iii

1947 Shannon (Noise
Chanel Model)(information theory)
1961
Francis Kucera Brown Corpus
Brown Corpus
Noam Chomsky
1960 1980 20
1980
(Hidden Markov Model)
John SinclairGeoffrey Leech, Sidney Greenbaum,
Jan Svartvik, Randolph Quirk.
Sinclair (1987)1, Quirk, Greenbaum, Leech, Svartvik (1985)2, Garside,
Leech, Sampson (1987)3 John Sinclair Harper Collins
1980 Collins
Cobuild Collins Cobuild
1
Looking Up: An Account of the Cobuild Project.
2
A Comprehensive Grammar of the English Language.
3
The Computational Analysis of English : A Corpus-Based Approach.
1

1990
(automatic
lexical knowledge acquisition)
1990
Brown Corpus, LOB Lancaster-Oslo-Bergen) Corpus, BNC (British National
Corpus), Project Gutenburg
Penn Corpus, Sussane Corpus
( 500 )
Penn Treebank
Hansard Corpus
Sinica Treebank (http://www.aclclp.org.tw/use_stb_c.php)

2

Penn Chinese Treebank
(http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T05)
Sinica Treebank Penn Chinese Treebank
PP, NP
FRAG (sentence)( IP )
Sinica Treebank ( Head-Driven Principle )
(Head)
SBJ OBJ
Sinica Treebank
Penn Treebank Sinica Treebank
Sinica Treebank Penn Chinese
Treebank Linguistic Data Consortium (LDC) Penn
Chinese Treebank Chinese Proposition Bank
(concordancer)
Antconc
Antconc concordancer
3

AntConc
http://www.antlab.sci.waseda.ac.jp/antconc_index.html
Antconc concordancer Concordance hits
Antconc Wordlist
AntConc
4

http://www.antlab.sci.waseda.ac.jp/antconc_index.html
(lemmatizer)
90 97
LingPipe
http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html
Chinese Word Segmenter http://nlp.stanford.edu/software/segmenter.shtml
(part-of-speech tagger):
98, Stanford Parser
Chinese Word Segmenter
(tagset)
5

http://ckipsvr.iis.sinica.edu.tw/
(parser)
(partial parse)
, Minipar Stanford Parser Lund
Mate-tool
Sinica Treebank Penn Chinese Treebank
PP, NP
FRAG (sentence)( IP )
Sinica Treebank ( Head-Driven Principle )
(Head)
SBJ OBJ
terminal node
head(node)
Stanford Parser Stanford Parser
NP VP IP
7

Stanford Parser
http://nlp.stanford.edu:8080/parser/

Stanford Parser
http://nlp.stanford.edu:8080/parser/
8

Lund Mate-tool
http://barbar.cs.lth.se:8091/
Lund Mate-tool
http://barbar.cs.lth.se:8091/
9

http://dict.revised.moe.edu.tw/

http://dict.revised.moe.edu.tw/
Hownet
( Dong and Dong (2006) Hownet
10

Hownet doctor, surgeon,
doctor {human| :HostOf={Occupation|
},domain={medical|},{doctor|:agent={~}}}
meta language,
{doctor|}
HowNet E-HowNet
E-HowNet
http://ehownet.iis.sinica.edu.tw/
Ae142 Ae151
Ae142","" "Ae142","" "Ae142","" "Ae142",""
"Ae142"," " "Ae142"," " "Ae142"," " "Ae142"," "
"Ae151","" "Ae151","" "Ae151","" "Ae151","" "Ae151",""
"Ae151","" "Ae151","" "Ae151","" "Ae151","" "Ae151","
11

" "Ae151","" "Ae151","" "Ae151","" "Ae151","" "Ae151","
" "Ae151","" "Ae151","" "Ae151","" "Ae151","" "Ae151",""
"Ae151","" "Ae151","" "Ae151","" "Ae151","" "Ae151",""
"Ae151"," " "Ae151"," " "Ae151"," " "Ae151"," "
"Ae151","" "Ae151","" "Ae151","" "Ae151",""
(Hidden Markov Model HMM)
John
saw the man with a telescope. with a telescope the
man() saw(
) PP attachment
12

UNIX LINUX
PC LINUX UNIX
PERL 4
tr
datafile output
tr A-Z a-z < datafile > output
tr sc A-Za-z \012 < datafile > output
sort
ASCII sort datafile > output
sort n datafile > output
sort nr datafile > output
sort +2 nr datafile > output
uniq
uniq datafile > output
uniq c datafile > output
sort datafile | uniq > output
sort datafile | uniq c > output
4
LINUX PERL (Interpreter) PERL
Window
13

tr sc A-Za-z \012 < datafile | sort | uniq c > output
wc
wc l datafile
wc w datafile
wc c datafile
grep
pattern grep pattern datafile > output
awk
pattern
awk /pattern/ { print } datafile > output
datafile tab
awk { print $2 \t $1} datafile > output
datafile 1.65 tab
awk { if ($3 >= 1.65) { print $2 \t $3} } datafile > output
perl C awksedshell programing awk
14

(mutual information):
(mutual information)(information theory)
P ( x, y )
MI ( x, y ) = log 2
P ( x) P ( y )
N f(x,y)
x y f(x)f(y) x y
f ( x, y )
( N 1) N f ( x, y )
MI ( x, y ) = log 2 f ( x) f ( y ) log 2 f ( x) f ( y )

N N
Ken Church (1991)
(word association)5
(collocations)
T-(t-score):
T-T-
5
Using Statistics In Lexical Acquisition.In Zernik, U. (1991) (eds.) Lexical Acquisition: Exploiting
On-Line Resources to Build a Lexion.
15

(t-score)(statistical significance test)
Ken Church (1991)
T-(confidence interval)
T- 1.65 95
T
f ( x) f ( y)
f ( x , y)
t N
f ( x , y)
(entropy):
(entropy)1960 1990
entropy = - kpi log(pi)
N (ngram)(language model)
N N
N-1
(bigram)(trigram)
16

Sketch Engine 1 4
5 Sketch Engine

http://db1x.sinica.edu.tw/kiwi/mkiwi/
17

()
18


19

Sketch Engine
http://www.sketchengine.co.uk/
Sketch Engine
http://www.sketchengine.co.uk/
20

Resnik, Olsen, and Diab (1999)
(Open Source)
(1)
21

In recent years calls for democratization of campuses have grown more
insistent. Traditional Chinese concepts of the proper ethical relationship between
students and teachers, in which students accorded teachers the same level of respect
they accorded their own fathers, are dissolving.
(2)
Stories like these are less and less exceptional on university campuses.
Manyprofessors have come to believe that a number of factors have laid hidden bones
of contention in teacher-student relations in recent years politicisation of all aspects of
life, democratisation in society, the promulgation of the new
"University Law" three years ago, rising feminism. . . . Relations have
become much more subtle and complex.
Gao (1998)
(discourse unit)
22

(Brown et al. (1991), Gale and Church (1993) )
(dynamic programming)
Brown et al. (1991) Gale and Church (1993)
(Hansard) 93
Gao (1998)
(Hansard) 89
53 35
Gale and Church (1993)

Category Frequency Prob(match)
1 - 1 1167 0.89
1 - 0 or 0 -1 13 0.0099
2 - 1 or 1 - 2 117 0.089
2 - 2 15 0.011
Bead Types (1,1) (1,2) (1,3) (1,4) (1,5)

frequency 0.53 0.32 0.06 0.06 0.03
23

Bead Types (1, 1) (1,2) (1,3) (1,4) (1,5)
frequency 0.35 0.38 0.17 0.06 0.04
Kay and Roscheisen (1993), Fung and Church (1994)

,,
Gao (1998) Fung Church (1994) Fung Church
(1994) K-vec (mutual information) t (t-score)
(mutual information)
(information theory)
N
f(x,y) x y
f(x)f(y) x y
Log2 ( (P(AB) / (P(A) * P(B) ) = Log2 (f(x,y)/f(x)*f(y))
Ken Church (1991)
(word association)
1.65
T-T-(t-score)
(statistical significance test) Ken Church (1991)
T-
24

(confidence interval) T- 1.65 95
T
P( x| y) P( x| z)
t=
2 P( x| y) + 2 P( x| z)
x|y y x
T N
f(x,y) x y f(x)f(y) x y
f ( x) f ( y)
f ( x, y)
t N
f ( x, y)
Fung Church (1994)
Fung Church K
K K
1 0
Fung Church
5 10 K K
Fung Church (1994)
()
a = k(A B) b = k(~ A B)
c = k(A ~B) d = k(~A ~B)
a b
c d
25

t
P(Vc) P(Ve)
P( Vc , Ve )
MI( Vc , Ve ) = log2
P( Vc ) P( Ve )
P( Vc ) = a+b
a+b+c+d
P( Ve ) = a+c
a+b+c+d
P( Vc , Ve ) P( Vc ) P( Ve )
t ( Vc , Ve ) =
P( Vc , Ve )
K
Gao (1998)
70 30
Gao (1998) Fung Church (1994)
University of Pennsylvania
Summer Institute of Linguistics
Fung Church
(1994) 90
Fung and Church (1994) K-vec
26

1
2exact string match
teacher

-partial string match
(proximity)
(translation equivalents) 90
Champollion Tool Kit (CTK)
CTK
CTK
Perl
CTK
CTK
27

CTK
Lucene
28

80 (rule-based)
(data-driven)
(Sinica Chinese Treebank)
(toolkit)Bird,
Klein, and Loper (2009) NLTK
(http://nltk.sourceforge.net/index.php/Main_Page)
(context-free parser)(finite-state
machine) (context-sensitive grammar)
(context-free parser)
Turing Machine
(context-free
parser)(phrase structure rules)
Bird, Klein, and Loper (forthcoming) NLTK python
NLTK NLPNatural language processing

NLP NLTK
NLTK PCFG
(probabilistic context-free parser)
Stanford parser (Klein and Manning (2002))
2002/12/05 parser 2006/6/11 1.5.1
29

Stanford parser PCFG JAVA
parser Dan Klein Christopher Manning
I/O Roger LevyChristopher Manning
Teg GrenagerGalen Andrew Stanford parser
Penn Treebank
Penn Chinese Treebank Penn
Treebank Xue and Xia (2000)
Stanford parser Klein and Manning (2002) factored
parsing maximum
likelihood-estimated PCFG constituent-free dependency parse
Stanford Parser Penn ChineseTreebank
Chiang (2003) Bikel (2004)
(parsing)
context free grammar ambiguous
S VP
S NP
VP V NN
NP V NN
NN

30

[S [VP [V ] [NN ] ] ]
[S [NP [V ] [NN ] ] ]
CFG parser
(probabilistic context free grammar)
weighted grammar derivation()
PCFG CFG parser
PCFG
derivation
PCFG
NLTK
Viterbi-style Viterbi PCFG parser
(bottom-up parser)(dynamic programming)
most likely constituent
table
1.
2.
3.
4.
31

Most Likely Constituents Table
[0:1] NP (NP ) 0.3
[2:3] NP (NP ) 0.3

[5:6] NP (NP ) 0.3
[1:4] PP (PP (NP ) ) 0.05
[5:6] VP (VP (NP )) 0.03
[0:4] NP (NP (NP ) (PP (NP ) 0.01

))
[0:6] S (S (NP (NP ) (PP (NP ) 0.0001
)) (VP (NP )))
parser S
Viterbi parser PCFG
Viterbi parser 1
234
Inserting tokens into the most likely constituents table...
Insert: |=.....|
Insert: |.=....|
Insert: |..=...|
Insert: |...=..|
Insert: |....=.|
Insert: |.....=|
32

Finding the most likely constituents spanning 1 text elements...
Insert: |=.....| NP -> '' (p=0.15) 0.1500000000
Insert: |.=....| P -> '' (p=0.61) 0.6100000000
Insert: |.=....| NP -> '' (p=0.5) 0.500000000
Insert: |..=...| LC -> '' (p=0.1) 0.5000000000
Insert: |...=..| V -> '' (p=0.61) 0.6500000000
Insert: |....=.| VP -> V (p=0.2) 0.1300000000
Insert: |.....=| NP -> '' (p=0.5) 0.5000000000
Insert: |....==| VP -> V NP (p=0.7) 0.0455000000
Insert: |.===..| PP -> P NP LC (p=1.0) 0.1525000000
Insert: |====..| NP -> NP PP (p=0.25) 0.0057187500
Insert: |.=====| VP -> PP VP (p=0.1) 0.0069387500
Discard: |.=====| VP -> PP VP (p=0.1) 0.0069387500
Insert: |======| S -> NP VP (p=1.0) 0.0002602031
33

NP Chunking
NP ChunkingNLP
(Ramshaw and Marcus (1995), Kudo and Matsumoto (2000, 2001))
(parsing)semantic role labeling
(co-reference)coherence
(information retrieval) information extractiontext mining
name entity

base NP
(NP conjunction)
94(Kudo and Matsumoto (2000,
2001))
finite state machines pattern (Voutilainen
(1993)) Church (1988),
Chen and Chen (1994)
(Penn Treebank) ( Marcus, Santorini and Marcinkiewicz (1993)),
machine learning
HMM (hidden Markov model)transformation-basedRamshaw and Marcus
1995, memory-based Veenstra 1998, Tjong Kim Sang and Veenstra 1999
Argamon, Dagan and Krymolowski 1998, maximum entropy Skut and Brants
34

1998, SVM (Kudo and Matsumoto, 2000, 2001)
HMM (hidden Markov model) finite state
machine transition function transformation-based
learning transformational rules,
parseHMMtransformation-based learning, memory-based learning
SVM machine learning ,
Wall Street Journal Corpus ,
(precision)(recall) 90%, SVMKudo and Matsumoto
2001(precision)(recall) 94%
Zhao and Huang (1998)
minimum description length principle (MDL)
quasi-dependency strength base NP
unsupervised learning(close test)(open test)
91.5% 88.7%
SVM (treebank)
argument
16
35

2
NP, VP ,
(:

http://godel.iis.sinica.edu.tw/CKIP/treebank/apposition.htm
terminal node
head(node)
NP, , , , ,
base NP
,,
(2005) Kudo and Matsumoto (2000, 2001)
(I,O,B):
class I,O,B:
I:
O:
B:
36

Tjong Kim Sang IOB1 Start/End
(Uchimoto et al.2000)
SE I,O,B, class:
B:
E:
I:
S:
O:
Inside/Outside Start/End
I S
O O
I B
I I
I E
B S
, 7 , Wordi i
POSi i :
Wordi POS(i-2) POS(i-1) POSi POS(i+1) POS(i+2)
37

I 1: 2:0 3:0 4:N 5:S 6:N
O 1: 2:0 3:N 4:S 5:N 6:V
I 1: 2:N 3:S 4:N 5:V 6:N
I 1: 2:S 3:N 4:V 5:N 6:V
I 1: 2:N 3:V 4:N 5:V 6:0
B 1: 2:V 3:N 4:V 5:0 6:0
nominalization

VD(), VK(), VG()
(D) (VK) (Nc) (Na) (P) (Na) (Na)
(VD) (Na)
NP chunks : , ;
NP chunk , SVM ,
( SVM (V)),
VH , NP , V
.,
,
38

23, 6,
SVM
SVM
Precision Recall
(1) 54.99% 53.17%

(2) 78.18% 59.33%

Kudo and Matsumoto (2000,2001)
(94%) ; :
IOB tag, I/O tag, chunk
( chunk ) tag, IOB
Start/End
SVM
kernel function SVM ,
linear, polynomial, radial basis function, sigmoid... ,
cross validation
: . 8,000
4 , SVM time complexity O(n2),
300,000 , , ,
cross validation , scaling
YAMCHA (http://chasen.org/~taku/software/YamCha/) Taku Kudo NP
Chunking SVM SVM SVM Tool: LIBSVM

39

(Chih-Chung Chang and Chih-Jen Lin, 2004)YAMCHA libsvm
a) Dynamic programming
b) Kernel Function
libsvm , chunking
chunking . , :
Inside/Outside
(B)
SVM tag ,
; IOB tag, (
) , IOB tag ,
YAMCHA Kudo and Matsumoto (2000,2001) IOB tag IO tag ,
B tag NP-chunk , NP-chunk
Kudo and Matsumoto (2000,2001) voting voting
, parsing
(backward ),
parsing SVM , Accuracy
40

. parsing ,
SVM
Kudo and Matsumoto (2000) F measure
F = (2* precision * recall) / (precision + recall) precision recall
recall precision F measure precision recall
YAMCHA Base-NP chunking
parsing
Precision Recall F measure
86.48% 88.41% 87.43%
(Forward) (10360/11980) (10360/11716)
86.29% 85.21% 85.74%
(Backward) (9983/11569) (9983/11716)
87.34% 75.02% 80.71%
(Forward) (8789/10063) (8789/11716)
84.88% 73.84% 78.98%
(Backward) (8651/10192) (8651/11716)
Vote using 88.71% 85.76% 87.21%
Accuracy Rate (10048/11327) (10048/11716)
F measure forward parsing
voting F measure
forward parsing backward
parsing votingrecall
12 16
41

, 10 ,
26 dynamic programming IOB Start/End
95 F measure
(Support Vector Machine)
(dependency parser)
,,
42

Yamcha
,(Supervised Machine Learning)
, l
(terms), l , O(l)
O(l2 ) l(l1) ,
(feature), R, L, O
, 3(31)
/ 2 = 3 :
()

R ()
O ()
L ()
43

(Support Vector Machine),,
:,
lLO , lLR , lOR L O,L R, O R ;
n ,
O(l2 (lLO + lLR + lOR ) n)
(binary classifier)
(multiclass classifier)
,,
,,
,,
,,
,(HEADER);,
2008
(Unsupervised Machine Learning)
2008 Google
44

2008
TinySVM YamCha (Kudo Matsumoto (2000))
(Kudo Matsumoto (2002))
(L, R, O)
YamCha
Perl
(2-degree polynomial kernel L1-loss support vector machine),
1(C = 1)( 125 )
()
43253
882708
L O (lLO) 77161
L R (lLR) 34490
O R (lOR) 130692
, 12492 , 246054 ,74850
0.4
45

l ,,
l , l(l1) (O),
,:
()
57109 (76.298%)
6724 (53.826%)
bank
bank
Senseval-2 English lexical sample 2001 73
(thesaurus) Lesk (1986)
46

Walker (1987)(thesaurus)
(supervised learning)
(unsupervised learning)
Purandare and Pedersen (2004)
Wordnet
(co-occurrence matrix) Singular
Value Decomposition (SVD) 100 Latent Semantic Indexing
(LSI)Jurafsky and Martin (2000)
(collocational features)
bag of words information
Semantic Concordancer Senseval
pseudoword pseudoword Gale et
al. (1992) Schutze(1992)
banana-door banana door
banana-door
duty
Brown et al. (1991) Gale et al.
(1992)
50 (context vector), Bayesian
47

classification
Bayesian classification Bayesian classification
Yarowsky (1995)
(One sense per discourse)
(One sense per collocation)
Yarowsky (1995)Lin (1997)
(classifier)
(knowledge source)
MINIPAR (dependency relations)
(local context)Lin (1997)
Le and Shimazu (2004)
Forward Sequential Selection Algorithm
mutual
information Flip-Flop algorithm (Brown et al. (1991)), decision list
(Yarowsky (1994))
Nave Bayes Maximum EntropySupport Vector Machine
Conditional Random Field
48

Lucene
Lucene Java
(Inverted File) API Lucene
(open source)Lucene
Lucene
API
Lucene
Lucene
Lucene
Identification of Fields in Texts to

be Indexed : doc (field1, field2 ... )
Indexer
Lucene Index
Searcher
Output of Lucene: Hits (doc

(field1, field2 ...), doc
(field1 ...) ...))
Lucene
49

LDC Sketch Engine
Linguistic Data Constorium (LDC)
Chinese Gigaword
LDC
Sketch Engine
1 2 Sketch Engine 10
50 Sketch Engine
1950
1980
,1990
50

1990 (BNC)
Collin
Collins Cobuild Project
1990
1960 Brown Corpus
IBM AT&T
Google
ETS
ETS
51

DARPA
NIST
1988
24
HowNet
2. 3.
4.
5.
52

A A Nb N VB Vi

Caa C Nc N VC Vt

Cab POST Ncd N VCL Vt

Cba POST Nd N VD Vt

Cbb C Nep DET VE Vt

D ADV Neqa DET VF Vt

DE , T Neqb POST VG Vt
,
,
Da ADV Nes DET VH Vi

Dfa ADV Neu DET VHC Vt

Dfb ADV Nf M VI Vi

Di ASP Ng POST VJ Vt
53

Dk ADV Nh N VK Vt

FW FW SHI Vt VL Vt

I T T T V_2 Vt

NAV NAV VA Vi

Na , N VAC Vi
,
,
54

Support Vector Machine (SVM)
SVM machine learning (Boser, Guyon, and Vapnik (1992),

Corts and Vapnik (1995))
, SVM
Joachims 1998 Taira and Haruno 1999 (Kudo and
Matsumoto 2000, 20001),
unknown word
guessing(Nakagawa, Kudo, and Matsumoto (2001)(part of speech
tagging)Nakagawa, Kudo, and Matsumoto (2002) Gimnez J ess and Mrquez
Llus (2004))(dependency analysis)(Kudo and Matsumoto
(2000)) (word sense disambiguation and sense tagging)
(Cabezas, Resnik, and Stevens (2001))(semantic parsing) (Pradhan
et al. (2004) Sun and Jurafsky (2004))
SVM machine,
SVM () ,
, , (
), SVM
55

min()= (1/2)||2
(, ),
Lagrange multiplier
(), SVM
H : Rd H x
,
.,
T
kernel function: K(xi,xj) = (xi) (xj)
, kernel function .
SVM
(xi, yi) { i= 1, 2, ..., l; xi Rn; yi { 1, -1 }}
l , xi n , yi ( 1
-1) SVM ,
minw,b,e(1/2)wTw + Ci=1lei
yi(wT(xi) + b) >= 1- ei, ei >= 0
xi H SVM
; K(xi,xj) Kernel function.
56

Bayesian Classification
Bayesian Classification
k
c
57

()
http://ckipsvr.iis.sinica.edu.tw/
AntConchttp://www.antlab.sci.waseda.ac.jp/software.html
Hownethttp://www.keenage.com/
MateParserhttp://barbar.cs.lth.se:8091/

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T05
Sinica Treebank 3.0. http://www.aclclp.org.tw/use_stb_c.php
SketchEnginehttp://www.sketchengine.co.uk/
StanfordNLPSoftwarehttp://nlp.stanford.edu/software/index.shtml
StanfordParserhttp://wwwnlp.stanford.edu/downloads/lexparser.shtml
ToolsforNaturalLanguageAnalysis,GenerationandMachineLearning
http://code.google.com/p/matetools/
YamCha:YetAnotherMultipurposeCHunkAnnotator
http://chasen.org/~taku/software/YamCha/
58

Bikel,Daniel.(2004).OntheParameterSpaceofGenerativeLexicalizedStatistical
ParsingModels.Ph.D.Dissertation.UniversityofPennsylvania.
Bird, Steven, Klein, Ewan and Loper Edward (2009) Natural Language Processing
with Python. OReilley.
Chiang, David. (2003) Statistical parsing with an automatically extracted tree adjoining
grammar. In Data Oriented Parsing, CSLI Publications, pages 299316.
Levy,RogerandManning,ChristopherD.(2003).IsithardertoparseChinese,orthe
ChineseTreebank?.ACL2003.
Manning,Christopher,andSchutze,Hinrich.(1999)FoundationsofStatisticalNatural
LanguageProcessing.MITPress.
Xue,NianwenandXia,Fei.(2000)"The Bracketing Guidelines for the Penn Chinese
Treebank (3.0)",IRCSReport0008,UniversityofPennsylvania,Oct2000.
Argamon,Shlomo,Dagan,Ido,andKrymolowski,Yuval(1998).AMemoryBased
ApproachtoLearningShallowNaturalLanguagePatterns.InProceedingsofthe
17thinternationalconferenceonComputationallinguistics,Vol.1,pp.6773,
Montreal,Quebec,Canada.
Brill, Eric and Ngai, Grace (1999), Man vs. Machine: A Case Study in Base Noun
Phrase Learning. In Proceedings of ACL'99, pp. 65-72, University of Maryland, MD,
USA.
Boser, E. Bernhard, Guyon, Isabelle, and Vapnik, Vladimir. (1992). A Training
Algorithm for Optimal Margin Classifiers. COLT: pp. 144-152
Cabezas, Clara, Resnik, Philip, and Stevens,Jessica. (2001). Supervised Sense Tagging
using Support Vector Machines. Proceedings of the Second International Workshop
on Evaluating Word Sense Disambiguation Systems (SENSEVAL-2), Toulouse,
France, 5-6 July 2001.
Cardie, Claire and Pierce, David (1998). Error-Driven Pruning of Treebank Grammars
for Base Noun Phrase Identification. In Proceedings of COLING-ACL'98, pp.
218-224, Montreal, Canada.
Chang, Chih-Chung and Lin, Chih-Jen. (2004) LIBSVM -- A Library for Support
Vector Machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Chen, Kuang-hua and Chen, Hsin-Hsi (1994). Extracting Noun Phrases from
Large-Scale Texts: A Hybrid Approach and Its Automatic Evaluation, In Proceedings
of ACL-94, Las Cruses, NM, USA.
Church, K. (1988) A Stochastic Parts Program and Noun Phrase Parser for Unrestricted
59

Text. Second Conference on Applied Natural Language Processing, Austin , Texas ,
pp. 136-143.
Corte, Corinna, and Vapnik, Vladimir (1995). Support-Vector Networks. Machine
Learning 20(3), pp. 273-297.
Gimnez J ess and Mrquez Llus (2004). SVMTool: A general POS tagger generator
based on Support Vector Machines Proceedings of the 4th International Conference
on Language Resources and Evaluation (LREC'04). Lisbon, Portugal. 2004 .
Joachims, Thorsten. (1998) Text Categorization with Support Vector Machines:
Learning with Many Relevant Features. Proceedings of the European Conference on
Machine Learning (ECML), Springer, 1998.
Hsu, Chih-Wei, Chang, Chih-Chung, and Lin, Chih-Jen. (2004). A Practical Guide to
Support Vector Classification.
Kudo, Taku, and Matsumoto, Yuji. (2000). Use of Support Vector Learning for Chunk
Identification. In Proceedings of CoNLL-2000, pp. 142-144.
Kudo, Taku, and Matsumoto, Yuji (2000). Japanese Dependency Analysis Based on
Support Vector Machines, EMNLP/VLC 2000
Kudo, Taku, and Matsumoto, Yuji. (2001). Chunking with Support Vector Machine. In
Proceedings of NAACL 2001, pp. 192-199.
Marcus, Mitchell P., Santorini, Beatrice and Marcinkiewicz, Mary Ann. (1993)
Building a large annotated corpus of English: the Penn Treebank, Computational
Linguistics, 19:2. vol. 19, no. 2, pp. 313330.
Pradhan, Sameer, Ward, Wayne, Hacioglu, Kadri, Martin, James H.and Jurafsky,
Daniel. (2004). Shallow Semantic Parsing Using Support Vector Machines. In
Proceedings of NAACL-HLT 2004,pp. 233-240..
Nakagawa, Tetsuji, Kudo, Taku, and Matsumoto, Yuji. (2001). Unknown Word
Guessing and Part-of-Speech Tagging Using Support Vector Machines. NLPRS, pp.
325-331
Nakagawa, Tetsuji, Kudo, Taku, and Matsumoto, Yuji. (2002). Revision Learning and
its Application to Part-of-Speech Tagging. In Proceedings of ACL 2002, pp.
497-504.
Ramshaw, Lance A., and Marcus, Mitchell P.. (1995). Text Chunking Using
Transformation-based Learning. In Proceedings of the Third ACL Workshop on Very
Large Corpora, pp. 82-94, Cambridge MA, USA.
Skut, Wojciech and Brants, Thorsten. (1998) A Maximum-Entropy Partial Parser for
Unrestricted Text. In Proceedings of the Sixth Workshop on Very Large Corpora, pp.
143-151, Montreal, Canada.
Sun, Honglin and Jurafsky, Daniel. 2004. Shallow Semantic Parsing of Chinese. In
Proceedings of NAACL-HLT 2004, pp.192-199.
60

Taira, Hirotoshi, Haruno, Masahiko 1999 : Feature Selection in SVM Text
Categorization. AAAI/IAAI 1999, pp. 480-486.
Tjong Kim Sang, Erik F. and Veenstra, Jorn (1999). Representing Text Chunks. In
Proceedings of EACL'99, 173-179, Bergen, Norway.
Tjong Kim Sang, Erik F. (2002) Memory-Based Shallow Parsing. Journal of Machine
Learning Research, Vol. 2, pp. 559-594.
Uchimoto,Kiyotaka,Ma,Qing,Murata,Masaki,Ozaku,Hiromi,Isahara,Hitoshi.(2000)
Namedentityextractionbasedonamaximumentropymodelandtransformation
rules.Proceedingsofthe38thAnnualMeetingonAssociationforComputational
Linguistics,HongKong,pp.326335.
Veenstra, Jorn. (1998). Fast NP chunking using memory-based learning techniques, In
F. Verdenius and W. van den Broek eds.,Proceedings of BENELEARN-98, pp. 71-79,
Wageningen, The Netherlands.
Voutilainen,A.(1993) NPtool,aDetectorofEnglishNounPhrase.InProceedingsof
theFirstAnnualWorkshoponVeryLargeCorpora,pp.4857.BernersLee,Tim.
(2000)WeavingtheWeb:theoriginaldesignandultimatedestinyoftheWorld
WideWebbyitsinventor.NewYork:HarperBusiness.
Boguraev,Branimir.andBriscoe,Ted.(1989)ComputationalLexicographyforNatural
LanguageProcessing.Longman:Harlow.Boguraev,BranimirandPustejovsky,James
(eds.)(1996)CorpusProcessingforLexicalAcquisition,MITPress.
Chaffin,RogerandIllerrmann,Douglas.(1988)TheNatureofSemanticRelations:a
ComparisonsofTwoApproaches.InEvens(eds)(1988),pp.289334.
Church,K.andHanks,P.(1990)WordAssociationNorms,MutualInformation,and
Lexicography.ComputationalLinguistics,Vol.16,No.1,pp.2229.
Church,K.etal.(1991)Parsing,WordAssociations,andTypical
PredicateArgumentRelations.InTomita(ed)CurrentIssuesinParsingTechnology,
Kluwer.
Church,Kenneth,WilliamGale,PatrickHanks,andDonaldHindle.(1994)Lexical
Substitutability,inAtkinsandZampolli(eds.)ComputationalApproachestothe
Lexicon,pp.153177.Oxford, OxfordUniversityPress.
Cruse,Allan.(1986)LexicalSemantics.Cambridge:CambridgeUniversityPress.
Dong,ZhendongandDong,Qiang.(2006)HownetandtheComputationofMeaning.
WorldScientific.
Evens,Martha.(eds.)(1988)RelationalModelsoftheLexicon:Representing
KnowledgeinSemanticNetworks.CambridgeUniversityPress.
Fillmore,Charles.(1968)TheCaseforCase.InE.BachandR.T.Harms,eds.,Universals
inLinguisticTheory,Holt,RiinehartandWinston,NewYork,188.
Koenig,JeanPierre.(1999)LexicalRelations.CSLI,StanfordUniversity.
61

Girju,R.,Nakov,P.,Nastase,V.,Szpakowicz,S.,Turney,P.,andYuret,D.(2007),
SemEval2007Task04:ClassificationofSemanticRelationsbetweenNominals,
ProceedingsoftheFourthInternationalWorkshoponSemanticEvaluations
(SemEval2007),Prague,CzechRepublic,pp.1318.
Grefefenstette,Gregory.(1994)ExplorationsinAutomaticThesaurusDiscovery.
KluwerAcademicPublishers.
Hearst,M.A.(1992).Automaticacquisitionofhyponymsfromlargetextcorpora.In
ProceedingsoftheFourteenthInternationalConferenceonComputational
Linguistics,pages539545,Nantes,France.
Levin,Beth.(1985)Introduction,inB.Levin(ed.)LexicalSemanticsinReview,Lexicon
ProjectWorkingPapers1,CenterforCognitiveScience,MIT,pp.162.
Melcuk,Igor.(1988)TheExplanatoryCombinatoryDictionary,inM.Evens(ed.)
(1988),pp.4174.
Pustejovsky,James,SabineBergler,andPeterAnnick(1993)LexicalSemantic
TechniquesforCorpusAnalysis,ComputationalLinguistics,Vol.19,No.2,pp.331
358.
Pustevojsky,James.(1995)TheGenerativeLexicon.TheMITPress.
Pustevojsky,James.(2000)SyntagmaticProcesses.inHandbookofLexicologyand
Lexicography,deGruyter,2000.
Jackendoff,Ray.(1983)SemanticsandCognition.Cambridge,Mass.:MITPress.
Jackendoff,Ray.(1990)SemanticStructures.Cambridge,Mass.:MITPress.Jones,
Stevens.(2002).Antonymy:ACorpusbasedPerspective.London;NewYork:
Routledge,2002
Pedersen,Patwardhan,andMichelizzi(2004)WordNet::SimilarityMeasuringthe
RelatednessofConceptsAppearsintheProceedingsoftheNineteenthNational
ConferenceonArtificialIntelligence(AAAI04),pp.10241025,July2529,2004,
SanJose,CA(IntelligentSystemsDemonstration)
Resnik,Phillip.(1992)WordNetandDistributionalAnalysis:AClassbasedApproach
toLexicalDiscovery,inWorkshopNotes,StatisticallyBasedNLPTechniques,
AmericanAssociationforArtificialIntelligence,pp.109113.
Schank,Roger.(1975)ConceptualInformationProcessing.Amsterdam:NorthHolland.
Sinclair,John.(eds).(1987)Lookingup.Glasglow:Collins.
Turney,P.D.(2006),Expressingimplicitsemanticrelationswithoutsupervision,
Proceedingsofthe21stInternationalConferenceonComputationalLinguisticsand
44thAnnualMeetingoftheAssociationforComputationalLinguistics
(Coling/ACL06),Sydney,Australia,pp.313320.
Wilks,A.Yorick(1968)OnlineSemanticAnalysisofEnglishTexts.MachineTranslation,
62

Vol. 11, pp. 5972. Brown, Peter et al. (1991) Word sense disambiguation using
statisticalmethods.InACL29,pp.264270.
Dong,ZhendongandDong,Qiang.(2006)HownetandtheComputationofMeaning.
WorldScientific.
Gale, William, Church, Kenneth, and Yarowsky, David. (1992) A method of
disambiguating word senses in a large corpus. Computers and the Humanties
26:415439.
Jurafsky, Daniel, and James H. Martin. (2000) Speech and Language Processing: An
Introduction to Natural Language Processing, Speech Recognition, and
ComputationalLinguistics.PrenticeHall.
Klein, Dan. and Manning, Christopher. (2003) Accurate Unlexicalized Parsing.
Proceedings of the 41st Meeting of the Association for Computational Linguistics,
pp.423430.
Le,CuongAnhandShimazu,Akira.(2004)HighWSDAccuracyUsingNaveBayesian
ClassifierwithRichFeatures.PACLIC18,Tokyo.
http://dspace.wul.waseda.ac.jp/dspace/bitstream/2065/564/1/oral8.pdf
Lesk,Michael.(1986)AutomaticSenseDisambiguation:Howtotellapineconefrom
anicecreamcone.InProceedingsofthe1986SIGDOCConference,pp.2426,
NewYork.AssociationforComputingMachinery.
Lin,Dekang.(1997).UsingSyntacticDependencyasLocalContexttoResolveWord
SenseAmbiguityInProceedingsofACL97,Madrid,Spain.July,1997.
Manning,Christopher,andSchutze,Hinrich.(1999)FoundationsofStatisticalNatural
LanguageProcessing.MITPress.
Patwardhan,Banerjee,andPedersen(2005)SenseRelate::TargetWordAGeneralized
FrameworkforWordSenseDisambiguation.AppearsintheProceedingsofthe
TwentiethNationalConferenceonArtificialIntelligence,July12,2005,Pittsburgh,
PA.(IntelligentSystemsDemonstration)
PurandareandPedersen(2004)ImprovingWordSenseDiscriminationwithGloss
AugmentedFeatureVectors.AppearsintheProceedingsoftheWorkshoponLexical
ResourcesfortheWebandWordSenseDisambiguation,November22,2004,
PueblaMexico.
Yarowsky,D.(1994)DecisionListsforLexicalAmbiguityResolution:Applicationto
AccentRestorationinSpanishandFrench.''InProceedingsofthe32ndAnnual
MeetingoftheAssociationforComputationalLinguistics.LasCruces,NM,pp.
8895.
Zhao,JunandHuang,Changning.(1998).AQuasiDependencyModelforStructural
AnalysisofChineseBaseNPs.InProceedingsofCOLINGACL98,pp.17,Montreal,
63

Canada.
(2007)
,pp.pp257272
(2007) :
,pp131144
2005
pp.317332
2005
pp.385396.
2004
(2001)
(Sinica Treebank Version 3.0).

http://www.aclclp.org.tw/use_stb_c.php
(1988). ,
64

（设计）語料庫建構技術研究報告 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

（设计）語料庫建構技術研究報告 PDF

Uploaded by

Copyright:

Available Formats

NAER-101-12-F-2-03-00-3-02

(Corpus Processing) .............................................13

n (ngram language model) ............................................................ 16

() Champollion Tool Kit (CTK) .........................27

LDC Sketch Engine .........................50

Support Vector Machine (SVM) ............................. 55

Bayesian Classification ..................................................................... 57

Stanford Parser ....................................................8

Stanford Parser ....................................................8

Lund Mate-too ...................9

Lund Mate-tool ..................9

Sketch Engine ................................................................20

Sketch Engine ........................................................20

1947 Shannon (Noise

Chanel Model)(information theory)

Francis Kucera Brown Corpus

(Hidden Markov Model)

John SinclairGeoffrey Leech, Sidney Greenbaum,

Jan Svartvik, Randolph Quirk.

Sinclair (1987)1, Quirk, Greenbaum, Leech, Svartvik (1985)2, Garside,

Leech, Sampson (1987)3 John Sinclair Harper Collins

Cobuild Collins Cobuild

lexical knowledge acquisition)

Brown Corpus, LOB Lancaster-Oslo-Bergen) Corpus, BNC (British National

Corpus), Project Gutenburg

Penn Corpus, Sussane Corpus

Sinica Treebank (http://www.aclclp.org.tw/use_stb_c.php)

Sinica Treebank Penn Chinese Treebank

Sinica Treebank ( Head-Driven Principle )

Penn Chinese Treebank

Penn Treebank Sinica Treebank

Sinica Treebank Penn Chinese

Treebank Linguistic Data Consortium (LDC) Penn

Chinese Treebank Chinese Proposition Bank

Antconc concordancer Concordance hits

Chinese Word Segmenter http://nlp.stanford.edu/software/segmenter.shtml

98, Stanford Parser

Chinese Word Segmenter

, Minipar Stanford Parser Lund

Sinica Treebank Penn Chinese Treebank

Sinica Treebank ( Head-Driven Principle )

Penn Chinese Treebank

Stanford Parser Stanford Parser

( Dong and Dong (2006) Hownet

doctor {human| :HostOf={Occupation|

Ae142","" "Ae142","" "Ae142","" "Ae142",""

"Ae142"," " "Ae142"," " "Ae142"," " "Ae142"," "

"Ae151","" "Ae151","" "Ae151","" "Ae151","" "Ae151",""

"Ae151","" "Ae151","" "Ae151","" "Ae151","" "Ae151","

" "Ae151","" "Ae151","" "Ae151","" "Ae151","" "Ae151",""

"Ae151","" "Ae151","" "Ae151","" "Ae151","" "Ae151",""

"Ae151"," " "Ae151"," " "Ae151"," " "Ae151"," "

"Ae151","" "Ae151","" "Ae151","" "Ae151",""

(Hidden Markov Model HMM)

saw the man with a telescope. with a telescope the

tr A-Z a-z < datafile > output

tr sc A-Za-z \012 < datafile > output

ASCII sort datafile > output

sort n datafile > output

sort nr datafile > output

sort +2 nr datafile > output

uniq datafile > output

uniq c datafile > output