Professional Documents
Culture Documents
i
......................................................... ii
................................................. iii
................................................... 1
1.1 ................................ 1
1.2 .............................. 2
1.3 .......................... 2
1.4 .......................... 3
......................... 5
2.1 ...................................... 5
2.1.2 .......................................... 9
2.1.3 .......................................... 10
2.1.4 .......................................... 10
2.1.5 .......................................... 10
2.2 ................................ 11
............................................... 13
3.1 ............................................ 13
3.2 .......................................... 18
............................................... 22
4.1 .......................................... 22
4.2 ................................ 24
4.3 ................................ 26
4.4 .......................................... 29
......................................... 33
................................... 34
WWW ....... 43
................................................... 58
-1-
..................................... 5
............................................. 6
3.0 ......................... 11
3.0 ......................... 11
3.0 ......................... 11
3.0 ......................... 12
3.0 ......................... 12
................................... 23
................................... 30
(3.0 http://www.sinica.edu.tw/ftms-bin/kiwi.sh)
-i-
tag
- ii -
1.~
2.
3.
4.+prop 30 32
1998.8.19
- iii -
Sinica Corpus
1994
19973.0
1.1
corpus-based
Svartvik 1992, Church and Mercer 1993, 1994, 1995
LOB Lancaster-
Oslo/BergenLondon-Lund
tag
versatility
1995
-1-
1.2
Huang & Chen 1992
Huang 1994
1991, Chen 1994
()
consortium
()
transcribe
() BBS
1.3
Design Features
()
1997
-2-
() text
()
monitor corpus
1.4
1.2
1994
-3-
infrastructureLOBLondon-Lund
-4-
2.1
%% =
%% =
%% =written
%% =
%% =
%% =
%% =
%% =
%% =
%% =
%% =
%% =
%%
%% =
2.1
LANCASTER-OSLO/BERGEN (LOB)
COBUILD Project
-5-
written ()
written-to-be-read ()
written-to-be-spoken ()
spoken ()
spoken-to-be-written ()
-6-
2.1.1
..
-7-
-8-
2.1.2
written-to-be-spoken
-9-
2.1.3
2.1.4
2.1.5
written written-to-be-read written-to-be-spoken spoken spoken-to-be-written
writtenwritten-to-be-read
written-to-be-spoken
spoken
spoken-to-be-
- 10 -
written
2.2
3.0
3.0
CORPUS 443.94 78.97 4.68 10.17 5.79 79.85 66.93 3.94 2.31 0.23 15.98 0.43 64.61 10.57 0.84
CORPUS 292.64 52.06 3.08 6.71 3.82 52.64 44.12 2.60 1.52 0.15 10.54 0.29 42.60 6.97 0.56
% 56.25 10.01 0.59 1.29 0.73 10.12 8.48 0.50 0.29 0.03 2.03 0.05 8.19 1.34 0.11
- 11 -
3.0
CORPUS 246.89 230.28 5.49 32.23 1.06 10.71 66.70 180.20 12.90 2.00 0.81
CORPUS 162.57 151.80 3.62 21.25 0.70 7.06 43.96 118.80 8.50 1.32 0.53
% 31.28 29.18 0.70 4.08 0.13 1.36 8.45 22.83 1.63 0.25 0.10
3.0
3.0
- 12 -
3.1
(1)
- 13 -
1
(2)
(1)
- 14 -
() 19956
BBS user txt
2/28330
AB-8888
(2)
/
1
- 15 -
(3)
(4)
(5)
- 16 -
(6)
1.
2.
3.
4.
5.
(1)
(2)
(1)
(2)
(3)
(4)
(5)
(6)
- 17 -
3.2
1.
(1)(2)(2)(3)
(2)
(1)(2)(2)(3)(4)
(4)
(3)
(1)(2)(2)(3)
(1)(2)(2)(3)
(1)
(3)
(1)(6)
(1)(2)
(6)(1)
(1)
(1)(1)
(1)
(1)
(1)
- 18 -
(1)
(1)
(2)
(2)
(6)
(2)
(1)(2)
(6)
(2)
(6)
(6)
(1)(1)
(1)
2.
(1)(2)(2)(3)
(3)
(1)(2)(2)(3)
(3)
(6)
(1)
/(2)(5)
(6)
(1)
- 19 -
(1)(1)
(1)
3. (1)(6)
4.
(1)
,
(1)
(1)
(1)
(1)
(1)
(1)
- 20 -
(1)
5.(1)(2)(2)(3)
(1)
(3)
(1)
(2)
(1) (2)
6.(1)(6)
- 21 -
96%Chen et al. 1994
4.1
178 1993
43346
4 DaaDabDa
Ne
Neu
Nes Specific
Nep
Neqa
Neqb
- 22 -
CKIP5
A A /**/
Caa Caa /**/
Cab Cab /**/
Cba Cbab /**/
Cbb Cbaa, Cbba, Cbbb, Cbca, Cbcb /**/
Da Daa /**/
Dfa Dfa /**/
Dfb Dfb /**/
Di Di /**/
Dk Dk /**/
D Dab, Dbaa, Dbab, Dbb, Dbc, Dc, Dd, Dg, Dh, Dj /**/
Na Naa, Nab, Nac, Nad, Naea, Naeb /**/
Nb Nba, Nbc /**/
Nc Nca, Ncb, Ncc, Nce /**/
Ncd Ncda, Ncdb /**/
Nd Ndaa, Ndab, Ndc, Ndd /**/
Neu Neu /**/.
Nes Nes /**/
Nep Nep /**/
Neqa Neqa /**/
Neqb Neqb /**/
Nf Nfa, Nfb, Nfc, Nfd, Nfe, Nfg, Nfh, Nfi /**/
Ng Ng /**/
Nh Nhaa, Nhab, Nhac, Nhb, Nhc /**/
I I /**/
P P* /**/
T Ta, Tb, Tc, Td /**/
VA VA11,12,13,VA3,VA4 /**/
VAC VA2 /**/
VB VB11,12,VB2 /**/
VC VC2, VC31,32,33 /**/
VCL VC1 /**/
VD VD1, VD2 /**/
VE VE11, VE12, VE2 /**/
VF VF1, VF2 /**/
VG VG1, VG2 /**/
VH VH11,12,13,14,15,17,VH21 /**/
VHC VH16, VH22 /*/
VI VI1,2,3 /**/
VJ VJ1,2,3 /**/
VK VK1,2 /**/
VL VL1,2,3,4 /**/
V_2 V_2 /**/
DE /*, , , */
SHI /**/
FW /**/
5 #93-05
- 23 -
4.2
46
VH
taggingVH
VH
Caa
Cab
Cba
Cbb
Dfa
Dfb
Di
Dk
D
Nf
Ng
Neu
Nes
Neqb
P
I
T
VADV
N-modifier
- 24 -
1991a, bDE
/
Da ADVN-modifier (Da)(Da)
N*-Nf-Ng NN-modifierN*
(Na)(Na)
Ncd NN-modifierlocative marker
(Ncd)(Ncd)(Ncd)
Nd NN-modifierADV
(Nd)(Nd)(Nd)
Nep N(h)N-modifier Nh
(Nep)(Nep)
Neqa NN-modifierADV
(Neqa)(Neqa)(Neqa)
V* VN-modifier (VC)(VC)
VH VN-modifierADV6
(VH)(VH)(VH)
V_2 (V_2)(V_2)
SHI (SHI)(SHI)(SHI)
A N-modifierADV (A)(A)
DE nominal markeradverbial markercomplement marker
(DE)(DE)(DE)
Na
VHNa Na
Na(D)
NaD
6 VH
- 25 -
4.3
VE VJ(VE)(VJ)
8
7 4.2
8
- 26 -
(Ncd, Ng, Nes)
(8)(Neu)(Nf) (Ng) (Nc)(Ncd)
(9)(Ncd)(V_2)(VH)(Na)(Ncd)(V_2)(VH)(Na)
(10)(Ncd)(Na);(Ncd)(Nc)
(11)(Nes)(Neu)(Nf)
NcdNg
Ncd
Ng(8)(Ncd)
(Ng)
(10)
(Nes)
(11)
(P, Caa)
(12) (Nh)(VK)(Na)(Caa)(Na)
(13) (Nh) (Caa) (Nh) (D) (VA)
(14) (Nh) Nd) P) Nh) (D) (VA)
(15) (Nep)(Nf)(Na) (P) (Nh) (VH)
(16) (Nh)(P)(Nh)(VH)(VH)
(Caa)(P)
(12)
(13)Caa
PCaaP
Caa
P(14)
(15)(16)
(15)Caa
- 27 -
(18) (P) (VC)(Na)(Na)((P) =)
(19) (Nb)(VG)(Na)((VG) =)
(20) (Nh)(VC)(Nep)(Nf)(Na)(SHI)(VJ)(Di)(Nh)
(18)(20)(20)VJ
(18)VJ
(17)(18)P
(Di,Dfa,VH,VCL)
(21) (Di)
(22) (Dfa)
(23) (VH)
(24) (VCL)
(25) (VCL)
(24) (25) (25)
(24)(23)
VCL
(Neqa, VH)
(26) (Neqa)(Nf)(Na)(D)(VC)
(27) (Neqa)(D)(P)(Nh)(VC)
(28) (Na)(VA)(Di)(Neqa)
(29) (Nb)(Neqa)(VCL)(Nc)
(30) (Nh)(VC)(DE)(VH)
(31) (Nh)(VH)(Neqa)
(32) (Nh)(D)(VA)
(33) (Na)(VH)
VH
Neqa (26)(29) (Neqa)
- 28 -
Neqa(31)D
(32)
(33)(30)VH
(30)
(34) (Nh)(V_2)(Nf)(Na)
(35) (Nh)(Dfa)(VH)
(36) (V_2)(Neqa)(Na)(D)(VK)
(37) (Nh)(Dfa)(VH)
(38) (Nh)(Neqa)(D)(D)(VK)
(34)Nf
Some 9
(38)Neqa
T, DiT, DE
(39) (Nh)(VC)(VH)(Na)(T)
(40) (Nh)(VC)(Di)
(41) (Nh)(SHI)(Dfa)(VK)(T)
(42) (Nh)(SHI)(VD)(Na)(DE)
T
Di(40)
T (39) T
T
(42)
4.4
9
separable
1995
- 29 -
+vrv V of a separable VR compound Vc[+vrv]
+vrr R of a separable VR compound Vc[+vrr]
+spv V of a separable V N compound Vc[+spv]
+spo N of a separable V N compound Na[+spo]
+p1 the first part of a separated compound (Nc)[+p1](Nc)
+p2 the second part of a separated compound (Nd)(Nd) [+p2]
+fw the feature of a foreign word OK(Na)[+fw]
+nom the feature for verbal nominalization (VA)[+nom]
+prop the feature for proper nouns (A)[+prop](Nc)
A.
(43)
(44)
[+vrv]
[+vrr]
(43)
(44) (Nh) (VC)[+vrv] (D) (VC)[+vrr] (Nh)
(45)
(45) (Nh)(VCL)(Na)
(46) (VH)
B.
(47)(48)
[+spv] Na
[+spo](47)-(51)
- 30 -
(47) (Nh) (VC)[+spv](Di) (Dfa) (VH) (DE) (Na)[+spo]
(48) (VC)[+spv](Nh)(DE)(Na)[+spo]
(49) (VA)[+spv](Di)(Neu)(Na)[+spo]
(50) (VA)[+spv](Di)(Na)[+spo](T)
(51)-(53)
(51)
[+vrv] [+vrr][+spv] [+spo]
(51) (VC)[+spv][+vrv](DE)(VC)[+vrr](Na)[+spo]
(52) (VC)[+spv][+vrv](D)(VC)[+vrr](Na)[+spo]
(53) (VC)[+spv][+vrv](D)(VC)[+vrr](Na)[+spo](T)
C.
19953
[+p1][+p2]
(54)-(56)
(54) (Na)[+p1](Na)[+p1](Na)
(55) (Nc)[+p1](Nc)
(56) (Nc)(Nc)[+p2]
(Neu)(Neu)(Na)
(Nes)(Nes)(Neqa)
D.
FW+fwFW[+fw]
FW
[+fw] KTVOK
Nc
Na+fwKTV(Nc)[+fw]OK(Na)[+fw]
- 31 -
E.
1992
(57)-
(58)[+nom]
[+nom]
(59)-(62)
(57) (VC)(VE)
(58) (VJ)(VH)
(59) (Na)(DE)(D)(VH)[+nom]
(60) (Nh)(P)(Na)(DE)(VJ)[+nom]
(61) (VC)[+nom](Nc)
(62) (VA)[+nom](Na)
(63)[+nom]
[+nom]
(63) (Nh)(DE)(D)(VJ)(Na)
F.
Nb
(64)-(68)
[+prop]
(64) (Nc)(Na)[+prop](Nc)
(65) (Nb)(Nd)[+prop](Na)(DE)(Na)
(66) (Nd)[+prop](Na)(Na)(Nb)
(67) (A)[+prop](Nc)
(68) (VH)[+prop](A)(Nc)
- 32 -
(A) <><><><><><><><>
<><><><><><><>
<><><>
(B)
- <><><><><><><><
><><><><><><><>
<><><><><><><><
><><><><><>
- <><><><><><><>
<><><><><><><><
><><><><><><><>
<><><><><>
- <><><><><><><>
<><><><><><><><
><><><><><><><
><><><><><><><>
- <><><><><><><><
><><><><><><><>
<><><><><><><><
><><><><><><><>
<><><><><><><>
<><><><><><>
- <><><><><><><><
><><><><><><><>
<><><><><><><><
><><><><><><><>
<><><><><><>
- <><><><><><><>
<><><><><>
- <><><><><><><>
<><><><><><><><
><><><><><><>
Huang (1992)
*
- 33 -
VA
VA1 theme
VA11
VA12
VA13
VA2 themecauser
VA3 theme
VA4 agent
VB
VB1 goal
VB11
VB12
VB2 theme
VC
- 34 -
VC2 agentgoal
VC3 agenttheme
VC31
VC32
VC33
VD
VD1 agent+source
goal
VD2 agent+goal
source
VE
VE1
VE11 agent goal
theme
VE12 VE11
VE12
VE2 agentgoal
- 35 -
VF
VF1 agent goal
VF2 agentgoal
theme
VG
VG1 agentthemerange
VG2 themerange
VH
VH1 theme
VH11
VH12
VH13
VH14
VH15
VH16 causer
theme
VH2 experiencer
VH21
VH22 causer
experiencer
- 36 -
VI
VI2 themegoal
VI3 themesource
VJ
VJ1 themegoal
VJ2 experiencergoal
VJ3 themerange
VK
VK1 experiencergoal
VK2 themegoal
VL
VL1 experiencergoal
VL2 themegoal
- 37 -
VL4 causer goal theme
Na
Naa
Nab
Nac
Nad
Nae
Naea
Naeb
Nb
Nba
Nbc
Nc
Nca
Ncb
Ncc
Ncd ,
Ncda """"
- 38 -
Ncdb
Nce
Nd
Nda
Ndaa
Ndaaa
Ndaab
Ndaac
Ndaad
Ndab
Ndaba ""
Ndabb ""
Ndabc """"
Ndabd ""
Ndabe
Ndabf
Ndc
Ndd
Ndca
Ndcb
Ndcc
Ne
Nf
Nfa
""""""""
Nfb
""""""
Nfc
""""""""
- 39 -
Nfd
""""""""
Nfe
""""""""
Nff
""""""""
Nfg
Nfh
Nfi
""""
Ng "
"""""
Nh
Nha
Nhaa
Nhab
Nhac
Nhb
Nhc
- 40 -
P
Da :
Db
Dba :
Dbaa :
Dbab :
Dbb :
Dbc "-":""
Dc :
Dd :
Df
Dfa :
Dfb :
Dg :
Dh :
Di :
Dj :
Dk :
Ca
Caa
Cab ,
- 41 -
Cb
Cba
Cbaa :
Cbab """"
Cbb
Cbba :
Cbbb :
Cbc
Cbca :
Cbcb :
Ta
Tb
Tc
Td
TaTbTcTd
I:
A:
- 42 -
WWW
Sinica Corpus
http://www.sinica.edu.tw/ftms-bin/kiwi.sh
- 43 -
Sinica Corpus
Sinica Corpus
3.0
95-02/98-04
(http://rocling.iis.sinica.edu.tw)
Sinica Corpus
collocation
- 44 -
and or
1. 2. 3. 4.
1.
- 45 -
2.
3.
- 46 -
95-02
93-05
*
*
4.
95-2
vrvvrv
1. and
vrrvrr
- 47 -
2. or
- 48 -
collocation
- 49 -
0 -1
+1
00
-x0
0x
-xy
0 -1
+1
- 50 -
collocation
collocationmutual information value
collocation
collocation collocation
collocation
collocation
0
0
0 -1
+1
collocation
= /
= / .
//
mutual information
probability
size of the corpus
- 51 -
freqx
freqy
freqxy
/
//
- 52 -
/
- 53 -
- 54 -
//
- 55 -
CKIP
A A /**/
Caa Caa /**/
Cab Cab /**/
Cba Cbab /**/
Cbb Cbaa, Cbba, Cbbb, Cbca, Cbcb /**/
D Dab, Dbaa, Dbab, Dbb, Dbc, Dc, Dd, Dg, Dh, Dj /**/
Da Daa /**/
DE /*, , , */
Dfa Dfa /**/
Dfb Dfb /**/
Di Di /**/
Dk Dk /**/
FW /**/
I I /**/
Na Naa, Nab, Nac, Nad, Naea, Naeb /**/
Nb Nba, Nbc /**/
Nc Nca, Ncb, Ncc, Nce /**/
Ncd Ncda, Ncdb /**/
Nd Ndaa, Ndab, Ndc, Ndd /**/
Nep Nep /**/
Neqa Neqa /**/
Neqb Neqb /**/
Nes Nes /**/
Neu Neu /**/.
Nf Nfa, Nfb, Nfc, Nfd, Nfe, Nfg, Nfh, Nfi /**/
Ng Ng /**/
Nh Nhaa, Nhab, Nhac, Nhb, Nhc /**/
P P* /**/
SHI /**/
T Ta, Tb, Tc, Td /**/
VA VA11,12,13,VA3,VA4 /**/
VAC VA2 /**/
VB VB11,12,VB2 /**/
VC VC2, VC31,32,33 /**/
VCL VC1 /**/
VD VD1, VD2 /**/
VE VE11, VE12, VE2 /**/
VF VF1, VF2 /**/
VG VG1, VG2 /**/
VH VH11,12,13,14,15,17,VH21 /**/
VHC VH16, VH22 /*/
VI VI1,2,3 /**/
VJ VJ1,2,3 /**/
VK VK1,2 /**/
VL VL1,2,3,4 /**/
V_2 V_2 /**/
- 56 -
- 57 -
1993 # 93-05
1996:
#96-01
1991
pp.19-37
1994ICCL-3
1995
1997
92-100
1991a
1991b
1991cICG
pp. 79-95
1992
pp.177-193
1995NACCL
1993
1994
Chang, Li-ping and Keh-jiann Chen, 1995. The CKIP Part-of-speech Tagging System for
Modern Chinese Texts. Proceedings of ICCPOL'95. Hawaii.
Chen, Keh-jiann, Shing-huan Liu, 1992. Word Identification for Mandarin Chinese Sentences.
Proceedings COLING'92, pp.54-59.
Chen, Keh-jiann, Shing-huan Liu, Li-ping Chang and Yeh-Hao Chin, 1994. A Practical Tagger
for Chinese Corpora. Proceedings of ROCLING VII, pp.111-126.
Church, K. W. and R. L. Mercer, 1993. Introduction to the Special Issue on Computational
Linguisitcs Using Large Corpora. Computational Linguistics, Vol.19, No.1, pp.1-24.
Huang, Chu-Ren, Keh-jiann Chen and Li-Li Chang. 1996. Segmentation Standard for Chinese
Natural Language Processing. Proceedings of the 1996 International Conference on
- 58 -
Computational Linguistics (COLING 96). August. Copenhagan, Denmark.
Huang, Chu-Ren, 1994. Corpus-based Studies of Mandarin Chinese: Foundational Issues and
Preliminary Results. In Matthew Chen and Ovid Tzeng Eds. In Honor of William S-Y.
Wang: Interdisciplinary Studies on Language and Language Change. pp. 165-186. Taipei:
Pyramid.
Huang, Chu-Ren, and Ruo-ping Mo, 1992. Mandarin Ditransitive Constructions and the
Category of Gei. In the Proceedings of the Berkeley Linguistics Society Annual Meeting
(BLS 18), pp. 109-122. Berkeley: BLS.
Huang, Chu-Ren and Keh-jiann Chen, 1992. A Chinese Corpus for Linguistics Research. In the
Proceedings of the 1992 International Conference on Computational Linguistics
(COLING-92). pp.1214-1217. Nantes, France.
Huang, Chu-Ren, 1987. Mandarin Chinese NP de: A Comparative Study of Current Grammatical
Theories. Special Publications No.93 of the Institute of History & Philology, Academia
Sinica, Taipei.
Hsu, Hui-li and Chu-Ren Huang, 1995. Design Criteria for a Balanced Modern Chinese Corpus.
Proceedings of ICCPOL'95, Hawaii.
Kucera, H. and W. N. Francis, 1967. Computational Analysis of Present-Day American English.
Providence: Brown University Press.
Sproat, R. and Shi C. (1990) A Statistical Method for Finding Word Boundaries in Chinese Text,
Computer Processing of Chinese & Oriental Languages, Vol. 4.
Svartvik, Jan, 1992. Ed. Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82,
4-8 August 1991. Trends in Linguistics Studies and Monographs 65. Berlin: Mouton.
- 59 -