You are on page 1of 65

...................................................

i
......................................................... ii
................................................. iii
................................................... 1
1.1 ................................ 1
1.2 .............................. 2
1.3 .......................... 2
1.4 .......................... 3
......................... 5
2.1 ...................................... 5
2.1.2 .......................................... 9
2.1.3 .......................................... 10
2.1.4 .......................................... 10
2.1.5 .......................................... 10
2.2 ................................ 11
............................................... 13
3.1 ............................................ 13
3.2 .......................................... 18
............................................... 22
4.1 .......................................... 22
4.2 ................................ 24
4.3 ................................ 26
4.4 .......................................... 29
......................................... 33
................................... 34
WWW ....... 43
................................................... 58

-1-

..................................... 5
............................................. 6
3.0 ......................... 11
3.0 ......................... 11
3.0 ......................... 11
3.0 ......................... 12
3.0 ......................... 12
................................... 23
................................... 30

(3.0 http://www.sinica.edu.tw/ftms-bin/kiwi.sh)

-i-

tag

- ii -

Sinica Corpus 3.0 1997 10


(http://www.sinica.edu.tw/ftms-bin/kiwi.sh)

1.~
2.

3.

4.+prop 30 32

1998.8.19

- iii -

Sinica Corpus

1994

19973.0

1.1

corpus-based
Svartvik 1992, Church and Mercer 1993, 1994, 1995

Brown Corpus Krucera and Francis


1967

LOB Lancaster-
Oslo/BergenLondon-Lund

tag

versatility
1995

-1-
1.2


Huang & Chen 1992
Huang 1994
1991, Chen 1994

()
consortium

()



transcribe

() BBS

1.3


Design Features

()

1997

-2-
() text

()

(1) (2) (3) (4) (5)

Hsu and Huang


1995
versatility
sub-corpora


monitor corpus

1.4

1.2

1994

-3-




infrastructureLOBLondon-Lund

-4-

2.1

%% =
%% =
%% =written
%% =
%% =
%% =
%% =
%% =
%% =
%% =
%% =
%% =
%%
%% =


2.1

LANCASTER-OSLO/BERGEN (LOB)
COBUILD Project

-5-



















written ()
written-to-be-read ()
written-to-be-spoken ()
spoken ()
spoken-to-be-written ()



















-6-
2.1.1














..


-7-

-8-

2.1.2

written-to-be-spoken

-9-
2.1.3

2.1.4

2.1.5


written written-to-be-read written-to-be-spoken spoken spoken-to-be-written
writtenwritten-to-be-read

written-to-be-spoken
spoken
spoken-to-be-

- 10 -
written

2.2

3.0

10% 10% 35% 5% 20% 20% 100%


CORPUS 68.53 10.24 276.13 73.22 141.20 127.85 789.27
CORPUS 45.17 67.50 182.03 48.27 93.08 84.28 520.28
% 8.68 12.97 34.99 9.28 17.89 16.20 100

3.0

CORPUS 443.94 78.97 4.68 10.17 5.79 79.85 66.93 3.94 2.31 0.23 15.98 0.43 64.61 10.57 0.84

CORPUS 292.64 52.06 3.08 6.71 3.82 52.64 44.12 2.60 1.52 0.15 10.54 0.29 42.60 6.97 0.56

% 56.25 10.01 0.59 1.29 0.73 10.12 8.48 0.50 0.29 0.03 2.03 0.05 8.19 1.34 0.11

- 11 -
3.0

CORPUS 246.89 230.28 5.49 32.23 1.06 10.71 66.70 180.20 12.90 2.00 0.81

CORPUS 162.57 151.80 3.62 21.25 0.70 7.06 43.96 118.80 8.50 1.32 0.53

% 31.28 29.18 0.70 4.08 0.13 1.36 8.45 22.83 1.63 0.25 0.10

3.0

CORPUS 711.47 10.93 6.45 57.55 2.87


CORPUS 469.00 7.20 4.25 37.94 1.89
% 90.14 1.38 0.82 7.29 0.36

3.0

CORPUS 557.72 96.60 112.62 22.34

CORPUS 367.64 63.68 74.24 14.72


% 70.66 12.24 14.72 2.83

- 12 -

99%Chen & Liu 1992

3.1

(1)

- 13 -

1

(2)

(1)

- 14 -

() 19956
BBS user txt

2/28330
AB-8888

(2)







/
1




- 15 -
(3)

(4)

(5)

- 16 -
(6)

1.
2.
3.




4.
5.

(1)

(2)

(1)
(2)
(3)
(4)
(5)
(6)

- 17 -
3.2

1.
(1)(2)(2)(3)

(2)

(1)(2)(2)(3)(4)
(4)
(3)
(1)(2)(2)(3)

(1)(2)(2)(3)
(1)
(3)
(1)(6)
(1)(2)
(6)(1)






(1)
(1)(1)
(1)
(1)
(1)

- 18 -
(1)

(1)
(2)
(2)
(6)
(2)
(1)(2)
(6)
(2)
(6)
(6)
(1)(1)

(1)

2.
(1)(2)(2)(3)
(3)
(1)(2)(2)(3)
(3)
(6)

(1)

/(2)(5)
(6)

(1)

- 19 -
(1)(1)



(1)

3. (1)(6)







4.
(1)

,
(1)

(1)
(1)

(1)
(1)
(1)

- 20 -
(1)

5.(1)(2)(2)(3)
(1)

(3)
(1)
(2)
(1) (2)

6.(1)(6)

- 21 -


96%Chen et al. 1994

Chang & Chen 1995

4.1

178 1993
43346

Daa, Dab, Neu, Nep, Nes, Neqa,


Neqb, 4 #93-
05 C V N A
DP IT46

4 DaaDabDa
Ne
Neu

Nes Specific

Nep
Neqa


Neqb

- 22 -

CKIP5
A A /**/
Caa Caa /**/
Cab Cab /**/
Cba Cbab /**/
Cbb Cbaa, Cbba, Cbbb, Cbca, Cbcb /**/
Da Daa /**/
Dfa Dfa /**/
Dfb Dfb /**/
Di Di /**/
Dk Dk /**/
D Dab, Dbaa, Dbab, Dbb, Dbc, Dc, Dd, Dg, Dh, Dj /**/
Na Naa, Nab, Nac, Nad, Naea, Naeb /**/
Nb Nba, Nbc /**/
Nc Nca, Ncb, Ncc, Nce /**/
Ncd Ncda, Ncdb /**/
Nd Ndaa, Ndab, Ndc, Ndd /**/
Neu Neu /**/.
Nes Nes /**/
Nep Nep /**/
Neqa Neqa /**/
Neqb Neqb /**/
Nf Nfa, Nfb, Nfc, Nfd, Nfe, Nfg, Nfh, Nfi /**/
Ng Ng /**/
Nh Nhaa, Nhab, Nhac, Nhb, Nhc /**/
I I /**/
P P* /**/
T Ta, Tb, Tc, Td /**/
VA VA11,12,13,VA3,VA4 /**/
VAC VA2 /**/
VB VB11,12,VB2 /**/
VC VC2, VC31,32,33 /**/
VCL VC1 /**/
VD VD1, VD2 /**/
VE VE11, VE12, VE2 /**/
VF VF1, VF2 /**/
VG VG1, VG2 /**/
VH VH11,12,13,14,15,17,VH21 /**/
VHC VH16, VH22 /*/
VI VI1,2,3 /**/
VJ VJ1,2,3 /**/
VK VK1,2 /**/
VL VL1,2,3,4 /**/
V_2 V_2 /**/
DE /*, , , */
SHI /**/
FW /**/

5 #93-05

- 23 -
4.2

46

VH
taggingVH


VH

Caa
Cab
Cba
Cbb
Dfa
Dfb
Di
Dk
D
Nf
Ng
Neu

Nes
Neqb

P
I
T

VADV
N-modifier

- 24 -
1991a, bDE
/

Da ADVN-modifier (Da)(Da)
N*-Nf-Ng NN-modifierN*
(Na)(Na)
Ncd NN-modifierlocative marker
(Ncd)(Ncd)(Ncd)
Nd NN-modifierADV
(Nd)(Nd)(Nd)
Nep N(h)N-modifier Nh
(Nep)(Nep)
Neqa NN-modifierADV
(Neqa)(Neqa)(Neqa)
V* VN-modifier (VC)(VC)
VH VN-modifierADV6
(VH)(VH)(VH)
V_2 (V_2)(V_2)
SHI (SHI)(SHI)(SHI)
A N-modifierADV (A)(A)
DE nominal markeradverbial markercomplement marker
(DE)(DE)(DE)


Na
VHNa Na
Na(D)
NaD

6 VH

- 25 -
4.3


VE VJ(VE)(VJ)
8

VA,Ng,D,T (VCL, VC, D, T)


(1) (Nh)(VCL)(Nc)(D) (VC)
(2) (Nh)(VCL)(Nc)(VC)
(3) (Nh)(VCL)(Nc)(T)
(4) (Nh)(P)(Na)(VC)(Na)(T)
(5) (Nh)(VA)(T)
(6) (Neu)(Nf)(Ng)
(7) (Nh)(VC)(Di)(Na)(T)(T)
(V)(D)
(1)(2)(T)VN

7 4.2


8

- 26 -
(Ncd, Ng, Nes)
(8)(Neu)(Nf) (Ng) (Nc)(Ncd)
(9)(Ncd)(V_2)(VH)(Na)(Ncd)(V_2)(VH)(Na)
(10)(Ncd)(Na);(Ncd)(Nc)
(11)(Nes)(Neu)(Nf)
NcdNg

Ncd
Ng(8)(Ncd)
(Ng)

(10)
(Nes)

(11)

(P, Caa)
(12) (Nh)(VK)(Na)(Caa)(Na)
(13) (Nh) (Caa) (Nh) (D) (VA)
(14) (Nh) Nd) P) Nh) (D) (VA)
(15) (Nep)(Nf)(Na) (P) (Nh) (VH)
(16) (Nh)(P)(Nh)(VH)(VH)
(Caa)(P)
(12)
(13)Caa
PCaaP
Caa
P(14)

(15)(16)
(15)Caa

(P, VG, P, VJ)


(17) (Nh)(P)(Na) (D)(VC)((P) =)

- 27 -
(18) (P) (VC)(Na)(Na)((P) =)
(19) (Nb)(VG)(Na)((VG) =)
(20) (Nh)(VC)(Nep)(Nf)(Na)(SHI)(VJ)(Di)(Nh)

(18)(20)(20)VJ
(18)VJ
(17)(18)P

(Di,Dfa,VH,VCL)
(21) (Di)
(22) (Dfa)
(23) (VH)
(24) (VCL)
(25) (VCL)

(24) (25) (25)
(24)(23)


VCL

(Neqa, VH)
(26) (Neqa)(Nf)(Na)(D)(VC)
(27) (Neqa)(D)(P)(Nh)(VC)
(28) (Na)(VA)(Di)(Neqa)
(29) (Nb)(Neqa)(VCL)(Nc)
(30) (Nh)(VC)(DE)(VH)
(31) (Nh)(VH)(Neqa)
(32) (Nh)(D)(VA)
(33) (Na)(VH)

VH
Neqa (26)(29) (Neqa)

- 28 -
Neqa(31)D
(32)
(33)(30)VH

(30)

(34) (Nh)(V_2)(Nf)(Na)
(35) (Nh)(Dfa)(VH)
(36) (V_2)(Neqa)(Na)(D)(VK)
(37) (Nh)(Dfa)(VH)
(38) (Nh)(Neqa)(D)(D)(VK)
(34)Nf
Some 9
(38)Neqa

T, DiT, DE
(39) (Nh)(VC)(VH)(Na)(T)
(40) (Nh)(VC)(Di)
(41) (Nh)(SHI)(Dfa)(VK)(T)
(42) (Nh)(SHI)(VD)(Na)(DE)
T
Di(40)
T (39) T
T
(42)

4.4
9
separable
1995

- 29 -


+vrv V of a separable VR compound Vc[+vrv]
+vrr R of a separable VR compound Vc[+vrr]
+spv V of a separable V N compound Vc[+spv]
+spo N of a separable V N compound Na[+spo]
+p1 the first part of a separated compound (Nc)[+p1](Nc)
+p2 the second part of a separated compound (Nd)(Nd) [+p2]
+fw the feature of a foreign word OK(Na)[+fw]
+nom the feature for verbal nominalization (VA)[+nom]
+prop the feature for proper nouns (A)[+prop](Nc)

A.


(43)
(44)
[+vrv]
[+vrr]
(43)
(44) (Nh) (VC)[+vrv] (D) (VC)[+vrr] (Nh)
(45)


(45) (Nh)(VCL)(Na)
(46) (VH)

B.


(47)(48)
[+spv] Na
[+spo](47)-(51)

- 30 -
(47) (Nh) (VC)[+spv](Di) (Dfa) (VH) (DE) (Na)[+spo]
(48) (VC)[+spv](Nh)(DE)(Na)[+spo]
(49) (VA)[+spv](Di)(Neu)(Na)[+spo]
(50) (VA)[+spv](Di)(Na)[+spo](T)
(51)-(53)
(51)
[+vrv] [+vrr][+spv] [+spo]

(51) (VC)[+spv][+vrv](DE)(VC)[+vrr](Na)[+spo]
(52) (VC)[+spv][+vrv](D)(VC)[+vrr](Na)[+spo]
(53) (VC)[+spv][+vrv](D)(VC)[+vrr](Na)[+spo](T)

C.
19953



[+p1][+p2]

(54)-(56)
(54) (Na)[+p1](Na)[+p1](Na)
(55) (Nc)[+p1](Nc)
(56) (Nc)(Nc)[+p2]

(Neu)(Neu)(Na)
(Nes)(Nes)(Neqa)

D.

FW+fwFW[+fw]
FW
[+fw] KTVOK
Nc
Na+fwKTV(Nc)[+fw]OK(Na)[+fw]

- 31 -
E.
1992
(57)-
(58)[+nom]
[+nom]

(59)-(62)
(57) (VC)(VE)
(58) (VJ)(VH)
(59) (Na)(DE)(D)(VH)[+nom]
(60) (Nh)(P)(Na)(DE)(VJ)[+nom]
(61) (VC)[+nom](Nc)
(62) (VA)[+nom](Na)
(63)[+nom]

[+nom]

(63) (Nh)(DE)(D)(VJ)(Na)

F.

Nb
(64)-(68)
[+prop]

(64) (Nc)(Na)[+prop](Nc)
(65) (Nb)(Nd)[+prop](Na)(DE)(Na)
(66) (Nd)[+prop](Na)(Na)(Nb)
(67) (A)[+prop](Nc)
(68) (VH)[+prop](A)(Nc)

- 32 -

(A) <><><><><><><><>
<><><><><><><>
<><><>
(B)
- <><><><><><><><
><><><><><><><>
<><><><><><><><
><><><><><>
- <><><><><><><>
<><><><><><><><
><><><><><><><>
<><><><><>
- <><><><><><><>
<><><><><><><><
><><><><><><><
><><><><><><><>
- <><><><><><><><
><><><><><><><>
<><><><><><><><
><><><><><><><>
<><><><><><><>
<><><><><><>
- <><><><><><><><
><><><><><><><>
<><><><><><><><
><><><><><><><>
<><><><><><>
- <><><><><><><>
<><><><><>
- <><><><><><><>
<><><><><><><><
><><><><><><>

Huang (1992)





*

- 33 -

VA

VA1 theme

VA11

VA12

VA13

VA2 themecauser

VA3 theme

VA4 agent

VB

VB1 goal

VB11
VB12

VB2 theme
VC

VC1 theme goal


- 34 -
VC2 agentgoal

VC3 agenttheme
VC31

VC32

VC33

VD

VD1 agent+source
goal

VD2 agent+goal
source

VE
VE1
VE11 agent goal
theme

VE12 VE11
VE12

VE2 agentgoal

- 35 -
VF
VF1 agent goal

VF2 agentgoal
theme

VG
VG1 agentthemerange

VG2 themerange

VH

VH1 theme

VH11
VH12

VH13
VH14

VH15

VH16 causer
theme

VH17 recipient theme


VH2 experiencer
VH21

VH22 causer
experiencer

- 36 -
VI

VI1 experiencer goal


VI2 themegoal

VI3 themesource

VJ
VJ1 themegoal

VJ2 experiencergoal

VJ3 themerange

VK
VK1 experiencergoal

VK2 themegoal

VL

VL1 experiencergoal

VL2 themegoal

VL3 goal theme



- 37 -
VL4 causer goal theme

Na
Naa

Nab

Nac

Nad

Nae

Naea

Naeb

Nb
Nba

Nbc

Nc
Nca

Ncb

Ncc

Ncd ,

Ncda """"

- 38 -
Ncdb
Nce

Nd
Nda
Ndaa

Ndaaa
Ndaab
Ndaac
Ndaad

Ndab
Ndaba ""
Ndabb ""
Ndabc """"
Ndabd ""
Ndabe
Ndabf

Ndc

Ndd

Ndca
Ndcb
Ndcc

Ne

Nf
Nfa
""""""""

Nfb
""""""

Nfc
""""""""

- 39 -
Nfd
""""""""

Nfe
""""""""

Nff
""""""""

Nfg







Nfh

Nfi
""""

Ng "
"""""

Nh
Nha
Nhaa
Nhab
Nhac

Nhb

Nhc

- 40 -
P

Da :

Db
Dba :
Dbaa :
Dbab :

Dbb :

Dbc "-":""

Dc :

Dd :

Df

Dfa :
Dfb :

Dg :

Dh :

Di :

Dj :

Dk :

Ca

Caa

Cab ,

- 41 -
Cb
Cba

Cbaa :
Cbab """"
Cbb

Cbba :
Cbbb :

Cbc
Cbca :
Cbcb :

Ta

Tb

Tc

Td

TaTbTcTd

I:

A:

- 42 -
WWW



Sinica Corpus

http://www.sinica.edu.tw/ftms-bin/kiwi.sh

- 43 -
Sinica Corpus

Sinica Corpus

3.0

95-02/98-04

(http://rocling.iis.sinica.edu.tw)

Sinica Corpus

collocation

- 44 -


and or
1. 2. 3. 4.

1.

- 45 -












2.


3.


- 46 -

95-02
93-05


*


*

4.


95-2

vrvvrv

1. and


vrrvrr

- 47 -

2. or

- 48 -

collocation

- 49 -





0 -1
+1


00

-x0

0x

-xy




0 -1
+1

- 50 -
collocation
collocationmutual information value

collocation
collocation collocation

collocation

collocation

0
0

0 -1
+1

collocation

= /
= / .
//
mutual information
probability
size of the corpus

- 51 -
freqx
freqy
freqxy




/



//



- 52 -

/

- 53 -

- 54 -

//

- 55 -

CKIP
A A /**/
Caa Caa /**/
Cab Cab /**/
Cba Cbab /**/
Cbb Cbaa, Cbba, Cbbb, Cbca, Cbcb /**/
D Dab, Dbaa, Dbab, Dbb, Dbc, Dc, Dd, Dg, Dh, Dj /**/
Da Daa /**/
DE /*, , , */
Dfa Dfa /**/
Dfb Dfb /**/
Di Di /**/
Dk Dk /**/
FW /**/
I I /**/
Na Naa, Nab, Nac, Nad, Naea, Naeb /**/
Nb Nba, Nbc /**/
Nc Nca, Ncb, Ncc, Nce /**/
Ncd Ncda, Ncdb /**/
Nd Ndaa, Ndab, Ndc, Ndd /**/
Nep Nep /**/
Neqa Neqa /**/
Neqb Neqb /**/
Nes Nes /**/
Neu Neu /**/.
Nf Nfa, Nfb, Nfc, Nfd, Nfe, Nfg, Nfh, Nfi /**/
Ng Ng /**/
Nh Nhaa, Nhab, Nhac, Nhb, Nhc /**/
P P* /**/
SHI /**/
T Ta, Tb, Tc, Td /**/
VA VA11,12,13,VA3,VA4 /**/
VAC VA2 /**/
VB VB11,12,VB2 /**/
VC VC2, VC31,32,33 /**/
VCL VC1 /**/
VD VD1, VD2 /**/
VE VE11, VE12, VE2 /**/
VF VF1, VF2 /**/
VG VG1, VG2 /**/
VH VH11,12,13,14,15,17,VH21 /**/
VHC VH16, VH22 /*/
VI VI1,2,3 /**/
VJ VJ1,2,3 /**/
VK VK1,2 /**/
VL VL1,2,3,4 /**/
V_2 V_2 /**/

- 56 -

+fw the feature of a foreign word OK (Na [+fw])

+nom the feature for verbal nominalization (VA[+nom])

+p1 the first part of a separated compound Nc[+p1](Nc)

+p2 the second part of a separated compound (Nd)(Nd[+p2])

+prop the feature for proper nouns (Na[+prop])

+spv V of a separable V N compound VC[+spv]

+spo N of a separable V N compound Na[+spo]

+vrv V of a separable VR compound VC[+vrv]

+vrr R of a separable VR compound VC[+vrr]

- 57 -

1993 # 93-05

1996:
#96-01
1991
pp.19-37
1994ICCL-3
1995

1997
92-100
1991a
1991b
1991cICG
pp. 79-95
1992
pp.177-193
1995NACCL
1993

1994
Chang, Li-ping and Keh-jiann Chen, 1995. The CKIP Part-of-speech Tagging System for
Modern Chinese Texts. Proceedings of ICCPOL'95. Hawaii.
Chen, Keh-jiann, Shing-huan Liu, 1992. Word Identification for Mandarin Chinese Sentences.
Proceedings COLING'92, pp.54-59.
Chen, Keh-jiann, Shing-huan Liu, Li-ping Chang and Yeh-Hao Chin, 1994. A Practical Tagger
for Chinese Corpora. Proceedings of ROCLING VII, pp.111-126.
Church, K. W. and R. L. Mercer, 1993. Introduction to the Special Issue on Computational
Linguisitcs Using Large Corpora. Computational Linguistics, Vol.19, No.1, pp.1-24.
Huang, Chu-Ren, Keh-jiann Chen and Li-Li Chang. 1996. Segmentation Standard for Chinese
Natural Language Processing. Proceedings of the 1996 International Conference on

- 58 -
Computational Linguistics (COLING 96). August. Copenhagan, Denmark.
Huang, Chu-Ren, 1994. Corpus-based Studies of Mandarin Chinese: Foundational Issues and
Preliminary Results. In Matthew Chen and Ovid Tzeng Eds. In Honor of William S-Y.
Wang: Interdisciplinary Studies on Language and Language Change. pp. 165-186. Taipei:
Pyramid.
Huang, Chu-Ren, and Ruo-ping Mo, 1992. Mandarin Ditransitive Constructions and the
Category of Gei. In the Proceedings of the Berkeley Linguistics Society Annual Meeting
(BLS 18), pp. 109-122. Berkeley: BLS.
Huang, Chu-Ren and Keh-jiann Chen, 1992. A Chinese Corpus for Linguistics Research. In the
Proceedings of the 1992 International Conference on Computational Linguistics
(COLING-92). pp.1214-1217. Nantes, France.
Huang, Chu-Ren, 1987. Mandarin Chinese NP de: A Comparative Study of Current Grammatical
Theories. Special Publications No.93 of the Institute of History & Philology, Academia
Sinica, Taipei.
Hsu, Hui-li and Chu-Ren Huang, 1995. Design Criteria for a Balanced Modern Chinese Corpus.
Proceedings of ICCPOL'95, Hawaii.
Kucera, H. and W. N. Francis, 1967. Computational Analysis of Present-Day American English.
Providence: Brown University Press.
Sproat, R. and Shi C. (1990) A Statistical Method for Finding Word Boundaries in Chinese Text,
Computer Processing of Chinese & Oriental Languages, Vol. 4.
Svartvik, Jan, 1992. Ed. Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82,
4-8 August 1991. Trends in Linguistics Studies and Monographs 65. Berlin: Mouton.

- 59 -