You are on page 1of 9

1)

1 1 1 1 2
1. 1000382. 210094

A Novel Approach for Co-occurrence Clustering Analysis: Maximal


Frequent Itemset Mining
Xu Shuo1, Qiao Xiaodong1, Zhu Lijun1, Zhang Yunliang1, Xue Chunxiang2
(1Institute of Scientific and Technical Information of China, Beijing 100038
2Nanjing University of Science and Technology, School of Economics and Management, Nanjing 210094)

Abstract

For a literature collection in some area, if two research objects have higher
co-occurrence frequency, then one usually assumes that there exists an underlying link between
the objects in a higher probability. It is this reason that prompts the popularity of many
co-occurrence analysis methods, such as co-word analysis, co-citation analysis, co-authorship
analysis, etc. The process of traditional co-occurrence analysis often consists of three steps.
However, there are problematic for the previous two steps, which may lead to some misleading
co-occurrence clustering results. Therefore, this paper introduces a new method for co-occurrence
clustering analysismaximal frequent itemset miningfrom association rule mining domain.
_______________________
2011 1 23

2011 8 26

1979
E-mail: xush@istic.ac.cn1965
E-mail: qiaox@istic.ac.cn1973
E-mail: zhulj@istic.ac.cn,1979
E-mail: zhangyl@istic.ac.cn1979
E-mail: xuechunxiang@gmail.com
1)

2011BAH10B04
YY-20112909TQC011

This approach compresses three steps in the traditional co-occurrence clustering into one step,
which simplifies greatly the resulting process. One of the most appealing characteristic of this
approach is that it can make the best use of all available information, which overcomes the
problem in the traditional co-occurrence analysis. Experimental results show that one can
basically obtain satisfactory clustering results by setting a proper minimal support threshold.

Keywords

Co-occurrence Analysis, Co-word Analysis, Clustering Analysis, Maximal


Frequent Itemset, Hierarchical Clustering

[1]

//

/[2]

Salton
[3]

20 70 ,

1971

20

2.1

k ( 3)

k ( 3)

[4]

Zipf Donohue

[11]

[5]

b
D

N I = {T1,

single-linkage
complete-linkage

T2, , TN}

average-linkage

S1 = {T1,1, T1,2, , T1,m} S2 = {T2,1,

N C Ci,,j = Cj,,ii j

T2,2, , T2,n}

i, j = 1, 2, , N<Ti, Tj>

Ci,,ii = 1, 2, , N

Sim(T1,i, T2,j)

Ti

S1 S2

Simsingle ( S1 , S 2 ) = max max{Sim(e1 , e2 )}

(1)

Simcomplete ( S1 , S 2 ) = min min{Sim(e1 , e2 )}

(2)

e1 S1 e2 S 2

C inclusion

e1 S1 e2 S 2

inclusion index [6] proximity


[6]

Simaverage ( S1 , S 2 ) =

[7]

index equivalence index


c

Pajek[8]NetDraw[9]
KNN
k-Means[10][11]Science
APaffinity propagation[12]
multidimensional scalingMDS

1
Sim(e1 , e2 )
m n e1S1 e2 S2

(3)

ID
1
2
3
4
5
6

{A, C, D}
{A, B, C, D, E}
{D, E}
{B, C}
{B, C, D}
{B, D}

ID
7
8
9
10
11
12

{B, C}
{A, B, D, E}
{B, C, D}
{A, C, D, E}
{A, C, D}
{D, E}

1 12

[13]

2.2

Sim(S1, S2) S1S2

1 1

S1 S2

2 2 ab

S1S2

B
2

A
B
C
D

A
B
CD

A
BCD

ABCD

C
4
5

B
2

D
5
5
6

CD
4.5 (4)
5 (3)

BCD
3.67 (1)

E
3
2
2
5

E
3
2
3.5 (2)

E
3
3 (1)

E
3 (1)

S1S2
Simtrue(S1,
S2) Simsingle(S1, S2)Simcomplete(S1, S2)
Simaverage(S1, S2)

Simtrue(S1, S2) Simcomplete(S1, S2)


Simaverage(S1, S2) Simsingle(S1, S2)
(4)

Agrawal et al.
[14]
/

Bayardo 1998
[15]

3.1
I
I
itemset
1 D
X X
Sup(X)
2 D
min_sup X
I Sup(X) min_sup X

D
FI
3
D min_sup
X I Sup(X) min_sup
X Y sup(Y) < min_sup
X D
MFI

MFI
MaxMiner[15] MAFIA[16]
GenMax[17][18] Pincer-Search[19]
Diffset [20] HBMFI[21]

[17]

3{A, B, C, D, E}

3.2
2 1
min_sup = 2 3 {A, B,
C, D, E} 3

1
3

3
3 4
{A, B, C, D, E}

{A, B, D, E}{A, C, D, E}{B,

C, D}

[22]-[24]

1-

3 4 2

4.2 FAO-780

overlapping

case

countries

studies developing

4{A, B, D, E}{A, C, D, E}{B, C, D}


2FAO-780

3
72
(9.23%)
10
61
(7.82%)

4
79
(10.13%)
11
46
(5.90%)

5
74
(9.49%)
12
43
(5.51%)

4.1

6
77
(9.87%)
13
41
(5.26%)

7
73
(9.36%)
14
30
(3.85%)

8
87
(11.15%)
15
18
(2.31%)

9
68
(8.72%)

11
(1.41%)

25
C C
5

min_sup = 6 GenMax [17]

FAO-780

FAO-780 MFI

[25]

780

320

FAO

179 MFI

1560

23 7.98
2

4.2
2.1

141
3 141
106 25 23
25

case studiesdeveloping

countries

25 C

min_sup

2 min_sup = 5 A

min_sup

106 23 =
83 23

[26]

525
3

ID

2
3
4
5

{agricultural resources, agricultural sector,


labour, social conditions, national planning,
decision making, legislation, extension
activities, training, women, role of women,
rural development}
{choice of species, selection, genetic
resources, resource conservation, forestry
policies, forest resources}
{forest management, forestry development,
forestry policies, forest resources}
{female labour, role of women, rural
development}
{forest management, sustainability, forest
resources}

ID

{shellfish culture, infrastructure, markets,


food consumption, fishery production,
surveys, fish culture, fishery management}

{wood industry, wood products, Asia,


forestry policies, forest resources}

8
9
10

{wood industry, supply balance, wood


products, Asia}
{social change, women, role of women, rural
development}
{wood, Asia, forest resources}

30

of Polymer Chemistry [J]. Scientometrics, 1991,

22(1): 155-205.

[8]

Network

Batagelj V, Mrvar A. Pajek Progam for Large


Analysis

[OL].

[2010-10-12].

http://pajek.imfm.si/doku.php?id=pajek.
[9]

Borgatti S. NetDraw Network Visualization [OL].

[2010-12-12]. http://www.analytictech.com/netdraw/

netdraw.htm.

[10] Duda R O, Hart P E, Stork D G. Pattern

Classification, 2nd ed. [M]. New York: John Wiley &

Sons, Inc, 2001.


[1]

.
[J]. ,

[2]

[6]

972-976.
[13] . Proximity

30(11): 1163-1170.
[14] Agrawal R, Imielinski T, Swami A. Mining

, 2009.

Association Rules between Sets of Items in Large

Salton G. Experiments in Automatic Thesaurus

Databases [C] // Buneman P, Jajodia S. Proceedings

Construction for Information Retrieval [C] //

of the ACM SIGMOD International Conference on

Freiman C V, Griffith J E, Rosenfeld J L.

Management of Data. New York: ACM Press. 1993:

Proceedings of the IFIP Congress, Volume 1

207-216.
[15] Bayardo Jr. R J. Efficiently Mining Long Patterns

115-123.

from Databases [C] // Proceedings of the 1998 ACM

Booth A D. A Law of Occurrences for Words of Low

SIGMOD International Conference on Management

Frequency [J]. Information and Control, 1967, 10(4):

of Data. New York: ACM Press. 1998: 85-93.


[16] Burdick D, Calimlim M, Gehrke J. MAFIA: A

Donohue J C. Understanding Scientific Literature: A

Maximal

Bibliographic Approach [M]. Cambridge: MIT Press,

Transactional Databases [C] // Proceedings of the

1973.

17th International Conference on Data Engineering.

Callon M, Law J, Rip A. Qualitative Scientometrics

Washington:

[M] // Mapping the Dynamics of Science and

433-442.

Technology. London: Macmillan Publishers Limited,

[7]

between Data Points [J]. Science, 2007, 315:

386-393.
[5]

[12] Frey B J, Dueck D. Clustering by Passing Messages

[J]. , 2011,

Amsterdam: North Holland Publishing Co., 1971:

[4]

[M]. New Jersey: Prentice-Hall, 1988.

2010, 29(4): 723-731.

5[R].

[3]

[11] Jain A K, Dubes R C. Algorithms for Clustering Data

Frequent

IEEE

Itemset

Computer

Algorithm

Society,

for

2001:

[17] Gouda K, Zaki M J. GenMax: An Efficient

1986: 103-123.

Algorithm for Mining Maximal Frequent Itemsets [J].

Callon M, Courtial J P, Laville F. Co-word Analysis

Data Mining and Knowledge Discovery, 2005, 11(3):

as a tool for Describing the Network of Interactions

1-20.

between Basic and Technological Research: the Case

[18] Gouda K, Zaki M J. Efficiently Mining Maximal

Frequent Itemsets [C] // Proceedings of the 1st IEEE


International Conference on Data Mining (ICDM).
Washington:

IEEE

Computer

Society,

2001:

163-170.
[19] Lin D-I, Kedem Z M. Pincer-Search: An Efficient
Algorithm for Discovering the Maximum Frequent
Set [J]. IEEE Transactions on Knowledge and Data
Engineering, 2002, 14(3); 553-566.
[20] Zaki M J, Gouda K. Fast Vertical Mining using
Diffsets [C] // Proceedings of the 9th ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining (KDD). New York: ACM Press,
2003: 326-335.
[21] Zubair Rahman A M J M, Balasubramanie P. An
Efficient Algorithm for Mining Maximal Frequent
Item Sets [J]. Journal of Computer Science, 2008,
4(8): 638-645.
[22] Creativity J. Creativity and Conformity in Science:
Titles, Keywords and Co-word Analysis [J]. Social
Studies of Science, 1989, 19(3): 473-496.
[23] He Q. Knowledge Discovery through Co-word
Analysis [J]. Library Trends, 1999, 48(1): 133-159.
[24] Law J, Whittaker J. Mapping Acidification Research:
A Test of the Co-word Method [J]. Scientometrics,
1992, 23(3): 417-461.
[25] Medelyan O, Witten I H, Frank E. FAO-780 [OL].
[2010-10-05]. http://code.google.com/p/maui-indexe
r/downloads/list.
[26] Perfetti C A. The Limits of Co-occurrence: Tools and
Theories in Language Research [J]. Discourse
Processes, 1998, 25(2&3): 363-377.

You might also like