Mfi

1)
1 1 1 1 2
1. 1000382. 210094
A Novel Approach for Co-occurrence Clustering Analysis: Maximal

Frequent Itemset Mining
Xu Shuo1, Qiao Xiaodong1, Zhu Lijun1, Zhang Yunliang1, Xue Chunxiang2
(1Institute of Scientific and Technical Information of China, Beijing 100038
2Nanjing University of Science and Technology, School of Economics and Management, Nanjing 210094)
Abstract
For a literature collection in some area, if two research objects have higher
co-occurrence frequency, then one usually assumes that there exists an underlying link between
the objects in a higher probability. It is this reason that prompts the popularity of many
co-occurrence analysis methods, such as co-word analysis, co-citation analysis, co-authorship
analysis, etc. The process of traditional co-occurrence analysis often consists of three steps.
However, there are problematic for the previous two steps, which may lead to some misleading
co-occurrence clustering results. Therefore, this paper introduces a new method for co-occurrence
clustering analysismaximal frequent itemset miningfrom association rule mining domain.
_______________________
2011 1 23
2011 8 26
1979
E-mail: xush@istic.ac.cn1965
E-mail: qiaox@istic.ac.cn1973
E-mail: zhulj@istic.ac.cn,1979
E-mail: zhangyl@istic.ac.cn1979
E-mail: xuechunxiang@gmail.com
1)
2011BAH10B04
YY-20112909TQC011
This approach compresses three steps in the traditional co-occurrence clustering into one step,
which simplifies greatly the resulting process. One of the most appealing characteristic of this
approach is that it can make the best use of all available information, which overcomes the
problem in the traditional co-occurrence analysis. Experimental results show that one can
basically obtain satisfactory clustering results by setting a proper minimal support threshold.
Keywords
Co-occurrence Analysis, Co-word Analysis, Clustering Analysis, Maximal

Frequent Itemset, Hierarchical Clustering
[1]
//
/[2]
Salton
[3]
20 70 ,
1971
20
2.1
k ( 3)
k ( 3)
[4]
Zipf Donohue
[11]
[5]
b
D
N I = {T1,
single-linkage
complete-linkage
T2, , TN}
average-linkage
S1 = {T1,1, T1,2, , T1,m} S2 = {T2,1,
N C Ci,,j = Cj,,ii j
T2,2, , T2,n}
i, j = 1, 2, , N<Ti, Tj>
Ci,,ii = 1, 2, , N
Sim(T1,i, T2,j)
Ti
S1 S2
Simsingle ( S1 , S 2 ) = max max{Sim(e1 , e2 )}
(1)
Simcomplete ( S1 , S 2 ) = min min{Sim(e1 , e2 )}
(2)
e1 S1 e2 S 2
C inclusion
e1 S1 e2 S 2
inclusion index [6] proximity

[6]
Simaverage ( S1 , S 2 ) =
[7]
index equivalence index

c
Pajek[8]NetDraw[9]
KNN
k-Means[10][11]Science
APaffinity propagation[12]
multidimensional scalingMDS
1
Sim(e1 , e2 )
m n e1S1 e2 S2
(3)
ID
1
2
3
4
5
6
{A, C, D}
{A, B, C, D, E}
{D, E}
{B, C}
{B, C, D}
{B, D}
ID
7
8
9
10
11
12
{B, C}
{A, B, D, E}
{B, C, D}
{A, C, D, E}
{A, C, D}
{D, E}
1 12
[13]
2.2
Sim(S1, S2) S1S2
1 1
S1 S2
2 2 ab
S1S2
B
2
A
B
C
D
A
B
CD
A
BCD
ABCD
C
4
5
B
2
D
5
5
6
CD
4.5 (4)
5 (3)
BCD
3.67 (1)
E
3
2
2
5
E
3
2
3.5 (2)
E
3
3 (1)
E
3 (1)
S1S2
Simtrue(S1,
S2) Simsingle(S1, S2)Simcomplete(S1, S2)
Simaverage(S1, S2)
Simtrue(S1, S2) Simcomplete(S1, S2)

Simaverage(S1, S2) Simsingle(S1, S2)
(4)
Agrawal et al.
[14]
/
Bayardo 1998
[15]
3.1
I
I
itemset
1 D
X X
Sup(X)
2 D
min_sup X
I Sup(X) min_sup X
D
FI
3
D min_sup
X I Sup(X) min_sup
X Y sup(Y) < min_sup
X D
MFI
MFI
MaxMiner[15] MAFIA[16]
GenMax[17][18] Pincer-Search[19]
Diffset [20] HBMFI[21]
[17]
3{A, B, C, D, E}
3.2
2 1
min_sup = 2 3 {A, B,
C, D, E} 3
1
3
3
3 4
{A, B, C, D, E}
{A, B, D, E}{A, C, D, E}{B,
C, D}
[22]-[24]
1-
3 4 2
4.2 FAO-780
overlapping
case
countries
studies developing
4{A, B, D, E}{A, C, D, E}{B, C, D}

2FAO-780
3
72
(9.23%)
10
61
(7.82%)
4
79
(10.13%)
11
46
(5.90%)
5
74
(9.49%)
12
43
(5.51%)
4.1
6
77
(9.87%)
13
41
(5.26%)
7
73
(9.36%)
14
30
(3.85%)
8
87
(11.15%)
15
18
(2.31%)
9
68
(8.72%)
11
(1.41%)
25
C C
5
min_sup = 6 GenMax [17]
FAO-780
FAO-780 MFI
[25]
780
320
FAO
179 MFI
1560
23 7.98
2
4.2
2.1
141
3 141
106 25 23
25
case studiesdeveloping
countries
25 C
min_sup
2 min_sup = 5 A
min_sup
106 23 =
83 23
[26]
525
3
ID
2
3
4
5
{agricultural resources, agricultural sector,

labour, social conditions, national planning,
decision making, legislation, extension
activities, training, women, role of women,
rural development}
{choice of species, selection, genetic
resources, resource conservation, forestry
policies, forest resources}
{forest management, forestry development,
forestry policies, forest resources}
{female labour, role of women, rural
development}
{forest management, sustainability, forest
resources}
ID
{shellfish culture, infrastructure, markets,

food consumption, fishery production,
surveys, fish culture, fishery management}
{wood industry, wood products, Asia,

forestry policies, forest resources}
8
9
10
{wood industry, supply balance, wood

products, Asia}
{social change, women, role of women, rural
development}
{wood, Asia, forest resources}
30
of Polymer Chemistry [J]. Scientometrics, 1991,
22(1): 155-205.
[8]
Network
Batagelj V, Mrvar A. Pajek Progam for Large

Analysis
[OL].
[2010-10-12].
http://pajek.imfm.si/doku.php?id=pajek.
[9]
Borgatti S. NetDraw Network Visualization [OL].
[2010-12-12]. http://www.analytictech.com/netdraw/
netdraw.htm.
[10] Duda R O, Hart P E, Stork D G. Pattern
Classification, 2nd ed. [M]. New York: John Wiley &
Sons, Inc, 2001.

[1]
.
[J]. ,
[2]
[6]
972-976.
[13] . Proximity
30(11): 1163-1170.
[14] Agrawal R, Imielinski T, Swami A. Mining
, 2009.
Association Rules between Sets of Items in Large
Salton G. Experiments in Automatic Thesaurus
Databases [C] // Buneman P, Jajodia S. Proceedings
Construction for Information Retrieval [C] //
of the ACM SIGMOD International Conference on
Freiman C V, Griffith J E, Rosenfeld J L.
Management of Data. New York: ACM Press. 1993:
Proceedings of the IFIP Congress, Volume 1
207-216.
[15] Bayardo Jr. R J. Efficiently Mining Long Patterns
115-123.
from Databases [C] // Proceedings of the 1998 ACM
Booth A D. A Law of Occurrences for Words of Low
SIGMOD International Conference on Management
Frequency [J]. Information and Control, 1967, 10(4):
of Data. New York: ACM Press. 1998: 85-93.

[16] Burdick D, Calimlim M, Gehrke J. MAFIA: A
Donohue J C. Understanding Scientific Literature: A
Maximal
Bibliographic Approach [M]. Cambridge: MIT Press,
Transactional Databases [C] // Proceedings of the
1973.
17th International Conference on Data Engineering.
Callon M, Law J, Rip A. Qualitative Scientometrics
Washington:
[M] // Mapping the Dynamics of Science and
433-442.
Technology. London: Macmillan Publishers Limited,
[7]
between Data Points [J]. Science, 2007, 315:
386-393.
[5]
[12] Frey B J, Dueck D. Clustering by Passing Messages
[J]. , 2011,
Amsterdam: North Holland Publishing Co., 1971:
[4]
[M]. New Jersey: Prentice-Hall, 1988.
2010, 29(4): 723-731.
5[R].
[3]
[11] Jain A K, Dubes R C. Algorithms for Clustering Data
Frequent
IEEE
Itemset
Computer
Algorithm
Society,
for
2001:
[17] Gouda K, Zaki M J. GenMax: An Efficient
1986: 103-123.
Algorithm for Mining Maximal Frequent Itemsets [J].
Callon M, Courtial J P, Laville F. Co-word Analysis
Data Mining and Knowledge Discovery, 2005, 11(3):
as a tool for Describing the Network of Interactions
1-20.
between Basic and Technological Research: the Case
[18] Gouda K, Zaki M J. Efficiently Mining Maximal
Frequent Itemsets [C] // Proceedings of the 1st IEEE

International Conference on Data Mining (ICDM).
Washington:
IEEE
Computer
Society,
2001:
163-170.
[19] Lin D-I, Kedem Z M. Pincer-Search: An Efficient
Algorithm for Discovering the Maximum Frequent
Set [J]. IEEE Transactions on Knowledge and Data
Engineering, 2002, 14(3); 553-566.
[20] Zaki M J, Gouda K. Fast Vertical Mining using
Diffsets [C] // Proceedings of the 9th ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining (KDD). New York: ACM Press,
2003: 326-335.
[21] Zubair Rahman A M J M, Balasubramanie P. An
Efficient Algorithm for Mining Maximal Frequent
Item Sets [J]. Journal of Computer Science, 2008,
4(8): 638-645.
[22] Creativity J. Creativity and Conformity in Science:
Titles, Keywords and Co-word Analysis [J]. Social
Studies of Science, 1989, 19(3): 473-496.
[23] He Q. Knowledge Discovery through Co-word
Analysis [J]. Library Trends, 1999, 48(1): 133-159.
[24] Law J, Whittaker J. Mapping Acidification Research:
A Test of the Co-word Method [J]. Scientometrics,
1992, 23(3): 417-461.
[25] Medelyan O, Witten I H, Frank E. FAO-780 [OL].
[2010-10-05]. http://code.google.com/p/maui-indexe
r/downloads/list.
[26] Perfetti C A. The Limits of Co-occurrence: Tools and
Theories in Language Research [J]. Discourse
Processes, 1998, 25(2&3): 363-377.

Mfi

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mfi

Uploaded by

Copyright:

Available Formats

1)

A Novel Approach for Co-occurrence Clustering Analysis: Maximal

Co-occurrence Analysis, Co-word Analysis, Clustering Analysis, Maximal

S1 = {T1,1, T1,2, , T1,m} S2 = {T2,1,

Simsingle ( S1 , S 2 ) = max max{Sim(e1 , e2 )}

Simcomplete ( S1 , S 2 ) = min min{Sim(e1 , e2 )}

inclusion index [6] proximity

index equivalence index

Sim(S1, S2) S1S2

Simtrue(S1, S2) Simcomplete(S1, S2)

{A, B, D, E}{A, C, D, E}{B,

4{A, B, D, E}{A, C, D, E}{B, C, D}

min_sup = 6 GenMax [17]

{agricultural resources, agricultural sector,

{shellfish culture, infrastructure, markets,

{wood industry, wood products, Asia,

{wood industry, supply balance, wood

of Polymer Chemistry [J]. Scientometrics, 1991,

Batagelj V, Mrvar A. Pajek Progam for Large

Borgatti S. NetDraw Network Visualization [OL].

[10] Duda R O, Hart P E, Stork D G. Pattern

Classification, 2nd ed. [M]. New York: John Wiley &

Sons, Inc, 2001.

Association Rules between Sets of Items in Large

Salton G. Experiments in Automatic Thesaurus

Databases [C] // Buneman P, Jajodia S. Proceedings

Construction for Information Retrieval [C] //

of the ACM SIGMOD International Conference on

Freiman C V, Griffith J E, Rosenfeld J L.

Management of Data. New York: ACM Press. 1993:

Proceedings of the IFIP Congress, Volume 1

from Databases [C] // Proceedings of the 1998 ACM

Booth A D. A Law of Occurrences for Words of Low

SIGMOD International Conference on Management

Frequency [J]. Information and Control, 1967, 10(4):

of Data. New York: ACM Press. 1998: 85-93.

Donohue J C. Understanding Scientific Literature: A

Bibliographic Approach [M]. Cambridge: MIT Press,

Transactional Databases [C] // Proceedings of the

17th International Conference on Data Engineering.

Callon M, Law J, Rip A. Qualitative Scientometrics

[M] // Mapping the Dynamics of Science and

Technology. London: Macmillan Publishers Limited,

between Data Points [J]. Science, 2007, 315:

[12] Frey B J, Dueck D. Clustering by Passing Messages

Amsterdam: North Holland Publishing Co., 1971:

[M]. New Jersey: Prentice-Hall, 1988.

2010, 29(4): 723-731.

[11] Jain A K, Dubes R C. Algorithms for Clustering Data

[17] Gouda K, Zaki M J. GenMax: An Efficient

Algorithm for Mining Maximal Frequent Itemsets [J].

Callon M, Courtial J P, Laville F. Co-word Analysis

Data Mining and Knowledge Discovery, 2005, 11(3):

as a tool for Describing the Network of Interactions

between Basic and Technological Research: the Case

[18] Gouda K, Zaki M J. Efficiently Mining Maximal

Frequent Itemsets [C] // Proceedings of the 1st IEEE

You might also like