Co So Du Lieu Do Thi

Mining, Indexing and Searching Graph Databases
Presenter: A/ Prof. Do Phuc Source: Jiawei Han , Vladimir Lipets

1
July 22, 2010
Graph, Graph, Everywhere

from H. Jeong et al Nature 411, 41 (2001)
Aspirin
Yeast protein interaction network
July 22, 2010
An Internet Web
Co-author network
Why Graph Mining and Searching?

Graphs are ubiquitous Chemical compounds (Cheminformatics) Protein structures, biological pathways/networks (Bioinformactics) Program control flow, traffic flow, and workflow analysis XML databases, Web, and social network analysis Graph is a general model Trees, lattices, sequences, and items are degenerated graphs Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) Complexity of algorithms: many problems are of high complexity!
July 22, 2010 3
Outline
Graph Isomorphism, Subgraph Isomorphism Mining frequent graph patterns Graph indexing methods Similairty search in graph databases Biological network analysis
July 22, 2010 4
Motivation
Graph, Subgraph isomorphism is important and very general form of pattern matching that finds practical application in areas such as: pattern recognition and computer vision, image processing, computer-aided design, graph grammars, graph transformation, biocomputing, search operation in chemical database, numerous others.
July 22, 2010 5
A hierarchy of pattern matching problems Graph isomorphism Subgraph isomorphism Maximum common subgraph Approximate subgraph isomorphism Graph edit distance
July 22, 2010
Isomorphic Graphs
July 22, 2010
Graph Isomorphism
July 22, 2010
Subgraph of a given graph
July 22, 2010
Subgraph Isomorphism
July 22, 2010
10
Subgraph Isomorphism and Related Problems

Given a pattern graph G and a target graph H Decision problem: Answer whether H contains a subgraph isomorphic to G Search problem: Return an occurrence of G as a subgraph of H Counting problem: Return a count of the number of subgraphs of H that are isomorphic to G Enumeration problem: Return all occurrences of G as a subgraph of H
July 22, 2010
11
Outline
Graph Isomorphism, Subgraph Isomorphism Mining frequent graph patterns Graph indexing methods Similairty search in graph databases Biological network analysis
July 22, 2010 12
Graph Pattern Mining

Frequent subgraphs
A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold Applications of graph pattern mining Mining biochemical structures Program control flow analysis Mining XML structures or Web communities Building blocks for graph classification, clustering, comparison, and correlation analysis
July 22, 2010 13
Example: Frequent Subgraphs

Graph Dataset
O OH S O
HO N N O O
(A)
(B)
(C)
O N
Frequent Patterns (min support is 2)
(1)
July 22, 2010
(2)
14
Frequent Subgraph Mining Approaches

Apriori-based approach
AGM/AcGM: Inokuchi, et al. (PKDD00) FSG: Kuramochi and Karypis (ICDM01) PATH: Vanetik and Gudes (ICDM02, ICDM04) FFSM: Huan, et al. (ICDM03)
Pattern growth-based approach

MoFa, Borgelt and Berthold (ICDM02) gSpan: Yan and Han (ICDM02) Gaston: Nijssen and Kok (KDD04)
July 22, 2010 15
Properties of Graph Mining Algorithms

Search order breadth vs. depth Generation of candidate subgraphs apriori vs. pattern growth Elimination of duplicate subgraphs passive vs. active Support calculation embedding store or not Discover order of patterns path
July 22, 2010
tree
graph
16
Outline
Mining frequent graph patterns Graph indexing methods Similairty search in graph databases Biological network analysis
July 22, 2010
17
Graph Search: Querying Graph Databases

Querying graph databases: Given a graph database and a query graph, find all graphs containing this query graph
HO
O N N OH
N N
O
O S N N N O S HO O N O OH
N+
NH
O OH
query graph
July 22, 2010
graph database
18
Scalability Issue
Sequential scan Disk I/O
N
N N O OH
OH O
NH O N N O
O
N
S O N N N O
N+
HO
OH
HO
S O
Subgraph isomorphism (a) testing An indexing mechanism is needed DayLight: Daylight.com (commercial) GraphGrep: Dennis Shasha, et al. PODS'02 Grace: Srinath Srinivasa, et al. ICDE'03
July 22, 2010
(b)
(c)
Query graph
N N
Sample database
19
Indexing Strategy
Query graph (Q) Graph (G) If graph G contains query graph Q, G should contain any substructure of Q Substructure
Remarks Index substructures of a query graph to prune graphs that do not contain these substructures
July 22, 2010 20
Framework
Two steps in processing graph queries
Step 1. Index Construction
Enumerate structures in the graph database, build an inverted index between structures and graphs
Step 2. Query Processing
Enumerate structures in the query graph Calculate the candidate graphs containing these structures Prune the false positive answers by performing subgraph isomorphism test
July 22, 2010 21
Outline
Mining frequent graph patterns Graph indexing methods Similairty search in graph databases Biological network analysis Some recent progress on graph mining
July 22, 2010 22
Graph Clustering
Graph similarity measure Feature-based similarity measure
Each graph is represented as a feature vector The similarity is defined by the distance of their corresponding vectors Frequent subgraphs can be used as features
Structure-based similarity measure

Maximal common subgraph Graph edit distance: insertion, deletion, and relabel Graph alignment distance
July 22, 2010 23
Graph Classification
Local structure based approach Local structures in a graph, e.g., neighbors surrounding a vertex, paths with fixed length Graph pattern-based approach Subgraph patterns from domain knowledge Subgraph patterns from data mining Kernel-based approach Random walk (Grtner 02, Kashima et al. 02, ICML03, Mah et al. ICML04)
July 22, 2010
Optimal local assignment (Frhlich et al. ICML05)
24
Structure Similarity Search

CHEMICAL COMPOUNDS
(a) caffeine
(b) diurobromine
(c) viagra
QUERY GRAPH
July 22, 2010
25
Some Straightforward Methods

Method1: Directly compute the similarity between the graphs in the DB and the query graph Sequential scan Subgraph similarity computation Method 2: Form a set of subgraph queries from the original query graph and use the exact subgraph search Costly: If we allow 3 edges to be missed in a 20edge query graph, it may generate 1,140 subgraphs
July 22, 2010 26
Index: Precise vs. Approximate Search

Precise Search Use frequent patterns as indexing features Select features in the database space based on their selectivity Build the index Approximate Search Hard to build indices covering similar subgraphs explosive number of subgraphs in databases Idea: (1) keep the index structure (2) select features in the query space
27
July 22, 2010
Substructure Similarity Measure

Query relaxation measure The number of edges that can be relabeled or missed; but the position of these edges are not fixed
QUERY GRAPH
July 22, 2010 28
Substructure Similarity Measure

Feature-based similarity measure Each graph is represented as a feature vector X = {x1, x2, , xn} The similarity is defined by the distance of their corresponding vectors Advantages
Easy to index Fast Rough measure
July 22, 2010 29
Query Processing Framework

Three steps in processing approximate graph queries
Step 1. Index Construction
Select small structures as features in a graph database, and build the featuregraph matrix between the features and the graphs in the database
July 22, 2010
30
Framework (cont.)
Step 2. Feature Miss Estimation Determine the indexed features belonging to the query graph Calculate the upper bound of the number of features that can be missed for an approximate matching, denoted by J On the query graph, not the graph database
July 22, 2010
31
Framework (cont.)
Step 3. Query Processing Use the feature-graph matrix to calculate the difference in the number of features between graph G and query Q, FG FQ If FG FQ > J, discard G. The remaining graphs constitute a candidate answer set
July 22, 2010 32
Outline
July 22, 2010
33
Biological Networks
Protein-protein interaction network Metabolic network Transcriptional regulatory network Co-expression network Genetic Interaction network
July 22, 2010
34
Data Mining Across Multiple Networks

f a c e b d g i b k d g i h j a c e k b d g i f h j f a c e k h j
f a c e b d g i b k d j h a c e
f h
f j a c e k b d g i k j h
July 22, 2010
35
Data Mining Across Multiple Networks

f a c e b d g i b k d g i h j a c e k b d g i f h j f a c e k h j
f a c e b d g i b k d j h a c e
f h
f j a c e k b d g i k j h
July 22, 2010
36
Identify Frequent Co-expression Clusters across Multiple Microarray Data Sets

a f c e
c1 c2 cm g1 .1 .2 .2 g2 .4 .3 .4
h j k
f a b d g c e
h j k
b d g i
c1 c2 cm g1 .8 .6 .2 g2 .2 .3 .4
a b
c d
f e
j h k i
a b
f c e h
j k i
d g
. . .
. . .
a c b d g i f e k b h j a c e
. . .
f h i j k d g
c1 c2 cm g1 .9 .4 .1 g2 .7 .3 .5
c1 c2 cm g1 .2 .5 .8 g2 .7 .1 .3
a b
c d
h e
j k
f a b d g i c h e
j k
July 22, 2010
37
CODENSE: Mine Coherent Dense Subgraphs

(1) Builds a summary graph by eliminating infrequent edges
f a c h a c e d g i f a h b d c e f d g i g i a c e b f a c e d g i h a c e d g i f h a c e d g i f h d g i h f h
G1
G2
G3
summary graph
G4
G5
G6
July 22, 2010
38
CODENSE: Mine Coherent Dense Subgraphs
(2) Identify dense subgraphs of the summary graph

f a c h f
Step 2
e b d i
MODES
summary graph
Sub()
Observation: If a frequent subgraph is dense, it must be a dense subgraph in the summary graph. However, the reverse is not true.
July 22, 2010 39
Applying CoDense to 39 Yeast Microarray Data Sets

f a
c1 c2 cm g1 .1 .2 .2 g2 .4 .3 .4
h j k
f a b d g c e
h j k
b d g i
c1 c2 cm g1 .8 .6 .2 g2 .2 .3 .4
a b
c d
f e
j h k i
a b
f c e d g h
j k i
c1 c2 cm g1 .9 .4 .1 g2 .7 .3 .5
a c b
f e
h i
j k
a c b d g
f e
h i
j k
d g
c1 c2 cm g1 .2 .5 .8 g2 .7 .1 .3
a b
c d
h e
j k
f a b d g i c h e
j k
July 22, 2010
40
Discovery of New Genes Based on Similar Genes
YDR115W
MRP49
PHB1 PET100 ATP17
MRPL51 ATP12
MRPL37
MRPL38
ACN9
MRPL32 MRPS18
July 22, 2010
MRPL39 FMC1
41
Network of Known Similar Genes

ATP17 MRP49 MRPL51
PHB1 PET100
ATP12
PET100 YDR115W
MRPL38
ACN9
MRPL32 MRPS18
MRPL39
FMC1
Brown: YDR115W, FMC1, ATP12, MRPL37, MRPS18 GO:0019538 (protein metabolism; pvalue = 0.001122)
July 22, 2010 42
Network Involved in the New Genes

YDR115W MRP49
PHB1 PET100
MRPL51 ATP12
ATP17
MRPL37
MRPL38
ACN9
MRPL32 MRPS18 FMC1
MRPL39
Red:PHB1,ATP17,MRPL51,MRPL39, MRPL49, MRPL51,PET100 GO:0006091 (generation of precursor metabolites and energy; pvalue=0. 001339)
July 22, 2010 43
Outline
July 22, 2010
44
Conclusions
Graph mining has wide applications Frequent and closed subgraph mining methods gSpan and CloseGraph: pattern-growth depth-first search approach Graph indexing techniques: Frequent and discirminative subgraphs as indexing fatures Similairty search in graph databases Indexing and approximate matching help similar subgraph search Biological network analysis Mining coherent, dense, multiple biological networks Many new developments along the line of graph pattern mining
July 22, 2010 45
Thanks and Questions
July 22, 2010
46

Co So Du Lieu Do Thi

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Co So Du Lieu Do Thi

Uploaded by

Copyright:

Available Formats

Mining, Indexing and Searching Graph Databases

Presenter: A/ Prof. Do Phuc Source: Jiawei Han , Vladimir Lipets

July 22, 2010

Graph, Graph, Everywhere

Yeast protein interaction network

July 22, 2010

Why Graph Mining and Searching?

July 22, 2010

July 22, 2010

July 22, 2010

Subgraph of a given graph

July 22, 2010

July 22, 2010

Subgraph Isomorphism and Related Problems

July 22, 2010

Graph Pattern Mining

Example: Frequent Subgraphs

Frequent Patterns (min support is 2)

Frequent Subgraph Mining Approaches

Pattern growth-based approach

Properties of Graph Mining Algorithms

July 22, 2010

Graph Search: Querying Graph Databases

Structure-based similarity measure

Optimal local assignment (Frhlich et al. ICML05)

Structure Similarity Search

July 22, 2010

Some Straightforward Methods

Index: Precise vs. Approximate Search

July 22, 2010

Substructure Similarity Measure

July 22, 2010 28

Substructure Similarity Measure

Query Processing Framework

July 22, 2010

July 22, 2010

July 22, 2010

July 22, 2010

Data Mining Across Multiple Networks

July 22, 2010

Data Mining Across Multiple Networks

July 22, 2010

Identify Frequent Co-expression Clusters across Multiple Microarray Data Sets

July 22, 2010

CODENSE: Mine Coherent Dense Subgraphs

July 22, 2010

CODENSE: Mine Coherent Dense Subgraphs

(2) Identify dense subgraphs of the summary graph

Applying CoDense to 39 Yeast Microarray Data Sets

July 22, 2010

Discovery of New Genes Based on Similar Genes

PHB1 PET100 ATP17

Network of Known Similar Genes

Network Involved in the New Genes

MRPL32 MRPS18 FMC1

July 22, 2010

Thanks and Questions

July 22, 2010

You might also like