You are on page 1of 46

Mining, Indexing and Searching Graph Databases

Presenter: A/ Prof. Do Phuc Source: Jiawei Han , Vladimir Lipets


1

July 22, 2010

Graph, Graph, Everywhere


from H. Jeong et al Nature 411, 41 (2001)

Aspirin

Yeast protein interaction network

July 22, 2010

An Internet Web

Co-author network

Why Graph Mining and Searching?


Graphs are ubiquitous Chemical compounds (Cheminformatics) Protein structures, biological pathways/networks (Bioinformactics) Program control flow, traffic flow, and workflow analysis XML databases, Web, and social network analysis Graph is a general model Trees, lattices, sequences, and items are degenerated graphs Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) Complexity of algorithms: many problems are of high complexity!
July 22, 2010 3

Outline
Graph Isomorphism, Subgraph Isomorphism Mining frequent graph patterns Graph indexing methods Similairty search in graph databases Biological network analysis
July 22, 2010 4

Motivation
Graph, Subgraph isomorphism is important and very general form of pattern matching that finds practical application in areas such as: pattern recognition and computer vision, image processing, computer-aided design, graph grammars, graph transformation, biocomputing, search operation in chemical database, numerous others.
July 22, 2010 5

A hierarchy of pattern matching problems Graph isomorphism Subgraph isomorphism Maximum common subgraph Approximate subgraph isomorphism Graph edit distance

July 22, 2010

Isomorphic Graphs

July 22, 2010

Graph Isomorphism

July 22, 2010

Subgraph of a given graph

July 22, 2010

Subgraph Isomorphism

July 22, 2010

10

Subgraph Isomorphism and Related Problems


Given a pattern graph G and a target graph H Decision problem: Answer whether H contains a subgraph isomorphic to G Search problem: Return an occurrence of G as a subgraph of H Counting problem: Return a count of the number of subgraphs of H that are isomorphic to G Enumeration problem: Return all occurrences of G as a subgraph of H

July 22, 2010

11

Outline
Graph Isomorphism, Subgraph Isomorphism Mining frequent graph patterns Graph indexing methods Similairty search in graph databases Biological network analysis
July 22, 2010 12

Graph Pattern Mining


Frequent subgraphs
A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold Applications of graph pattern mining Mining biochemical structures Program control flow analysis Mining XML structures or Web communities Building blocks for graph classification, clustering, comparison, and correlation analysis
July 22, 2010 13

Example: Frequent Subgraphs


Graph Dataset
O OH S O
HO N N O O

(A)

(B)

(C)
O N

Frequent Patterns (min support is 2)

(1)
July 22, 2010

(2)
14

Frequent Subgraph Mining Approaches


Apriori-based approach
AGM/AcGM: Inokuchi, et al. (PKDD00) FSG: Kuramochi and Karypis (ICDM01) PATH: Vanetik and Gudes (ICDM02, ICDM04) FFSM: Huan, et al. (ICDM03)

Pattern growth-based approach


MoFa, Borgelt and Berthold (ICDM02) gSpan: Yan and Han (ICDM02) Gaston: Nijssen and Kok (KDD04)
July 22, 2010 15

Properties of Graph Mining Algorithms


Search order breadth vs. depth Generation of candidate subgraphs apriori vs. pattern growth Elimination of duplicate subgraphs passive vs. active Support calculation embedding store or not Discover order of patterns path
July 22, 2010

tree

graph
16

Outline
Mining frequent graph patterns Graph indexing methods Similairty search in graph databases Biological network analysis

July 22, 2010

17

Graph Search: Querying Graph Databases


Querying graph databases: Given a graph database and a query graph, find all graphs containing this query graph

HO
O N N OH

N N
O
O S N N N O S HO O N O OH

N+

NH

O OH

query graph
July 22, 2010

graph database
18

Scalability Issue
Sequential scan Disk I/O
N
N N O OH
OH O

NH O N N O
O
N

S O N N N O

N+

HO

OH
HO

S O

Subgraph isomorphism (a) testing An indexing mechanism is needed DayLight: Daylight.com (commercial) GraphGrep: Dennis Shasha, et al. PODS'02 Grace: Srinath Srinivasa, et al. ICDE'03
July 22, 2010

(b)

(c)

Query graph
N N

Sample database

19

Indexing Strategy
Query graph (Q) Graph (G) If graph G contains query graph Q, G should contain any substructure of Q Substructure

Remarks Index substructures of a query graph to prune graphs that do not contain these substructures
July 22, 2010 20

Framework
Two steps in processing graph queries
Step 1. Index Construction

Enumerate structures in the graph database, build an inverted index between structures and graphs
Step 2. Query Processing

Enumerate structures in the query graph Calculate the candidate graphs containing these structures Prune the false positive answers by performing subgraph isomorphism test
July 22, 2010 21

Outline
Mining frequent graph patterns Graph indexing methods Similairty search in graph databases Biological network analysis Some recent progress on graph mining
July 22, 2010 22

Graph Clustering
Graph similarity measure Feature-based similarity measure
Each graph is represented as a feature vector The similarity is defined by the distance of their corresponding vectors Frequent subgraphs can be used as features

Structure-based similarity measure


Maximal common subgraph Graph edit distance: insertion, deletion, and relabel Graph alignment distance
July 22, 2010 23

Graph Classification
Local structure based approach Local structures in a graph, e.g., neighbors surrounding a vertex, paths with fixed length Graph pattern-based approach Subgraph patterns from domain knowledge Subgraph patterns from data mining Kernel-based approach Random walk (Grtner 02, Kashima et al. 02, ICML03, Mah et al. ICML04)
July 22, 2010

Optimal local assignment (Frhlich et al. ICML05)

24

Structure Similarity Search


CHEMICAL COMPOUNDS

(a) caffeine

(b) diurobromine

(c) viagra

QUERY GRAPH

July 22, 2010

25

Some Straightforward Methods


Method1: Directly compute the similarity between the graphs in the DB and the query graph Sequential scan Subgraph similarity computation Method 2: Form a set of subgraph queries from the original query graph and use the exact subgraph search Costly: If we allow 3 edges to be missed in a 20edge query graph, it may generate 1,140 subgraphs
July 22, 2010 26

Index: Precise vs. Approximate Search


Precise Search Use frequent patterns as indexing features Select features in the database space based on their selectivity Build the index Approximate Search Hard to build indices covering similar subgraphs explosive number of subgraphs in databases Idea: (1) keep the index structure (2) select features in the query space
27

July 22, 2010

Substructure Similarity Measure


Query relaxation measure The number of edges that can be relabeled or missed; but the position of these edges are not fixed
QUERY GRAPH

July 22, 2010 28

Substructure Similarity Measure


Feature-based similarity measure Each graph is represented as a feature vector X = {x1, x2, , xn} The similarity is defined by the distance of their corresponding vectors Advantages
Easy to index Fast Rough measure
July 22, 2010 29

Query Processing Framework


Three steps in processing approximate graph queries
Step 1. Index Construction

Select small structures as features in a graph database, and build the featuregraph matrix between the features and the graphs in the database

July 22, 2010

30

Framework (cont.)
Step 2. Feature Miss Estimation Determine the indexed features belonging to the query graph Calculate the upper bound of the number of features that can be missed for an approximate matching, denoted by J On the query graph, not the graph database

July 22, 2010

31

Framework (cont.)
Step 3. Query Processing Use the feature-graph matrix to calculate the difference in the number of features between graph G and query Q, FG FQ If FG FQ > J, discard G. The remaining graphs constitute a candidate answer set
July 22, 2010 32

Outline
Mining frequent graph patterns Graph indexing methods Similairty search in graph databases Biological network analysis

July 22, 2010

33

Biological Networks
Protein-protein interaction network Metabolic network Transcriptional regulatory network Co-expression network Genetic Interaction network

July 22, 2010

34

Data Mining Across Multiple Networks


f a c e b d g i b k d g i h j a c e k b d g i f h j f a c e k h j

f a c e b d g i b k d j h a c e

f h

f j a c e k b d g i k j h

July 22, 2010

35

Data Mining Across Multiple Networks


f a c e b d g i b k d g i h j a c e k b d g i f h j f a c e k h j

f a c e b d g i b k d j h a c e

f h

f j a c e k b d g i k j h

July 22, 2010

36

Identify Frequent Co-expression Clusters across Multiple Microarray Data Sets


a f c e

c1 c2 cm g1 .1 .2 .2 g2 .4 .3 .4

h j k

f a b d g c e

h j k

b d g i

c1 c2 cm g1 .8 .6 .2 g2 .2 .3 .4

a b

c d

f e

j h k i

a b

f c e h

j k i

d g

. . .

. . .
a c b d g i f e k b h j a c e

. . .
f h i j k d g

c1 c2 cm g1 .9 .4 .1 g2 .7 .3 .5

c1 c2 cm g1 .2 .5 .8 g2 .7 .1 .3

a b

c d

h e

j k

f a b d g i c h e

j k

July 22, 2010

37

CODENSE: Mine Coherent Dense Subgraphs


(1) Builds a summary graph by eliminating infrequent edges
f a c h a c e d g i f a h b d c e f d g i g i a c e b f a c e d g i h a c e d g i f h a c e d g i f h d g i h f h

G1

G2

G3

summary graph

G4

G5

G6

July 22, 2010

38

CODENSE: Mine Coherent Dense Subgraphs

(2) Identify dense subgraphs of the summary graph


f a c h f

Step 2

e b d i

MODES

summary graph

Sub()

Observation: If a frequent subgraph is dense, it must be a dense subgraph in the summary graph. However, the reverse is not true.
July 22, 2010 39

Applying CoDense to 39 Yeast Microarray Data Sets


f a

c1 c2 cm g1 .1 .2 .2 g2 .4 .3 .4

h j k

f a b d g c e

h j k

b d g i

c1 c2 cm g1 .8 .6 .2 g2 .2 .3 .4

a b

c d

f e

j h k i

a b

f c e d g h

j k i

c1 c2 cm g1 .9 .4 .1 g2 .7 .3 .5

a c b

f e

h i

j k

a c b d g

f e

h i

j k

d g

c1 c2 cm g1 .2 .5 .8 g2 .7 .1 .3

a b

c d

h e

j k

f a b d g i c h e

j k

July 22, 2010

40

Discovery of New Genes Based on Similar Genes

YDR115W

MRP49

PHB1 PET100 ATP17

MRPL51 ATP12

MRPL37

MRPL38

ACN9

MRPL32 MRPS18
July 22, 2010

MRPL39 FMC1
41

Network of Known Similar Genes


ATP17 MRP49 MRPL51

PHB1 PET100

ATP12

PET100 YDR115W

MRPL38

ACN9

MRPL32 MRPS18

MRPL39

FMC1

Brown: YDR115W, FMC1, ATP12, MRPL37, MRPS18 GO:0019538 (protein metabolism; pvalue = 0.001122)
July 22, 2010 42

Network Involved in the New Genes


YDR115W MRP49

PHB1 PET100

MRPL51 ATP12

ATP17

MRPL37

MRPL38

ACN9

MRPL32 MRPS18 FMC1

MRPL39

Red:PHB1,ATP17,MRPL51,MRPL39, MRPL49, MRPL51,PET100 GO:0006091 (generation of precursor metabolites and energy; pvalue=0. 001339)
July 22, 2010 43

Outline
Mining frequent graph patterns Graph indexing methods Similairty search in graph databases Biological network analysis

July 22, 2010

44

Conclusions
Graph mining has wide applications Frequent and closed subgraph mining methods gSpan and CloseGraph: pattern-growth depth-first search approach Graph indexing techniques: Frequent and discirminative subgraphs as indexing fatures Similairty search in graph databases Indexing and approximate matching help similar subgraph search Biological network analysis Mining coherent, dense, multiple biological networks Many new developments along the line of graph pattern mining
July 22, 2010 45

Thanks and Questions

July 22, 2010

46

You might also like