SlideShare a Scribd company logo
Gspan: Graph-based
Substructure Pattern Mining
Presented By: Sadik Mussah
University of Vermont
CS 332 – Data mining
1
- Algorithm -
Outlines
• Background
• Problem Definition
• Authors Contribution
• Concepts Behind Gspan
• Experimental Result
• Conclusion
2
Background
• Frequent Subgraph Mining Is An Extension To Existing
Frequent Pattern Mining Algorithms
• A Major Challenge IsTo Count How Many Instances of
patterns are in the Dataset
• Counting Instances Might Be Easy For Sets, But Subtle For
Graphs
• Graph Isomorphism Problem
3
Background
Theorem
Given two graphs G and G’ (g prime), G isomorphic to G’ iff min(G)
= min(G’)
04/12/16Sadik Mussah
4
Background
5
X W
U Y
V
(a)
X
W
U
YV
(b)
Two Isomorphic graph (a) and (b) with their mapping function (c)
 Two Graphs Are Isomorphic If One Can Find A Mapping Of Nodes Of
The First Graph To The Second Graph Such That Labels On Nodes
And Edges Are Preserved.
f(V1.1) = V2.2
f(V1.2) = V2.5
f(V1.3) = V2.3
f(V1.4) = V2.4
f(V1.5) = V2.1
(c)
G1=(V1,E1,L1) G2=(V2,E2,L2)
1
2
3
4
5
1
2
3
4
5
Problem: Finding Frequent Subgraphs
• Problem Setting: Similar To Finding Frequent Itemsets For
Association Rule Discovery
• Input: Database Of Graph Transactions
• Undirected Simple Graph (No Multiples Edges)
• Each Graph Transaction Has Labeled Edges/Vertices.
• Transactions May Not Be Connected
• Minimum Support Thresholds
• Output: Frequent Subgraphs That Satisfy The Support Threshold,
Where Each Frequent Subgraph Is Connected.
6
Finding Frequent Subgraphs
7
Authors Contribution
• Representing Graphs As Strings (Like Treeminer)
• No Candidate Generation!
• “It Combines The Growing And Checking Of Frequent Subgraphs
Into One Procedure,Thus Accelerates The Mining Process.”
• Really Fast, Still A Standard Baseline System That Most Rivals
Compare Their Systems To.
8
Concepts Behind Gspan
• The Idea Is To Produces A Depth-first Search (DFS) Codes For
Each Edge In Graphs
• Edges Are Sorted According To Lexicographic Order Of Codes
• Yan And Han Proved That Graph Isomororphism Can Be Tested
For Two Graphs Annotated With DFS Codes
• Starting With Small Graph Patterns Containing 1-edge, Patterns
Are Expanded Systemically By The DFS Search
• Employ Anti-monotonic Property Of Graph Frequency
9
Lexicographic Ordering In Graph
• It Can Tell Us The Order Of Two Graphs.
• The Design Can Help Us Build A Similar Hierarchy.
• The Design Should Guarantee Easy-growing From One Level To
The Lower Level And Easy-rolling-up From Low Level To Higher
Level.
• It May Be Difficult To Have Such Design That No Two Nodes In
This Tree Are Same For Graph Case.
• It Can Tell Us Whether The Graph Has Been Discovered.
• And More,The Most Important, If A Graph Has Been Discovered,
All Its Children Nodes In The Hierarchy Must Have Been
Discovered.
10
Lexicographic Ordering in Graph11
...
... ...
1-edge
2-edge
...3-edge ...
...
...
...
DFS Code And Minimum DFS Code
• We Use A 5-tuple (Vi,Vj, L(vi), L(vj), L(vi,vj)) To Represent An Edge. (It May Be
Redudant, But Much EasierTo Understand.)
• Turn A Graph Into A SequenceWhose Basic Element Is 5-tuple. Form The
Sequence In Such An Order:
• To Extend One New Node,Add The Forward Edge
That Connect One Node In The Old Graph With This
New Node.
• Add All Backward Edge That Connect This New Node
To Other Nodes In The Old Graph
• Repeat This Procedure.
12
DFS code
13
X
Y
X
Z
Z
a a
b
b
c
d
v0
v1
v2
v3
v4
X
Y
a
e0: (0,1,x,y,a)
X
b
e1: (1,2,y,x,b)a
e2: (2,0,x,x,a)
Z
c e3: (2,3,x,z,c)b
e4: (3,1,x,y,b)
Z
d
e5: (1,4,x,z,d)
DFS Code And Minimum DFS Code
14
Depth First Tree And Forward/Backward Edge Set
Minimum DFS code
15
Each Graph may have lots of DFS code (why?):
one smallest lexicographic one is its Minimum DFS Code
Edge no. (B) (C) ( D)
0 (0,1,x,y,a) (0,1,y,x,a) (0,1,x,x,a)
1 (1,2,y,x,b) (1,2,x,x,a) (1,2,x,y,b)
2 (2,0,x,x,a) (2,0,x,y,b) (0,1,y,x,a)
3 (2,3,x,z,c) (2,3,x,z,c) (2,3,y,z,a)
4 (3,1,z,y,b) (3,0,z,y,b) (3,1,z,x,c)
5 (1,4,x,z,d) (0,4,y,z,d) (2,4,y,z,d)
Graph Parent And Its Children
16
X
Y
X
Z
Z
a
b
c
a
Given a DFS code
c0=(e0,e1,…,en)
if c1=(e0,e1,…,en,ex)
if c0<c1, then
c0 is c1’s parent,
c1 is c0’s child.
?
?
?
?
?
?
?
?
Theorem
• 1. Given Two Graph G0 And G1, G0 Is Isomorphic To G1 Iff
Min_dfs_code(g0)=min_dfs_code(g1).
• 2. DFS CodeTree Covers All Graphs Although SomeTree Nodes May
Represent The Same Graph
• 3. Given A Node In DFS CodeTree, If Its DFS Code Is Not Its Minimum DFS
Code, PruneThis Node And Its All DescendantsWon’t Change.“Covering”.
17
DFS Code Tree
18
...
... ...
1-edge
2-edge
...3-edge ...
...
...
...
pruned
FSG: two substructure patterns and their
potential candidates.
04/12/16Sadik Mussah
19
04/12/16SADIK MUSSAH
20
AGM: two substructures joined by two chains
Algorithm
21
Algorithm
22
Algorithm:
Apriorigraph
04/12/16SADIK MUSSAH
23
ALGORITHM:
gSpan
04/12/16Xifeng Yan
24
Experimental Result
25
Experimental Result
26
Conclusion
• No Candidate Generation And FalseTest
• Space Saving From Depth First Search
• Good Performance: Using “Memory Pool” And One Major
Counting Improvement, It SeemsThe PerformanceWill Be
Improved 5Times More. (But Need MoreTesting).
27
Questions
Q1) What Two Major Costs From Apriori-like, Frequent
Substructure Mining Algorithms Did Gspan Aim To
Reduce/Avoid?
 Answer:
1)The Creation Of Size K+1 Candidate Subgraphs From Size K
Frequent Subgraphs Is More Complicated And Costly The
Standard Apriori Large Itemset Generation.
2) Pruning False Positives Is An Expensive Process. Subgraph
Isomorphism Problem Is Np-complete.
28
Security Graph 3DVisualization
• https://www.youtube.com/watch?v=JsEm-CDj4qM
04/12/16Sadik Mussah
29
Questions (cont.)
• Q2) Which DFSTree Does The DFS Code Below BelongTo?
30
v0
Y
x
x
z
z v4
v1
v2
v3
a
a
c
bb
d
Answer: tree (c)
Questions
• Q3) What Does Gspan CompareWhen Testing For
Isomorphism Between Two Graphs,AndWhy?
• Answer: Gspan Compares The Minimum Dfs Codes Of The Two
Graphs. GivenTwo Graphs G And G’, G Is Isomorphic To G’ If
Min(g)=min(g’).This Theorem Allows For A Simple String
Comparison Of More Complicated Graphs. If Two Nodes Contain
The Same Graph But Different Minimum DFS Codes,We Can
Prune The Sub-branch Of The Rightmost Of The Two Nodes.This
Greatly Decreases The Problem Size.
32
Questions?
33

More Related Content

PPT
5.5 graph mining
PPTX
Greedy algorithms
PPTX
Analysis and Design of Algorithms
PDF
Major and Minor Elements of Object Model
PPT
Asymptotic notations
PPT
Heuristc Search Techniques
PPTX
Object Oriented Testing
PPTX
Lecture 17 Iterative Deepening a star algorithm
5.5 graph mining
Greedy algorithms
Analysis and Design of Algorithms
Major and Minor Elements of Object Model
Asymptotic notations
Heuristc Search Techniques
Object Oriented Testing
Lecture 17 Iterative Deepening a star algorithm

What's hot (20)

PPTX
Disjoint sets union, find
PPTX
Birch Algorithm With Solved Example
PPTX
Bfs and Dfs
PPT
Dinive conquer algorithm
PPTX
Greedy Algorithms
PPTX
Density based methods
PPTX
Job sequencing with deadline
PPT
Graph coloring problem
PPT
Disjoint sets
PPTX
Solving recurrences
PPTX
Stochastic Gradient Decent (SGD).pptx
PDF
18CS2005 Cryptography and Network Security
PPTX
Parallel sorting algorithm
PPT
Unit 1 chapter 1 Design and Analysis of Algorithms
PPT
Clock synchronization in distributed system
PPT
Divide and Conquer
PPTX
Webinar : P, NP, NP-Hard , NP - Complete problems
PDF
Reinforcement learning, Q-Learning
PPTX
And or graph
PPT
All-Reduce and Prefix-Sum Operations
Disjoint sets union, find
Birch Algorithm With Solved Example
Bfs and Dfs
Dinive conquer algorithm
Greedy Algorithms
Density based methods
Job sequencing with deadline
Graph coloring problem
Disjoint sets
Solving recurrences
Stochastic Gradient Decent (SGD).pptx
18CS2005 Cryptography and Network Security
Parallel sorting algorithm
Unit 1 chapter 1 Design and Analysis of Algorithms
Clock synchronization in distributed system
Divide and Conquer
Webinar : P, NP, NP-Hard , NP - Complete problems
Reinforcement learning, Q-Learning
And or graph
All-Reduce and Prefix-Sum Operations
Ad

Viewers also liked (8)

PPT
gSpan algorithm
PPTX
Lgm saarbrucken
PPT
Trends In Graph Data Management And Mining
PPTX
Data Mining Seminar - Graph Mining and Social Network Analysis
PPTX
Cracking the Coding Interview (Oct 2012)
PPTX
Close Graph
PPTX
Clustering for Beginners
PPTX
Data mining fp growth
gSpan algorithm
Lgm saarbrucken
Trends In Graph Data Management And Mining
Data Mining Seminar - Graph Mining and Social Network Analysis
Cracking the Coding Interview (Oct 2012)
Close Graph
Clustering for Beginners
Data mining fp growth
Ad

Similar to gSpan algorithm (20)

PPTX
NS-CUK Seminar: H.B.Kim, Review on "subgraph2vec: Learning Distributed Repre...
DOCX
graphin-c1.pnggraphin-c1.txt1 22 3 83 44 5.docx
PPT
ae_722_unstructured_meshes.ppt
PDF
LEC 12-DSALGO-GRAPHS(final12).pdf
PPTX
141222 graphulo ingraphblas
 
PPTX
141205 graphulo ingraphblas
PPTX
Graph mining ppt
PDF
Lgm pakdd2011 public
PDF
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
PDF
06 mlp
PDF
testpang
PDF
Skiena algorithm 2007 lecture12 topological sort connectivity
PPTX
LalitBDA2015V3
PDF
Number Crunching in Python
PPTX
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PDF
Incremental and parallel computation of structural graph summaries for evolvi...
PPTX
GBM package in r
PPTX
Lecture-12-CS345A-2023 of Design and Analysis
PPTX
Unit 4-PartB of data design and algorithms
PPTX
Data Structures - Lecture 10 [Graphs]
NS-CUK Seminar: H.B.Kim, Review on "subgraph2vec: Learning Distributed Repre...
graphin-c1.pnggraphin-c1.txt1 22 3 83 44 5.docx
ae_722_unstructured_meshes.ppt
LEC 12-DSALGO-GRAPHS(final12).pdf
141222 graphulo ingraphblas
 
141205 graphulo ingraphblas
Graph mining ppt
Lgm pakdd2011 public
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
06 mlp
testpang
Skiena algorithm 2007 lecture12 topological sort connectivity
LalitBDA2015V3
Number Crunching in Python
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Incremental and parallel computation of structural graph summaries for evolvi...
GBM package in r
Lecture-12-CS345A-2023 of Design and Analysis
Unit 4-PartB of data design and algorithms
Data Structures - Lecture 10 [Graphs]

Recently uploaded (20)

PDF
Fluorescence-microscope_Botany_detailed content
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Introduction to the R Programming Language
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
annual-report-2024-2025 original latest.
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Lecture1 pattern recognition............
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
Fluorescence-microscope_Botany_detailed content
Miokarditis (Inflamasi pada Otot Jantung)
Qualitative Qantitative and Mixed Methods.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Database Infoormation System (DBIS).pptx
Mega Projects Data Mega Projects Data
Supervised vs unsupervised machine learning algorithms
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
.pdf is not working space design for the following data for the following dat...
Introduction to the R Programming Language
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
annual-report-2024-2025 original latest.
STERILIZATION AND DISINFECTION-1.ppthhhbx
Reliability_Chapter_ presentation 1221.5784
Lecture1 pattern recognition............
Introduction-to-Cloud-ComputingFinal.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx

gSpan algorithm

  • 1. Gspan: Graph-based Substructure Pattern Mining Presented By: Sadik Mussah University of Vermont CS 332 – Data mining 1 - Algorithm -
  • 2. Outlines • Background • Problem Definition • Authors Contribution • Concepts Behind Gspan • Experimental Result • Conclusion 2
  • 3. Background • Frequent Subgraph Mining Is An Extension To Existing Frequent Pattern Mining Algorithms • A Major Challenge IsTo Count How Many Instances of patterns are in the Dataset • Counting Instances Might Be Easy For Sets, But Subtle For Graphs • Graph Isomorphism Problem 3
  • 4. Background Theorem Given two graphs G and G’ (g prime), G isomorphic to G’ iff min(G) = min(G’) 04/12/16Sadik Mussah 4
  • 5. Background 5 X W U Y V (a) X W U YV (b) Two Isomorphic graph (a) and (b) with their mapping function (c)  Two Graphs Are Isomorphic If One Can Find A Mapping Of Nodes Of The First Graph To The Second Graph Such That Labels On Nodes And Edges Are Preserved. f(V1.1) = V2.2 f(V1.2) = V2.5 f(V1.3) = V2.3 f(V1.4) = V2.4 f(V1.5) = V2.1 (c) G1=(V1,E1,L1) G2=(V2,E2,L2) 1 2 3 4 5 1 2 3 4 5
  • 6. Problem: Finding Frequent Subgraphs • Problem Setting: Similar To Finding Frequent Itemsets For Association Rule Discovery • Input: Database Of Graph Transactions • Undirected Simple Graph (No Multiples Edges) • Each Graph Transaction Has Labeled Edges/Vertices. • Transactions May Not Be Connected • Minimum Support Thresholds • Output: Frequent Subgraphs That Satisfy The Support Threshold, Where Each Frequent Subgraph Is Connected. 6
  • 8. Authors Contribution • Representing Graphs As Strings (Like Treeminer) • No Candidate Generation! • “It Combines The Growing And Checking Of Frequent Subgraphs Into One Procedure,Thus Accelerates The Mining Process.” • Really Fast, Still A Standard Baseline System That Most Rivals Compare Their Systems To. 8
  • 9. Concepts Behind Gspan • The Idea Is To Produces A Depth-first Search (DFS) Codes For Each Edge In Graphs • Edges Are Sorted According To Lexicographic Order Of Codes • Yan And Han Proved That Graph Isomororphism Can Be Tested For Two Graphs Annotated With DFS Codes • Starting With Small Graph Patterns Containing 1-edge, Patterns Are Expanded Systemically By The DFS Search • Employ Anti-monotonic Property Of Graph Frequency 9
  • 10. Lexicographic Ordering In Graph • It Can Tell Us The Order Of Two Graphs. • The Design Can Help Us Build A Similar Hierarchy. • The Design Should Guarantee Easy-growing From One Level To The Lower Level And Easy-rolling-up From Low Level To Higher Level. • It May Be Difficult To Have Such Design That No Two Nodes In This Tree Are Same For Graph Case. • It Can Tell Us Whether The Graph Has Been Discovered. • And More,The Most Important, If A Graph Has Been Discovered, All Its Children Nodes In The Hierarchy Must Have Been Discovered. 10
  • 11. Lexicographic Ordering in Graph11 ... ... ... 1-edge 2-edge ...3-edge ... ... ... ...
  • 12. DFS Code And Minimum DFS Code • We Use A 5-tuple (Vi,Vj, L(vi), L(vj), L(vi,vj)) To Represent An Edge. (It May Be Redudant, But Much EasierTo Understand.) • Turn A Graph Into A SequenceWhose Basic Element Is 5-tuple. Form The Sequence In Such An Order: • To Extend One New Node,Add The Forward Edge That Connect One Node In The Old Graph With This New Node. • Add All Backward Edge That Connect This New Node To Other Nodes In The Old Graph • Repeat This Procedure. 12
  • 13. DFS code 13 X Y X Z Z a a b b c d v0 v1 v2 v3 v4 X Y a e0: (0,1,x,y,a) X b e1: (1,2,y,x,b)a e2: (2,0,x,x,a) Z c e3: (2,3,x,z,c)b e4: (3,1,x,y,b) Z d e5: (1,4,x,z,d)
  • 14. DFS Code And Minimum DFS Code 14 Depth First Tree And Forward/Backward Edge Set
  • 15. Minimum DFS code 15 Each Graph may have lots of DFS code (why?): one smallest lexicographic one is its Minimum DFS Code Edge no. (B) (C) ( D) 0 (0,1,x,y,a) (0,1,y,x,a) (0,1,x,x,a) 1 (1,2,y,x,b) (1,2,x,x,a) (1,2,x,y,b) 2 (2,0,x,x,a) (2,0,x,y,b) (0,1,y,x,a) 3 (2,3,x,z,c) (2,3,x,z,c) (2,3,y,z,a) 4 (3,1,z,y,b) (3,0,z,y,b) (3,1,z,x,c) 5 (1,4,x,z,d) (0,4,y,z,d) (2,4,y,z,d)
  • 16. Graph Parent And Its Children 16 X Y X Z Z a b c a Given a DFS code c0=(e0,e1,…,en) if c1=(e0,e1,…,en,ex) if c0<c1, then c0 is c1’s parent, c1 is c0’s child. ? ? ? ? ? ? ? ?
  • 17. Theorem • 1. Given Two Graph G0 And G1, G0 Is Isomorphic To G1 Iff Min_dfs_code(g0)=min_dfs_code(g1). • 2. DFS CodeTree Covers All Graphs Although SomeTree Nodes May Represent The Same Graph • 3. Given A Node In DFS CodeTree, If Its DFS Code Is Not Its Minimum DFS Code, PruneThis Node And Its All DescendantsWon’t Change.“Covering”. 17
  • 18. DFS Code Tree 18 ... ... ... 1-edge 2-edge ...3-edge ... ... ... ... pruned
  • 19. FSG: two substructure patterns and their potential candidates. 04/12/16Sadik Mussah 19
  • 20. 04/12/16SADIK MUSSAH 20 AGM: two substructures joined by two chains
  • 27. Conclusion • No Candidate Generation And FalseTest • Space Saving From Depth First Search • Good Performance: Using “Memory Pool” And One Major Counting Improvement, It SeemsThe PerformanceWill Be Improved 5Times More. (But Need MoreTesting). 27
  • 28. Questions Q1) What Two Major Costs From Apriori-like, Frequent Substructure Mining Algorithms Did Gspan Aim To Reduce/Avoid?  Answer: 1)The Creation Of Size K+1 Candidate Subgraphs From Size K Frequent Subgraphs Is More Complicated And Costly The Standard Apriori Large Itemset Generation. 2) Pruning False Positives Is An Expensive Process. Subgraph Isomorphism Problem Is Np-complete. 28
  • 29. Security Graph 3DVisualization • https://www.youtube.com/watch?v=JsEm-CDj4qM 04/12/16Sadik Mussah 29
  • 30. Questions (cont.) • Q2) Which DFSTree Does The DFS Code Below BelongTo? 30
  • 32. Questions • Q3) What Does Gspan CompareWhen Testing For Isomorphism Between Two Graphs,AndWhy? • Answer: Gspan Compares The Minimum Dfs Codes Of The Two Graphs. GivenTwo Graphs G And G’, G Is Isomorphic To G’ If Min(g)=min(g’).This Theorem Allows For A Simple String Comparison Of More Complicated Graphs. If Two Nodes Contain The Same Graph But Different Minimum DFS Codes,We Can Prune The Sub-branch Of The Rightmost Of The Two Nodes.This Greatly Decreases The Problem Size. 32

Editor's Notes

  • #4: Isomorphisim: The graph isomorphism problem is the computational problem of determining whether two finite graphs are isomorphic. Which is MP - it is one of a very small number of problems belonging to NP neither known to be solvable in polynomial time nor NP-complete: