SlideShare a Scribd company logo
Gspan: Graph-based
Substructure Pattern Mining
Presented By: Sadik Mussah
University of Vermont
CS 332 – Data mining
1
- Algorithm -
Outlines
• Background
• Problem Definition
• Authors Contribution
• Concepts Behind Gspan
• Experimental Result
• Conclusion
2
Background
• Frequent Subgraph Mining Is An Extension To Existing
Frequent Pattern Mining Algorithms
• A Major Challenge IsTo Count How Many Instances of
patterns are in the Dataset
• Counting Instances Might Be Easy For Sets, But Subtle For
Graphs
• Graph Isomorphism Problem
3
Background
Theorem
Given two graphs G and G’ (g prime), G isomorphic to G’ iff min(G)
= min(G’)
04/12/16Sadik Mussah
4
Background
5
X W
U Y
V
(a)
X
W
U
YV
(b)
Two Isomorphic graph (a) and (b) with their mapping function (c)
 Two Graphs Are Isomorphic If One Can Find A Mapping Of Nodes Of
The First Graph To The Second Graph Such That Labels On Nodes
And Edges Are Preserved.
f(V1.1) = V2.2
f(V1.2) = V2.5
f(V1.3) = V2.3
f(V1.4) = V2.4
f(V1.5) = V2.1
(c)
G1=(V1,E1,L1) G2=(V2,E2,L2)
1
2
3
4
5
1
2
3
4
5
Problem: Finding Frequent Subgraphs
• Problem Setting: Similar To Finding Frequent Itemsets For
Association Rule Discovery
• Input: Database Of Graph Transactions
• Undirected Simple Graph (No Multiples Edges)
• Each Graph Transaction Has Labeled Edges/Vertices.
• Transactions May Not Be Connected
• Minimum Support Thresholds
• Output: Frequent Subgraphs That Satisfy The Support Threshold,
Where Each Frequent Subgraph Is Connected.
6
Finding Frequent Subgraphs
7
Authors Contribution
• Representing Graphs As Strings (Like Treeminer)
• No Candidate Generation!
• “It Combines The Growing And Checking Of Frequent Subgraphs
Into One Procedure,Thus Accelerates The Mining Process.”
• Really Fast, Still A Standard Baseline System That Most Rivals
Compare Their Systems To.
8
Concepts Behind Gspan
• The Idea Is To Produces A Depth-first Search (DFS) Codes For
Each Edge In Graphs
• Edges Are Sorted According To Lexicographic Order Of Codes
• Yan And Han Proved That Graph Isomororphism Can Be Tested
For Two Graphs Annotated With DFS Codes
• Starting With Small Graph Patterns Containing 1-edge, Patterns
Are Expanded Systemically By The DFS Search
• Employ Anti-monotonic Property Of Graph Frequency
9
Lexicographic Ordering In Graph
• It Can Tell Us The Order Of Two Graphs.
• The Design Can Help Us Build A Similar Hierarchy.
• The Design Should Guarantee Easy-growing From One Level To
The Lower Level And Easy-rolling-up From Low Level To Higher
Level.
• It May Be Difficult To Have Such Design That No Two Nodes In
This Tree Are Same For Graph Case.
• It Can Tell Us Whether The Graph Has Been Discovered.
• And More,The Most Important, If A Graph Has Been Discovered,
All Its Children Nodes In The Hierarchy Must Have Been
Discovered.
10
Lexicographic Ordering in Graph11
...
... ...
1-edge
2-edge
...3-edge ...
...
...
...
DFS Code And Minimum DFS Code
• We Use A 5-tuple (Vi,Vj, L(vi), L(vj), L(vi,vj)) To Represent An Edge. (It May Be
Redudant, But Much EasierTo Understand.)
• Turn A Graph Into A SequenceWhose Basic Element Is 5-tuple. Form The
Sequence In Such An Order:
• To Extend One New Node,Add The Forward Edge
That Connect One Node In The Old Graph With This
New Node.
• Add All Backward Edge That Connect This New Node
To Other Nodes In The Old Graph
• Repeat This Procedure.
12
DFS code
13
X
Y
X
Z
Z
a a
b
b
c
d
v0
v1
v2
v3
v4
X
Y
a
e0: (0,1,x,y,a)
X
b
e1: (1,2,y,x,b)a
e2: (2,0,x,x,a)
Z
c e3: (2,3,x,z,c)b
e4: (3,1,x,y,b)
Z
d
e5: (1,4,x,z,d)
DFS Code And Minimum DFS Code
14
Depth First Tree And Forward/Backward Edge Set
Minimum DFS code
15
Each Graph may have lots of DFS code (why?):
one smallest lexicographic one is its Minimum DFS Code
Edge no. (B) (C) ( D)
0 (0,1,x,y,a) (0,1,y,x,a) (0,1,x,x,a)
1 (1,2,y,x,b) (1,2,x,x,a) (1,2,x,y,b)
2 (2,0,x,x,a) (2,0,x,y,b) (0,1,y,x,a)
3 (2,3,x,z,c) (2,3,x,z,c) (2,3,y,z,a)
4 (3,1,z,y,b) (3,0,z,y,b) (3,1,z,x,c)
5 (1,4,x,z,d) (0,4,y,z,d) (2,4,y,z,d)
Graph Parent And Its Children
16
X
Y
X
Z
Z
a
b
c
a
Given a DFS code
c0=(e0,e1,…,en)
if c1=(e0,e1,…,en,ex)
if c0<c1, then
c0 is c1’s parent,
c1 is c0’s child.
?
?
?
?
?
?
?
?
Theorem
• 1. Given Two Graph G0 And G1, G0 Is Isomorphic To G1 Iff
Min_dfs_code(g0)=min_dfs_code(g1).
• 2. DFS CodeTree Covers All Graphs Although SomeTree Nodes May
Represent The Same Graph
• 3. Given A Node In DFS CodeTree, If Its DFS Code Is Not Its Minimum DFS
Code, PruneThis Node And Its All DescendantsWon’t Change.“Covering”.
17
DFS Code Tree
18
...
... ...
1-edge
2-edge
...3-edge ...
...
...
...
pruned
FSG: two substructure patterns and their
potential candidates.
04/12/16Sadik Mussah
19
04/12/16SADIK MUSSAH
20
AGM: two substructures joined by two chains
Algorithm
21
Algorithm
22
Algorithm:
Apriorigraph
04/12/16SADIK MUSSAH
23
ALGORITHM:
gSpan
04/12/16Xifeng Yan
24
Experimental Result
25
Experimental Result
26
Conclusion
• No Candidate Generation And FalseTest
• Space Saving From Depth First Search
• Good Performance: Using “Memory Pool” And One Major
Counting Improvement, It SeemsThe PerformanceWill Be
Improved 5Times More. (But Need MoreTesting).
27
Questions
Q1) What Two Major Costs From Apriori-like, Frequent
Substructure Mining Algorithms Did Gspan Aim To
Reduce/Avoid?
 Answer:
1)The Creation Of Size K+1 Candidate Subgraphs From Size K
Frequent Subgraphs Is More Complicated And Costly The
Standard Apriori Large Itemset Generation.
2) Pruning False Positives Is An Expensive Process. Subgraph
Isomorphism Problem Is Np-complete.
28
Security Graph 3DVisualization
• https://www.youtube.com/watch?v=JsEm-CDj4qM
04/12/16Sadik Mussah
29
Questions (cont.)
• Q2) Which DFSTree Does The DFS Code Below BelongTo?
30
v0
Y
x
x
z
z v4
v1
v2
v3
a
a
c
bb
d
Answer: tree (c)
Questions
• Q3) What Does Gspan CompareWhen Testing For
Isomorphism Between Two Graphs,AndWhy?
• Answer: Gspan Compares The Minimum Dfs Codes Of The Two
Graphs. GivenTwo Graphs G And G’, G Is Isomorphic To G’ If
Min(g)=min(g’).This Theorem Allows For A Simple String
Comparison Of More Complicated Graphs. If Two Nodes Contain
The Same Graph But Different Minimum DFS Codes,We Can
Prune The Sub-branch Of The Rightmost Of The Two Nodes.This
Greatly Decreases The Problem Size.
32
Questions?
33

More Related Content

PPT
gSpan algorithm
PPTX
Non- Deterministic Algorithms
PPT
Graph coloring problem
PPT
Graph traversal-BFS & DFS
PPTX
Depth first search and breadth first searching
PPTX
Divide and Conquer / Greedy Techniques
PPTX
AI - Local Search - Hill Climbing
PPTX
Huffman's algorithm in Data Structure
gSpan algorithm
Non- Deterministic Algorithms
Graph coloring problem
Graph traversal-BFS & DFS
Depth first search and breadth first searching
Divide and Conquer / Greedy Techniques
AI - Local Search - Hill Climbing
Huffman's algorithm in Data Structure

What's hot (20)

PPTX
Fractional Knapsack Problem
PPTX
Greedy Algorithms
PPTX
PDF
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
PDF
N Queens problem
PPTX
Graph coloring problem(DAA).pptx
PPT
Red black tree
PDF
MERGE SORT ALGORITHMS DIVIDE AND CONQUER
PDF
backtracking algorithms of ada
PPTX
Bloom filters
PPTX
AlexNet, VGG, GoogleNet, Resnet
PPT
Depth First Search ( DFS )
PPTX
Binomial heap presentation
PPTX
Image colorization
PPTX
Huffman coding
PPT
Breadth first search and depth first search
PDF
TP1 Atelier C++/ GL2 INSAT / Tunisie
PPTX
Breadth first search (bfs)
PPTX
bfs and dfs (data structures).pptx
PPTX
Graph coloring
Fractional Knapsack Problem
Greedy Algorithms
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
N Queens problem
Graph coloring problem(DAA).pptx
Red black tree
MERGE SORT ALGORITHMS DIVIDE AND CONQUER
backtracking algorithms of ada
Bloom filters
AlexNet, VGG, GoogleNet, Resnet
Depth First Search ( DFS )
Binomial heap presentation
Image colorization
Huffman coding
Breadth first search and depth first search
TP1 Atelier C++/ GL2 INSAT / Tunisie
Breadth first search (bfs)
bfs and dfs (data structures).pptx
Graph coloring
Ad

Similar to gSpan algorithm (20)

PPT
5.5 graph mining
PPT
graph_mining_seminar_2009.ppt
PPTX
Graph mining ppt
PPT
Graph mining seminar_2009
PPTX
Data Mining Seminar - Graph Mining and Social Network Analysis
PDF
call for papers, research paper publishing, where to publish research paper, ...
PPT
Lect12 graph mining
PPT
Trends In Graph Data Management And Mining
PPT
Data Structures-Non Linear DataStructures-Graphs
PPT
Unit VI - Graphs.ppt
PPTX
Basic Graph Algorithms Vertex (Node): lk
PPTX
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
PDF
Text categorization as a graph
PDF
Graph classification problem.pptx
PDF
Text categorization as graph
PDF
Text categorization
PDF
Text categorization as a graph
PDF
Text categorization as a graph
PDF
Text categorization as a graph
PPTX
Lgm saarbrucken
5.5 graph mining
graph_mining_seminar_2009.ppt
Graph mining ppt
Graph mining seminar_2009
Data Mining Seminar - Graph Mining and Social Network Analysis
call for papers, research paper publishing, where to publish research paper, ...
Lect12 graph mining
Trends In Graph Data Management And Mining
Data Structures-Non Linear DataStructures-Graphs
Unit VI - Graphs.ppt
Basic Graph Algorithms Vertex (Node): lk
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
Text categorization as a graph
Graph classification problem.pptx
Text categorization as graph
Text categorization
Text categorization as a graph
Text categorization as a graph
Text categorization as a graph
Lgm saarbrucken
Ad

Recently uploaded (20)

PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Quality review (1)_presentation of this 21
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
Introduction to Data Science and Data Analysis
PDF
annual-report-2024-2025 original latest.
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction-to-Cloud-ComputingFinal.pptx
.pdf is not working space design for the following data for the following dat...
oil_refinery_comprehensive_20250804084928 (1).pptx
Fluorescence-microscope_Botany_detailed content
Quality review (1)_presentation of this 21
IB Computer Science - Internal Assessment.pptx
1_Introduction to advance data techniques.pptx
Introduction to Data Science and Data Analysis
annual-report-2024-2025 original latest.
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to machine learning and Linear Models
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
iec ppt-1 pptx icmr ppt on rehabilitation.pptx

gSpan algorithm

  • 1. Gspan: Graph-based Substructure Pattern Mining Presented By: Sadik Mussah University of Vermont CS 332 – Data mining 1 - Algorithm -
  • 2. Outlines • Background • Problem Definition • Authors Contribution • Concepts Behind Gspan • Experimental Result • Conclusion 2
  • 3. Background • Frequent Subgraph Mining Is An Extension To Existing Frequent Pattern Mining Algorithms • A Major Challenge IsTo Count How Many Instances of patterns are in the Dataset • Counting Instances Might Be Easy For Sets, But Subtle For Graphs • Graph Isomorphism Problem 3
  • 4. Background Theorem Given two graphs G and G’ (g prime), G isomorphic to G’ iff min(G) = min(G’) 04/12/16Sadik Mussah 4
  • 5. Background 5 X W U Y V (a) X W U YV (b) Two Isomorphic graph (a) and (b) with their mapping function (c)  Two Graphs Are Isomorphic If One Can Find A Mapping Of Nodes Of The First Graph To The Second Graph Such That Labels On Nodes And Edges Are Preserved. f(V1.1) = V2.2 f(V1.2) = V2.5 f(V1.3) = V2.3 f(V1.4) = V2.4 f(V1.5) = V2.1 (c) G1=(V1,E1,L1) G2=(V2,E2,L2) 1 2 3 4 5 1 2 3 4 5
  • 6. Problem: Finding Frequent Subgraphs • Problem Setting: Similar To Finding Frequent Itemsets For Association Rule Discovery • Input: Database Of Graph Transactions • Undirected Simple Graph (No Multiples Edges) • Each Graph Transaction Has Labeled Edges/Vertices. • Transactions May Not Be Connected • Minimum Support Thresholds • Output: Frequent Subgraphs That Satisfy The Support Threshold, Where Each Frequent Subgraph Is Connected. 6
  • 8. Authors Contribution • Representing Graphs As Strings (Like Treeminer) • No Candidate Generation! • “It Combines The Growing And Checking Of Frequent Subgraphs Into One Procedure,Thus Accelerates The Mining Process.” • Really Fast, Still A Standard Baseline System That Most Rivals Compare Their Systems To. 8
  • 9. Concepts Behind Gspan • The Idea Is To Produces A Depth-first Search (DFS) Codes For Each Edge In Graphs • Edges Are Sorted According To Lexicographic Order Of Codes • Yan And Han Proved That Graph Isomororphism Can Be Tested For Two Graphs Annotated With DFS Codes • Starting With Small Graph Patterns Containing 1-edge, Patterns Are Expanded Systemically By The DFS Search • Employ Anti-monotonic Property Of Graph Frequency 9
  • 10. Lexicographic Ordering In Graph • It Can Tell Us The Order Of Two Graphs. • The Design Can Help Us Build A Similar Hierarchy. • The Design Should Guarantee Easy-growing From One Level To The Lower Level And Easy-rolling-up From Low Level To Higher Level. • It May Be Difficult To Have Such Design That No Two Nodes In This Tree Are Same For Graph Case. • It Can Tell Us Whether The Graph Has Been Discovered. • And More,The Most Important, If A Graph Has Been Discovered, All Its Children Nodes In The Hierarchy Must Have Been Discovered. 10
  • 11. Lexicographic Ordering in Graph11 ... ... ... 1-edge 2-edge ...3-edge ... ... ... ...
  • 12. DFS Code And Minimum DFS Code • We Use A 5-tuple (Vi,Vj, L(vi), L(vj), L(vi,vj)) To Represent An Edge. (It May Be Redudant, But Much EasierTo Understand.) • Turn A Graph Into A SequenceWhose Basic Element Is 5-tuple. Form The Sequence In Such An Order: • To Extend One New Node,Add The Forward Edge That Connect One Node In The Old Graph With This New Node. • Add All Backward Edge That Connect This New Node To Other Nodes In The Old Graph • Repeat This Procedure. 12
  • 13. DFS code 13 X Y X Z Z a a b b c d v0 v1 v2 v3 v4 X Y a e0: (0,1,x,y,a) X b e1: (1,2,y,x,b)a e2: (2,0,x,x,a) Z c e3: (2,3,x,z,c)b e4: (3,1,x,y,b) Z d e5: (1,4,x,z,d)
  • 14. DFS Code And Minimum DFS Code 14 Depth First Tree And Forward/Backward Edge Set
  • 15. Minimum DFS code 15 Each Graph may have lots of DFS code (why?): one smallest lexicographic one is its Minimum DFS Code Edge no. (B) (C) ( D) 0 (0,1,x,y,a) (0,1,y,x,a) (0,1,x,x,a) 1 (1,2,y,x,b) (1,2,x,x,a) (1,2,x,y,b) 2 (2,0,x,x,a) (2,0,x,y,b) (0,1,y,x,a) 3 (2,3,x,z,c) (2,3,x,z,c) (2,3,y,z,a) 4 (3,1,z,y,b) (3,0,z,y,b) (3,1,z,x,c) 5 (1,4,x,z,d) (0,4,y,z,d) (2,4,y,z,d)
  • 16. Graph Parent And Its Children 16 X Y X Z Z a b c a Given a DFS code c0=(e0,e1,…,en) if c1=(e0,e1,…,en,ex) if c0<c1, then c0 is c1’s parent, c1 is c0’s child. ? ? ? ? ? ? ? ?
  • 17. Theorem • 1. Given Two Graph G0 And G1, G0 Is Isomorphic To G1 Iff Min_dfs_code(g0)=min_dfs_code(g1). • 2. DFS CodeTree Covers All Graphs Although SomeTree Nodes May Represent The Same Graph • 3. Given A Node In DFS CodeTree, If Its DFS Code Is Not Its Minimum DFS Code, PruneThis Node And Its All DescendantsWon’t Change.“Covering”. 17
  • 18. DFS Code Tree 18 ... ... ... 1-edge 2-edge ...3-edge ... ... ... ... pruned
  • 19. FSG: two substructure patterns and their potential candidates. 04/12/16Sadik Mussah 19
  • 20. 04/12/16SADIK MUSSAH 20 AGM: two substructures joined by two chains
  • 27. Conclusion • No Candidate Generation And FalseTest • Space Saving From Depth First Search • Good Performance: Using “Memory Pool” And One Major Counting Improvement, It SeemsThe PerformanceWill Be Improved 5Times More. (But Need MoreTesting). 27
  • 28. Questions Q1) What Two Major Costs From Apriori-like, Frequent Substructure Mining Algorithms Did Gspan Aim To Reduce/Avoid?  Answer: 1)The Creation Of Size K+1 Candidate Subgraphs From Size K Frequent Subgraphs Is More Complicated And Costly The Standard Apriori Large Itemset Generation. 2) Pruning False Positives Is An Expensive Process. Subgraph Isomorphism Problem Is Np-complete. 28
  • 29. Security Graph 3DVisualization • https://www.youtube.com/watch?v=JsEm-CDj4qM 04/12/16Sadik Mussah 29
  • 30. Questions (cont.) • Q2) Which DFSTree Does The DFS Code Below BelongTo? 30
  • 32. Questions • Q3) What Does Gspan CompareWhen Testing For Isomorphism Between Two Graphs,AndWhy? • Answer: Gspan Compares The Minimum Dfs Codes Of The Two Graphs. GivenTwo Graphs G And G’, G Is Isomorphic To G’ If Min(g)=min(g’).This Theorem Allows For A Simple String Comparison Of More Complicated Graphs. If Two Nodes Contain The Same Graph But Different Minimum DFS Codes,We Can Prune The Sub-branch Of The Rightmost Of The Two Nodes.This Greatly Decreases The Problem Size. 32

Editor's Notes

  • #4: Isomorphisim: The graph isomorphism problem is the computational problem of determining whether two finite graphs are isomorphic. Which is MP - it is one of a very small number of problems belonging to NP neither known to be solvable in polynomial time nor NP-complete: