SlideShare a Scribd company logo
Distributed Graph Mining 
Presented By 
Sayeed Mahmud
Motivation
Motivation 
• The reason BigData is here 
– To make processing data easier which is impossible or 
overwhelming to process with our existing problem. 
• Some Graph Database might be too big for a 
single machine 
– Easier for a distributed system by sharing load 
• Graph Database may itself be scattered around 
the globe 
– Google search records.
Distributed Graph Mining 
• Partition based 
• Divide the problem into independent sub-problems 
– Each node of the system can process it 
independently 
– Parallel processing 
– Speedup computation 
– Enhance scalability of solutions
Techniques 
• MRPF 
• MapReduce 
– We are mainly interested in this
Map Reduce 
• A programming model for distributed 
platforms. 
• Proposed by Google 
• Abundant open source implementations 
– Hadoop 
• Divides the problem in to sub-problems to be 
processed in nodes 
– Mapping 
• Combining the processing results 
– Reduce
Map Reduce Example 
• Problem: Find frequency of a word in documents available on 
a system. 
…wor 
d…. 
word 
… 
… 
…wor 
d…. 
… 
… 
…wor 
d…. 
word 
… 
… 
<word, count> 
Map 
Distributed 
System 
<word, 2> <word, 1> <word, 2> 
<word, 2 + 1 + 2 = 5> Reduce
Graph Mining using Map Reduce 
• Problem: Find frequent sub-graphs of a graph database in a 
MapReduce programming model (Local Support 2) 
Graph Dataset 
Map 
Distributed System 
Run gSpan Run gSpan 
3 
2 
5 Reduce
Data Partitioning 
• Performance and load balancing will be 
depending on Mapping portion 
– Termed “Partitioning” 
– Which portion of the graph dataset will go to which 
– Loss of Data and Load Balancing directly dependent 
on partitioning. 
• Two approach 
– MRGP (Map Reduced Partitioning) 
– DGP (Density Based Partitioning)
MRGP 
• Followed in common Map Reduce problems. 
• Assigned sequentially 
• Simple 
Graph Size (KB) Density 
G1 1 0.25 
G2 2 0.5 
G3 2 0.6 
G4 1 0.25 
G5 2 0.5 
G6 2 0.5 
G7 2 0.5 
G8 2 0.6 
G9 2 0.6 
G10 2 0.7 
G11 3 0.7 
G12 3 0.8 
4 Partition 6KB Each 
G1, G2, G3, G4 
G5, G6, G7 
G8, G9, G10 
G11, G12
DGP 
• Goes for a balanced distribution 
• Uses intermediary Bucket 
• First graphs are sorted according to densities. 
Graph Size (KB) Density 
G1 1 0.25 
G2 2 0.5 
G3 2 0.6 
G4 1 0.25 
G5 2 0.5 
G6 2 0.5 
G7 2 0.5 
G8 2 0.6 
G9 2 0.6 
G10 2 0.7 
G11 3 0.7 
G12 3 0.8 
G1 (0.25) 
G4 (0.25) 
G2 (0.5) 
G5 (0.5) 
G6 (0.5) 
G7 (0.5) 
G3 (0.6) 
G8 (0.6) 
G9 (0.6) 
G10 (0.7) 
G11 (0.7) 
G12 (0.8)
DGP cont.. 
• Lets say bucket count for this demo is 2 
• Next we equally distribute the sorted list to two buckets. 
Bucket 1 Bucket 2 
G1 
G G2 5 
G6 G7 
G4 
G3 
G G8 10 
G11 G12 
G9 
Make 4 PaDrivtiidtei oeancsh iBnu ctkoett ainl 4 Non Empty Sub-Bucket
DGP Cont.. 
• Now take one partition from each and form 
final partitions 
G1 
G G2 5 
G6 G7 
G4 
G3 
G G8 10 
G11 G12 
G9 
G1, G2, G3, 
G8 
G4, G5, G9, 
G10 
G6, G11 G7, G12
Support Count 
• There are two types of support counts to be 
considered in distributed graph mining 
– Global Support Count 
– Local Support Count 
• Global Support is the same as in normal graph 
mining 
• When each mapper is running individual job it 
considers local support count.
Local Support Count 
• Each individual node has only partial graph 
data set. 
• Support Count need to be adjusted relative to 
the original dataset. 
• This adjusted support count is Local Support 
Count. 
• Local Support Count = Tolerance Rate * Global 
Support [Tolerance rate is between 1 and 0]
Loss of Data 
• Some frequent sub-graph are lost 
• The loss can be mitigated by choosing an 
optimal tolerance rate. 
– Theoretically tolerance rate = 1 means there will 
be no loss of data. 
– But usually higher run time.
Experiment Environment 
• Language : Perl 
• MapReduce Framework : Hadoop (0.20.1) 
• Cluster Size : 5 
• Node Specification: 
– Processor AMD Opteron Quad Core 2.4 GHz 
– 4GB Main memory
Data Sets 
• Synthetic (Size Ranging from 18MB to 69GB) 
• Real 
– Chemical Compound Dataset from National 
Cancer Institute.
Loss Rate for gSpan Support 30%
Loss Rate for Gaston and FSG Support 
30%
Runtime
Thank You

More Related Content

PPTX
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
PPTX
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
PDF
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
PDF
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
PDF
Introduction to map reduce
PDF
Using Hadoop
PDF
Hadoop Network Performance profile
PDF
Map Reduce
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
Introduction to map reduce
Using Hadoop
Hadoop Network Performance profile
Map Reduce

What's hot (7)

PPT
Map Reduce
PDF
Shuffle phase as the bottleneck in Hadoop Terasort
PPTX
Introduction to MapReduce
PDF
Large Scale Geo Processing on Hadoop
PPTX
writing Hadoop Map Reduce programs
PPT
Comparing Distributed Indexing To Mapreduce or Not?
PDF
T180304125129
Map Reduce
Shuffle phase as the bottleneck in Hadoop Terasort
Introduction to MapReduce
Large Scale Geo Processing on Hadoop
writing Hadoop Map Reduce programs
Comparing Distributed Indexing To Mapreduce or Not?
T180304125129
Ad

Similar to Distributed graph mining (20)

PDF
Scalable and Adaptive Graph Querying with MapReduce
PDF
Ashwin_Thesis
PPTX
2011.10.14 Apache Giraph - Hortonworks
PDF
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
PPTX
Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010
PPTX
Graph mining ppt
PDF
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
PPT
5.5 graph mining
PDF
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
PDF
On Extending MapReduce - Survey and Experiments
PPTX
MapReduce presentation
PDF
Graph Algorithms - Map-Reduce Graph Processing
PPT
Introduction To Map Reduce
PPTX
Map reduce programming model to solve graph problems
PPTX
Graph processing
PPT
graph_mining_seminar_2009.ppt
PPT
Graph mining seminar_2009
PPTX
This gives a brief detail about big data
PPTX
Lgm saarbrucken
Scalable and Adaptive Graph Querying with MapReduce
Ashwin_Thesis
2011.10.14 Apache Giraph - Hortonworks
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010
Graph mining ppt
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
5.5 graph mining
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...
On Extending MapReduce - Survey and Experiments
MapReduce presentation
Graph Algorithms - Map-Reduce Graph Processing
Introduction To Map Reduce
Map reduce programming model to solve graph problems
Graph processing
graph_mining_seminar_2009.ppt
Graph mining seminar_2009
This gives a brief detail about big data
Lgm saarbrucken
Ad

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Business Analytics and business intelligence.pdf
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Lecture1 pattern recognition............
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
STUDY DESIGN details- Lt Col Maksud (21).pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
[EN] Industrial Machine Downtime Prediction
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Reliability_Chapter_ presentation 1221.5784
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
climate analysis of Dhaka ,Banglades.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Business Analytics and business intelligence.pdf
Clinical guidelines as a resource for EBP(1).pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Lecture1 pattern recognition............
SAP 2 completion done . PRESENTATION.pptx
IB Computer Science - Internal Assessment.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush

Distributed graph mining

  • 1. Distributed Graph Mining Presented By Sayeed Mahmud
  • 3. Motivation • The reason BigData is here – To make processing data easier which is impossible or overwhelming to process with our existing problem. • Some Graph Database might be too big for a single machine – Easier for a distributed system by sharing load • Graph Database may itself be scattered around the globe – Google search records.
  • 4. Distributed Graph Mining • Partition based • Divide the problem into independent sub-problems – Each node of the system can process it independently – Parallel processing – Speedup computation – Enhance scalability of solutions
  • 5. Techniques • MRPF • MapReduce – We are mainly interested in this
  • 6. Map Reduce • A programming model for distributed platforms. • Proposed by Google • Abundant open source implementations – Hadoop • Divides the problem in to sub-problems to be processed in nodes – Mapping • Combining the processing results – Reduce
  • 7. Map Reduce Example • Problem: Find frequency of a word in documents available on a system. …wor d…. word … … …wor d…. … … …wor d…. word … … <word, count> Map Distributed System <word, 2> <word, 1> <word, 2> <word, 2 + 1 + 2 = 5> Reduce
  • 8. Graph Mining using Map Reduce • Problem: Find frequent sub-graphs of a graph database in a MapReduce programming model (Local Support 2) Graph Dataset Map Distributed System Run gSpan Run gSpan 3 2 5 Reduce
  • 9. Data Partitioning • Performance and load balancing will be depending on Mapping portion – Termed “Partitioning” – Which portion of the graph dataset will go to which – Loss of Data and Load Balancing directly dependent on partitioning. • Two approach – MRGP (Map Reduced Partitioning) – DGP (Density Based Partitioning)
  • 10. MRGP • Followed in common Map Reduce problems. • Assigned sequentially • Simple Graph Size (KB) Density G1 1 0.25 G2 2 0.5 G3 2 0.6 G4 1 0.25 G5 2 0.5 G6 2 0.5 G7 2 0.5 G8 2 0.6 G9 2 0.6 G10 2 0.7 G11 3 0.7 G12 3 0.8 4 Partition 6KB Each G1, G2, G3, G4 G5, G6, G7 G8, G9, G10 G11, G12
  • 11. DGP • Goes for a balanced distribution • Uses intermediary Bucket • First graphs are sorted according to densities. Graph Size (KB) Density G1 1 0.25 G2 2 0.5 G3 2 0.6 G4 1 0.25 G5 2 0.5 G6 2 0.5 G7 2 0.5 G8 2 0.6 G9 2 0.6 G10 2 0.7 G11 3 0.7 G12 3 0.8 G1 (0.25) G4 (0.25) G2 (0.5) G5 (0.5) G6 (0.5) G7 (0.5) G3 (0.6) G8 (0.6) G9 (0.6) G10 (0.7) G11 (0.7) G12 (0.8)
  • 12. DGP cont.. • Lets say bucket count for this demo is 2 • Next we equally distribute the sorted list to two buckets. Bucket 1 Bucket 2 G1 G G2 5 G6 G7 G4 G3 G G8 10 G11 G12 G9 Make 4 PaDrivtiidtei oeancsh iBnu ctkoett ainl 4 Non Empty Sub-Bucket
  • 13. DGP Cont.. • Now take one partition from each and form final partitions G1 G G2 5 G6 G7 G4 G3 G G8 10 G11 G12 G9 G1, G2, G3, G8 G4, G5, G9, G10 G6, G11 G7, G12
  • 14. Support Count • There are two types of support counts to be considered in distributed graph mining – Global Support Count – Local Support Count • Global Support is the same as in normal graph mining • When each mapper is running individual job it considers local support count.
  • 15. Local Support Count • Each individual node has only partial graph data set. • Support Count need to be adjusted relative to the original dataset. • This adjusted support count is Local Support Count. • Local Support Count = Tolerance Rate * Global Support [Tolerance rate is between 1 and 0]
  • 16. Loss of Data • Some frequent sub-graph are lost • The loss can be mitigated by choosing an optimal tolerance rate. – Theoretically tolerance rate = 1 means there will be no loss of data. – But usually higher run time.
  • 17. Experiment Environment • Language : Perl • MapReduce Framework : Hadoop (0.20.1) • Cluster Size : 5 • Node Specification: – Processor AMD Opteron Quad Core 2.4 GHz – 4GB Main memory
  • 18. Data Sets • Synthetic (Size Ranging from 18MB to 69GB) • Real – Chemical Compound Dataset from National Cancer Institute.
  • 19. Loss Rate for gSpan Support 30%
  • 20. Loss Rate for Gaston and FSG Support 30%