Distributed graph mining

Distributed Graph Mining
Presented By
Sayeed Mahmud

Motivation
• The reason BigData is here
– To make processing data easier which is impossible or
overwhelming to process with our existing problem.
• Some Graph Database might be too big for a
single machine
– Easier for a distributed system by sharing load
• Graph Database may itself be scattered around
the globe
– Google search records.

Distributed Graph Mining
• Partition based
• Divide the problem into independent sub-problems
– Each node of the system can process it
independently
– Parallel processing
– Speedup computation
– Enhance scalability of solutions

Techniques
• MRPF
• MapReduce
– We are mainly interested in this

Map Reduce
• A programming model for distributed
platforms.
• Proposed by Google
• Abundant open source implementations
– Hadoop
• Divides the problem in to sub-problems to be
processed in nodes
– Mapping
• Combining the processing results
– Reduce

Map Reduce Example
• Problem: Find frequency of a word in documents available on
a system.
…wor
d….
word
…
…
…wor
d….
…
…
…wor
d….
word
…
…
<word, count>
Map
Distributed
System
<word, 2> <word, 1> <word, 2>
<word, 2 + 1 + 2 = 5> Reduce

Graph Mining using Map Reduce
• Problem: Find frequent sub-graphs of a graph database in a
MapReduce programming model (Local Support 2)
Graph Dataset
Map
Distributed System
Run gSpan Run gSpan
3
2
5 Reduce

Data Partitioning
• Performance and load balancing will be
depending on Mapping portion
– Termed “Partitioning”
– Which portion of the graph dataset will go to which
– Loss of Data and Load Balancing directly dependent
on partitioning.
• Two approach
– MRGP (Map Reduced Partitioning)
– DGP (Density Based Partitioning)

MRGP
• Followed in common Map Reduce problems.
• Assigned sequentially
• Simple
Graph Size (KB) Density
G1 1 0.25
G2 2 0.5
G3 2 0.6
G4 1 0.25
G5 2 0.5
G6 2 0.5
G7 2 0.5
G8 2 0.6
G9 2 0.6
G10 2 0.7
G11 3 0.7
G12 3 0.8
4 Partition 6KB Each
G1, G2, G3, G4
G5, G6, G7
G8, G9, G10
G11, G12

DGP
• Goes for a balanced distribution
• Uses intermediary Bucket
• First graphs are sorted according to densities.
Graph Size (KB) Density
G1 1 0.25
G2 2 0.5
G3 2 0.6
G4 1 0.25
G5 2 0.5
G6 2 0.5
G7 2 0.5
G8 2 0.6
G9 2 0.6
G10 2 0.7
G11 3 0.7
G12 3 0.8
G1 (0.25)
G4 (0.25)
G2 (0.5)
G5 (0.5)
G6 (0.5)
G7 (0.5)
G3 (0.6)
G8 (0.6)
G9 (0.6)
G10 (0.7)
G11 (0.7)
G12 (0.8)

DGP cont..
• Lets say bucket count for this demo is 2
• Next we equally distribute the sorted list to two buckets.
Bucket 1 Bucket 2
G1
G G2 5
G6 G7
G4
G3
G G8 10
G11 G12
G9
Make 4 PaDrivtiidtei oeancsh iBnu ctkoett ainl 4 Non Empty Sub-Bucket

DGP Cont..
• Now take one partition from each and form
final partitions
G1
G G2 5
G6 G7
G4
G3
G G8 10
G11 G12
G9
G1, G2, G3,
G8
G4, G5, G9,
G10
G6, G11 G7, G12

Support Count
• There are two types of support counts to be
considered in distributed graph mining
– Global Support Count
– Local Support Count
• Global Support is the same as in normal graph
mining
• When each mapper is running individual job it
considers local support count.

Local Support Count
• Each individual node has only partial graph
data set.
• Support Count need to be adjusted relative to
the original dataset.
• This adjusted support count is Local Support
Count.
• Local Support Count = Tolerance Rate * Global
Support [Tolerance rate is between 1 and 0]

Loss of Data
• Some frequent sub-graph are lost
• The loss can be mitigated by choosing an
optimal tolerance rate.
– Theoretically tolerance rate = 1 means there will
be no loss of data.
– But usually higher run time.

Experiment Environment
• Language : Perl
• MapReduce Framework : Hadoop (0.20.1)
• Cluster Size : 5
• Node Specification:
– Processor AMD Opteron Quad Core 2.4 GHz
– 4GB Main memory

Data Sets
• Synthetic (Size Ranging from 18MB to 69GB)
• Real
– Chemical Compound Dataset from National
Cancer Institute.

Loss Rate for gSpan Support 30%

Loss Rate for Gaston and FSG Support
30%

Distributed graph mining

More Related Content

What's hot (7)

Similar to Distributed graph mining (20)

Recently uploaded (20)

Distributed graph mining