Cut to Fit: Tailoring the Partitioning to the Computation

Cut to Fit: Tailoring the Partitioning
to the Computation
Iacovos G. Kolokasis & Polyvios Pratikakis
30 June 2019
Institute of Computer Sciense (ICS)
Foundation of Research and Technology – Hellas (FORTH) &
Computer Science Department, University of Crete

Outline
1. Motivation & Overview
2. Experimental Methodology
3. Characterizing Partition Strategies
4. Partition Metrics As Performance Predictors
5. Conclusions
kolokasis@ics.forth.gr 1 of 26

Graph Analytics Computation Dependencies
1. Various graph datasets with different properties
• Power-law graphs (e.g. social networks)
• Grid graphs (e.g. road networks)
2. Various graph algorithms with different computation
effort
• Not all algorithms perform a fixed amount of operation
per edge (e.g. BFS, Connected Components)
• Many algorithms make passes over the vertices apart
from passes over the edges
3. Various partition strategies
• Distributed graph computing frameworks operation
based on graph partitioning

Impact of Graph Partitioning
• Data partitioning could have a signiﬁcant impact on the
perfofmance of the graph computation
• Network Traﬃc
• Memory occupation
• Load balance

Challenges
• There is no single optimal partitioner for all problems
• Complex partitioner results into increased partitioning
time
Our Goal is to study these two problems, by:
• Characterizing partition strategies using a wide set of
metrics
• Quantifying the correlation of partition metrics with
computation performance

Spark Cluster Conﬁguration
Instance Total Cores Total Memory Exec./Worker
Master 1 32 256GB -
Workers 4 32 256GB 6
Per Executor - 5 29GB -
• Nodes connect with 40Gb network
• We use 240 and 480 total number of partitions
• We restart Spark between runs

Experimental Setup
• Typical Graph Algirithms
• PageRank (PR), Connected Components (CC)
• Triangle Count (TR), Single Source Short. Path (SSSP)
• Datasets
Dataset Vertices Edges Size
web-wikipedia-link-fr 4.9M 113.1M 1.6G
soc-twitter-2010 21.2M 265.0M 4.4G
road-road-usa 23.9M 28.8M 469.7M
soc-sinaweibo 58.6M 261.3M 3.8G
socfb-uci-uni 58.7M 92.2M 1.5G

Graph Partitioners
Assigns edges to partitions by hashing together the source and
destination vertex IDs, resulting in a random vertex cut.

Graph Partitioners
Assigns edges to partitions by hashing the source vertex ID.
This causes all edges with the same source vertex to be
collocated in the same partition.

Graph Partitioners
Arranges all partitions into a square matrix and picks the
column on the basis of the source vertex’s hash and the row
on the basis of the destination vertex’s hash.

Graph Partitioners
Assigns edges to partitions by hashing the source and
destination vertex IDs in a canonical direction, resulting in a
random vertex cut that collocates all edges between two
vertices, regardless of direction.

Graph Partitioners
Assigns edges to partition by simple modulo of the source
vertex IDs with the total number of partitions. We expect any
correlation between vertex IDs and locality.

Graph Partitioners
Assigns edges to partition by simple modulo of the
destination vertex IDs with the total number of partitions.
We assume that vertex IDs may capture a metric of locality.

Graph Partitioners
Places edges into partitions using a Destination Cut strategy
when the destination is a hub, or a Source Cut strategy when
it is not.

Graph Partitioners
Distributes edges using the Edge Partition 2D strategy when
source and destination vertices are both hubs or both not
hubs; if only one of them is a hub, the algorithm places the
edge near the non-hub vertex.

Characterizing Partition
Strategies

Partition Metrics
The ratio of the number of edges in the biggest partition, over
the average number of edges per partition.

Partition Metrics
Normalized Standard Deviation of the number of edges per
partition. An alternative measure of imbalance in the edge
partitioning.

Partition Metrics
The ratio of the total number of vertices of each partition,
including replicated vertices, over the total number of vertices
of the original graph.

Partition Metrics
The number of vertices that exist in more than one partition,
irrespective of how many copies of each cut vertex there are.
These are the unique vertices copied across partitions.

Partition Metrics
The total number of copies of replicated vertices that exist in
more than one partition. Shows the number of messages that
need to be exchanged on every superstep.

Characterization of Partitions Metrics
• Almost all partitions produced by partitioners are quite
balanced
• Except for web-wikipedia-link-fr, where DC produced
unballanced partitions

Characterization of Partitions Metrics
• Power-law graphs
results into higher RF
• Low number of CV
usually means a low RF

Partition Metrics As
Performance Predictors

Which Metrics can predict the performance?
• RF is almost correlated with PR except only in
web-wikipedia-link-fr dataset
• RF is not correlated with TC

Which Metrics can predict the performance?
• CV is almost correlated with CC except only in
road-road-usa dataset
• CV is not reliable predictor of TC performance

Dynamic Partitioner Selection
Hypothesis
Select a partitioner dynamically based on the properties of the
data (e.g size of the graph, granularity of partitioning)
Testing
We implemented a very simple dynamic partitioner that selects
between partitioning algorithms based on the granularity of
partitioning

Dynamic Partitioner Selection

Conclusions
• Distributed graph analytics frameworks efficiency is highly
dependent on the partitioning strategies used
• There is no single optimal partitioner for all problems
• There is no simple way to predict the performance of the
computation
• Dymamic partitioners can achieve results better than
static partitioners on different set of datasets and
configurations

Q&A
For questions after this session, contact us at:
kolokasis@ics.forth.gr
Supported by:

Cut to Fit: Tailoring the Partitioning to the Computation

More Related Content

What's hot (19)

Similar to Cut to Fit: Tailoring the Partitioning to the Computation (20)

Recently uploaded (20)

Cut to Fit: Tailoring the Partitioning to the Computation