SlideShare a Scribd company logo
Cut to Fit: Tailoring the Partitioning
to the Computation
Iacovos G. Kolokasis & Polyvios Pratikakis
30 June 2019
Institute of Computer Sciense (ICS)
Foundation of Research and Technology – Hellas (FORTH) &
Computer Science Department, University of Crete
Outline
1. Motivation & Overview
2. Experimental Methodology
3. Characterizing Partition Strategies
4. Partition Metrics As Performance Predictors
5. Conclusions
kolokasis@ics.forth.gr 1 of 26
Motivation & Overview
Graph Analytics Computation Dependencies
1. Various graph datasets with different properties
• Power-law graphs (e.g. social networks)
• Grid graphs (e.g. road networks)
2. Various graph algorithms with different computation
effort
• Not all algorithms perform a fixed amount of operation
per edge (e.g. BFS, Connected Components)
• Many algorithms make passes over the vertices apart
from passes over the edges
3. Various partition strategies
• Distributed graph computing frameworks operation
based on graph partitioning
kolokasis@ics.forth.gr 2 of 26
Impact of Graph Partitioning
• Data partitioning could have a significant impact on the
perfofmance of the graph computation
• Network Traffic
• Memory occupation
• Load balance
kolokasis@ics.forth.gr 3 of 26
Challenges
• There is no single optimal partitioner for all problems
• Complex partitioner results into increased partitioning
time
Our Goal is to study these two problems, by:
• Characterizing partition strategies using a wide set of
metrics
• Quantifying the correlation of partition metrics with
computation performance
kolokasis@ics.forth.gr 4 of 26
Experimental Methodology
Spark Cluster Configuration
Instance Total Cores Total Memory Exec./Worker
Master 1 32 256GB -
Workers 4 32 256GB 6
Per Executor - 5 29GB -
• Nodes connect with 40Gb network
• We use 240 and 480 total number of partitions
• We restart Spark between runs
kolokasis@ics.forth.gr 5 of 26
Experimental Setup
• Typical Graph Algirithms
• PageRank (PR), Connected Components (CC)
• Triangle Count (TR), Single Source Short. Path (SSSP)
• Datasets
Dataset Vertices Edges Size
web-wikipedia-link-fr 4.9M 113.1M 1.6G
soc-twitter-2010 21.2M 265.0M 4.4G
road-road-usa 23.9M 28.8M 469.7M
soc-sinaweibo 58.6M 261.3M 3.8G
socfb-uci-uni 58.7M 92.2M 1.5G
kolokasis@ics.forth.gr 6 of 26
Graph Partitioners
Assigns edges to partitions by hashing together the source and
destination vertex IDs, resulting in a random vertex cut.
kolokasis@ics.forth.gr 7 of 26
Graph Partitioners
Assigns edges to partitions by hashing the source vertex ID.
This causes all edges with the same source vertex to be
collocated in the same partition.
kolokasis@ics.forth.gr 8 of 26
Graph Partitioners
Arranges all partitions into a square matrix and picks the
column on the basis of the source vertex’s hash and the row
on the basis of the destination vertex’s hash.
kolokasis@ics.forth.gr 9 of 26
Graph Partitioners
Assigns edges to partitions by hashing the source and
destination vertex IDs in a canonical direction, resulting in a
random vertex cut that collocates all edges between two
vertices, regardless of direction.
kolokasis@ics.forth.gr 10 of 26
Graph Partitioners
Assigns edges to partition by simple modulo of the source
vertex IDs with the total number of partitions. We expect any
correlation between vertex IDs and locality.
kolokasis@ics.forth.gr 11 of 26
Graph Partitioners
Assigns edges to partition by simple modulo of the
destination vertex IDs with the total number of partitions.
We assume that vertex IDs may capture a metric of locality.
kolokasis@ics.forth.gr 12 of 26
Graph Partitioners
Places edges into partitions using a Destination Cut strategy
when the destination is a hub, or a Source Cut strategy when
it is not.
kolokasis@ics.forth.gr 13 of 26
Graph Partitioners
Distributes edges using the Edge Partition 2D strategy when
source and destination vertices are both hubs or both not
hubs; if only one of them is a hub, the algorithm places the
edge near the non-hub vertex.
kolokasis@ics.forth.gr 14 of 26
Characterizing Partition
Strategies
Partition Metrics
The ratio of the number of edges in the biggest partition, over
the average number of edges per partition.
kolokasis@ics.forth.gr 15 of 26
Partition Metrics
Normalized Standard Deviation of the number of edges per
partition. An alternative measure of imbalance in the edge
partitioning.
kolokasis@ics.forth.gr 16 of 26
Partition Metrics
The ratio of the total number of vertices of each partition,
including replicated vertices, over the total number of vertices
of the original graph.
kolokasis@ics.forth.gr 17 of 26
Partition Metrics
The number of vertices that exist in more than one partition,
irrespective of how many copies of each cut vertex there are.
These are the unique vertices copied across partitions.
kolokasis@ics.forth.gr 18 of 26
Partition Metrics
The total number of copies of replicated vertices that exist in
more than one partition. Shows the number of messages that
need to be exchanged on every superstep.
kolokasis@ics.forth.gr 19 of 26
Characterization of Partitions Metrics
• Almost all partitions produced by partitioners are quite
balanced
• Except for web-wikipedia-link-fr, where DC produced
unballanced partitions
kolokasis@ics.forth.gr 20 of 26
Characterization of Partitions Metrics
• Power-law graphs
results into higher RF
• Low number of CV
usually means a low RF
kolokasis@ics.forth.gr 21 of 26
Partition Metrics As
Performance Predictors
Which Metrics can predict the performance?
• RF is almost correlated with PR except only in
web-wikipedia-link-fr dataset
• RF is not correlated with TC
kolokasis@ics.forth.gr 22 of 26
Which Metrics can predict the performance?
• CV is almost correlated with CC except only in
road-road-usa dataset
• CV is not reliable predictor of TC performance
kolokasis@ics.forth.gr 23 of 26
Dynamic Partitioner Selection
Hypothesis
Select a partitioner dynamically based on the properties of the
data (e.g size of the graph, granularity of partitioning)
Testing
We implemented a very simple dynamic partitioner that selects
between partitioning algorithms based on the granularity of
partitioning
kolokasis@ics.forth.gr 24 of 26
Dynamic Partitioner Selection
kolokasis@ics.forth.gr 25 of 26
Conclusions
Conclusions
• Distributed graph analytics frameworks efficiency is highly
dependent on the partitioning strategies used
• There is no single optimal partitioner for all problems
• There is no simple way to predict the performance of the
computation
• Dymamic partitioners can achieve results better than
static partitioners on different set of datasets and
configurations
kolokasis@ics.forth.gr 26 of 26
Q&A
For questions after this session, contact us at:
kolokasis@ics.forth.gr
Supported by:

More Related Content

PPTX
Mujungi Davis
PPTX
Spatiotemporal analytics
PDF
Mathematical Analysis of Half Volume DRA with Performance Evaluation for High...
PPT
Iccsa stankuteha180611
PPT
Improvement of Spatial Data Quality Using the Data Conflation
DOCX
Vector and Raster Data data model
PPT
Gis Concepts 4/5
PDF
GIS data structure
Mujungi Davis
Spatiotemporal analytics
Mathematical Analysis of Half Volume DRA with Performance Evaluation for High...
Iccsa stankuteha180611
Improvement of Spatial Data Quality Using the Data Conflation
Vector and Raster Data data model
Gis Concepts 4/5
GIS data structure

What's hot (19)

PPTX
Vector data model
PDF
Lecture+12+topology+2013 (3)
PPTX
Lect 7 & 8 types of vector data model-gis
PPT
Vectors and Rasters
PPTX
Vector data model
PPT
Plan4business technical solution
PPTX
Conversion of Existing Data
PPTX
Raster data and Vector data
PPSX
Geographical information system unit 5
PPT
datamodel_vector
PPT
Network Analysis in ArcGIS
PPTX
How to digitize penstocks leading to powerhouse of a hydropower plant from th...
PPT
Info Grafix
PPTX
Raster data ppt
PDF
Au 2008 Gs100 1 P Getting Spatial With
PPTX
GIS Modeling
PPTX
Data models in geographical information system(GIS)
PDF
Raster
PDF
Workshop on Real-time & Stream Analytics IEEE BigData 2016
Vector data model
Lecture+12+topology+2013 (3)
Lect 7 & 8 types of vector data model-gis
Vectors and Rasters
Vector data model
Plan4business technical solution
Conversion of Existing Data
Raster data and Vector data
Geographical information system unit 5
datamodel_vector
Network Analysis in ArcGIS
How to digitize penstocks leading to powerhouse of a hydropower plant from th...
Info Grafix
Raster data ppt
Au 2008 Gs100 1 P Getting Spatial With
GIS Modeling
Data models in geographical information system(GIS)
Raster
Workshop on Real-time & Stream Analytics IEEE BigData 2016
Ad

Similar to Cut to Fit: Tailoring the Partitioning to the Computation (20)

PPTX
Computer Aided Engineering - Introduction
PPTX
Sparse Graph Attention Networks 2021.pptx
PDF
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
PDF
Coarse grained hybrid reconfigurable architecture with noc router for variabl...
PDF
Coarse grained hybrid reconfigurable architecture
PDF
Coarse grained hybrid reconfigurable architecture with no c router
PDF
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
PPTX
Major project.pptx engineering students vtu ece
PDF
Graph Separators With Applications Frontiers In Computer Science Arnold L Ros...
PDF
Model Complex Routing with Cisco MATE Design External Endpoints (White Paper)
PDF
Scaling PageRank to 100 Billion Pages
PDF
STIC-D: algorithmic techniques for efficient parallel pagerank computation on...
PPTX
VLSI design flow.pptx
PDF
FFM_–_Technical_Brief_–_Network_Solutions_for_Intelligence_Surveillance_and_R...
PPTX
Management of Fiber Optics for Utilities
PDF
Netlist Optimization for CMOS Place and Route in MICROWIND
PDF
Hybrid Target Tracking Scheme in Wireless Sensor Networks
PDF
TechShanghai2016 - 从全局角度实现IC、封装和PCB的协同优化
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PDF
Field Programmable Gate Array for Data Processing in Medical Systems
Computer Aided Engineering - Introduction
Sparse Graph Attention Networks 2021.pptx
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architecture with no c router
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Major project.pptx engineering students vtu ece
Graph Separators With Applications Frontiers In Computer Science Arnold L Ros...
Model Complex Routing with Cisco MATE Design External Endpoints (White Paper)
Scaling PageRank to 100 Billion Pages
STIC-D: algorithmic techniques for efficient parallel pagerank computation on...
VLSI design flow.pptx
FFM_–_Technical_Brief_–_Network_Solutions_for_Intelligence_Surveillance_and_R...
Management of Fiber Optics for Utilities
Netlist Optimization for CMOS Place and Route in MICROWIND
Hybrid Target Tracking Scheme in Wireless Sensor Networks
TechShanghai2016 - 从全局角度实现IC、封装和PCB的协同优化
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Field Programmable Gate Array for Data Processing in Medical Systems
Ad

Recently uploaded (20)

PPTX
Cell Types and Its function , kingdom of life
PDF
RMMM.pdf make it easy to upload and study
PPTX
master seminar digital applications in india
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Classroom Observation Tools for Teachers
PPTX
Institutional Correction lecture only . . .
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Lesson notes of climatology university.
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
O7-L3 Supply Chain Operations - ICLT Program
Cell Types and Its function , kingdom of life
RMMM.pdf make it easy to upload and study
master seminar digital applications in india
Module 4: Burden of Disease Tutorial Slides S2 2025
Classroom Observation Tools for Teachers
Institutional Correction lecture only . . .
Pharmacology of Heart Failure /Pharmacotherapy of CHF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Pre independence Education in Inndia.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Lesson notes of climatology university.
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?
STATICS OF THE RIGID BODIES Hibbelers.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Computing-Curriculum for Schools in Ghana
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Final Presentation General Medicine 03-08-2024.pptx
O7-L3 Supply Chain Operations - ICLT Program

Cut to Fit: Tailoring the Partitioning to the Computation

  • 1. Cut to Fit: Tailoring the Partitioning to the Computation Iacovos G. Kolokasis & Polyvios Pratikakis 30 June 2019 Institute of Computer Sciense (ICS) Foundation of Research and Technology – Hellas (FORTH) & Computer Science Department, University of Crete
  • 2. Outline 1. Motivation & Overview 2. Experimental Methodology 3. Characterizing Partition Strategies 4. Partition Metrics As Performance Predictors 5. Conclusions kolokasis@ics.forth.gr 1 of 26
  • 4. Graph Analytics Computation Dependencies 1. Various graph datasets with different properties • Power-law graphs (e.g. social networks) • Grid graphs (e.g. road networks) 2. Various graph algorithms with different computation effort • Not all algorithms perform a fixed amount of operation per edge (e.g. BFS, Connected Components) • Many algorithms make passes over the vertices apart from passes over the edges 3. Various partition strategies • Distributed graph computing frameworks operation based on graph partitioning kolokasis@ics.forth.gr 2 of 26
  • 5. Impact of Graph Partitioning • Data partitioning could have a significant impact on the perfofmance of the graph computation • Network Traffic • Memory occupation • Load balance kolokasis@ics.forth.gr 3 of 26
  • 6. Challenges • There is no single optimal partitioner for all problems • Complex partitioner results into increased partitioning time Our Goal is to study these two problems, by: • Characterizing partition strategies using a wide set of metrics • Quantifying the correlation of partition metrics with computation performance kolokasis@ics.forth.gr 4 of 26
  • 8. Spark Cluster Configuration Instance Total Cores Total Memory Exec./Worker Master 1 32 256GB - Workers 4 32 256GB 6 Per Executor - 5 29GB - • Nodes connect with 40Gb network • We use 240 and 480 total number of partitions • We restart Spark between runs kolokasis@ics.forth.gr 5 of 26
  • 9. Experimental Setup • Typical Graph Algirithms • PageRank (PR), Connected Components (CC) • Triangle Count (TR), Single Source Short. Path (SSSP) • Datasets Dataset Vertices Edges Size web-wikipedia-link-fr 4.9M 113.1M 1.6G soc-twitter-2010 21.2M 265.0M 4.4G road-road-usa 23.9M 28.8M 469.7M soc-sinaweibo 58.6M 261.3M 3.8G socfb-uci-uni 58.7M 92.2M 1.5G kolokasis@ics.forth.gr 6 of 26
  • 10. Graph Partitioners Assigns edges to partitions by hashing together the source and destination vertex IDs, resulting in a random vertex cut. kolokasis@ics.forth.gr 7 of 26
  • 11. Graph Partitioners Assigns edges to partitions by hashing the source vertex ID. This causes all edges with the same source vertex to be collocated in the same partition. kolokasis@ics.forth.gr 8 of 26
  • 12. Graph Partitioners Arranges all partitions into a square matrix and picks the column on the basis of the source vertex’s hash and the row on the basis of the destination vertex’s hash. kolokasis@ics.forth.gr 9 of 26
  • 13. Graph Partitioners Assigns edges to partitions by hashing the source and destination vertex IDs in a canonical direction, resulting in a random vertex cut that collocates all edges between two vertices, regardless of direction. kolokasis@ics.forth.gr 10 of 26
  • 14. Graph Partitioners Assigns edges to partition by simple modulo of the source vertex IDs with the total number of partitions. We expect any correlation between vertex IDs and locality. kolokasis@ics.forth.gr 11 of 26
  • 15. Graph Partitioners Assigns edges to partition by simple modulo of the destination vertex IDs with the total number of partitions. We assume that vertex IDs may capture a metric of locality. kolokasis@ics.forth.gr 12 of 26
  • 16. Graph Partitioners Places edges into partitions using a Destination Cut strategy when the destination is a hub, or a Source Cut strategy when it is not. kolokasis@ics.forth.gr 13 of 26
  • 17. Graph Partitioners Distributes edges using the Edge Partition 2D strategy when source and destination vertices are both hubs or both not hubs; if only one of them is a hub, the algorithm places the edge near the non-hub vertex. kolokasis@ics.forth.gr 14 of 26
  • 19. Partition Metrics The ratio of the number of edges in the biggest partition, over the average number of edges per partition. kolokasis@ics.forth.gr 15 of 26
  • 20. Partition Metrics Normalized Standard Deviation of the number of edges per partition. An alternative measure of imbalance in the edge partitioning. kolokasis@ics.forth.gr 16 of 26
  • 21. Partition Metrics The ratio of the total number of vertices of each partition, including replicated vertices, over the total number of vertices of the original graph. kolokasis@ics.forth.gr 17 of 26
  • 22. Partition Metrics The number of vertices that exist in more than one partition, irrespective of how many copies of each cut vertex there are. These are the unique vertices copied across partitions. kolokasis@ics.forth.gr 18 of 26
  • 23. Partition Metrics The total number of copies of replicated vertices that exist in more than one partition. Shows the number of messages that need to be exchanged on every superstep. kolokasis@ics.forth.gr 19 of 26
  • 24. Characterization of Partitions Metrics • Almost all partitions produced by partitioners are quite balanced • Except for web-wikipedia-link-fr, where DC produced unballanced partitions kolokasis@ics.forth.gr 20 of 26
  • 25. Characterization of Partitions Metrics • Power-law graphs results into higher RF • Low number of CV usually means a low RF kolokasis@ics.forth.gr 21 of 26
  • 27. Which Metrics can predict the performance? • RF is almost correlated with PR except only in web-wikipedia-link-fr dataset • RF is not correlated with TC kolokasis@ics.forth.gr 22 of 26
  • 28. Which Metrics can predict the performance? • CV is almost correlated with CC except only in road-road-usa dataset • CV is not reliable predictor of TC performance kolokasis@ics.forth.gr 23 of 26
  • 29. Dynamic Partitioner Selection Hypothesis Select a partitioner dynamically based on the properties of the data (e.g size of the graph, granularity of partitioning) Testing We implemented a very simple dynamic partitioner that selects between partitioning algorithms based on the granularity of partitioning kolokasis@ics.forth.gr 24 of 26
  • 32. Conclusions • Distributed graph analytics frameworks efficiency is highly dependent on the partitioning strategies used • There is no single optimal partitioner for all problems • There is no simple way to predict the performance of the computation • Dymamic partitioners can achieve results better than static partitioners on different set of datasets and configurations kolokasis@ics.forth.gr 26 of 26
  • 33. Q&A For questions after this session, contact us at: kolokasis@ics.forth.gr Supported by: