SlideShare a Scribd company logo
Comparison and Evaluation of Open Source 
Implementations of Pregel and Related Systems 
December 2, 2013 
Joshua Woo, Prashant Raghav, Vishnu Prathish 
David R. Cheriton School of Computer Science 
University of Waterloo
Outline 
● Motivation 
● Our Project 
● Setup 
● Preliminary Results 
● Preliminary Analysis 
● In-Progress 
● References
Motivation 
Recall: Pregel 
● Large-scale graph processing system 
● Fault-tolerant framework for graph 
algorithms 
● MapReduce for graph operations? 
● Vertex-centric model (“think like a vertex”)
Motivation 
● Pregel is proprietary 
● Many open source graph processing 
systems 
○ Pregel clones 
○ Pregel-inspired 
○ BSP
Motivation 
● Apache Hama 
● Signal/Collect 
● Apache Giraph 
● GPS 
● GraphLab 
● Phoebus 
● GoldenOrb 
● HipG 
● Mizan
Motivation 
System Impl. Language Type 
Apache Hama Java Pure BSP framework 
Signal/Collect Scala Pregel inspired 
Apache Giraph Java Pregel clone 
GPS Java Advanced Pregel clone 
GraphLab C++ Pregel inspired 
Phoebus Erlang Pregel clone 
GoldenOrb Java Pregel clone 
HipG Java Advanced Pregel clone 
Mizan C++ Advanced Pregel clone
Motivation 
● How do these systems compare? 
○ In terms of performance (runtime)? 
○ In terms of memory footprint? 
○ In terms of network utilization (num. messages)? 
○ Variables: 
■ Algorithm 
■ Graph size (number of vertices) 
■ Cluster size
Our Project 
● Compare at least 3 systems 
○ Apache Hama - general BSP framework 
○ Apache Giraph - Hadoop Map-only job, Facebook 
○ GPS - +dynamic repartitioning, +multi vertex-centric 
○ Signal/Collect - +edges, +async computations 
○ GraphLab 
○ Mizan
Our Project 
● Measure the runtime of at least two 
algorithms on each system 
○ PageRank 
■ Fixed number of supersteps = 30 
○ Single Source Shortest Path (SSSP) 
○ k-means clustering
Setup 
● Experiments on AWS 
○ Ubuntu 12.04 m1.medium EC2 instances 
■ 2 ECUs, 1 vCPU, 3.7 GiB memory, moderate network 
performance 
■ 8 GiB EBS volume per instance 
○ Cluster sizes: 
■ Single-node cluster 
■ 4-node cluster 
■ 8-node cluster
Setup 
● Experiments on AWS 
○ 5 runs per dataset per algorithm per cluster 
■ 35 runs per algorithm per cluster 
■ 70 runs per cluster 
■ 140 runs in total (single-node, 4-node) 
● TODO: another 70 runs (8-node)
Setup 
● Dataset 
○ 7 datasets 
■ tinyEWD: 8 vertices 15 edges 
■ mediumEWD: 250 vertices 2,546 edges 
■ 1000EWD: 1,000 vertices 16,866 edges 
■ rome99: 3,353 vertices 8,870 edges 
■ 10000EWD: 10,000 vertices 16,866 edges 
■ NYC: 264,346 vertices 733,846 edges 
■ largeEWD: 1,000,000 vertices 15,172,126 edges 
○ Source: http://algs4.cs.princeton.edu/44sp/
Setup 
● Systems 
○ Hama 
■ Hadoop 1.03.0 
■ Hama 0.6.3 
○ Giraph 
■ Hadoop 0.20.203rc1 
■ Giraph (trunk@37bc2c80564b45d7e4ce95db76f5411a6b8bdb3a) 
○ GPS 
■ Hadoop 0.20.203rc1 
■ GPS (trunk@Revision 112)
Setup 
● Input Graph 
○ Source files converted into format suitable for each 
system 
■ Time for this conversion excluded from results: 
● Conversion done before algorithms are run (pre-processing?) 
● Negligible for largeEWD (1,000,000 vertices, 15,172,126 
edges)
Preliminary Results 
Average SSSP runtime on 4-node cluster (in seconds) 
Dataset Hama Giraph GPS 
tinyEWD 14.17 41.60 14.40 
mediumEWD 16.36 44.00 36.00 
1000EWD 18.06 48.80 46.60 
rome99 22.95 66.00 50.00 
10000EWD 25.32 67.40 55.00 
NYC 165.01 267.00 310.00 
largeEWD 6,109.20 602.80 618.70
Preliminary Results 
SSSP runtime vs. graph size (num. vertices)
Preliminary Results 
Average PageRank (30 supersteps) runtime on 4-node cluster (in seconds) 
Dataset Hama Giraph GPS 
tinyEWD 29.36 49.40 58.57 
mediumEWD 30.26 53.40 60.42 
1000EWD 37.86 54.60 61.03 
rome99 29.35 56.20 61.80 
10000EWD 302.33 61.80 64.80 
NYC 1,001.24 134.40 68.69 
largeEWD Failed 2,100.00 1,213.56
Preliminary Results 
PageRank runtime vs. graph size (num. vertices)
Preliminary Analysis 
● A point of resource crunch 
○ No significant change in performance until a point 
● Hama does not scale well (vertices ~10^4) 
● Giraph and GPS scale better 
● In general, PageRank runtime > SSSP runtime 
● GPS input reader does not guarantee true partitioning 
for large datasets 
● Which ‘knobs’ to keep constant? - Optimization vs. 
Comparability
In-Progress 
● Output validation 
● Memory footprint 
● Network utilization (num. messages) 
● GraphLab and Signal/Collect 
● Green-Marl? 
○ (DSL) → [Compiler] → (Giraph, GPS)
Questions?
Extras
Preliminary Results 
Number of supersteps for SSSP 
Dataset Hama Giraph GPS 
tinyEWD 10 7 7 
mediumEWD 16 13 18 
1000EWD 27 25 23 
rome99 105 102 18 
10000EWD 85 80 64 
NYC 671 905 438 
largeEWD 806 670 730
Preliminary Results 
Number of supersteps for SSSP
Really, really Preliminary 
PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated 
Dataset Native Green-Marl generated 
tinyEWD 58.57 60.20 
mediumEWD 60.42 60.11 
1000EWD 61.03 62.30 
rome99 61.80 62.32 
10000EWD 64.80 65.78 
NYC 68.69 71.34 
largeEWD 1,213.56 -
Really, really Preliminary 
PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated
References 
● Our Project Proposal 
● http://algs4.cs.princeton.edu/44sp/ 
● https://github.com/apache/hadoop-common 
● https://github.com/apache/giraph 
● https://subversion.assembla.com/svn/phd-projects/ 
gps/trunk/ 
● http://ppl.stanford.edu/main/green_marl.html

More Related Content

PDF
Debugging data pipelines @OLA by Karan Kumar
PPT
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
PDF
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
PDF
Apache Giraph
PDF
Processing edges on apache giraph
PPTX
2011.10.14 Apache Giraph - Hortonworks
PDF
Fast, Scalable Graph Processing: Apache Giraph on YARN
PPT
Giraph at Hadoop Summit 2014
Debugging data pipelines @OLA by Karan Kumar
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Apache Giraph
Processing edges on apache giraph
2011.10.14 Apache Giraph - Hortonworks
Fast, Scalable Graph Processing: Apache Giraph on YARN
Giraph at Hadoop Summit 2014

What's hot (20)

PDF
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
PPTX
Case study- Real-time OLAP Cubes
PDF
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
PDF
H2O World - GLM - Tomas Nykodym
PDF
Giraph
PDF
NetFlow Data processing using Hadoop and Vertica
PDF
Introducing Apache Giraph for Large Scale Graph Processing
PDF
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
PPTX
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
PDF
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
PDF
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
PDF
Enterprise Scale Topological Data Analysis Using Spark
PDF
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
PDF
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
PDF
Introduction to Spark R with R studio - Mr. Pragith
PDF
Joker'14 Java as a fundamental working tool of the Data Scientist
PDF
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
PDF
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
PPTX
Introduction to Yarn
PDF
SparkR: Enabling Interactive Data Science at Scale on Hadoop
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Case study- Real-time OLAP Cubes
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
H2O World - GLM - Tomas Nykodym
Giraph
NetFlow Data processing using Hadoop and Vertica
Introducing Apache Giraph for Large Scale Graph Processing
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Enterprise Scale Topological Data Analysis Using Spark
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Introduction to Spark R with R studio - Mr. Pragith
Joker'14 Java as a fundamental working tool of the Data Scientist
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
Introduction to Yarn
SparkR: Enabling Interactive Data Science at Scale on Hadoop
Ad

Similar to Comparing pregel related systems (20)

PDF
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
PDF
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
PDF
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
PDF
Apache spark - Spark's distributed programming model
PDF
Architecting and productionising data science applications at scale
PDF
Big data should be simple
PDF
[@NaukriEngineering] Apache Spark
PDF
The state of SQL-on-Hadoop in the Cloud
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PPSX
Dynamically Optimizing Queries over Large Scale Data Platforms
PDF
Impala presentation ahad rana
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
PDF
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
PDF
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
PDF
RAPIDS – Open GPU-accelerated Data Science
PDF
Hadoop Network Performance profile
PDF
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
PDF
Big Data processing with Apache Spark
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Apache spark - Spark's distributed programming model
Architecting and productionising data science applications at scale
Big data should be simple
[@NaukriEngineering] Apache Spark
The state of SQL-on-Hadoop in the Cloud
AWS Big Data Demystified #1: Big data architecture lessons learned
Dynamically Optimizing Queries over Large Scale Data Platforms
Impala presentation ahad rana
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
Spark Concepts - Spark SQL, Graphx, Streaming
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
RAPIDS – Open GPU-accelerated Data Science
Hadoop Network Performance profile
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
Big Data processing with Apache Spark
Ad

Recently uploaded (20)

PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
PPT on Performance Review to get promotions
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPT
Mechanical Engineering MATERIALS Selection
PPT
Project quality management in manufacturing
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Construction Project Organization Group 2.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
web development for engineering and engineering
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
additive manufacturing of ss316l using mig welding
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Geodesy 1.pptx...............................................
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPT on Performance Review to get promotions
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Mechanical Engineering MATERIALS Selection
Project quality management in manufacturing
Operating System & Kernel Study Guide-1 - converted.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Construction Project Organization Group 2.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
web development for engineering and engineering
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Lecture Notes Electrical Wiring System Components
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
additive manufacturing of ss316l using mig welding
Automation-in-Manufacturing-Chapter-Introduction.pdf
R24 SURVEYING LAB MANUAL for civil enggi
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Geodesy 1.pptx...............................................

Comparing pregel related systems

  • 1. Comparison and Evaluation of Open Source Implementations of Pregel and Related Systems December 2, 2013 Joshua Woo, Prashant Raghav, Vishnu Prathish David R. Cheriton School of Computer Science University of Waterloo
  • 2. Outline ● Motivation ● Our Project ● Setup ● Preliminary Results ● Preliminary Analysis ● In-Progress ● References
  • 3. Motivation Recall: Pregel ● Large-scale graph processing system ● Fault-tolerant framework for graph algorithms ● MapReduce for graph operations? ● Vertex-centric model (“think like a vertex”)
  • 4. Motivation ● Pregel is proprietary ● Many open source graph processing systems ○ Pregel clones ○ Pregel-inspired ○ BSP
  • 5. Motivation ● Apache Hama ● Signal/Collect ● Apache Giraph ● GPS ● GraphLab ● Phoebus ● GoldenOrb ● HipG ● Mizan
  • 6. Motivation System Impl. Language Type Apache Hama Java Pure BSP framework Signal/Collect Scala Pregel inspired Apache Giraph Java Pregel clone GPS Java Advanced Pregel clone GraphLab C++ Pregel inspired Phoebus Erlang Pregel clone GoldenOrb Java Pregel clone HipG Java Advanced Pregel clone Mizan C++ Advanced Pregel clone
  • 7. Motivation ● How do these systems compare? ○ In terms of performance (runtime)? ○ In terms of memory footprint? ○ In terms of network utilization (num. messages)? ○ Variables: ■ Algorithm ■ Graph size (number of vertices) ■ Cluster size
  • 8. Our Project ● Compare at least 3 systems ○ Apache Hama - general BSP framework ○ Apache Giraph - Hadoop Map-only job, Facebook ○ GPS - +dynamic repartitioning, +multi vertex-centric ○ Signal/Collect - +edges, +async computations ○ GraphLab ○ Mizan
  • 9. Our Project ● Measure the runtime of at least two algorithms on each system ○ PageRank ■ Fixed number of supersteps = 30 ○ Single Source Shortest Path (SSSP) ○ k-means clustering
  • 10. Setup ● Experiments on AWS ○ Ubuntu 12.04 m1.medium EC2 instances ■ 2 ECUs, 1 vCPU, 3.7 GiB memory, moderate network performance ■ 8 GiB EBS volume per instance ○ Cluster sizes: ■ Single-node cluster ■ 4-node cluster ■ 8-node cluster
  • 11. Setup ● Experiments on AWS ○ 5 runs per dataset per algorithm per cluster ■ 35 runs per algorithm per cluster ■ 70 runs per cluster ■ 140 runs in total (single-node, 4-node) ● TODO: another 70 runs (8-node)
  • 12. Setup ● Dataset ○ 7 datasets ■ tinyEWD: 8 vertices 15 edges ■ mediumEWD: 250 vertices 2,546 edges ■ 1000EWD: 1,000 vertices 16,866 edges ■ rome99: 3,353 vertices 8,870 edges ■ 10000EWD: 10,000 vertices 16,866 edges ■ NYC: 264,346 vertices 733,846 edges ■ largeEWD: 1,000,000 vertices 15,172,126 edges ○ Source: http://algs4.cs.princeton.edu/44sp/
  • 13. Setup ● Systems ○ Hama ■ Hadoop 1.03.0 ■ Hama 0.6.3 ○ Giraph ■ Hadoop 0.20.203rc1 ■ Giraph (trunk@37bc2c80564b45d7e4ce95db76f5411a6b8bdb3a) ○ GPS ■ Hadoop 0.20.203rc1 ■ GPS (trunk@Revision 112)
  • 14. Setup ● Input Graph ○ Source files converted into format suitable for each system ■ Time for this conversion excluded from results: ● Conversion done before algorithms are run (pre-processing?) ● Negligible for largeEWD (1,000,000 vertices, 15,172,126 edges)
  • 15. Preliminary Results Average SSSP runtime on 4-node cluster (in seconds) Dataset Hama Giraph GPS tinyEWD 14.17 41.60 14.40 mediumEWD 16.36 44.00 36.00 1000EWD 18.06 48.80 46.60 rome99 22.95 66.00 50.00 10000EWD 25.32 67.40 55.00 NYC 165.01 267.00 310.00 largeEWD 6,109.20 602.80 618.70
  • 16. Preliminary Results SSSP runtime vs. graph size (num. vertices)
  • 17. Preliminary Results Average PageRank (30 supersteps) runtime on 4-node cluster (in seconds) Dataset Hama Giraph GPS tinyEWD 29.36 49.40 58.57 mediumEWD 30.26 53.40 60.42 1000EWD 37.86 54.60 61.03 rome99 29.35 56.20 61.80 10000EWD 302.33 61.80 64.80 NYC 1,001.24 134.40 68.69 largeEWD Failed 2,100.00 1,213.56
  • 18. Preliminary Results PageRank runtime vs. graph size (num. vertices)
  • 19. Preliminary Analysis ● A point of resource crunch ○ No significant change in performance until a point ● Hama does not scale well (vertices ~10^4) ● Giraph and GPS scale better ● In general, PageRank runtime > SSSP runtime ● GPS input reader does not guarantee true partitioning for large datasets ● Which ‘knobs’ to keep constant? - Optimization vs. Comparability
  • 20. In-Progress ● Output validation ● Memory footprint ● Network utilization (num. messages) ● GraphLab and Signal/Collect ● Green-Marl? ○ (DSL) → [Compiler] → (Giraph, GPS)
  • 23. Preliminary Results Number of supersteps for SSSP Dataset Hama Giraph GPS tinyEWD 10 7 7 mediumEWD 16 13 18 1000EWD 27 25 23 rome99 105 102 18 10000EWD 85 80 64 NYC 671 905 438 largeEWD 806 670 730
  • 24. Preliminary Results Number of supersteps for SSSP
  • 25. Really, really Preliminary PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated Dataset Native Green-Marl generated tinyEWD 58.57 60.20 mediumEWD 60.42 60.11 1000EWD 61.03 62.30 rome99 61.80 62.32 10000EWD 64.80 65.78 NYC 68.69 71.34 largeEWD 1,213.56 -
  • 26. Really, really Preliminary PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated
  • 27. References ● Our Project Proposal ● http://algs4.cs.princeton.edu/44sp/ ● https://github.com/apache/hadoop-common ● https://github.com/apache/giraph ● https://subversion.assembla.com/svn/phd-projects/ gps/trunk/ ● http://ppl.stanford.edu/main/green_marl.html