SlideShare a Scribd company logo
DATO VS. SPARK GRAPHX
KEIRA ZHOU
OCT, 2015
Details: https://github.com/keiraqz/dato-vs-graphx
SETTINGS
• 1 master node and 3 work nodes on AWS
• m4.large instances with 8GB of RAM with 2 cores
DATO
• A graph-based, asynchronous, high performance, distributed
computation framework written in C++
• 30-days free trial, then a service fee
• Install GraphLab Create on the local machine and Dato
Distributed on a cluster
SPARK GRAPHX
• Come with Spark
import org.apache.spark._
import org.apache.spark.graphx._
EXPERIMENTS
• Graph Algorithms
• Triangle-counting
• PageRank
• Connected Components
• Datasets: Stanford Large Network Dataset Collection (SNAP)
• Facebook:
• Nodes: 4039 | Edges: 88234 | Number of triangles: 1612010
• YouTube:
• Nodes: 1134890 | Edges: 2987624 | Number of triangles: 3056386
• Pokec:
• Nodes: 1632803 | Edges: 30622564 | Number of triangles: 32557458
• LiveJournal:
• Nodes: 3997962 | Edges: 34681189 | Number of triangles: 177820130
EXPERIMENTS (CONT’D)
• Default settings
• Dato:
• GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY = 4G
• GraphX
• Start with executor memory = 1G
• Change into 2G later
RESULTS
• Triangle Counting: both Dato and GraphX (if it finishes the job) returns the
correct answer as listed on the SNAP website.
• For Pokec and LiveJournal data, GraphX has trouble finishing the
computation
TAKE-AWAY FOR GRAPHX
• What I observed was that certain stages within the job kept
failing
• A stage in Spark will operate on one partition of the RDD at a
time (and load the data in that partition into memory)
• Potential Solution
• Increasing the executor memory
• Increase the number of partitions of the RDD so that each
stage is processing smaller amount of data
RESULTS (CONT’D)
• PageRank: The threshold for PageRank is set to 0.001
RESULTS (CONT’D)
• Connected Components
CONCLUSIONS
• Quick setups for both of the tools without fine-tune runtime
parameters, but
• Dato has clear advantages over GraphX in terms of execution
time for processing large scale graph data
• However, GraphX is free while Dato charges a service fee after
the free trial.
• The goal of the GraphX project is to unify graph-parallel and data-
parallel computation in one system with a single composable API.
• Further experiments can be done to compare the overall
performance of a specific task that contains both graph algorithms
and other data-parallel computation
MORE DETAILS
• https://github.com/keiraqz/dato-vs-graphx
REFERENCES
• Dato:
• https://dato.com/
• Spark GraphX:
• https://spark.apache.org/docs/1.1.0/graphx-programming-
guide.html
• Stanford Large Network Dataset Collection (SNAP):
• https://snap.stanford.edu/data/

More Related Content

PDF
Artmosphere Demo
PPTX
Introduction to Streaming Distributed Processing with Storm
PDF
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
PPTX
Scaling Graphite At Yelp
PDF
Hoodie: How (And Why) We built an analytical datastore on Spark
PDF
The Revolution Will be Streamed
PDF
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
Artmosphere Demo
Introduction to Streaming Distributed Processing with Storm
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013
Scaling Graphite At Yelp
Hoodie: How (And Why) We built an analytical datastore on Spark
The Revolution Will be Streamed
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Presto Summit 2018 - 09 - Netflix Iceberg

What's hot (20)

PPTX
presto-at-netflix-hadoop-summit-15
PDF
Introduction to Apache Beam
PDF
Highly Available Graphite
PDF
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
PPTX
Presto Talk @ Hadoop Summit'15
PDF
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
PDF
Working with OpenStreetMap using Apache Spark and Geotrellis
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
PDF
Big data serving: Processing and inference at scale in real time
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
PDF
Workflow Hacks #1 - dots. Tokyo
PDF
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
PDF
Monitoring pg with_graphite_grafana
PDF
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
PDF
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
PPTX
Need for Time series Database
PPTX
Presto@Netflix Presto Meetup 03-19-15
PDF
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
PDF
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
PDF
Statsd introduction
presto-at-netflix-hadoop-summit-15
Introduction to Apache Beam
Highly Available Graphite
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Presto Talk @ Hadoop Summit'15
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Working with OpenStreetMap using Apache Spark and Geotrellis
Introduction to Data Engineer and Data Pipeline at Credit OK
Big data serving: Processing and inference at scale in real time
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Workflow Hacks #1 - dots. Tokyo
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Monitoring pg with_graphite_grafana
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
Need for Time series Database
Presto@Netflix Presto Meetup 03-19-15
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Statsd introduction
Ad

Viewers also liked (6)

PPTX
Next generation analytics with yarn, spark and graph lab
PDF
PDF
Teaching ML with scikit-learn at Telecom ParisTech
PDF
Anomaly/Novelty detection with scikit-learn
PPTX
Machine Learning With Spark
PDF
Parquet performance tuning: the missing guide
Next generation analytics with yarn, spark and graph lab
Teaching ML with scikit-learn at Telecom ParisTech
Anomaly/Novelty detection with scikit-learn
Machine Learning With Spark
Parquet performance tuning: the missing guide
Ad

Similar to Dato vs GraphX (20)

PPTX
Apache Spark for Beginners
PDF
GraphX: Graph analytics for insights about developer communities
PPTX
Graphs in data structures are non-linear data structures made up of a finite ...
PDF
Graph Analytics in Spark
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
PDF
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
PDF
An excursion into Graph Analytics with Apache Spark GraphX
PDF
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
PDF
Apache Spark Presentation good for big data
PPTX
SPARK ARCHITECTURE
PDF
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Microservices, containers, and machine learning
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
PPTX
Danny Bickson - Python based predictive analytics with GraphLab Create
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PPT
Big_data_analytics_NoSql_Module-4_Session
PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
An introduction To Apache Spark
Apache Spark for Beginners
GraphX: Graph analytics for insights about developer communities
Graphs in data structures are non-linear data structures made up of a finite ...
Graph Analytics in Spark
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
An excursion into Graph Analytics with Apache Spark GraphX
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
Apache Spark Presentation good for big data
SPARK ARCHITECTURE
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Microservices, containers, and machine learning
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
Danny Bickson - Python based predictive analytics with GraphLab Create
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Big_data_analytics_NoSql_Module-4_Session
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Processing Large Data with Apache Spark -- HasGeek
An introduction To Apache Spark

Recently uploaded (20)

PDF
Introduction to the R Programming Language
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
Microsoft Core Cloud Services powerpoint
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
Microsoft 365 products and services descrption
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Business Analytics and business intelligence.pdf
PPTX
Business_Capability_Map_Collection__pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
Global Data and Analytics Market Outlook Report
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to the R Programming Language
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
retention in jsjsksksksnbsndjddjdnFPD.pptx
DU, AIS, Big Data and Data Analytics.ppt
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
ISS -ESG Data flows What is ESG and HowHow
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Microsoft Core Cloud Services powerpoint
CYBER SECURITY the Next Warefare Tactics
Microsoft 365 products and services descrption
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Business Analytics and business intelligence.pdf
Business_Capability_Map_Collection__pptx
[EN] Industrial Machine Downtime Prediction
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Global Data and Analytics Market Outlook Report
SAP 2 completion done . PRESENTATION.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx

Dato vs GraphX

  • 1. DATO VS. SPARK GRAPHX KEIRA ZHOU OCT, 2015 Details: https://github.com/keiraqz/dato-vs-graphx
  • 2. SETTINGS • 1 master node and 3 work nodes on AWS • m4.large instances with 8GB of RAM with 2 cores
  • 3. DATO • A graph-based, asynchronous, high performance, distributed computation framework written in C++ • 30-days free trial, then a service fee • Install GraphLab Create on the local machine and Dato Distributed on a cluster
  • 4. SPARK GRAPHX • Come with Spark import org.apache.spark._ import org.apache.spark.graphx._
  • 5. EXPERIMENTS • Graph Algorithms • Triangle-counting • PageRank • Connected Components • Datasets: Stanford Large Network Dataset Collection (SNAP) • Facebook: • Nodes: 4039 | Edges: 88234 | Number of triangles: 1612010 • YouTube: • Nodes: 1134890 | Edges: 2987624 | Number of triangles: 3056386 • Pokec: • Nodes: 1632803 | Edges: 30622564 | Number of triangles: 32557458 • LiveJournal: • Nodes: 3997962 | Edges: 34681189 | Number of triangles: 177820130
  • 6. EXPERIMENTS (CONT’D) • Default settings • Dato: • GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY = 4G • GraphX • Start with executor memory = 1G • Change into 2G later
  • 7. RESULTS • Triangle Counting: both Dato and GraphX (if it finishes the job) returns the correct answer as listed on the SNAP website. • For Pokec and LiveJournal data, GraphX has trouble finishing the computation
  • 8. TAKE-AWAY FOR GRAPHX • What I observed was that certain stages within the job kept failing • A stage in Spark will operate on one partition of the RDD at a time (and load the data in that partition into memory) • Potential Solution • Increasing the executor memory • Increase the number of partitions of the RDD so that each stage is processing smaller amount of data
  • 9. RESULTS (CONT’D) • PageRank: The threshold for PageRank is set to 0.001
  • 11. CONCLUSIONS • Quick setups for both of the tools without fine-tune runtime parameters, but • Dato has clear advantages over GraphX in terms of execution time for processing large scale graph data • However, GraphX is free while Dato charges a service fee after the free trial. • The goal of the GraphX project is to unify graph-parallel and data- parallel computation in one system with a single composable API. • Further experiments can be done to compare the overall performance of a specific task that contains both graph algorithms and other data-parallel computation
  • 13. REFERENCES • Dato: • https://dato.com/ • Spark GraphX: • https://spark.apache.org/docs/1.1.0/graphx-programming- guide.html • Stanford Large Network Dataset Collection (SNAP): • https://snap.stanford.edu/data/