Dato vs GraphX

DATO VS. SPARK GRAPHX
KEIRA ZHOU
OCT, 2015
Details: https://github.com/keiraqz/dato-vs-graphx

SETTINGS
• 1 master node and 3 work nodes on AWS
• m4.large instances with 8GB of RAM with 2 cores

DATO
• A graph-based, asynchronous, high performance, distributed
computation framework written in C++
• 30-days free trial, then a service fee
• Install GraphLab Create on the local machine and Dato
Distributed on a cluster

SPARK GRAPHX
• Come with Spark
import org.apache.spark._
import org.apache.spark.graphx._

EXPERIMENTS
• Graph Algorithms
• Triangle-counting
• PageRank
• Connected Components
• Datasets: Stanford Large Network Dataset Collection (SNAP)
• Facebook:
• Nodes: 4039 | Edges: 88234 | Number of triangles: 1612010
• YouTube:
• Pokec:
• LiveJournal:

EXPERIMENTS (CONT’D)
• Default settings
• Dato:
• GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY = 4G
• GraphX
• Start with executor memory = 1G
• Change into 2G later

RESULTS
• Triangle Counting: both Dato and GraphX (if it finishes the job) returns the
correct answer as listed on the SNAP website.
• For Pokec and LiveJournal data, GraphX has trouble finishing the
computation

TAKE-AWAY FOR GRAPHX
• What I observed was that certain stages within the job kept
failing
• A stage in Spark will operate on one partition of the RDD at a
time (and load the data in that partition into memory)
• Potential Solution
• Increasing the executor memory
• Increase the number of partitions of the RDD so that each
stage is processing smaller amount of data

RESULTS (CONT’D)
• PageRank: The threshold for PageRank is set to 0.001

RESULTS (CONT’D)
• Connected Components

CONCLUSIONS
• Quick setups for both of the tools without fine-tune runtime
parameters, but
• Dato has clear advantages over GraphX in terms of execution
time for processing large scale graph data
• However, GraphX is free while Dato charges a service fee after
the free trial.
• The goal of the GraphX project is to unify graph-parallel and data-
parallel computation in one system with a single composable API.
• Further experiments can be done to compare the overall
performance of a specific task that contains both graph algorithms
and other data-parallel computation

MORE DETAILS
• https://github.com/keiraqz/dato-vs-graphx

REFERENCES
• Dato:
• https://dato.com/
• Spark GraphX:
• https://spark.apache.org/docs/1.1.0/graphx-programming-
guide.html
• Stanford Large Network Dataset Collection (SNAP):
• https://snap.stanford.edu/data/

Dato vs GraphX

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Dato vs GraphX (20)

Recently uploaded (20)

Dato vs GraphX