SlideShare a Scribd company logo
Fishing Graphs in a Hadoop Data
Lake
Max Neunhöffer
Munich, 6 April 2017
www.arangodb.com
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
Social networks (edges are friendship)
Dependency chains
Computer networks
Citations
Hierarchies
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
Social networks (edges are friendship)
Dependency chains
Computer networks
Citations
Hierarchies
Indeed any relation
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
Social networks (edges are friendship)
Dependency chains
Computer networks
Citations
Hierarchies
Indeed any relation
Sometimes directed, sometimes undirected.
Usual approach: data in HDFS, use Spark/GraphFrames
v = spark.read.option("header",true).csv("hdfs://...")
e = spark.read.option("header",true).csv("hdfs://...")
g = GraphFrame(v,e)
g.inDegrees.show()
g.outDegrees.groupBy("outDegree").count().sort("outDegree").show(1000)
g.vertices.groupBy("GYEAR").count().sort("GYEAR").show()
g.find("(a)-[e]->(b);(b)-[ee]->(c)").filter("a.id = 6009536").count()
results = g.pageRank(resetProbability=0.01, maxIter=3)
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds.
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds. Usually, we would like to run many of them.
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds. Usually, we would like to run many of them.
Examples:
friends of friends of one person
find all immediate dependencies of one item
find all direct and indirect citations of one article
find all descendants of one member of a hierarchy
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds. Usually, we would like to run many of them.
Examples:
friends of friends of one person
find all immediate dependencies of one item
find all direct and indirect citations of one article
find all descendants of one member of a hierarchy
IDEA: Use a Graph Database
Graph Databases
Graph Databases
Can store and persist graphs.
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph queries:
Find paths in graphs according to a pattern.
Find everything reachable from a vertex.
Find shortest paths between two given vertices.
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph queries:
Find paths in graphs according to a pattern.
Find everything reachable from a vertex.
Find shortest paths between two given vertices.
=⇒ Graph Traversals
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph queries:
Find paths in graphs according to a pattern.
Find everything reachable from a vertex.
Find shortest paths between two given vertices.
=⇒ Graph Traversals Crucial: Number of steps a priori unknown!
Graph Traversals
A
B
C
D
J
E
H
F
G
A
Graph Traversals
A
B
C
D
J
E
H
F
G
AB
Graph Traversals
A
B
C
D
J
E
H
F
G
ABC
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCE
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCED
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJ
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJ
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJF
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJFG
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJFG
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJFGH
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
Important:
Is able to compete with specialised products on their turf.
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
Important:
Is able to compete with specialised products on their turf.
Allows for polyglot persistence using a single database technology.
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
Important:
Is able to compete with specialised products on their turf.
Allows for polyglot persistence using a single database technology.
In a microservice architecture, there will be several different deployments.
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
and to do graph queries,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
and to do graph queries,
AQL is independent of the driver used and
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
and to do graph queries,
AQL is independent of the driver used and
offers protection against injections by design.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
This leads to significantly better resource utilization.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
This leads to significantly better resource utilization.
Fault tolerance, self-healing and automatic failover is guaranteed.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
This leads to significantly better resource utilization.
Fault tolerance, self-healing and automatic failover is guaranteed.
runs on Apache Mesos and Mesosphere DC/OS clusters.
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Allows to plug things together!
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Allows to plug things together!
Consequence: We can easily deploy multiple systems alongside each other.
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Allows to plug things together!
Consequence: We can easily deploy multiple systems alongside each other.
Example: HDFS, Spark and ArangoDB
Import data into ArangoDB
hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/patents.csv
hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/citations.csv
dcos package install arangodb3
arangosh 
--server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
var g = require("@arangodb/general-graph");
var G = g._create("G",[g._relation("citations",["patents"],["patents"])]);
arangoimp --collection patents --file patents.csv --type csv 
--server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
arangoimp --collection citations --file citations.csv --type csv 
--server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
Run a graph traversal
This query finds patents cited by patents/6009503 (depth ≤ 3) recursively:
Recursive traversal, 500 results, 317 ms
FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G"
RETURN v
Run a graph traversal
This query finds patents cited by patents/6009503 (depth ≤ 3) recursively:
Recursive traversal, 500 results, 317 ms
FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G"
RETURN v
This one finds all patents that cite any of those cited by patents/6009503:
One step forward and one back, 35 results, 59 ms
FOR v IN 1..1 OUTBOUND "patents/6009503" GRAPH "G"
FOR w IN 1..1 INBOUND v._id GRAPH "G"
FILTER w._id != v._id
RETURN w
Run a graph traversal
This query finds all patents that cite patents/3541687 directly or in two steps:
Recursive traversal backwards, 22 results, 15 ms
FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G"
RETURN v._key
Run a graph traversal
This query finds all patents that cite patents/3541687 directly or in two steps:
Recursive traversal backwards, 22 results, 15 ms
FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G"
RETURN v._key
This one counts all patents that cite patents/3541687 recursively:
Deep recursion backwards, count 398, 311 ms
FOR v IN 1..10 INBOUND "patents/3541687" GRAPH "G"
COLLECT WITH COUNT INTO c
RETURN c
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Graph database as primary data store
You can turn things around:
Keep and maintain the graph data in a graph database.
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Graph database as primary data store
You can turn things around:
Keep and maintain the graph data in a graph database.
Regularly dump to HDFS and run larger analysis jobs there.
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Graph database as primary data store
You can turn things around:
Keep and maintain the graph data in a graph database.
Regularly dump to HDFS and run larger analysis jobs there.
Or: Use ArangoDB’s Spark Connector:
https://github.com/arangodb/arangodb-spark-connector
Links
http://hadoop.apache.org/
http://spark.apache.org/
https://graphframes.github.io/
https://www.arangodb.com
https://github.com/arangodb/arangodb-spark-connector
https://docs.arangodb.com/cookbook/index.html
http://mesos.apache.org/
https://mesosphere.com/
https://github.com/dcos/demos

More Related Content

PPTX
Efficient Data Formats for Analytics with Parquet and Arrow
PPTX
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PPTX
Dynamic DDL: Adding structure to streaming IoT data on the fly
PPTX
Schema Registry - Set Your Data Free
PPTX
Apache Hadoop 3.0 Community Update
PPTX
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
PPTX
Empower Data-Driven Organizations with HPE and Hadoop
PDF
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Efficient Data Formats for Analytics with Parquet and Arrow
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Dynamic DDL: Adding structure to streaming IoT data on the fly
Schema Registry - Set Your Data Free
Apache Hadoop 3.0 Community Update
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Empower Data-Driven Organizations with HPE and Hadoop
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...

What's hot (20)

PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
PPTX
Schema Registry - Set you Data Free
PDF
Realizing the Promise of Portable Data Processing with Apache Beam
PPTX
Deep Learning using Spark and DL4J for fun and profit
PPTX
Cloudy with a Chance of Hadoop - Real World Considerations
PPTX
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
PPTX
Interactive Analytics at Scale in Apache Hive Using Druid
PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow...
PPTX
Hadoop 3 in a Nutshell
PDF
HAWQ Meets Hive - Querying Unmanaged Data
PPTX
Applied Deep Learning with Spark and Deeplearning4j
PPT
Running Spark in Production
PPTX
Dancing elephants - efficiently working with object stores from Apache Spark ...
PPTX
Big Data in the Cloud - The What, Why and How from the Experts
PPTX
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
PPTX
Hadoop in the Cloud - The what, why and how from the experts
PDF
Spark Uber Development Kit
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Schema Registry - Set you Data Free
Realizing the Promise of Portable Data Processing with Apache Beam
Deep Learning using Spark and DL4J for fun and profit
Cloudy with a Chance of Hadoop - Real World Considerations
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Interactive Analytics at Scale in Apache Hive Using Druid
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Hadoop 3 in a Nutshell
HAWQ Meets Hive - Querying Unmanaged Data
Applied Deep Learning with Spark and Deeplearning4j
Running Spark in Production
Dancing elephants - efficiently working with object stores from Apache Spark ...
Big Data in the Cloud - The What, Why and How from the Experts
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
HDFS Tiered Storage: Mounting Object Stores in HDFS
Hadoop in the Cloud - The what, why and how from the experts
Spark Uber Development Kit
Ad

Similar to Fishing Graphs in a Hadoop Data Lake (20)

PDF
Fishing Graphs in a Hadoop Data Lake
PDF
Deep dive into the native multi model database ArangoDB
PPTX
Introduction to Polyglot Persistence
PDF
Processing large-scale graphs with Google Pregel
PPTX
Intro to Hadoop
PDF
Hadoop Technologies
PDF
Graph Stream Processing : spinning fast, large scale, complex analytics
PDF
Distributed graph processing
PDF
Survey of Big Data Infrastructures
PPTX
Large Scale Graph Analytics with JanusGraph
PPTX
Large Scale Graph Analytics with JanusGraph
PDF
Comparison among rdbms, hadoop and spark
KEY
Polyglot Persistence & Big Data in the Cloud
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
PDF
Technologies for Data Analytics Platform
PPTX
Bw tech hadoop
PPTX
BW Tech Meetup: Hadoop and The rise of Big Data
PDF
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
PDF
Introduction To Hadoop Ecosystem
Fishing Graphs in a Hadoop Data Lake
Deep dive into the native multi model database ArangoDB
Introduction to Polyglot Persistence
Processing large-scale graphs with Google Pregel
Intro to Hadoop
Hadoop Technologies
Graph Stream Processing : spinning fast, large scale, complex analytics
Distributed graph processing
Survey of Big Data Infrastructures
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
Comparison among rdbms, hadoop and spark
Polyglot Persistence & Big Data in the Cloud
Hadoop - Architectural road map for Hadoop Ecosystem
Technologies for Data Analytics Platform
Bw tech hadoop
BW Tech Meetup: Hadoop and The rise of Big Data
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Introduction To Hadoop Ecosystem
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Electronic commerce courselecture one. Pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Encapsulation_ Review paper, used for researhc scholars
sap open course for s4hana steps from ECC to s4
Understanding_Digital_Forensics_Presentation.pptx
Review of recent advances in non-invasive hemoglobin estimation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Building Integrated photovoltaic BIPV_UPV.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation theory and applications.pdf
Big Data Technologies - Introduction.pptx
Unlocking AI with Model Context Protocol (MCP)
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Per capita expenditure prediction using model stacking based on satellite ima...
Reach Out and Touch Someone: Haptics and Empathic Computing
Electronic commerce courselecture one. Pdf
“AI and Expert System Decision Support & Business Intelligence Systems”

Fishing Graphs in a Hadoop Data Lake

  • 1. Fishing Graphs in a Hadoop Data Lake Max Neunhöffer Munich, 6 April 2017 www.arangodb.com
  • 2. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x)
  • 3. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x) Social networks (edges are friendship) Dependency chains Computer networks Citations Hierarchies
  • 4. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x) Social networks (edges are friendship) Dependency chains Computer networks Citations Hierarchies Indeed any relation
  • 5. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x) Social networks (edges are friendship) Dependency chains Computer networks Citations Hierarchies Indeed any relation Sometimes directed, sometimes undirected.
  • 6. Usual approach: data in HDFS, use Spark/GraphFrames v = spark.read.option("header",true).csv("hdfs://...") e = spark.read.option("header",true).csv("hdfs://...") g = GraphFrame(v,e) g.inDegrees.show() g.outDegrees.groupBy("outDegree").count().sort("outDegree").show(1000) g.vertices.groupBy("GYEAR").count().sort("GYEAR").show() g.find("(a)-[e]->(b);(b)-[ee]->(c)").filter("a.id = 6009536").count() results = g.pageRank(resetProbability=0.01, maxIter=3)
  • 7. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data.
  • 8. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds.
  • 9. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds. Usually, we would like to run many of them.
  • 10. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds. Usually, we would like to run many of them. Examples: friends of friends of one person find all immediate dependencies of one item find all direct and indirect citations of one article find all descendants of one member of a hierarchy
  • 11. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds. Usually, we would like to run many of them. Examples: friends of friends of one person find all immediate dependencies of one item find all direct and indirect citations of one article find all descendants of one member of a hierarchy IDEA: Use a Graph Database
  • 12. Graph Databases Graph Databases Can store and persist graphs.
  • 13. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries.
  • 14. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries. Graph queries: Find paths in graphs according to a pattern. Find everything reachable from a vertex. Find shortest paths between two given vertices.
  • 15. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries. Graph queries: Find paths in graphs according to a pattern. Find everything reachable from a vertex. Find shortest paths between two given vertices. =⇒ Graph Traversals
  • 16. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries. Graph queries: Find paths in graphs according to a pattern. Find everything reachable from a vertex. Find shortest paths between two given vertices. =⇒ Graph Traversals Crucial: Number of steps a priori unknown!
  • 28. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store,
  • 29. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models.
  • 30. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: Is able to compete with specialised products on their turf.
  • 31. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: Is able to compete with specialised products on their turf. Allows for polyglot persistence using a single database technology.
  • 32. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: Is able to compete with specialised products on their turf. Allows for polyglot persistence using a single database technology. In a microservice architecture, there will be several different deployments.
  • 33. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries,
  • 34. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics,
  • 35. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins,
  • 36. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, and to do graph queries,
  • 37. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, and to do graph queries, AQL is independent of the driver used and
  • 38. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, and to do graph queries, AQL is independent of the driver used and offers protection against injections by design.
  • 39. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems.
  • 40. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone.
  • 41. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic.
  • 42. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to significantly better resource utilization.
  • 43. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to significantly better resource utilization. Fault tolerance, self-healing and automatic failover is guaranteed.
  • 44. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to significantly better resource utilization. Fault tolerance, self-healing and automatic failover is guaranteed. runs on Apache Mesos and Mesosphere DC/OS clusters.
  • 45. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery
  • 46. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery Allows to plug things together!
  • 47. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery Allows to plug things together! Consequence: We can easily deploy multiple systems alongside each other.
  • 48. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery Allows to plug things together! Consequence: We can easily deploy multiple systems alongside each other. Example: HDFS, Spark and ArangoDB
  • 49. Import data into ArangoDB hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/patents.csv hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/citations.csv dcos package install arangodb3 arangosh --server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos var g = require("@arangodb/general-graph"); var G = g._create("G",[g._relation("citations",["patents"],["patents"])]); arangoimp --collection patents --file patents.csv --type csv --server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos arangoimp --collection citations --file citations.csv --type csv --server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
  • 50. Run a graph traversal This query finds patents cited by patents/6009503 (depth ≤ 3) recursively: Recursive traversal, 500 results, 317 ms FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G" RETURN v
  • 51. Run a graph traversal This query finds patents cited by patents/6009503 (depth ≤ 3) recursively: Recursive traversal, 500 results, 317 ms FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G" RETURN v This one finds all patents that cite any of those cited by patents/6009503: One step forward and one back, 35 results, 59 ms FOR v IN 1..1 OUTBOUND "patents/6009503" GRAPH "G" FOR w IN 1..1 INBOUND v._id GRAPH "G" FILTER w._id != v._id RETURN w
  • 52. Run a graph traversal This query finds all patents that cite patents/3541687 directly or in two steps: Recursive traversal backwards, 22 results, 15 ms FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G" RETURN v._key
  • 53. Run a graph traversal This query finds all patents that cite patents/3541687 directly or in two steps: Recursive traversal backwards, 22 results, 15 ms FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G" RETURN v._key This one counts all patents that cite patents/3541687 recursively: Deep recursion backwards, count 398, 311 ms FOR v IN 1..10 INBOUND "patents/3541687" GRAPH "G" COLLECT WITH COUNT INTO c RETURN c
  • 54. Yet another approach If your graph data changes rapidly in a transactional fashion...
  • 55. Yet another approach If your graph data changes rapidly in a transactional fashion... Graph database as primary data store You can turn things around: Keep and maintain the graph data in a graph database.
  • 56. Yet another approach If your graph data changes rapidly in a transactional fashion... Graph database as primary data store You can turn things around: Keep and maintain the graph data in a graph database. Regularly dump to HDFS and run larger analysis jobs there.
  • 57. Yet another approach If your graph data changes rapidly in a transactional fashion... Graph database as primary data store You can turn things around: Keep and maintain the graph data in a graph database. Regularly dump to HDFS and run larger analysis jobs there. Or: Use ArangoDB’s Spark Connector: https://github.com/arangodb/arangodb-spark-connector