Handling Billions of Edges in a Graph Database

www.arangodb.com
Handling Billions Of Edges in a
Graph Database
Michael Hackstein
@mchacki
New Technology

Michael Hackstein
‣ ArangoDB Core Team
‣ Web Frontend
‣ Graph visualisation
‣ Graph features
‣ SmartGraphs
‣ Host of cologne.js
‣ Master’s Degree 
(spec. Databases and 
Information Systems)
2

What are Graph Databases
‣ Schema-free Objects (Vertices)
‣ Relations between them (Edges)
‣ Edges have a direction
3
{
name: "alice",
age: 32
}
{
name: "bob",
age: 35,
size: 1,73m
}
{
name: "ﬁshing"
}
{
name: "reading"
}
{
name: "dancing"
}
married
hobby
hobby
hobby
hobby

What are Graph Databases
‣ Schema-free Objects (Vertices)
‣ Relations between them (Edges)
‣ Edges have a direction
‣ Edges can be queried in both
directions
‣ Easily query a range of edges (2 to
5)
‣ Undeﬁned number of edges (1 to *)
‣ Shortest Path between two vertices
3
{
name: "alice",
age: 32
}
{
name: "bob",
age: 35,
size: 1,73m
}
{
name: "ﬁshing"
}
{
name: "reading"
}
{
name: "dancing"
}
married
hobby
hobby
hobby
hobby

Typical Graph Queries
‣ Give me all friends of Alice
4

‣ Give me all friends-of-friends of Alice
4

‣ What is the linking path between Alice and Bob
4

‣ Which Trainstations can I reach if I am allowed to drive a distance of 6 stations on my
ticket
4

ticket
‣ Pattern Matching:
4

ticket
‣ Give me all users that share two hobbies with Alice
4

ticket
‣ Give me all users that share two hobbies with Alice
‣ Give me all products that at least one of my friends has bought together with the
products I already own, ordered by how many friends have bought it and the products
rating, but only 20 of them.
4

Non-Typical Graph Queries
‣ Give me all users which have an age attribute between 21 and 35.
5

‣ Give me the age distribution of all users
5

‣ Give me the age distribution of all users
‣ Group all users by their name
5

Traversal
6
Iterate down two edges with some ﬁlters

Traversal
‣ We ﬁrst pick a start vertex (S)
6
S

Traversal
‣ We collect all edges on S
6
S
A
B
C

Traversal
‣ We apply ﬁlters on edges
6
S
A
B
C

Traversal
‣ We iterate down one of the new vertices (A)
6
S
A
B
C
D
E

Traversal
6
S
A
B
C
D
E

Traversal
‣ The next vertex (E) is in desired depth. Return the path
S -> A -> E
6
S
A
B
C
D
E

Traversal
S -> A -> E
‣ Go back to the next unﬁnished vertex (B)
6
S
A
B
C
D
E

Traversal
S -> A -> E
‣ We iterate down on (B)
6
S
A
B
C
D
E
F

Traversal
S -> A -> E
6
S
A
B
C
D
E
F

Traversal
S -> A -> E
‣ The next vertex (F) is in desired depth. Return the path
S -> B -> F
6
S
A
B
C
D
E
F

Traversal - Complexity
‣ Once:
‣ Find the start vertex
‣ For every depth:
‣ Find all connected edges
‣ Filter non-matching edges
‣ Find connected vertices
‣ Filter non-matching vertices
7
Depends on indexes: Hash:
Edge-Index or Index-Free:
Linear in edges:
Depends on indexes: Hash:
Linear in vertices:
Only one pass:
O
1
1
n
n * 1
n
3n

Traversal - Complexity
‣ Linear sounds evil?
‣ NOT linear in All Edges O(E)
‣ Only Linear in relevant Edges n < E
‣ Traversals solely scale with their result size.
‣ They are not effected at all by total amount of data
‣ BUT: Every depth increases the exponent: O(3 * n )
‣ "7 degrees of separation": 3*n < E < 3*n
8
d
6 7

‣ MULTI-MODEL database
‣ Stores Documents and Graphs
‣ Query language AQL
‣ Document Queries
‣ Graph Queries
‣ Joins
‣ All can be combined in the same statement
‣ ACID support including Multi Collection Transactions
9

AQL
10
FOR user IN users
RETURN user

AQL
11
FOR user IN users
FILTER user.name == "alice"
RETURN user

AQL
12
FOR user IN users
FOR product IN OUTBOUND user has_bought
RETURN product
Alice TV
has_bought

AQL
13
FOR user IN users
FOR recommendation, action, path IN 3 ANY user has_bought
FILTER path.vertices[2].age <= user.age + 5
AND path.vertices[2].age >= user.age - 5
FILTER recommendation.price < 25
LIMIT 10
RETURN recommendation
Alice TV
has_bought
Bob Playstation
has_boughthas_bought
alice.age - 5 <= bob.age &&
bob.age <= alice.age + 5 playstation.price < 25

First Boost - Vertex Centric Indices
‣ Remember Complexity? O(3 * n )
‣ Filtering of non-matching edges is linear for every depth
‣ Index all edges based on their vertices and arbitrary other attributes
‣ Find initial set of edges in identical time
‣ Less / No post-ﬁltering required
‣ This decreases the n
15
d

16
Demo Time
Vertex-Centric Indices

Scaling
‣ Vertex-Centric Indexes help with super-nodes
‣ But what if the graph is too large for one machine?
‣ Distribute graph on several machines (sharding)
‣ How to query it now?
‣ No global view of the graph possible any more
‣ What about edges between servers?
17

18
First let's do
the cluster thingy

Is Mesosphere required?
‣ ArangoDB can run clusters without it
‣ Setup Requires manual effort (can be scripted):
‣ Conﬁgure IP addresses
‣ Correct startup ordering
‣ This works:
‣ Automatic Failover (Follower takes over if leader dies)
‣ Rebalancing of shards
‣ Everything inside of ArangoDB
‣ This is based on Mesos:
‣ Complete self healing
‣ Automatic restart of ArangoDBs (on new machines)
➡ We suggest you have someone on call
22

Dangers of Sharding
‣ Only parts of the graph on every machine
‣ Neighboring vertices may be on different machines
‣ Even edges could be on other machines than their vertices
‣ Queries need to be executed in a distributed way
‣ Result needs to be merged locally
24

Random Distribution
‣ Disadvantages:
‣ Neighbors on different machines
‣ Probably edges on other machines than their
vertices
‣ A lot of network overhead is required for
querying
25
‣ Advantages:
‣ every server takes an equal portion of
data
‣ easy to realize
‣ no knowledge about data required
‣ always works

Index-Free Adjacency
26
‣ Used by most other graph databases
‣ Every vertex maintains two lists of it's edges (IN and OUT)
‣ Do not use an index to ﬁnd edges
‣ How to shard this?

26
????

‣ ArangoDB uses an hash-based EdgeIndex (O(1) - lookup)
‣ The vertex is independent of it's edges
‣ It can be stored on a different machine
26
????

Domain Based Distribution
‣ Many Graphs have a natural distribution
‣ By country/region for People
‣ By tags for Blogs
‣ By category for Products
‣ Most edges in same group
‣ Rare edges between groups
27

Domain Based Distribution
‣ Many Graphs have a natural distribution
‣ By country/region for People
‣ By tags for Blogs
‣ By category for Products
‣ Most edges in same group
‣ Rare edges between groups
27
uses Domain Knowledge 
for short-cuts

Benchmark Comparison
Source: https://www.arangodb.com/2015/10/benchmark-postgresql-mongodb-arangodb/

Thank you
‣ Further questions?
‣ Follow us on twitter: @arangodb
‣ Join our slack: slack.arangodb.com
‣ Follow me on twitter/github: @mchacki
‣ Write me a mail: michael@arangodb.com
30

Handling Billions of Edges in a Graph Database

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Handling Billions of Edges in a Graph Database (17)

More from ArangoDB Database (20)

Recently uploaded (20)

Handling Billions of Edges in a Graph Database