SlideShare a Scribd company logo
GRADOOP: Scalable Graph Analytics
with Apache Flink
Martin Junghanns @kc1s
Apache Flink and Neo4j Meetup Berlin
About the speaker and the team
Apache Flink and Neo4j Meetup Berlin 2
André
PhD Student
Martin
PhD Student
Kevin
M.Sc. Student
Niklas
M.Sc. Student
Prof. Dr. Erhard Rahm
Database Chair
Apache Flink and Neo4j Meetup Berlin 3
Motivation
„Graphs are everywhere“
Apache Flink and Neo4j Meetup Berlin 4
𝑮𝑟𝑎𝑝ℎ = (𝑽𝑒𝑟𝑡𝑖𝑐𝑒𝑠, 𝑬𝑑𝑔𝑒𝑠)
„Graphs are everywhere“
Apache Flink and Neo4j Meetup Berlin 5
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
Trent
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑠)
„Graphs are everywhere“
Apache Flink and Neo4j Meetup Berlin 6
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠)
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
Trent
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
„Graphs are heterogeneous“
Apache Flink and Neo4j Meetup Berlin 7
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 8
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)
0.2
0.28
0.26
0.33
0.25
0.26
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
3.6
2.82
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 9
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 10
Assuming a social network
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 11
Assuming a social network
1. Determine subgraph
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 12
Assuming a social network
1. Determine subgraph
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 13
Assuming a social network
1. Determine subgraph
2. Find communities
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 14
Assuming a social network
1. Determine subgraph
2. Find communities
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 15
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 16
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 17
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
4. Find common subgraph
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 18
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
4. Find common subgraph
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 19
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 20
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 21
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 22
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
„Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 23
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
„And let‘s not forget …“
Apache Flink and Neo4j Meetup Berlin 24
“…Graphs are large”
Apache Flink and Neo4j Meetup Berlin 25
„A framework and research platform for efficient,
distributed and domain independent management
and analytics of heterogeneous graph data.“
Apache Flink and Neo4j Meetup Berlin 26
High Level Architecture
Apache Flink and Neo4j Meetup Berlin 27
High Level Architecture
Apache Flink and Neo4j Meetup Berlin 27
HDFS/YARN
Cluster
High Level Architecture
Apache Flink and Neo4j Meetup Berlin 27
HDFS/YARN
Cluster
Apache HBase Distributed Graph Store
High Level Architecture
Apache Flink and Neo4j Meetup Berlin 27
HDFS/YARN
Cluster
Apache HBase Distributed Graph Store
Apache Flink Distributed Operator Execution
High Level Architecture
Apache Flink and Neo4j Meetup Berlin 27
HDFS/YARN
Cluster
Apache HBase Distributed Graph Store
Apache Flink Operator Implementation
Apache Flink Distributed Operator Execution
Extended Property Graph Model (EPGM)
Graph Analytical Language (GrALa)  Java 7
 25K (33K) LOC
 GPLv3
Apache Flink Third-party library
Apache Flink and Neo4j Meetup Berlin 28
Streaming Dataflow Runtime
DataSet DataStream
HadoopMR
Table
Gelly
ML
Table
Zeppelin
Cascading
MRQL
Dataflow
Storm
Dataflow
SAMOA
GRADOOP
Cluster (e.g. YARN)Local Cloud (e.g. EC2)
Batch Stream
Data Storage (e.g. Files, HDFS, S3, JDBC, Kafka, …)
Apache Flink and Neo4j Meetup Berlin 29
Extended Property Graph Model (EPGM)
Extended Property Graph Model
• Vertices and directed Edges
Apache Flink and Neo4j Meetup Berlin 30
Extended Property Graph Model
• Vertices and directed Edges
• Logical Graphs
Apache Flink and Neo4j Meetup Berlin 31
Extended Property Graph Model
• Vertices and directed Edges
• Logical Graphs
• Identifiers
Apache Flink and Neo4j Meetup Berlin 32
1 3
4
5
21 2
3
4
5
1
2
Extended Property Graph Model
• Vertices and directed Edges
• Logical Graphs
• Identifiers
• Type Labels
Apache Flink and Neo4j Meetup Berlin 33
1 3
4
5
21 2
3
4
5
Person Band
Person
Person
Band
likes likes
likes
knows
likes
1|Community
2|Community
Extended Property Graph Model
• Vertices and directed Edges
• Logical Graphs
• Identifiers
• Type Labels
• Properties
Apache Flink and Neo4j Meetup Berlin 34
1 3
4
5
21 2
3
4
5
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Apache Flink and Neo4j Meetup Berlin 35
EPGM Operators
Basic Binary Operators
Apache Flink and Neo4j Meetup Berlin 36
Basic Binary Operators
Apache Flink and Neo4j Meetup Berlin 36
1 3
4
5
2
1
2
Basic Binary Operators
Apache Flink and Neo4j Meetup Berlin 36
1 3
4
5
2
1 3
4
5
2
1
2
Combination
3
Basic Binary Operators
Apache Flink and Neo4j Meetup Berlin 36
1 3
4
5
2
31 3
4
5
2
1
2
3
Combination
Overlap
3
Basic Binary Operators
Apache Flink and Neo4j Meetup Berlin 36
1 3
4
5
2
3
1 2
1 3
4
5
2
1
2
3
3
Combination
Overlap
Exclusion
3
Graph Aggregation
Apache Flink and Neo4j Meetup Berlin 37
Graph Aggregation
Apache Flink and Neo4j Meetup Berlin 37
1 3
4
5
2
3
Graph Aggregation
Apache Flink and Neo4j Meetup Berlin 37
1 3
4
5
2
3
UDF
Graph Aggregation
Apache Flink and Neo4j Meetup Berlin 37
1 3
4
5
2
3
1 3
4
5
2
3 | vertexCount: 5
UDF
Graph Aggregation
Apache Flink and Neo4j Meetup Berlin 37
1 3
4
5
2
3
1 3
4
5
2
3 | vertexCount: 5
1 3
4
5
2
3
revenue:7000
expense:1000
expense:1000
UDF
Graph Aggregation
Apache Flink and Neo4j Meetup Berlin 37
1 3
4
5
2
3
1 3
4
5
2
3 | vertexCount: 5
1 3
4
5
2
3
revenue:7000
expense:1000
expense:1000
UDF
UDF
Graph Aggregation
Apache Flink and Neo4j Meetup Berlin 37
1 3
4
5
2
3
1 3
4
5
2
3 | vertexCount: 5
1 3
4
5
2
3
revenue:7000
expense:1000
expense:1000
1 3
4
5
2
3 | profit: 5000
revenue:7000
expense:1000
expense:1000
UDF
UDF
Graph Transformation
Apache Flink and Neo4j Meetup Berlin 38
Graph Transformation
Apache Flink and Neo4j Meetup Berlin 38
3 | vertexCount: 5
name:Alice
f_name:Bob1 3
4
5
2
Graph Transformation
Apache Flink and Neo4j Meetup Berlin 38
UDF
3 | vertexCount: 5
name:Alice
f_name:Bob1 3
4
5
2
3 | Community| vCount: 5
f_name:Alice
f_name:Bob1 3
4
5
2
Subgraph Extraction
Apache Flink and Neo4j Meetup Berlin 39
Subgraph Extraction
Apache Flink and Neo4j Meetup Berlin 39
3
1 3
4
5
2
Subgraph Extraction
Apache Flink and Neo4j Meetup Berlin 39
3
1 3
4
5
2
UDF
Subgraph Extraction
Apache Flink and Neo4j Meetup Berlin 39
3
1 3
4
5
2
3
4
1 2UDF
Subgraph Extraction
Apache Flink and Neo4j Meetup Berlin 39
3
1 3
4
5
2
3
4
1 2
UDF
UDF
Subgraph Extraction
Apache Flink and Neo4j Meetup Berlin 39
3
1 3
4
5
2
3
4
1 2
3
4
1 2UDF
UDF
Subgraph Extraction
Apache Flink and Neo4j Meetup Berlin 39
3
1 3
4
5
2
3
4
1 2
3
4
1 2
UDF
UDF
UDF
Subgraph Extraction
Apache Flink and Neo4j Meetup Berlin 39
3
1 3
4
5
2
3
4
1 2
3
4
1 2
4
3
5
2UDF
UDF
UDF
Graph Pattern Matching
Apache Flink and Neo4j Meetup Berlin 40
Graph Pattern Matching
Apache Flink and Neo4j Meetup Berlin 40
3
1 3
4
5
2
Graph Pattern Matching
Apache Flink and Neo4j Meetup Berlin 40
3
1 3
4
5
2 Pattern
Graph Pattern Matching
Apache Flink and Neo4j Meetup Berlin 40
3
1 3
4
5
2 Pattern
4 5
1 3
4
2
Graph Pattern Matching
Apache Flink and Neo4j Meetup Berlin 40
3
1 3
4
5
2 Pattern
4 5
1 3
4
2
Graph Collection
Graph Grouping
Apache Flink and Neo4j Meetup Berlin 41
Graph Grouping
Apache Flink and Neo4j Meetup Berlin 41
3
1 3
4
5
2
Graph Grouping
Apache Flink and Neo4j Meetup Berlin 41
Keys
3
1 3
4
5
2
Graph Grouping
Apache Flink and Neo4j Meetup Berlin 41
Keys
3
1 3
4
5
2
4
6 7
Graph Grouping
Apache Flink and Neo4j Meetup Berlin 41
Keys
3
1 3
4
5
2
4
6 7
3
a:23 a:84
a:42
a:12
1 3
4
5
2
a:13
a:21
Graph Grouping
Apache Flink and Neo4j Meetup Berlin 41
Keys
3
1 3
4
5
2
4
6 7
+Aggregate
3
a:23 a:84
a:42
a:12
1 3
4
5
2
a:13
a:21
Graph Grouping
Apache Flink and Neo4j Meetup Berlin 41
Keys
3
1 3
4
5
2
4
6 7
+Aggregate
3
a:23 a:84
a:42
a:12
1 3
4
5
2
a:13
a:21
4
count:2 count:2
max(a):42
max(a):84
max(a):13 max(a):21
6 7
Apply (e.g. Aggregation)
Apache Flink and Neo4j Meetup Berlin 42
Apply (e.g. Aggregation)
Apache Flink and Neo4j Meetup Berlin 42
1
2
3
revenue:7000
expense:1000
expense:1000
revenue:2000
revenue:4000
expense:3000
expense:1000
0 2
3
4
1
5 7 86
9 11 1210
Apply (e.g. Aggregation)
Apache Flink and Neo4j Meetup Berlin 42
Operator
1
2
3
revenue:7000
expense:1000
expense:1000
revenue:2000
revenue:4000
expense:3000
expense:1000
0 2
3
4
1
5 7 86
9 11 1210
Apply (e.g. Aggregation)
Apache Flink and Neo4j Meetup Berlin 42
Operator
1
2
3
revenue:7000
expense:1000
expense:1000
revenue:2000
revenue:4000
expense:3000
expense:1000
0 2
3
4
1
5 7 86
9 11 1210
1 | profit: 5000
2 | profit: -1000
3 | profit: 3000
revenue:7000
expense:1000
expense:1000
revenue:2000
revenue:4000
expense:3000
expense:1000
0 2
3
4
1
5 7 86
9 11 1210
Selection
Apache Flink and Neo4j Meetup Berlin 43
Selection
Apache Flink and Neo4j Meetup Berlin 43
1 | profit: 5000
2 | profit: -1000
3 | profit: 3000
revenue:7000
expense:1000
expense:1000
revenue:2000
revenue:4000
expense:3000
expense:1000
0 2
3
4
1
5 7 86
9 11 1210
Selection
Apache Flink and Neo4j Meetup Berlin 43
UDF
profit > 0
1 | profit: 5000
2 | profit: -1000
3 | profit: 3000
revenue:7000
expense:1000
expense:1000
revenue:2000
revenue:4000
expense:3000
expense:1000
0 2
3
4
1
5 7 86
9 11 1210
Selection
Apache Flink and Neo4j Meetup Berlin 43
UDF
profit > 0
1 | profit: 5000
2 | profit: -1000
3 | profit: 3000
revenue:7000
expense:1000
expense:1000
revenue:2000
revenue:4000
expense:3000
expense:1000
0 2
3
4
1
5 7 86
9 11 1210
1 | profit: 5000
3 | profit: 3000
revenue:7000
expense:1000
expense:1000
revenue:4000 expense:1000
0 2
3
4
1
9 11 1210
Call (e.g. Clustering)
Apache Flink and Neo4j Meetup Berlin 44
Call (e.g. Clustering)
Apache Flink and Neo4j Meetup Berlin 44
1
0 2
3
4
1
5 7 86
9 11 1210
Call (e.g. Clustering)
Apache Flink and Neo4j Meetup Berlin 44
Algorithm
1
0 2
3
4
1
5 7 86
9 11 1210
Call (e.g. Clustering)
Apache Flink and Neo4j Meetup Berlin 44
Algorithm
1
0 2
3
4
1
5 7 86
9 11 1210
2
3
4
0 2
3
4
1
5 7 86
9 11 1210
Call (e.g. PageRank)
Apache Flink and Neo4j Meetup Berlin 45
Call (e.g. PageRank)
Apache Flink and Neo4j Meetup Berlin 45
1
0 2
3
4
1
5 7 86
9 11 1210
Call (e.g. PageRank)
Apache Flink and Neo4j Meetup Berlin 45
Algorithm
1
0 2
3
4
1
5 7 86
9 11 1210
Call (e.g. PageRank)
Apache Flink and Neo4j Meetup Berlin 45
Algorithm
2
rank:0.11
rank:0.25
rank:0.11
rank:1.29
rank:1.29
rank:1.58rank:0.11rank:5.12
rank:0.11
rank:0.11 rank:0.26 rank:0.11 rank:2.47
0 2
3
4
1
5 7 86
9 11 1210
1
0 2
3
4
1
5 7 86
9 11 1210
EPGM Operators Overview
Apache Flink and Neo4j Meetup Berlin 46
Operators
Unary Binary
GraphCollectionLogicalGraph
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Flink Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply
Reduce
Call
Adaptive Partitioning
Subgraph
EPGM Operators Overview
Apache Flink and Neo4j Meetup Berlin 47
Operators
Unary Binary
GraphCollectionLogicalGraph
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Flink Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply
Reduce
Call
Adaptive Partitioning
Subgraph
EPGM Operators Overview
Apache Flink and Neo4j Meetup Berlin 48
Operators
Unary Binary
GraphCollectionLogicalGraph
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Flink Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply
Reduce
Call
Adaptive Partitioning
Subgraph
Apache Flink and Neo4j Meetup Berlin 49
EPGM on Apache Flink
Flink DataSet API
Apache Flink and Neo4j Meetup Berlin 50
Flink DataSet API
Apache Flink and Neo4j Meetup Berlin 50
• DataSet := Distributed Collection of Data Objects
DataSet
DataSet
DataSet
Flink DataSet API
Apache Flink and Neo4j Meetup Berlin 50
• DataSet := Distributed Collection of Data Objects
• Transformation := Operation on DataSets
DataSet
DataSet
DataSet
Transformation
Transformation
DataSet
DataSet
Flink DataSet API
Apache Flink and Neo4j Meetup Berlin 50
• DataSet := Distributed Collection of Data Objects
• Transformation := Operation on DataSets
• Flink Programm := Composition of Transformations
DataSet
DataSet
DataSet
Transformation
Transformation
DataSet
DataSet
Transformation DataSet
Flink Program
Flink DataSet API
Apache Flink and Neo4j Meetup Berlin 50
DataSetDataSetDataSet
DataSetDataSetDataSet
DataSetDataSetDataSet
DataSetDataSetDataSet
DataSetDataSetDataSet
DataSetDataSetDataSet
• DataSet := Distributed Collection of Data Objects
• Transformation := Operation on DataSets
• Flink Programm := Composition of Transformations
DataSet
DataSet
DataSet
Transformation
Transformation
DataSet
DataSet
Transformation DataSet
Flink Program
Graph Representation
Apache Flink and Neo4j Meetup Berlin 51
Graph Representation
Apache Flink and Neo4j Meetup Berlin 51
EPGMGraphHead
Id Label Properties POJO DataSet<EPGMGraphHead>
Graph Representation
Apache Flink and Neo4j Meetup Berlin 51
Id Label Properties Graphs
EPGMGraphHead
EPGMVertex
Id Label Properties POJO
POJO
DataSet<EPGMGraphHead>
DataSet<EPGMVertex>
Graph Representation
Apache Flink and Neo4j Meetup Berlin 51
Id Label Properties Graphs
Id Label Properties SourceId TargetId Graphs
EPGMGraphHead
EPGMVertex
EPGMEdge
Id Label Properties POJO
POJO
POJO
DataSet<EPGMGraphHead>
DataSet<EPGMVertex>
DataSet<EPGMEdge>
Graph Representation
Apache Flink and Neo4j Meetup Berlin 51
Id Label Properties Graphs
Id Label Properties SourceId TargetId Graphs
EPGMGraphHead
EPGMVertex
EPGMEdge
Id Label Properties POJO
POJO
POJO
DataSet<EPGMGraphHead>
DataSet<EPGMVertex>
DataSet<EPGMEdge>
Id Label Properties Graphs
EPGMVertex
Graph Representation
Apache Flink and Neo4j Meetup Berlin 51
Id Label Properties Graphs
Id Label Properties SourceId TargetId Graphs
EPGMGraphHead
EPGMVertex
EPGMEdge
Id Label Properties POJO
POJO
POJO
DataSet<EPGMGraphHead>
DataSet<EPGMVertex>
DataSet<EPGMEdge>
Id Label Properties Graphs
EPGMVertex
GradoopId := UUID
128-bit
Graph Representation
Apache Flink and Neo4j Meetup Berlin 51
Id Label Properties Graphs
Id Label Properties SourceId TargetId Graphs
EPGMGraphHead
EPGMVertex
EPGMEdge
Id Label Properties POJO
POJO
POJO
DataSet<EPGMGraphHead>
DataSet<EPGMVertex>
DataSet<EPGMEdge>
Id Label Properties Graphs
EPGMVertex
GradoopId := UUID
128-bit
String
Graph Representation
Apache Flink and Neo4j Meetup Berlin 51
Id Label Properties Graphs
Id Label Properties SourceId TargetId Graphs
EPGMGraphHead
EPGMVertex
EPGMEdge
Id Label Properties POJO
POJO
POJO
DataSet<EPGMGraphHead>
DataSet<EPGMVertex>
DataSet<EPGMEdge>
Id Label Properties Graphs
EPGMVertex
GradoopId := UUID
128-bit
String PropertyList := List<Property>
Property := (String, PropertyValue)
PropertyValue := byte[]
Graph Representation
Apache Flink and Neo4j Meetup Berlin 51
Id Label Properties Graphs
Id Label Properties SourceId TargetId Graphs
EPGMGraphHead
EPGMVertex
EPGMEdge
Id Label Properties POJO
POJO
POJO
DataSet<EPGMGraphHead>
DataSet<EPGMVertex>
DataSet<EPGMEdge>
Id Label Properties Graphs
EPGMVertex
GradoopId := UUID
128-bit
String PropertyList := List<Property>
Property := (String, PropertyValue)
PropertyValue := byte[]
GradoopIdSet := Set<GradoopId>
Graph Representation
Apache Flink and Neo4j Meetup Berlin 52
Graph Representation
Apache Flink and Neo4j Meetup Berlin 52
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
Graph Representation
Apache Flink and Neo4j Meetup Berlin 52
Id Label Properties
1 Community {interest:Heavy Metal}
2 Community {interest:Hard Rock}
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
DataSet<EPGMGraphHead>
Graph Representation
Apache Flink and Neo4j Meetup Berlin 52
Id Label Properties
1 Community {interest:Heavy Metal}
2 Community {interest:Hard Rock}
Id Label Properties Graphs
1 Person {name:Alice, born:1984} {1}
2 Band {name:Metallica,founded:1981} {1}
3 Person {name:Bob} {1,2}
4 Band {name:AC/DC,founded:1973} {2}
5 Person {name:Eve} {2}
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
DataSet<EPGMGraphHead>
DataSet<EPGMVertex>
Graph Representation
Apache Flink and Neo4j Meetup Berlin 52
Id Label Properties
1 Community {interest:Heavy Metal}
2 Community {interest:Hard Rock}
Id Label Properties Graphs
1 Person {name:Alice, born:1984} {1}
2 Band {name:Metallica,founded:1981} {1}
3 Person {name:Bob} {1,2}
4 Band {name:AC/DC,founded:1973} {2}
5 Person {name:Eve} {2}
Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
3 likes 3 4 {since:2015} {2}
4 knows 3 5 {} {2}
5 likes 5 4 {since:2014} {2}
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
DataSet<EPGMGraphHead>
DataSet<EPGMVertex> DataSet<EPGMEdge>
Flink DataSet Transformations
Apache Flink and Neo4j Meetup Berlin 53
Flink DataSet Transformations
Apache Flink and Neo4j Meetup Berlin 53
SQL-like Transformations
• filter
• project
• cross
• union
• distinct
• first-N (limit)
• groupBy
• aggregate
• join
• leftOuterJoin
• rightOuterJoin
• fullOuterJoin
Flink DataSet Transformations
Apache Flink and Neo4j Meetup Berlin 53
Hadoop-like Transformations
• map
• flatMap
• mapPartition
• reduce
• reduceGroup
• coGroup
Special Flink Operations
• iterate
• iterateDelta
SQL-like Transformations
• filter
• project
• cross
• union
• distinct
• first-N (limit)
• groupBy
• aggregate
• join
• leftOuterJoin
• rightOuterJoin
• fullOuterJoin
Operator Implementation
Apache Flink and Neo4j Meetup Berlin 54
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
Operator Implementation
Apache Flink and Neo4j Meetup Berlin 54
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
Exclusion
Operator Implementation
Apache Flink and Neo4j Meetup Berlin 54
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5 // input: firstGraph (G[1]), secondGraph (G[2])
1: DataSet<GradoopId> graphId = secondGraph.getGraphHead()
2: .map(new Id<G>());
3:
4: DataSet<V> newVertices = firstGraph.getVertices()
5: .filter(new NotInGraphBroadCast<V>())
6: .withBroadcastSet(graphId, GRAPH_ID);
7:
8: DataSet<E> newEdges = firstGraph.getEdges()
9: .filter(new NotInGraphBroadCast<E>())
10: .withBroadcastSet(graphId, GRAPH_ID)
11: .join(newVertices)
12: .where(new SourceId<E>().equalTo(new Id<V>())
13: .with(new LeftSide<E, V>())
14: .join(newVertices)
15: .where(new TargetId<E>().equalTo(new Id<V>())
16: .with(new LeftSide<E, V>());
Exclusion
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
graphId = secondGraph.getGraphHead()
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
Id Label Properties
2 Community {interest:Hard Rock}
graphId = secondGraph.getGraphHead()
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
Id Label Properties
2 Community {interest:Hard Rock}
graphId = secondGraph.getGraphHead()
.map(new Id<G>());
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
Id Label Properties
2 Community {interest:Hard Rock}
graphId = secondGraph.getGraphHead()
Id
2
.map(new Id<G>());
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
Id Label Properties
2 Community {interest:Hard Rock}
graphId = secondGraph.getGraphHead()
Id
2
newVertices = firstGraph.getVertices()
.map(new Id<G>());
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
Id Label Properties
2 Community {interest:Hard Rock}
graphId = secondGraph.getGraphHead()
Id
2
newVertices = firstGraph.getVertices() Id Label Properties Graphs
1 Person {name:Alice} {1}
2 Band {name:Metallica,founded:1981} {1}
3 Person {name:Bob} {1,2}
.map(new Id<G>());
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
Id Label Properties
2 Community {interest:Hard Rock}
graphId = secondGraph.getGraphHead()
Id
2
newVertices = firstGraph.getVertices() Id Label Properties Graphs
1 Person {name:Alice} {1}
2 Band {name:Metallica,founded:1981} {1}
3 Person {name:Bob} {1,2}
.map(new Id<G>());
.filter(new NotInGraphBroadCast<V>())
.withBroadcastSet(graphId, GRAPH_ID);
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
Id Label Properties
2 Community {interest:Hard Rock}
graphId = secondGraph.getGraphHead()
Id
2
newVertices = firstGraph.getVertices() Id Label Properties Graphs
1 Person {name:Alice} {1}
2 Band {name:Metallica,founded:1981} {1}
3 Person {name:Bob} {1,2}
Id Label Properties Graphs
1 Person {name:Alice} {1}
2 Band {name:Metallica,founded:1981} {1}
.map(new Id<G>());
.filter(new NotInGraphBroadCast<V>())
.withBroadcastSet(graphId, GRAPH_ID);
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 56
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 56
newEdges = firstGraph.getEdges()
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 56
newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 56
newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
.filter(new NotInGraphBroadCast<E>())
.withBroadcastSet(graphId, GRAPH_ID)
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 56
newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
.filter(new NotInGraphBroadCast<E>())
.withBroadcastSet(graphId, GRAPH_ID)
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 56
newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
.join(newVertices)
.where(new SourceId<E>().equalTo(new Id<V>())
.filter(new NotInGraphBroadCast<E>())
.withBroadcastSet(graphId, GRAPH_ID)
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 56
newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target … Id Label …
1 likes 1 2 … 1 Person …
.join(newVertices)
.where(new SourceId<E>().equalTo(new Id<V>())
.filter(new NotInGraphBroadCast<E>())
.withBroadcastSet(graphId, GRAPH_ID)
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 56
newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target … Id Label …
1 likes 1 2 … 1 Person …
.with(new LeftSide<E, V>())
.join(newVertices)
.where(new SourceId<E>().equalTo(new Id<V>())
.filter(new NotInGraphBroadCast<E>())
.withBroadcastSet(graphId, GRAPH_ID)
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 56
newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target … Id Label …
1 likes 1 2 … 1 Person …
Id Label Source Target …
1 likes 1 2 …
.with(new LeftSide<E, V>())
.join(newVertices)
.where(new SourceId<E>().equalTo(new Id<V>())
.filter(new NotInGraphBroadCast<E>())
.withBroadcastSet(graphId, GRAPH_ID)
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 56
newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target … Id Label …
1 likes 1 2 … 1 Person …
Id Label Source Target …
1 likes 1 2 …
.join(newVertices)
.where(new TargetId<E>().equalTo(new Id<V>())
.with(new LeftSide<E, V>())
.join(newVertices)
.where(new SourceId<E>().equalTo(new Id<V>())
.filter(new NotInGraphBroadCast<E>())
.withBroadcastSet(graphId, GRAPH_ID)
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 56
newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target … Id Label …
1 likes 1 2 … 1 Person …
Id Label Source Target …
1 likes 1 2 …
Id Label Source Target … Id Label …
1 likes 1 2 … 2 Band …
.join(newVertices)
.where(new TargetId<E>().equalTo(new Id<V>())
.with(new LeftSide<E, V>())
.join(newVertices)
.where(new SourceId<E>().equalTo(new Id<V>())
.filter(new NotInGraphBroadCast<E>())
.withBroadcastSet(graphId, GRAPH_ID)
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 56
newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target … Id Label …
1 likes 1 2 … 1 Person …
Id Label Source Target …
1 likes 1 2 …
Id Label Source Target … Id Label …
1 likes 1 2 … 2 Band …
.with(new LeftSide<E, V>());
.join(newVertices)
.where(new TargetId<E>().equalTo(new Id<V>())
.with(new LeftSide<E, V>())
.join(newVertices)
.where(new SourceId<E>().equalTo(new Id<V>())
.filter(new NotInGraphBroadCast<E>())
.withBroadcastSet(graphId, GRAPH_ID)
Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 56
newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
Id Label Source Target … Id Label …
1 likes 1 2 … 1 Person …
Id Label Source Target …
1 likes 1 2 …
Id Label Source Target … Id Label …
1 likes 1 2 … 2 Band …
Id Label Source Target …
1 likes 1 2 …
.with(new LeftSide<E, V>());
.join(newVertices)
.where(new TargetId<E>().equalTo(new Id<V>())
.with(new LeftSide<E, V>())
.join(newVertices)
.where(new SourceId<E>().equalTo(new Id<V>())
.filter(new NotInGraphBroadCast<E>())
.withBroadcastSet(graphId, GRAPH_ID)
GrALa API
Apache Flink and Neo4j Meetup Berlin 57
GrALa API
Apache Flink and Neo4j Meetup Berlin 57
class LogicalGraph<G extends EPGMGraphHead,
V extends EPGMVertex,
E extends EPGMEdge> {
fromCollections(...) : LogicalGraph<G, V, E>
fromDataSets(...) : LogicalGraph<G, V, E>
fromGellyGraph(...) : LogicalGraph<G, V, E>
getGraphHead() : DataSet<G>
getVertices() : DataSet<V>
getEdges() : DataSet<E>
aggregate(...) : LogicalGraph<G, V, E>
match(...) : GraphCollection<G, V, E>
groupBy(...) : LogicalGraph<G, V, E>
subgraph(...) : LogicalGraph<G, V, E>
combine(...) : LogicalGraph<G, V, E>
// ...
}
GrALa API
Apache Flink and Neo4j Meetup Berlin 57
class LogicalGraph<G extends EPGMGraphHead,
V extends EPGMVertex,
E extends EPGMEdge> {
fromCollections(...) : LogicalGraph<G, V, E>
fromDataSets(...) : LogicalGraph<G, V, E>
fromGellyGraph(...) : LogicalGraph<G, V, E>
getGraphHead() : DataSet<G>
getVertices() : DataSet<V>
getEdges() : DataSet<E>
aggregate(...) : LogicalGraph<G, V, E>
match(...) : GraphCollection<G, V, E>
groupBy(...) : LogicalGraph<G, V, E>
subgraph(...) : LogicalGraph<G, V, E>
combine(...) : LogicalGraph<G, V, E>
// ...
}
class GraphCollection<G extends EPGMGraphHead,
V extends EPGMVertex,
E extends EPGMEdge > {
fromCollections(...) : GraphCollection<G, V, E>
fromDataSets(...) : GraphCollection<G, V, E>
getGraphHeads() : DataSet<G>
getVertices() : DataSet<V>
getEdges() : DataSet<E>
select(...) : GraphCollection<G, V, E>
distinct( ) : GraphCollection<G, V, E>
sortBy(...) : GraphCollection<G, V, E>
union(...) : GraphCollection<G, V, E>
difference(...) : GraphCollection<G, V, E>
// ...
}
GrALa API
Apache Flink and Neo4j Meetup Berlin 58
class EPGMDatabase<G extends EPGMGraphHead,
V extends EPGMVertex,
E extends EPGMEdge> {
fromCollections(...) : EPGMDatabase<G, V, E>
fromDataSets(...) : EPGMDatabase<G, V, E>
fromHBase(...) : EPGMDatabase<G, V, E>
fromJSON(...) : EPGMDatabase<G, V, E>
fromExternalGraph(...) : EPGMDatabase<G, V, E>
writeAsJSON(...) : void
writeToHBase(...) : void
getDatabaseGraph( ) : LogicalGraph<G, V, E>
getGraphById(...) : LogicalGraph<G, V, E>
getGraphsById(...) : GraphCollection<G, V, E>
// ...
}
GrALa API
Apache Flink and Neo4j Meetup Berlin 59
class EPGMDatabase<G extends EPGMGraphHead,
V extends EPGMVertex,
E extends EPGMEdge> {
fromCollections(...) : EPGMDatabase<G, V, E>
fromDataSets(...) : EPGMDatabase<G, V, E>
fromHBase(...) : EPGMDatabase<G, V, E>
fromJSON(...) : EPGMDatabase<G, V, E>
fromExternalGraph(...) : EPGMDatabase<G, V, E>
writeAsJSON(...) : void
writeToHBase(...) : void
getDatabaseGraph( ) : LogicalGraph<G, V, E>
getGraphById(...) : LogicalGraph<G, V, E>
getGraphsById(...) : GraphCollection<G, V, E>
// ...
}
Apache Flink and Neo4j Meetup Berlin 60
Performance
Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
http://www.ldbcouncil.org/
Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
http://www.ldbcouncil.org/
Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
http://www.ldbcouncil.org/
Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
http://www.ldbcouncil.org/
Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
http://www.ldbcouncil.org/
Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
http://www.ldbcouncil.org/
Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
http://www.ldbcouncil.org/
Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
7. Group graph by Persons location and gender
http://www.ldbcouncil.org/
Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
7. Group graph by Persons location and gender
8. Aggregate vertex and edge count of grouped graph
http://www.ldbcouncil.org/
Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 62
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
7. Group graph by Persons location and gender
8. Aggregate vertex and edge count of grouped graph
https://git.io/vgozj
Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 63
Dataset # Vertices # Edges Disk size
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12)
• 16x 48 GB RAM
• 1 Gigabit Ethernet
• Hadoop 2.6.0
• Flink 1.0-SNAPSHOT
• slots (per worker) 12
• jobmanager.heap.mb 2048
• taskmanager.heap.mb 40960
Social Network Benchmark – Runtime
Apache Flink and Neo4j Meetup Berlin 64
Dataset # Vertices # Edges Disk size
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12)
• 16x 48 GB RAM
• 1 Gigabit Ethernet
• Hadoop 2.6.0
• Flink 1.0-SNAPSHOT
• slots (per worker) 12
• jobmanager.heap.mb 2048
• taskmanager.heap.mb 40960
0
200
400
600
800
1000
1200
1 2 4 8 16
Runtime[s]
Number of workers
Graphalytics.100
1
2
4
8
16
1 2 4 8 16
Speedup
Number of workers
Graphalytics.100 Linear
Social Network Benchmark – Speedup
Apache Flink and Neo4j Meetup Berlin 65
Dataset # Vertices # Edges Disk size
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12)
• 16x 48 GB RAM
• 1 Gigabit Ethernet
• Hadoop 2.6.0
• Flink 1.0-SNAPSHOT
• slots (per worker) 12
• jobmanager.heap.mb 2048
• taskmanager.heap.mb 40960
1
10
100
1000
10000
Runtime[s]
Social Network Benchmark – Datasets
Apache Flink and Neo4j Meetup Berlin 66
Dataset # Vertices # Edges Disk size
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12)
• 16x 48 GB RAM
• 1 Gigabit Ethernet
• Hadoop 2.6.0
• Flink 1.0-SNAPSHOT
• slots (per worker) 12
• jobmanager.heap.mb 2048
• taskmanager.heap.mb 40960
Apache Flink and Neo4j Meetup Berlin 67
Demo
https://github.com/s1ck/neo4j-gradoop-demos
Apache Flink and Neo4j Meetup Berlin 68
Current State and Future Work
Current State – Operator Implementations
Apache Flink and Neo4j Meetup Berlin 69
Operators
Unary Binary
GraphCollectionLogicalGraph
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Flink Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply
Reduce
Call
Adaptive Partitioning
Subgraph
Release History
Apache Flink and Neo4j Meetup Berlin 70
• 0.0.1 First Prototype (May 2015)
– Hadoop MapReduce and Giraph for operator implementations
– Too much complexity
– Performance loss through serialization in HDFS/HBase
• 0.0.2 Using Flink as execution layer (June 2015)
– Basic operators
• 0.1 December 2015
– System-side identifiers (UUID)
– Improved property handling
– More operator implementations (e.g., Equality, Bool operators)
– Code refactoring
• 0.2-SNAPSHOT
– Graph Pattern Matching
– Frequent Subgraph Mining
– Memory optimization (96-bit ID, Dictionary Encoding, …)
– Tuple Implementation
Contributions to Flink
Apache Flink and Neo4j Meetup Berlin 71
• FLINK-2411 Add basic graph summarization algorithm
• FLINK-2590 DataSetUtils.zipWithUniqueID creates duplicate Ids
• FLINK-2905 Add intersect method to Graph class
• FLINK-2910 Combine tests for binary graph operators
• FLINK-2941 Implement a neo4j - Flink/Gelly connector
• FLINK-2981 Update README for building docs
• FLINK-3064 Missing size check in GroupReduceOperatorBase leads to NPE
• FLINK-3118 Check if MessageFunction implements ResultTypeQueryable
• FLINK-3122 Generalize value type in LabelPropagation
• FLINK-3272 Generalize vertex value type in ConnectedComponents
• Flink Forward (October 2015)
• Meetup Big Data Usergroup Saxony (December 2015)
• FOSDEM (January 2016)
Contributions Welcome
Apache Flink and Neo4j Meetup Berlin 72
• Code
– Operator implementations / improvement
– Performance Tuning
• People
– Bachelor / Master Thesis
– Open PhD positions in Leipzig, Germany
• Use Cases and (Big) Data!
Apache Flink and Neo4j Meetup Berlin 73
Thank you!
www.gradoop.com
http://flink.apache.org
http://neo4j.com
http://ldbcouncil.org
https://github.com/s1ck/neo4j-gradoop-demos
https://github.com/s1ck/flink-neo4j
https://github.com/s1ck/ldbc-flink-import
https://github.com/s1ck/gdl

More Related Content

PDF
Distributed Graph Analytics with Gradoop
PDF
Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016
PDF
H2O Big Join Slides
PDF
Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with A...
PDF
An excursion into Graph Analytics with Apache Spark GraphX
PPTX
Information-Rich Programming in F# with Semantic Data
PDF
data.table and H2O at LondonR with Matt Dowle
PPTX
Deploying your Predictive Models as a Service via Domino
Distributed Graph Analytics with Gradoop
Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016
H2O Big Join Slides
Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with A...
An excursion into Graph Analytics with Apache Spark GraphX
Information-Rich Programming in F# with Semantic Data
data.table and H2O at LondonR with Matt Dowle
Deploying your Predictive Models as a Service via Domino

What's hot (20)

PPTX
Querying Linked Geospatial Data with Incomplete Information
PPTX
Kaggle Competitions, New Friends, New Skills and New Opportunities
PPTX
Improving Model Predictions via Stacking and Hyper-parameters Tuning
PDF
Graphs & Neo4j - Past Present Future
PDF
Producing, publishing and consuming linked data - CSHALS 2013
PDF
sparklyr - Jeff Allen
PDF
Learning Commonalities in RDF
PPTX
Spark for Recommender Systems
PDF
LDQL: A Query Language for the Web of Linked Data
PPTX
Using H2O Random Grid Search for Hyper-parameters Optimization
KEY
PyData Introduction
PPTX
H2O Machine Learning Use Cases
PDF
Stacked Ensembles in H2O
PPTX
Project "Deep Water"
PPTX
Medical Heritage Library (MHL) on ArchiveSpark
PDF
High Performance Machine Learning in R with H2O
PDF
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
PPTX
Suneel Marthi - Deep Learning with Apache Flink and DL4J
PPTX
PDF
RSP4J: An API for RDF Stream Processing
Querying Linked Geospatial Data with Incomplete Information
Kaggle Competitions, New Friends, New Skills and New Opportunities
Improving Model Predictions via Stacking and Hyper-parameters Tuning
Graphs & Neo4j - Past Present Future
Producing, publishing and consuming linked data - CSHALS 2013
sparklyr - Jeff Allen
Learning Commonalities in RDF
Spark for Recommender Systems
LDQL: A Query Language for the Web of Linked Data
Using H2O Random Grid Search for Hyper-parameters Optimization
PyData Introduction
H2O Machine Learning Use Cases
Stacked Ensembles in H2O
Project "Deep Water"
Medical Heritage Library (MHL) on ArchiveSpark
High Performance Machine Learning in R with H2O
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Suneel Marthi - Deep Learning with Apache Flink and DL4J
RSP4J: An API for RDF Stream Processing
Ad

Similar to Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Berlin (20)

PDF
This week in Neo4j - 14th October 2017
PDF
This Week in Neo4j- 1st December 2018
PDF
This Week in neo4j - 22nd September 2018
PDF
This week in Neo4j - 7th October 2017
PPTX
Hdf Augmentation: Interoperability in the Last Mile
PDF
Predicting Influence and Communities Using Graph Algorithms
PDF
openCV with python
PPTX
Elasticsearch - DevNexus 2015
PDF
Vancouver part 1 intro to elasticsearch and kibana-beginner's crash course ...
PDF
This Week in Neo4j - 24th November 2018
PDF
EKON 24 ML_community_edition
PDF
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
PDF
What's New in Neo4j - David Allen, Neo4j
PDF
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
PDF
DrupalANDElasticsearch
PPTX
Implementing the FRBR Conceptual Model in the Variations Music Discovery System
PDF
This Week in Neo4j - 20th October 2018
PDF
Building a Knowledge Graph using NLP and Ontologies
PDF
NIPS 2016 Highlights - Sebastian Ruder
PPTX
CILK/CILK++ and Reducers
This week in Neo4j - 14th October 2017
This Week in Neo4j- 1st December 2018
This Week in neo4j - 22nd September 2018
This week in Neo4j - 7th October 2017
Hdf Augmentation: Interoperability in the Last Mile
Predicting Influence and Communities Using Graph Algorithms
openCV with python
Elasticsearch - DevNexus 2015
Vancouver part 1 intro to elasticsearch and kibana-beginner's crash course ...
This Week in Neo4j - 24th November 2018
EKON 24 ML_community_edition
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
What's New in Neo4j - David Allen, Neo4j
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
DrupalANDElasticsearch
Implementing the FRBR Conceptual Model in the Variations Music Discovery System
This Week in Neo4j - 20th October 2018
Building a Knowledge Graph using NLP and Ontologies
NIPS 2016 Highlights - Sebastian Ruder
CILK/CILK++ and Reducers
Ad

Recently uploaded (20)

PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction to machine learning and Linear Models
PPTX
Introduction to Knowledge Engineering Part 1
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
IBA_Chapter_11_Slides_Final_Accessible.pptx
Lecture1 pattern recognition............
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
[EN] Industrial Machine Downtime Prediction
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to machine learning and Linear Models
Introduction to Knowledge Engineering Part 1
.pdf is not working space design for the following data for the following dat...
Fluorescence-microscope_Botany_detailed content
Supervised vs unsupervised machine learning algorithms
Reliability_Chapter_ presentation 1221.5784
SAP 2 completion done . PRESENTATION.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Database Infoormation System (DBIS).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Acceptance and paychological effects of mandatory extra coach I classes.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj

Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Berlin

  • 1. GRADOOP: Scalable Graph Analytics with Apache Flink Martin Junghanns @kc1s Apache Flink and Neo4j Meetup Berlin
  • 2. About the speaker and the team Apache Flink and Neo4j Meetup Berlin 2 André PhD Student Martin PhD Student Kevin M.Sc. Student Niklas M.Sc. Student Prof. Dr. Erhard Rahm Database Chair
  • 3. Apache Flink and Neo4j Meetup Berlin 3 Motivation
  • 4. „Graphs are everywhere“ Apache Flink and Neo4j Meetup Berlin 4 𝑮𝑟𝑎𝑝ℎ = (𝑽𝑒𝑟𝑡𝑖𝑐𝑒𝑠, 𝑬𝑑𝑔𝑒𝑠)
  • 5. „Graphs are everywhere“ Apache Flink and Neo4j Meetup Berlin 5 Alice Bob Eve Dave Carol Mallory Peggy Trent 𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑠)
  • 6. „Graphs are everywhere“ Apache Flink and Neo4j Meetup Berlin 6 𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠) Alice Bob Eve Dave Carol Mallory Peggy Trent
  • 7. Alice Bob AC/DC Dave Carol Mallory Peggy Metallica „Graphs are heterogeneous“ Apache Flink and Neo4j Meetup Berlin 7 𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)
  • 8. Alice Bob AC/DC Dave Carol Mallory Peggy Metallica „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 8 𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)
  • 9. 0.2 0.28 0.26 0.33 0.25 0.26 Alice Bob AC/DC Dave Carol Mallory Peggy Metallica 3.6 2.82 „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 9 𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)
  • 10. „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 10 Assuming a social network
  • 11. „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 11 Assuming a social network 1. Determine subgraph
  • 12. „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 12 Assuming a social network 1. Determine subgraph
  • 13. „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 13 Assuming a social network 1. Determine subgraph 2. Find communities
  • 14. „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 14 Assuming a social network 1. Determine subgraph 2. Find communities
  • 15. „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 15 Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities
  • 16. „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 16 Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities
  • 17. „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 17 Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities 4. Find common subgraph
  • 18. „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 18 Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities 4. Find common subgraph
  • 19. „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 19 Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
  • 20. „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 20 Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
  • 21. „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 21 Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
  • 22. „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 22 Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
  • 23. „Graphs can be analyzed“ Apache Flink and Neo4j Meetup Berlin 23 Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
  • 24. „And let‘s not forget …“ Apache Flink and Neo4j Meetup Berlin 24
  • 25. “…Graphs are large” Apache Flink and Neo4j Meetup Berlin 25
  • 26. „A framework and research platform for efficient, distributed and domain independent management and analytics of heterogeneous graph data.“ Apache Flink and Neo4j Meetup Berlin 26
  • 27. High Level Architecture Apache Flink and Neo4j Meetup Berlin 27
  • 28. High Level Architecture Apache Flink and Neo4j Meetup Berlin 27 HDFS/YARN Cluster
  • 29. High Level Architecture Apache Flink and Neo4j Meetup Berlin 27 HDFS/YARN Cluster Apache HBase Distributed Graph Store
  • 30. High Level Architecture Apache Flink and Neo4j Meetup Berlin 27 HDFS/YARN Cluster Apache HBase Distributed Graph Store Apache Flink Distributed Operator Execution
  • 31. High Level Architecture Apache Flink and Neo4j Meetup Berlin 27 HDFS/YARN Cluster Apache HBase Distributed Graph Store Apache Flink Operator Implementation Apache Flink Distributed Operator Execution Extended Property Graph Model (EPGM) Graph Analytical Language (GrALa)  Java 7  25K (33K) LOC  GPLv3
  • 32. Apache Flink Third-party library Apache Flink and Neo4j Meetup Berlin 28 Streaming Dataflow Runtime DataSet DataStream HadoopMR Table Gelly ML Table Zeppelin Cascading MRQL Dataflow Storm Dataflow SAMOA GRADOOP Cluster (e.g. YARN)Local Cloud (e.g. EC2) Batch Stream Data Storage (e.g. Files, HDFS, S3, JDBC, Kafka, …)
  • 33. Apache Flink and Neo4j Meetup Berlin 29 Extended Property Graph Model (EPGM)
  • 34. Extended Property Graph Model • Vertices and directed Edges Apache Flink and Neo4j Meetup Berlin 30
  • 35. Extended Property Graph Model • Vertices and directed Edges • Logical Graphs Apache Flink and Neo4j Meetup Berlin 31
  • 36. Extended Property Graph Model • Vertices and directed Edges • Logical Graphs • Identifiers Apache Flink and Neo4j Meetup Berlin 32 1 3 4 5 21 2 3 4 5 1 2
  • 37. Extended Property Graph Model • Vertices and directed Edges • Logical Graphs • Identifiers • Type Labels Apache Flink and Neo4j Meetup Berlin 33 1 3 4 5 21 2 3 4 5 Person Band Person Person Band likes likes likes knows likes 1|Community 2|Community
  • 38. Extended Property Graph Model • Vertices and directed Edges • Logical Graphs • Identifiers • Type Labels • Properties Apache Flink and Neo4j Meetup Berlin 34 1 3 4 5 21 2 3 4 5 Person name : Alice born : 1984 Band name : Metallica founded : 1981 Person name : Bob Person name : Eve Band name : AC/DC founded : 1973 likes since : 2014 likes since : 2013 likes since : 2015 knows likes since : 2014 1|Community|interest:Heavy Metal 2|Community|interest:Hard Rock
  • 39. Apache Flink and Neo4j Meetup Berlin 35 EPGM Operators
  • 40. Basic Binary Operators Apache Flink and Neo4j Meetup Berlin 36
  • 41. Basic Binary Operators Apache Flink and Neo4j Meetup Berlin 36 1 3 4 5 2 1 2
  • 42. Basic Binary Operators Apache Flink and Neo4j Meetup Berlin 36 1 3 4 5 2 1 3 4 5 2 1 2 Combination 3
  • 43. Basic Binary Operators Apache Flink and Neo4j Meetup Berlin 36 1 3 4 5 2 31 3 4 5 2 1 2 3 Combination Overlap 3
  • 44. Basic Binary Operators Apache Flink and Neo4j Meetup Berlin 36 1 3 4 5 2 3 1 2 1 3 4 5 2 1 2 3 3 Combination Overlap Exclusion 3
  • 45. Graph Aggregation Apache Flink and Neo4j Meetup Berlin 37
  • 46. Graph Aggregation Apache Flink and Neo4j Meetup Berlin 37 1 3 4 5 2 3
  • 47. Graph Aggregation Apache Flink and Neo4j Meetup Berlin 37 1 3 4 5 2 3 UDF
  • 48. Graph Aggregation Apache Flink and Neo4j Meetup Berlin 37 1 3 4 5 2 3 1 3 4 5 2 3 | vertexCount: 5 UDF
  • 49. Graph Aggregation Apache Flink and Neo4j Meetup Berlin 37 1 3 4 5 2 3 1 3 4 5 2 3 | vertexCount: 5 1 3 4 5 2 3 revenue:7000 expense:1000 expense:1000 UDF
  • 50. Graph Aggregation Apache Flink and Neo4j Meetup Berlin 37 1 3 4 5 2 3 1 3 4 5 2 3 | vertexCount: 5 1 3 4 5 2 3 revenue:7000 expense:1000 expense:1000 UDF UDF
  • 51. Graph Aggregation Apache Flink and Neo4j Meetup Berlin 37 1 3 4 5 2 3 1 3 4 5 2 3 | vertexCount: 5 1 3 4 5 2 3 revenue:7000 expense:1000 expense:1000 1 3 4 5 2 3 | profit: 5000 revenue:7000 expense:1000 expense:1000 UDF UDF
  • 52. Graph Transformation Apache Flink and Neo4j Meetup Berlin 38
  • 53. Graph Transformation Apache Flink and Neo4j Meetup Berlin 38 3 | vertexCount: 5 name:Alice f_name:Bob1 3 4 5 2
  • 54. Graph Transformation Apache Flink and Neo4j Meetup Berlin 38 UDF 3 | vertexCount: 5 name:Alice f_name:Bob1 3 4 5 2 3 | Community| vCount: 5 f_name:Alice f_name:Bob1 3 4 5 2
  • 55. Subgraph Extraction Apache Flink and Neo4j Meetup Berlin 39
  • 56. Subgraph Extraction Apache Flink and Neo4j Meetup Berlin 39 3 1 3 4 5 2
  • 57. Subgraph Extraction Apache Flink and Neo4j Meetup Berlin 39 3 1 3 4 5 2 UDF
  • 58. Subgraph Extraction Apache Flink and Neo4j Meetup Berlin 39 3 1 3 4 5 2 3 4 1 2UDF
  • 59. Subgraph Extraction Apache Flink and Neo4j Meetup Berlin 39 3 1 3 4 5 2 3 4 1 2 UDF UDF
  • 60. Subgraph Extraction Apache Flink and Neo4j Meetup Berlin 39 3 1 3 4 5 2 3 4 1 2 3 4 1 2UDF UDF
  • 61. Subgraph Extraction Apache Flink and Neo4j Meetup Berlin 39 3 1 3 4 5 2 3 4 1 2 3 4 1 2 UDF UDF UDF
  • 62. Subgraph Extraction Apache Flink and Neo4j Meetup Berlin 39 3 1 3 4 5 2 3 4 1 2 3 4 1 2 4 3 5 2UDF UDF UDF
  • 63. Graph Pattern Matching Apache Flink and Neo4j Meetup Berlin 40
  • 64. Graph Pattern Matching Apache Flink and Neo4j Meetup Berlin 40 3 1 3 4 5 2
  • 65. Graph Pattern Matching Apache Flink and Neo4j Meetup Berlin 40 3 1 3 4 5 2 Pattern
  • 66. Graph Pattern Matching Apache Flink and Neo4j Meetup Berlin 40 3 1 3 4 5 2 Pattern 4 5 1 3 4 2
  • 67. Graph Pattern Matching Apache Flink and Neo4j Meetup Berlin 40 3 1 3 4 5 2 Pattern 4 5 1 3 4 2 Graph Collection
  • 68. Graph Grouping Apache Flink and Neo4j Meetup Berlin 41
  • 69. Graph Grouping Apache Flink and Neo4j Meetup Berlin 41 3 1 3 4 5 2
  • 70. Graph Grouping Apache Flink and Neo4j Meetup Berlin 41 Keys 3 1 3 4 5 2
  • 71. Graph Grouping Apache Flink and Neo4j Meetup Berlin 41 Keys 3 1 3 4 5 2 4 6 7
  • 72. Graph Grouping Apache Flink and Neo4j Meetup Berlin 41 Keys 3 1 3 4 5 2 4 6 7 3 a:23 a:84 a:42 a:12 1 3 4 5 2 a:13 a:21
  • 73. Graph Grouping Apache Flink and Neo4j Meetup Berlin 41 Keys 3 1 3 4 5 2 4 6 7 +Aggregate 3 a:23 a:84 a:42 a:12 1 3 4 5 2 a:13 a:21
  • 74. Graph Grouping Apache Flink and Neo4j Meetup Berlin 41 Keys 3 1 3 4 5 2 4 6 7 +Aggregate 3 a:23 a:84 a:42 a:12 1 3 4 5 2 a:13 a:21 4 count:2 count:2 max(a):42 max(a):84 max(a):13 max(a):21 6 7
  • 75. Apply (e.g. Aggregation) Apache Flink and Neo4j Meetup Berlin 42
  • 76. Apply (e.g. Aggregation) Apache Flink and Neo4j Meetup Berlin 42 1 2 3 revenue:7000 expense:1000 expense:1000 revenue:2000 revenue:4000 expense:3000 expense:1000 0 2 3 4 1 5 7 86 9 11 1210
  • 77. Apply (e.g. Aggregation) Apache Flink and Neo4j Meetup Berlin 42 Operator 1 2 3 revenue:7000 expense:1000 expense:1000 revenue:2000 revenue:4000 expense:3000 expense:1000 0 2 3 4 1 5 7 86 9 11 1210
  • 78. Apply (e.g. Aggregation) Apache Flink and Neo4j Meetup Berlin 42 Operator 1 2 3 revenue:7000 expense:1000 expense:1000 revenue:2000 revenue:4000 expense:3000 expense:1000 0 2 3 4 1 5 7 86 9 11 1210 1 | profit: 5000 2 | profit: -1000 3 | profit: 3000 revenue:7000 expense:1000 expense:1000 revenue:2000 revenue:4000 expense:3000 expense:1000 0 2 3 4 1 5 7 86 9 11 1210
  • 79. Selection Apache Flink and Neo4j Meetup Berlin 43
  • 80. Selection Apache Flink and Neo4j Meetup Berlin 43 1 | profit: 5000 2 | profit: -1000 3 | profit: 3000 revenue:7000 expense:1000 expense:1000 revenue:2000 revenue:4000 expense:3000 expense:1000 0 2 3 4 1 5 7 86 9 11 1210
  • 81. Selection Apache Flink and Neo4j Meetup Berlin 43 UDF profit > 0 1 | profit: 5000 2 | profit: -1000 3 | profit: 3000 revenue:7000 expense:1000 expense:1000 revenue:2000 revenue:4000 expense:3000 expense:1000 0 2 3 4 1 5 7 86 9 11 1210
  • 82. Selection Apache Flink and Neo4j Meetup Berlin 43 UDF profit > 0 1 | profit: 5000 2 | profit: -1000 3 | profit: 3000 revenue:7000 expense:1000 expense:1000 revenue:2000 revenue:4000 expense:3000 expense:1000 0 2 3 4 1 5 7 86 9 11 1210 1 | profit: 5000 3 | profit: 3000 revenue:7000 expense:1000 expense:1000 revenue:4000 expense:1000 0 2 3 4 1 9 11 1210
  • 83. Call (e.g. Clustering) Apache Flink and Neo4j Meetup Berlin 44
  • 84. Call (e.g. Clustering) Apache Flink and Neo4j Meetup Berlin 44 1 0 2 3 4 1 5 7 86 9 11 1210
  • 85. Call (e.g. Clustering) Apache Flink and Neo4j Meetup Berlin 44 Algorithm 1 0 2 3 4 1 5 7 86 9 11 1210
  • 86. Call (e.g. Clustering) Apache Flink and Neo4j Meetup Berlin 44 Algorithm 1 0 2 3 4 1 5 7 86 9 11 1210 2 3 4 0 2 3 4 1 5 7 86 9 11 1210
  • 87. Call (e.g. PageRank) Apache Flink and Neo4j Meetup Berlin 45
  • 88. Call (e.g. PageRank) Apache Flink and Neo4j Meetup Berlin 45 1 0 2 3 4 1 5 7 86 9 11 1210
  • 89. Call (e.g. PageRank) Apache Flink and Neo4j Meetup Berlin 45 Algorithm 1 0 2 3 4 1 5 7 86 9 11 1210
  • 90. Call (e.g. PageRank) Apache Flink and Neo4j Meetup Berlin 45 Algorithm 2 rank:0.11 rank:0.25 rank:0.11 rank:1.29 rank:1.29 rank:1.58rank:0.11rank:5.12 rank:0.11 rank:0.11 rank:0.26 rank:0.11 rank:2.47 0 2 3 4 1 5 7 86 9 11 1210 1 0 2 3 4 1 5 7 86 9 11 1210
  • 91. EPGM Operators Overview Apache Flink and Neo4j Meetup Berlin 46 Operators Unary Binary GraphCollectionLogicalGraph Algorithms Aggregation Pattern Matching Transformation Grouping Equality Call Combination Overlap Exclusion Equality Union Intersection Difference Flink Gelly Library BTG Extraction Frequent Subgraphs Limit Selection Distinct Sort Apply Reduce Call Adaptive Partitioning Subgraph
  • 92. EPGM Operators Overview Apache Flink and Neo4j Meetup Berlin 47 Operators Unary Binary GraphCollectionLogicalGraph Algorithms Aggregation Pattern Matching Transformation Grouping Equality Call Combination Overlap Exclusion Equality Union Intersection Difference Flink Gelly Library BTG Extraction Frequent Subgraphs Limit Selection Distinct Sort Apply Reduce Call Adaptive Partitioning Subgraph
  • 93. EPGM Operators Overview Apache Flink and Neo4j Meetup Berlin 48 Operators Unary Binary GraphCollectionLogicalGraph Algorithms Aggregation Pattern Matching Transformation Grouping Equality Call Combination Overlap Exclusion Equality Union Intersection Difference Flink Gelly Library BTG Extraction Frequent Subgraphs Limit Selection Distinct Sort Apply Reduce Call Adaptive Partitioning Subgraph
  • 94. Apache Flink and Neo4j Meetup Berlin 49 EPGM on Apache Flink
  • 95. Flink DataSet API Apache Flink and Neo4j Meetup Berlin 50
  • 96. Flink DataSet API Apache Flink and Neo4j Meetup Berlin 50 • DataSet := Distributed Collection of Data Objects DataSet DataSet DataSet
  • 97. Flink DataSet API Apache Flink and Neo4j Meetup Berlin 50 • DataSet := Distributed Collection of Data Objects • Transformation := Operation on DataSets DataSet DataSet DataSet Transformation Transformation DataSet DataSet
  • 98. Flink DataSet API Apache Flink and Neo4j Meetup Berlin 50 • DataSet := Distributed Collection of Data Objects • Transformation := Operation on DataSets • Flink Programm := Composition of Transformations DataSet DataSet DataSet Transformation Transformation DataSet DataSet Transformation DataSet Flink Program
  • 99. Flink DataSet API Apache Flink and Neo4j Meetup Berlin 50 DataSetDataSetDataSet DataSetDataSetDataSet DataSetDataSetDataSet DataSetDataSetDataSet DataSetDataSetDataSet DataSetDataSetDataSet • DataSet := Distributed Collection of Data Objects • Transformation := Operation on DataSets • Flink Programm := Composition of Transformations DataSet DataSet DataSet Transformation Transformation DataSet DataSet Transformation DataSet Flink Program
  • 100. Graph Representation Apache Flink and Neo4j Meetup Berlin 51
  • 101. Graph Representation Apache Flink and Neo4j Meetup Berlin 51 EPGMGraphHead Id Label Properties POJO DataSet<EPGMGraphHead>
  • 102. Graph Representation Apache Flink and Neo4j Meetup Berlin 51 Id Label Properties Graphs EPGMGraphHead EPGMVertex Id Label Properties POJO POJO DataSet<EPGMGraphHead> DataSet<EPGMVertex>
  • 103. Graph Representation Apache Flink and Neo4j Meetup Berlin 51 Id Label Properties Graphs Id Label Properties SourceId TargetId Graphs EPGMGraphHead EPGMVertex EPGMEdge Id Label Properties POJO POJO POJO DataSet<EPGMGraphHead> DataSet<EPGMVertex> DataSet<EPGMEdge>
  • 104. Graph Representation Apache Flink and Neo4j Meetup Berlin 51 Id Label Properties Graphs Id Label Properties SourceId TargetId Graphs EPGMGraphHead EPGMVertex EPGMEdge Id Label Properties POJO POJO POJO DataSet<EPGMGraphHead> DataSet<EPGMVertex> DataSet<EPGMEdge> Id Label Properties Graphs EPGMVertex
  • 105. Graph Representation Apache Flink and Neo4j Meetup Berlin 51 Id Label Properties Graphs Id Label Properties SourceId TargetId Graphs EPGMGraphHead EPGMVertex EPGMEdge Id Label Properties POJO POJO POJO DataSet<EPGMGraphHead> DataSet<EPGMVertex> DataSet<EPGMEdge> Id Label Properties Graphs EPGMVertex GradoopId := UUID 128-bit
  • 106. Graph Representation Apache Flink and Neo4j Meetup Berlin 51 Id Label Properties Graphs Id Label Properties SourceId TargetId Graphs EPGMGraphHead EPGMVertex EPGMEdge Id Label Properties POJO POJO POJO DataSet<EPGMGraphHead> DataSet<EPGMVertex> DataSet<EPGMEdge> Id Label Properties Graphs EPGMVertex GradoopId := UUID 128-bit String
  • 107. Graph Representation Apache Flink and Neo4j Meetup Berlin 51 Id Label Properties Graphs Id Label Properties SourceId TargetId Graphs EPGMGraphHead EPGMVertex EPGMEdge Id Label Properties POJO POJO POJO DataSet<EPGMGraphHead> DataSet<EPGMVertex> DataSet<EPGMEdge> Id Label Properties Graphs EPGMVertex GradoopId := UUID 128-bit String PropertyList := List<Property> Property := (String, PropertyValue) PropertyValue := byte[]
  • 108. Graph Representation Apache Flink and Neo4j Meetup Berlin 51 Id Label Properties Graphs Id Label Properties SourceId TargetId Graphs EPGMGraphHead EPGMVertex EPGMEdge Id Label Properties POJO POJO POJO DataSet<EPGMGraphHead> DataSet<EPGMVertex> DataSet<EPGMEdge> Id Label Properties Graphs EPGMVertex GradoopId := UUID 128-bit String PropertyList := List<Property> Property := (String, PropertyValue) PropertyValue := byte[] GradoopIdSet := Set<GradoopId>
  • 109. Graph Representation Apache Flink and Neo4j Meetup Berlin 52
  • 110. Graph Representation Apache Flink and Neo4j Meetup Berlin 52 1 3 4 5 2 1|Community|interest:Heavy Metal 2|Community|interest:Hard Rock Person name : Alice born : 1984 Band name : Metallica founded : 1981 Person name : Bob Person name : Eve Band name : AC/DC founded : 1973 likes since : 2014 likes since : 2013 likes since : 2015 knows likes since : 2014 1 2 3 4 5
  • 111. Graph Representation Apache Flink and Neo4j Meetup Berlin 52 Id Label Properties 1 Community {interest:Heavy Metal} 2 Community {interest:Hard Rock} 1 3 4 5 2 1|Community|interest:Heavy Metal 2|Community|interest:Hard Rock Person name : Alice born : 1984 Band name : Metallica founded : 1981 Person name : Bob Person name : Eve Band name : AC/DC founded : 1973 likes since : 2014 likes since : 2013 likes since : 2015 knows likes since : 2014 1 2 3 4 5 DataSet<EPGMGraphHead>
  • 112. Graph Representation Apache Flink and Neo4j Meetup Berlin 52 Id Label Properties 1 Community {interest:Heavy Metal} 2 Community {interest:Hard Rock} Id Label Properties Graphs 1 Person {name:Alice, born:1984} {1} 2 Band {name:Metallica,founded:1981} {1} 3 Person {name:Bob} {1,2} 4 Band {name:AC/DC,founded:1973} {2} 5 Person {name:Eve} {2} 1 3 4 5 2 1|Community|interest:Heavy Metal 2|Community|interest:Hard Rock Person name : Alice born : 1984 Band name : Metallica founded : 1981 Person name : Bob Person name : Eve Band name : AC/DC founded : 1973 likes since : 2014 likes since : 2013 likes since : 2015 knows likes since : 2014 1 2 3 4 5 DataSet<EPGMGraphHead> DataSet<EPGMVertex>
  • 113. Graph Representation Apache Flink and Neo4j Meetup Berlin 52 Id Label Properties 1 Community {interest:Heavy Metal} 2 Community {interest:Hard Rock} Id Label Properties Graphs 1 Person {name:Alice, born:1984} {1} 2 Band {name:Metallica,founded:1981} {1} 3 Person {name:Bob} {1,2} 4 Band {name:AC/DC,founded:1973} {2} 5 Person {name:Eve} {2} Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} 3 likes 3 4 {since:2015} {2} 4 knows 3 5 {} {2} 5 likes 5 4 {since:2014} {2} 1 3 4 5 2 1|Community|interest:Heavy Metal 2|Community|interest:Hard Rock Person name : Alice born : 1984 Band name : Metallica founded : 1981 Person name : Bob Person name : Eve Band name : AC/DC founded : 1973 likes since : 2014 likes since : 2013 likes since : 2015 knows likes since : 2014 1 2 3 4 5 DataSet<EPGMGraphHead> DataSet<EPGMVertex> DataSet<EPGMEdge>
  • 114. Flink DataSet Transformations Apache Flink and Neo4j Meetup Berlin 53
  • 115. Flink DataSet Transformations Apache Flink and Neo4j Meetup Berlin 53 SQL-like Transformations • filter • project • cross • union • distinct • first-N (limit) • groupBy • aggregate • join • leftOuterJoin • rightOuterJoin • fullOuterJoin
  • 116. Flink DataSet Transformations Apache Flink and Neo4j Meetup Berlin 53 Hadoop-like Transformations • map • flatMap • mapPartition • reduce • reduceGroup • coGroup Special Flink Operations • iterate • iterateDelta SQL-like Transformations • filter • project • cross • union • distinct • first-N (limit) • groupBy • aggregate • join • leftOuterJoin • rightOuterJoin • fullOuterJoin
  • 117. Operator Implementation Apache Flink and Neo4j Meetup Berlin 54 1 3 4 5 2 1|Community|interest:Heavy Metal 2|Community|interest:Hard Rock Person name : Alice born : 1984 Band name : Metallica founded : 1981 Person name : Bob Person name : Eve Band name : AC/DC founded : 1973 likes since : 2014 likes since : 2013 likes since : 2015 knows likes since : 2014 1 2 3 4 5
  • 118. Operator Implementation Apache Flink and Neo4j Meetup Berlin 54 1 3 4 5 2 1|Community|interest:Heavy Metal 2|Community|interest:Hard Rock Person name : Alice born : 1984 Band name : Metallica founded : 1981 Person name : Bob Person name : Eve Band name : AC/DC founded : 1973 likes since : 2014 likes since : 2013 likes since : 2015 knows likes since : 2014 1 2 3 4 5 Exclusion
  • 119. Operator Implementation Apache Flink and Neo4j Meetup Berlin 54 1 3 4 5 2 1|Community|interest:Heavy Metal 2|Community|interest:Hard Rock Person name : Alice born : 1984 Band name : Metallica founded : 1981 Person name : Bob Person name : Eve Band name : AC/DC founded : 1973 likes since : 2014 likes since : 2013 likes since : 2015 knows likes since : 2014 1 2 3 4 5 // input: firstGraph (G[1]), secondGraph (G[2]) 1: DataSet<GradoopId> graphId = secondGraph.getGraphHead() 2: .map(new Id<G>()); 3: 4: DataSet<V> newVertices = firstGraph.getVertices() 5: .filter(new NotInGraphBroadCast<V>()) 6: .withBroadcastSet(graphId, GRAPH_ID); 7: 8: DataSet<E> newEdges = firstGraph.getEdges() 9: .filter(new NotInGraphBroadCast<E>()) 10: .withBroadcastSet(graphId, GRAPH_ID) 11: .join(newVertices) 12: .where(new SourceId<E>().equalTo(new Id<V>()) 13: .with(new LeftSide<E, V>()) 14: .join(newVertices) 15: .where(new TargetId<E>().equalTo(new Id<V>()) 16: .with(new LeftSide<E, V>()); Exclusion
  • 120. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 55
  • 121. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 55 graphId = secondGraph.getGraphHead()
  • 122. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 55 Id Label Properties 2 Community {interest:Hard Rock} graphId = secondGraph.getGraphHead()
  • 123. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 55 Id Label Properties 2 Community {interest:Hard Rock} graphId = secondGraph.getGraphHead() .map(new Id<G>());
  • 124. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 55 Id Label Properties 2 Community {interest:Hard Rock} graphId = secondGraph.getGraphHead() Id 2 .map(new Id<G>());
  • 125. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 55 Id Label Properties 2 Community {interest:Hard Rock} graphId = secondGraph.getGraphHead() Id 2 newVertices = firstGraph.getVertices() .map(new Id<G>());
  • 126. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 55 Id Label Properties 2 Community {interest:Hard Rock} graphId = secondGraph.getGraphHead() Id 2 newVertices = firstGraph.getVertices() Id Label Properties Graphs 1 Person {name:Alice} {1} 2 Band {name:Metallica,founded:1981} {1} 3 Person {name:Bob} {1,2} .map(new Id<G>());
  • 127. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 55 Id Label Properties 2 Community {interest:Hard Rock} graphId = secondGraph.getGraphHead() Id 2 newVertices = firstGraph.getVertices() Id Label Properties Graphs 1 Person {name:Alice} {1} 2 Band {name:Metallica,founded:1981} {1} 3 Person {name:Bob} {1,2} .map(new Id<G>()); .filter(new NotInGraphBroadCast<V>()) .withBroadcastSet(graphId, GRAPH_ID);
  • 128. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 55 Id Label Properties 2 Community {interest:Hard Rock} graphId = secondGraph.getGraphHead() Id 2 newVertices = firstGraph.getVertices() Id Label Properties Graphs 1 Person {name:Alice} {1} 2 Band {name:Metallica,founded:1981} {1} 3 Person {name:Bob} {1,2} Id Label Properties Graphs 1 Person {name:Alice} {1} 2 Band {name:Metallica,founded:1981} {1} .map(new Id<G>()); .filter(new NotInGraphBroadCast<V>()) .withBroadcastSet(graphId, GRAPH_ID);
  • 129. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 56
  • 130. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 56 newEdges = firstGraph.getEdges()
  • 131. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 56 newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1}
  • 132. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 56 newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} .filter(new NotInGraphBroadCast<E>()) .withBroadcastSet(graphId, GRAPH_ID)
  • 133. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 56 newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} .filter(new NotInGraphBroadCast<E>()) .withBroadcastSet(graphId, GRAPH_ID)
  • 134. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 56 newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} .join(newVertices) .where(new SourceId<E>().equalTo(new Id<V>()) .filter(new NotInGraphBroadCast<E>()) .withBroadcastSet(graphId, GRAPH_ID)
  • 135. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 56 newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target … Id Label … 1 likes 1 2 … 1 Person … .join(newVertices) .where(new SourceId<E>().equalTo(new Id<V>()) .filter(new NotInGraphBroadCast<E>()) .withBroadcastSet(graphId, GRAPH_ID)
  • 136. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 56 newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target … Id Label … 1 likes 1 2 … 1 Person … .with(new LeftSide<E, V>()) .join(newVertices) .where(new SourceId<E>().equalTo(new Id<V>()) .filter(new NotInGraphBroadCast<E>()) .withBroadcastSet(graphId, GRAPH_ID)
  • 137. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 56 newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target … Id Label … 1 likes 1 2 … 1 Person … Id Label Source Target … 1 likes 1 2 … .with(new LeftSide<E, V>()) .join(newVertices) .where(new SourceId<E>().equalTo(new Id<V>()) .filter(new NotInGraphBroadCast<E>()) .withBroadcastSet(graphId, GRAPH_ID)
  • 138. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 56 newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target … Id Label … 1 likes 1 2 … 1 Person … Id Label Source Target … 1 likes 1 2 … .join(newVertices) .where(new TargetId<E>().equalTo(new Id<V>()) .with(new LeftSide<E, V>()) .join(newVertices) .where(new SourceId<E>().equalTo(new Id<V>()) .filter(new NotInGraphBroadCast<E>()) .withBroadcastSet(graphId, GRAPH_ID)
  • 139. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 56 newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target … Id Label … 1 likes 1 2 … 1 Person … Id Label Source Target … 1 likes 1 2 … Id Label Source Target … Id Label … 1 likes 1 2 … 2 Band … .join(newVertices) .where(new TargetId<E>().equalTo(new Id<V>()) .with(new LeftSide<E, V>()) .join(newVertices) .where(new SourceId<E>().equalTo(new Id<V>()) .filter(new NotInGraphBroadCast<E>()) .withBroadcastSet(graphId, GRAPH_ID)
  • 140. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 56 newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target … Id Label … 1 likes 1 2 … 1 Person … Id Label Source Target … 1 likes 1 2 … Id Label Source Target … Id Label … 1 likes 1 2 … 2 Band … .with(new LeftSide<E, V>()); .join(newVertices) .where(new TargetId<E>().equalTo(new Id<V>()) .with(new LeftSide<E, V>()) .join(newVertices) .where(new SourceId<E>().equalTo(new Id<V>()) .filter(new NotInGraphBroadCast<E>()) .withBroadcastSet(graphId, GRAPH_ID)
  • 141. Operator Implementation – Exclusion Apache Flink and Neo4j Meetup Berlin 56 newEdges = firstGraph.getEdges() Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target Properties Graphs 1 likes 1 2 {since:2014} {1} 2 likes 3 2 {since:2013} {1} Id Label Source Target … Id Label … 1 likes 1 2 … 1 Person … Id Label Source Target … 1 likes 1 2 … Id Label Source Target … Id Label … 1 likes 1 2 … 2 Band … Id Label Source Target … 1 likes 1 2 … .with(new LeftSide<E, V>()); .join(newVertices) .where(new TargetId<E>().equalTo(new Id<V>()) .with(new LeftSide<E, V>()) .join(newVertices) .where(new SourceId<E>().equalTo(new Id<V>()) .filter(new NotInGraphBroadCast<E>()) .withBroadcastSet(graphId, GRAPH_ID)
  • 142. GrALa API Apache Flink and Neo4j Meetup Berlin 57
  • 143. GrALa API Apache Flink and Neo4j Meetup Berlin 57 class LogicalGraph<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge> { fromCollections(...) : LogicalGraph<G, V, E> fromDataSets(...) : LogicalGraph<G, V, E> fromGellyGraph(...) : LogicalGraph<G, V, E> getGraphHead() : DataSet<G> getVertices() : DataSet<V> getEdges() : DataSet<E> aggregate(...) : LogicalGraph<G, V, E> match(...) : GraphCollection<G, V, E> groupBy(...) : LogicalGraph<G, V, E> subgraph(...) : LogicalGraph<G, V, E> combine(...) : LogicalGraph<G, V, E> // ... }
  • 144. GrALa API Apache Flink and Neo4j Meetup Berlin 57 class LogicalGraph<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge> { fromCollections(...) : LogicalGraph<G, V, E> fromDataSets(...) : LogicalGraph<G, V, E> fromGellyGraph(...) : LogicalGraph<G, V, E> getGraphHead() : DataSet<G> getVertices() : DataSet<V> getEdges() : DataSet<E> aggregate(...) : LogicalGraph<G, V, E> match(...) : GraphCollection<G, V, E> groupBy(...) : LogicalGraph<G, V, E> subgraph(...) : LogicalGraph<G, V, E> combine(...) : LogicalGraph<G, V, E> // ... } class GraphCollection<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge > { fromCollections(...) : GraphCollection<G, V, E> fromDataSets(...) : GraphCollection<G, V, E> getGraphHeads() : DataSet<G> getVertices() : DataSet<V> getEdges() : DataSet<E> select(...) : GraphCollection<G, V, E> distinct( ) : GraphCollection<G, V, E> sortBy(...) : GraphCollection<G, V, E> union(...) : GraphCollection<G, V, E> difference(...) : GraphCollection<G, V, E> // ... }
  • 145. GrALa API Apache Flink and Neo4j Meetup Berlin 58 class EPGMDatabase<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge> { fromCollections(...) : EPGMDatabase<G, V, E> fromDataSets(...) : EPGMDatabase<G, V, E> fromHBase(...) : EPGMDatabase<G, V, E> fromJSON(...) : EPGMDatabase<G, V, E> fromExternalGraph(...) : EPGMDatabase<G, V, E> writeAsJSON(...) : void writeToHBase(...) : void getDatabaseGraph( ) : LogicalGraph<G, V, E> getGraphById(...) : LogicalGraph<G, V, E> getGraphsById(...) : GraphCollection<G, V, E> // ... }
  • 146. GrALa API Apache Flink and Neo4j Meetup Berlin 59 class EPGMDatabase<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge> { fromCollections(...) : EPGMDatabase<G, V, E> fromDataSets(...) : EPGMDatabase<G, V, E> fromHBase(...) : EPGMDatabase<G, V, E> fromJSON(...) : EPGMDatabase<G, V, E> fromExternalGraph(...) : EPGMDatabase<G, V, E> writeAsJSON(...) : void writeToHBase(...) : void getDatabaseGraph( ) : LogicalGraph<G, V, E> getGraphById(...) : LogicalGraph<G, V, E> getGraphsById(...) : GraphCollection<G, V, E> // ... }
  • 147. Apache Flink and Neo4j Meetup Berlin 60 Performance
  • 148. Social Network Benchmark Apache Flink and Neo4j Meetup Berlin 61
  • 149. Social Network Benchmark Apache Flink and Neo4j Meetup Berlin 61 http://www.ldbcouncil.org/
  • 150. Social Network Benchmark Apache Flink and Neo4j Meetup Berlin 61 1. Extract subgraph containing only Persons and knows relations http://www.ldbcouncil.org/
  • 151. Social Network Benchmark Apache Flink and Neo4j Meetup Berlin 61 1. Extract subgraph containing only Persons and knows relations 2. Transform Persons to necessary information http://www.ldbcouncil.org/
  • 152. Social Network Benchmark Apache Flink and Neo4j Meetup Berlin 61 1. Extract subgraph containing only Persons and knows relations 2. Transform Persons to necessary information 3. Find communities using Label Propagation http://www.ldbcouncil.org/
  • 153. Social Network Benchmark Apache Flink and Neo4j Meetup Berlin 61 1. Extract subgraph containing only Persons and knows relations 2. Transform Persons to necessary information 3. Find communities using Label Propagation 4. Aggregate vertex count for each community http://www.ldbcouncil.org/
  • 154. Social Network Benchmark Apache Flink and Neo4j Meetup Berlin 61 1. Extract subgraph containing only Persons and knows relations 2. Transform Persons to necessary information 3. Find communities using Label Propagation 4. Aggregate vertex count for each community 5. Select communities with more than 50K users http://www.ldbcouncil.org/
  • 155. Social Network Benchmark Apache Flink and Neo4j Meetup Berlin 61 1. Extract subgraph containing only Persons and knows relations 2. Transform Persons to necessary information 3. Find communities using Label Propagation 4. Aggregate vertex count for each community 5. Select communities with more than 50K users 6. Combine large communities to a single graph http://www.ldbcouncil.org/
  • 156. Social Network Benchmark Apache Flink and Neo4j Meetup Berlin 61 1. Extract subgraph containing only Persons and knows relations 2. Transform Persons to necessary information 3. Find communities using Label Propagation 4. Aggregate vertex count for each community 5. Select communities with more than 50K users 6. Combine large communities to a single graph 7. Group graph by Persons location and gender http://www.ldbcouncil.org/
  • 157. Social Network Benchmark Apache Flink and Neo4j Meetup Berlin 61 1. Extract subgraph containing only Persons and knows relations 2. Transform Persons to necessary information 3. Find communities using Label Propagation 4. Aggregate vertex count for each community 5. Select communities with more than 50K users 6. Combine large communities to a single graph 7. Group graph by Persons location and gender 8. Aggregate vertex and edge count of grouped graph http://www.ldbcouncil.org/
  • 158. Social Network Benchmark Apache Flink and Neo4j Meetup Berlin 62 1. Extract subgraph containing only Persons and knows relations 2. Transform Persons to necessary information 3. Find communities using Label Propagation 4. Aggregate vertex count for each community 5. Select communities with more than 50K users 6. Combine large communities to a single graph 7. Group graph by Persons location and gender 8. Aggregate vertex and edge count of grouped graph https://git.io/vgozj
  • 159. Social Network Benchmark Apache Flink and Neo4j Meetup Berlin 63 Dataset # Vertices # Edges Disk size Graphalytics.1 61,613 2,026,082 570 MB Graphalytics.10 260,613 16,600,778 4.5 GB Graphalytics.100 1,695,613 147,437,275 40.2 GB Graphalytics.1000 12,775,613 1,363,747,260 372 GB Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB • 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT • slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
  • 160. Social Network Benchmark – Runtime Apache Flink and Neo4j Meetup Berlin 64 Dataset # Vertices # Edges Disk size Graphalytics.1 61,613 2,026,082 570 MB Graphalytics.10 260,613 16,600,778 4.5 GB Graphalytics.100 1,695,613 147,437,275 40.2 GB Graphalytics.1000 12,775,613 1,363,747,260 372 GB Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB • 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT • slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960 0 200 400 600 800 1000 1200 1 2 4 8 16 Runtime[s] Number of workers Graphalytics.100
  • 161. 1 2 4 8 16 1 2 4 8 16 Speedup Number of workers Graphalytics.100 Linear Social Network Benchmark – Speedup Apache Flink and Neo4j Meetup Berlin 65 Dataset # Vertices # Edges Disk size Graphalytics.1 61,613 2,026,082 570 MB Graphalytics.10 260,613 16,600,778 4.5 GB Graphalytics.100 1,695,613 147,437,275 40.2 GB Graphalytics.1000 12,775,613 1,363,747,260 372 GB Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB • 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT • slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
  • 162. 1 10 100 1000 10000 Runtime[s] Social Network Benchmark – Datasets Apache Flink and Neo4j Meetup Berlin 66 Dataset # Vertices # Edges Disk size Graphalytics.1 61,613 2,026,082 570 MB Graphalytics.10 260,613 16,600,778 4.5 GB Graphalytics.100 1,695,613 147,437,275 40.2 GB Graphalytics.1000 12,775,613 1,363,747,260 372 GB Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB • 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT • slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
  • 163. Apache Flink and Neo4j Meetup Berlin 67 Demo https://github.com/s1ck/neo4j-gradoop-demos
  • 164. Apache Flink and Neo4j Meetup Berlin 68 Current State and Future Work
  • 165. Current State – Operator Implementations Apache Flink and Neo4j Meetup Berlin 69 Operators Unary Binary GraphCollectionLogicalGraph Algorithms Aggregation Pattern Matching Transformation Grouping Equality Call Combination Overlap Exclusion Equality Union Intersection Difference Flink Gelly Library BTG Extraction Frequent Subgraphs Limit Selection Distinct Sort Apply Reduce Call Adaptive Partitioning Subgraph
  • 166. Release History Apache Flink and Neo4j Meetup Berlin 70 • 0.0.1 First Prototype (May 2015) – Hadoop MapReduce and Giraph for operator implementations – Too much complexity – Performance loss through serialization in HDFS/HBase • 0.0.2 Using Flink as execution layer (June 2015) – Basic operators • 0.1 December 2015 – System-side identifiers (UUID) – Improved property handling – More operator implementations (e.g., Equality, Bool operators) – Code refactoring • 0.2-SNAPSHOT – Graph Pattern Matching – Frequent Subgraph Mining – Memory optimization (96-bit ID, Dictionary Encoding, …) – Tuple Implementation
  • 167. Contributions to Flink Apache Flink and Neo4j Meetup Berlin 71 • FLINK-2411 Add basic graph summarization algorithm • FLINK-2590 DataSetUtils.zipWithUniqueID creates duplicate Ids • FLINK-2905 Add intersect method to Graph class • FLINK-2910 Combine tests for binary graph operators • FLINK-2941 Implement a neo4j - Flink/Gelly connector • FLINK-2981 Update README for building docs • FLINK-3064 Missing size check in GroupReduceOperatorBase leads to NPE • FLINK-3118 Check if MessageFunction implements ResultTypeQueryable • FLINK-3122 Generalize value type in LabelPropagation • FLINK-3272 Generalize vertex value type in ConnectedComponents • Flink Forward (October 2015) • Meetup Big Data Usergroup Saxony (December 2015) • FOSDEM (January 2016)
  • 168. Contributions Welcome Apache Flink and Neo4j Meetup Berlin 72 • Code – Operator implementations / improvement – Performance Tuning • People – Bachelor / Master Thesis – Open PhD positions in Leipzig, Germany • Use Cases and (Big) Data!
  • 169. Apache Flink and Neo4j Meetup Berlin 73 Thank you! www.gradoop.com http://flink.apache.org http://neo4j.com http://ldbcouncil.org https://github.com/s1ck/neo4j-gradoop-demos https://github.com/s1ck/flink-neo4j https://github.com/s1ck/ldbc-flink-import https://github.com/s1ck/gdl