Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink

Efficient Distributed R Dataframes on Apache Flink
Andreas Kunft, Jens Meiners, Tilmann Rabl, Volker Markl

• R got huge traction
• Open source
• Rich support for analytics & statistics
• But, standalone not well suited for out of core data loads
• Multiple extensions for distributed execution
• Hadoop + R
• Spark + R
• SystemML
1

Our Goals
Provide API with natural feeling
•
•
•
Achieve comparable performance as native dataflow system
2
1
df$km <- df$miles * 1.6
df <- select(df, f = df$flights, df$distance)
df <- apply(df, key = id, aggFunc)
2

General Approach
• R dataframe(T1,T2,…,TN) as DataSet<TupleN<T1,T2,…,TN>>
• Create execution plan
• Map R dataframe functions to the native API whenever possible
e.g., select to projections
• Call user defined R functions within the worker nodes
3

General Approach
• R dataframe(T1,T2,…,TN) as DataSet<TupleN<T1,T2,…,TN>>
• Create execution plan
• Map R dataframe functions to the native API whenever possible
e.g., select to projections
• Call user defined R functions within the worker nodes
4

Handling user defined R functions
5

Inter Process Communication
6
Job
Manager
Client
Task
Manager
Task
Task
Task
Manager
Task
Task
R Process
R Process
R Process
R Process

Inter Process Communication
Communication + Serialization
Java and R compete for memory
7
Task
Manager
filter R Process
filter <- function(df) {
df$language == ‘R’
}
1
2
1
2

Source-to-Source Translation
• Translate restrict set of operations to native dataflow API
• Operations are executed natively
8
df <- filter(
df,
df$language == ‘R’
)
val df = df.filter($”language” === “R”)
df$km <- df$miles * 1.6 val df = df.withColumn(“km”, $”miles” * 1.6)

Truffle/Graal
10
HotSpot
JIT
Bytecode

Truffle/Graal
11
HotSpot
JIT
Bytecode
Graal

Truffle/Graal
12
HotSpot
Graal
Truffle
GraalVM

Truffle/Graal
13Figure based on: Grimmer, Matthias, et al. "High-performance cross-language interoperability in a multi-language runtime." ACM SIGPLAN Notices. Vol. 51. No. 2. ACM, 2015.
HotSpot Runtime
Graal Interpreter GC …
Truffle
TruffleR (fastR) TruffleJSjavac
*.js*.R*.java
GraalVM
AST Interpreter
Source Code

Flink + fastR
fastR: R implementation on top of Truffle/Graal
• Allows us to execute R code in the same VM as Flink
• Infer result types of R functions
• Access Java (Flink) data types in R
14

Client:
1. Dataframe rows to Flink tuples
2. Determine return types of UDFs
3. Create execution plan
15
Job
Manager
Client
Task
Manager
map
map
Task
Manager
map
map
flink.init(SERVER, PORT)
flink.parallelism(DOP)
df <- flink.readdf(SOURCE,
list("id", “body“, …),
list(character, character, …)
)
df$wordcount <- length(strsplit(df$body, " ")[[1]])
flink.writeAsText(df, SINK)
flink.execute()

function(tuple) {
.fun <- function(tuple) { length(strsplit(tuple[[2]], " ")[[1]] }
flink.tuple(tuple[[1]], tuple[[2]], .fun(tuple))
}
Dataframe proxy keeps track of columns, provides efficient access
Can be extended with new columns
Rewrite to directly use Flink tuples
16
df$wordcount <- length(strsplit(df$body, " ")[[1]])
1
23
1
2
3

17
Job
Manager
Client
Task
Manager
map
map
Task
Manager
map
map
map { tuple =>
executeRFunction(func, tuple)
}
map { tuple =>
executeRFunction(func, tuple)
}
• Task Manager:
Evaluate R UDF & Execute

fastR + Flink
• R dataframe abstraction for distributed computation
• Performance gains even on single node (local mode)
• Approaches native performance even for R UDFs
• Interesting opportunities for:
• Streaming
• Other dynamic languages
• Dynamic Re-optimization
Thank you for your attention!
22

Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink

More Related Content

What's hot (20)

Similar to Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink (20)

More from Flink Forward (20)

Recently uploaded (20)

Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink