SlideShare a Scribd company logo
Efficient Distributed R Dataframes on Apache Flink
Andreas Kunft, Jens Meiners, Tilmann Rabl, Volker Markl
• R got huge traction
• Open source
• Rich support for analytics & statistics
• But, standalone not well suited for out of core data loads
• Multiple extensions for distributed execution
• Hadoop + R
• Spark + R
• SystemML
1
Our Goals
Provide API with natural feeling
•
•
•
Achieve comparable performance as native dataflow system
2
1
df$km <- df$miles * 1.6
df <- select(df, f = df$flights, df$distance)
df <- apply(df, key = id, aggFunc)
2
General Approach
• R dataframe(T1,T2,…,TN) as DataSet<TupleN<T1,T2,…,TN>>
• Create execution plan
• Map R dataframe functions to the native API whenever possible
e.g., select to projections
• Call user defined R functions within the worker nodes
3
General Approach
• R dataframe(T1,T2,…,TN) as DataSet<TupleN<T1,T2,…,TN>>
• Create execution plan
• Map R dataframe functions to the native API whenever possible
e.g., select to projections
• Call user defined R functions within the worker nodes
4
Handling user defined R functions
5
Inter Process Communication
6
Job
Manager
Client
Task
Manager
Task
Task
Task
Manager
Task
Task
R Process
R Process
R Process
R Process
Inter Process Communication
Communication + Serialization
Java and R compete for memory
7
Task
Manager
filter R Process
filter <- function(df) {
df$language == ‘R’
}
1
2
1
2
Source-to-Source Translation
• Translate restrict set of operations to native dataflow API
• Operations are executed natively
8
df <- filter(
df,
df$language == ‘R’
)
val df = df.filter($”language” === “R”)
df$km <- df$miles * 1.6 val df = df.withColumn(“km”, $”miles” * 1.6)
Flink + fastR
9
Truffle/Graal
10
HotSpot
JIT
Bytecode
Truffle/Graal
11
HotSpot
JIT
Bytecode
Graal
Truffle/Graal
12
HotSpot
Graal
Truffle
GraalVM
Truffle/Graal
13Figure based on: Grimmer, Matthias, et al. "High-performance cross-language interoperability in a multi-language runtime." ACM SIGPLAN Notices. Vol. 51. No. 2. ACM, 2015.
HotSpot Runtime
Graal Interpreter GC …
Truffle
TruffleR (fastR) TruffleJSjavac
*.js*.R*.java
GraalVM
AST Interpreter
Source Code
Flink + fastR
fastR: R implementation on top of Truffle/Graal
• Allows us to execute R code in the same VM as Flink
• Infer result types of R functions
• Access Java (Flink) data types in R
14
Client:
1. Dataframe rows to Flink tuples
2. Determine return types of UDFs
3. Create execution plan
15
Job
Manager
Client
Task
Manager
map
map
Task
Manager
map
map
flink.init(SERVER, PORT)
flink.parallelism(DOP)
df <- flink.readdf(SOURCE,
list("id", “body“, …),
list(character, character, …)
)
df$wordcount <- length(strsplit(df$body, " ")[[1]])
flink.writeAsText(df, SINK)
flink.execute()
function(tuple) {
.fun <- function(tuple) { length(strsplit(tuple[[2]], " ")[[1]] }
flink.tuple(tuple[[1]], tuple[[2]], .fun(tuple))
}
Dataframe proxy keeps track of columns, provides efficient access
Can be extended with new columns
Rewrite to directly use Flink tuples
16
df$wordcount <- length(strsplit(df$body, " ")[[1]])
1
23
1
2
3
17
Job
Manager
Client
Task
Manager
map
map
Task
Manager
map
map
map { tuple =>
executeRFunction(func, tuple)
}
map { tuple =>
executeRFunction(func, tuple)
}
• Task Manager:
Evaluate R UDF & Execute
Local - 1.4GB
18
Local - 14GB
19
Local – 1GB
20
Cluster – 10GB
21
fastR + Flink
• R dataframe abstraction for distributed computation
• Performance gains even on single node (local mode)
• Approaches native performance even for R UDFs
• Interesting opportunities for:
• Streaming
• Other dynamic languages
• Dynamic Re-optimization
Thank you for your attention!
22

More Related Content

PDF
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
PDF
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
PPTX
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
PDF
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
PDF
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
PDF
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
PPTX
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
PPTX
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...

What's hot (20)

PDF
Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...
PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
PDF
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
PDF
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
PDF
Marton Balassi – Stateful Stream Processing
PPTX
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
PDF
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
PDF
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
PPTX
Apache Flink at Strata San Jose 2016
PDF
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
PDF
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
PPTX
Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...
PPTX
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
PDF
Stream Loops on Flink - Reinventing the wheel for the streaming era
PDF
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
PPTX
Apache Flink Training: System Overview
PDF
Alexander Kolb – Flink. Yet another Streaming Framework?
PDF
Flink Connector Development Tips & Tricks
PDF
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Marton Balassi – Stateful Stream Processing
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
Apache Flink at Strata San Jose 2016
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Stream Loops on Flink - Reinventing the wheel for the streaming era
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Apache Flink Training: System Overview
Alexander Kolb – Flink. Yet another Streaming Framework?
Flink Connector Development Tips & Tricks
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Ad

Similar to Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink (20)

PDF
FastR+Apache Flink
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
PDF
Getting The Best Performance With PySpark
PPTX
Apache Flink Deep Dive
PPTX
Streaming SQL to unify batch and stream processing: Theory and practice with ...
PDF
Apache Flink - a Gentle Start
PPTX
Chicago Flink Meetup: Flink's streaming architecture
PDF
Sparkr sigmod
PDF
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
PDF
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
PDF
Big Data Beyond the JVM - Strata San Jose 2018
PDF
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
PDF
Near real-time anomaly detection at Lyft
PDF
Big data beyond the JVM - DDTX 2018
PDF
Flink Apachecon Presentation
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PPTX
Apache Spark Components
PPTX
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
PDF
Big data analysis using spark r published
FastR+Apache Flink
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Getting The Best Performance With PySpark
Apache Flink Deep Dive
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Apache Flink - a Gentle Start
Chicago Flink Meetup: Flink's streaming architecture
Sparkr sigmod
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Big Data Beyond the JVM - Strata San Jose 2018
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Near real-time anomaly detection at Lyft
Big data beyond the JVM - DDTX 2018
Flink Apachecon Presentation
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Large-Scale Stream Processing in the Hadoop Ecosystem
Apache Spark Components
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Big data analysis using spark r published
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
.pdf is not working space design for the following data for the following dat...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Global journeys: estimating international migration
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
Supervised vs unsupervised machine learning algorithms
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Major-Components-ofNKJNNKNKNKNKronment.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
.pdf is not working space design for the following data for the following dat...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Reliability_Chapter_ presentation 1221.5784
Introduction-to-Cloud-ComputingFinal.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Global journeys: estimating international migration
IBA_Chapter_11_Slides_Final_Accessible.pptx

Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink

  • 1. Efficient Distributed R Dataframes on Apache Flink Andreas Kunft, Jens Meiners, Tilmann Rabl, Volker Markl
  • 2. • R got huge traction • Open source • Rich support for analytics & statistics • But, standalone not well suited for out of core data loads • Multiple extensions for distributed execution • Hadoop + R • Spark + R • SystemML 1
  • 3. Our Goals Provide API with natural feeling • • • Achieve comparable performance as native dataflow system 2 1 df$km <- df$miles * 1.6 df <- select(df, f = df$flights, df$distance) df <- apply(df, key = id, aggFunc) 2
  • 4. General Approach • R dataframe(T1,T2,…,TN) as DataSet<TupleN<T1,T2,…,TN>> • Create execution plan • Map R dataframe functions to the native API whenever possible e.g., select to projections • Call user defined R functions within the worker nodes 3
  • 5. General Approach • R dataframe(T1,T2,…,TN) as DataSet<TupleN<T1,T2,…,TN>> • Create execution plan • Map R dataframe functions to the native API whenever possible e.g., select to projections • Call user defined R functions within the worker nodes 4
  • 6. Handling user defined R functions 5
  • 8. Inter Process Communication Communication + Serialization Java and R compete for memory 7 Task Manager filter R Process filter <- function(df) { df$language == ‘R’ } 1 2 1 2
  • 9. Source-to-Source Translation • Translate restrict set of operations to native dataflow API • Operations are executed natively 8 df <- filter( df, df$language == ‘R’ ) val df = df.filter($”language” === “R”) df$km <- df$miles * 1.6 val df = df.withColumn(“km”, $”miles” * 1.6)
  • 14. Truffle/Graal 13Figure based on: Grimmer, Matthias, et al. "High-performance cross-language interoperability in a multi-language runtime." ACM SIGPLAN Notices. Vol. 51. No. 2. ACM, 2015. HotSpot Runtime Graal Interpreter GC … Truffle TruffleR (fastR) TruffleJSjavac *.js*.R*.java GraalVM AST Interpreter Source Code
  • 15. Flink + fastR fastR: R implementation on top of Truffle/Graal • Allows us to execute R code in the same VM as Flink • Infer result types of R functions • Access Java (Flink) data types in R 14
  • 16. Client: 1. Dataframe rows to Flink tuples 2. Determine return types of UDFs 3. Create execution plan 15 Job Manager Client Task Manager map map Task Manager map map flink.init(SERVER, PORT) flink.parallelism(DOP) df <- flink.readdf(SOURCE, list("id", “body“, …), list(character, character, …) ) df$wordcount <- length(strsplit(df$body, " ")[[1]]) flink.writeAsText(df, SINK) flink.execute()
  • 17. function(tuple) { .fun <- function(tuple) { length(strsplit(tuple[[2]], " ")[[1]] } flink.tuple(tuple[[1]], tuple[[2]], .fun(tuple)) } Dataframe proxy keeps track of columns, provides efficient access Can be extended with new columns Rewrite to directly use Flink tuples 16 df$wordcount <- length(strsplit(df$body, " ")[[1]]) 1 23 1 2 3
  • 18. 17 Job Manager Client Task Manager map map Task Manager map map map { tuple => executeRFunction(func, tuple) } map { tuple => executeRFunction(func, tuple) } • Task Manager: Evaluate R UDF & Execute
  • 23. fastR + Flink • R dataframe abstraction for distributed computation • Performance gains even on single node (local mode) • Approaches native performance even for R UDFs • Interesting opportunities for: • Streaming • Other dynamic languages • Dynamic Re-optimization Thank you for your attention! 22