SlideShare a Scribd company logo
Introducing Arc
Klas Segeljakt Max Meldrum
@FlinkForward
A Common Intermediate Language for
Unified Batch and Stream Analytics
Outline
• Project Introduction
• The Arc Intermediate Representation (IR)
• Arc Examples
• Arc + Flink Integration?
• Conclusions
2
The Big Picture
3
• Flink plays an important
role in the data science
landscape
?
?
?
?
? ?
The Big Picture
3
• Flink plays an important
role in the data science
landscape
• Combining Flink with other
frameworks can lead to
interesting applications
?
?
?
?
? ?
The Big Picture
3
• Flink plays an important
role in the data science
landscape
• Combining Flink with other
frameworks can lead to
interesting applications
• However, there is a
language barrier
?
?
?
?
? ?
Intuition
4
f1 f2 f3
Intuition
4
f1 f2 f3
• No cross-optimisation
optimisation is possible,
e.g. resource sharing
Intuition
4
f1 f2 f3
• No cross-optimisation
optimisation is possible,
e.g. resource sharing
• Data movement

costs ( )
Intuition
4
#Frameworks
Performance
f1 f2 f3
• No cross-optimisation
optimisation is possible,
e.g. resource sharing
• Data movement

costs ( )
Intuition
4
#Frameworks
Performance
f1 f2 f3
• No cross-optimisation
optimisation is possible,
e.g. resource sharing
• Data movement

costs ( )
Intuition
4
#Frameworks
Performance
f1 f2 f3
• No cross-optimisation
optimisation is possible,
e.g. resource sharing
• Data movement

costs ( )
f1 f2 f3IR IR IR
Intuition
4
#Frameworks
Performance
f1 f2 f3
• No cross-optimisation
optimisation is possible,
e.g. resource sharing
• Data movement

costs ( )
f1 f2 f3IR IR IR
f1 + f2 + f3IR
The Arc IR
5
• Streams
• Tables
• Linear algebra
High-level
• Runners
• Hardware
Low-level
Arc
The Arc IR
5
• Streams
• Tables
• Linear algebra
High-level
• Runners
• Hardware
Low-level
Arc
The Arc IR
5
• Streams
• Tables
• Linear algebra
High-level
• Runners
• Hardware
Low-level
Abstractions
• Pipelines (Operators/Sources/Sinks)
• User-defined Windows
• Out-of-Order Processing, ...
Arc
The Arc IR
5
• Streams
• Tables
• Linear algebra
High-level
• Runners
• Hardware
Low-level
Optimisations
• Compiler: Partial evaluation, ...
• Dataflow: Operator fusion,
fission, reordering, ...
Abstractions
• Pipelines (Operators/Sources/Sinks)
• User-defined Windows
• Out-of-Order Processing, ...
Compiler Pipeline
6
Arc (High Level IR)
Frontends
Logical Dataflow IR
Binaries
Physical Dataflow IR
Compiler Pipeline
6
Arc (High Level IR)
Frontends
Logical Dataflow IR
Binaries
Physical Dataflow IR
Flink Backend
Flink Frontend
7
What is Arc?
7
What is Arc?
A restrictive language for describing batch and stream transformations
7
What is Arc?
A restrictive language for describing batch and stream transformations
Transformations are modelled through:
7
What is Arc?
A restrictive language for describing batch and stream transformations
Transformations are modelled through:
• Values: Read-only data types (e.g. Vec[T], Stream[T], i8..i64)
7
What is Arc?
A restrictive language for describing batch and stream transformations
Transformations are modelled through:
• Values: Read-only data types (e.g. Vec[T], Stream[T], i8..i64)
• Builders: Write-only data types (e.g. Appender[T])
7
What is Arc?
A restrictive language for describing batch and stream transformations
Transformations are modelled through:
• Values: Read-only data types (e.g. Vec[T], Stream[T], i8..i64)
• Builders: Write-only data types (e.g. Appender[T])
• Values are written to builders, and builders are lazily materialised back into values
7
What is Arc?
A restrictive language for describing batch and stream transformations
Transformations are modelled through:
• Values: Read-only data types (e.g. Vec[T], Stream[T], i8..i64)
• Builders: Write-only data types (e.g. Appender[T])
• Values are written to builders, and builders are lazily materialised back into values
➡ Dependencies between values and builders form a dataflow graph
8
source
evenSink
oddSink
map(v+5)
filter(v%2==0)
filter(v%2!=0)
Arc Example
8
|source:Stream[i32],
evenSink:StreamAppender[i32],
oddSink:StreamAppender[i32]|
let mapped = result(for(source,
StreamAppender[i32],
|out, v| merge(out, v + 5)));
for(mapped, evenSink, |out, v|
if (v % 2 == 0, merge(out, v), out));
for(mapped, oddSink, |out, v|
if (v % 2 != 0, merge(out, v), out))
Arc
source
evenSink
oddSink
map(v+5)
filter(v%2==0)
filter(v%2!=0)
Arc Example
9
source
evenSink
oddSink
map(v+5)
filter(v%2==0)
filter(v%2!=0)
Arc Example
source
evenSink
oddSink
map(v+5) then if(x%2==0)
10
source
evenSink
oddSink
map(v+5) then if(x%2==0)
Arc Example (Fused)
|source:Stream[i32],
evenSink:StreamAppender[i32],
oddSink:StreamAppender[i32]|
let mapped = result(for(source,
StreamAppender[i32],
|out, v| merge(out, v + 5)));
for(mapped, evenSink, |out, v|
if (v % 2 == 0, merge(out, v), out));
for(mapped, oddSink, |out, v|
if (v % 2 != 0, merge(out, v), out))
Unfused
10
source
evenSink
oddSink
map(v+5) then if(x%2==0)
Arc Example (Fused)
|source:Stream[i32],
evenSink:StreamAppender[i32],
oddSink:StreamAppender[i32]|
let mapped = result(for(source,
StreamAppender[i32],
|out, v| merge(out, v + 5)));
for(mapped, evenSink, |out, v|
if (v % 2 == 0, merge(out, v), out));
for(mapped, oddSink, |out, v|
if (v % 2 != 0, merge(out, v), out))
Unfused
|source:Stream[i32],
evenSink:StreamAppender[i32],
oddSink:StreamAppender[i32]|
for(source,
{evenSink,oddSink},
|out, v|
let x = v + 5;
if (x % 2 == 0,
{merge(out.$1, x), out.$2},
{out.$1, merge(out.$2, x)}))
Fused
11
Arc + Flink?
• Benefits:
11
Arc + Flink?
• Benefits:
• Enable stronger optimisations
11
Arc + Flink?
• Benefits:
• Enable stronger optimisations
• Use your other favourite libraries together with Flink
11
Arc + Flink?
• Benefits:
• Enable stronger optimisations
• Use your other favourite libraries together with Flink
• Make life easier for data scientists
11
Arc + Flink?
The black box problem
UDFs are black boxes
12
stream.map( )
.filter( )
.reduce( )
The black box problem
UDFs are black boxes
➡ Flink is unaware of what is being executed
inside of each black box
12
stream.map( )
.filter( )
.reduce( )
Fusion Levels
13
= Flink Task ~ Thread
Fusion Levels
13
= Flink Task ~ Thread
x + 1 x + 1 x + 1 x + 11. No Fusion
Fusion Levels
13
= Flink Task ~ Thread
x + 1 x + 1 x + 1 x + 11. No Fusion
x + 1 x + 1 x + 1 x + 12.Task Fusion
Fusion Levels
13
= Flink Task ~ Thread
x + 1 x + 1 x + 1 x + 11. No Fusion
x + 1 x + 1 x + 1 x + 12.Task Fusion
3. Invocation-level
Fusion
x + 1
for-loop
4X
Fusion Levels
13
= Flink Task ~ Thread
x + 1 x + 1 x + 1 x + 11. No Fusion
x + 1 x + 1 x + 1 x + 12.Task Fusion
4. Instruction-level
Fusion
x + 4
3. Invocation-level
Fusion
x + 1
for-loop
4X
14
Experiment Results
100
101
102
103
ExecutionTime(seconds)
None
Task(Flink)
Invocation
Instruction
50 maps on 10M elements
N
one
Task(Flink)
Invocation
Instruction
Optimisation level
(Lower is better)
Example Frontend Code
(Pandas + Beam)
15
Example Frontend Code
(Pandas + Beam)
15
Example Frontend Code
(Pandas + Beam)
15
import arc.beam as beam
import arc.beam.transforms.window as window
import arc.beam.transforms.combiners as combiners
import arc.pandas as pandas
Example Frontend Code
(Pandas + Beam)
15
import arc.beam as beam
import arc.beam.transforms.window as window
import arc.beam.transforms.combiners as combiners
import arc.pandas as pandas
def normalise(elements):
series = pandas.Series(elements)
avg = series.sum() / series.count()
return series / avg
Example Frontend Code
(Pandas + Beam)
15
import arc.beam as beam
import arc.beam.transforms.window as window
import arc.beam.transforms.combiners as combiners
import arc.pandas as pandas
def normalise(elements):
series = pandas.Series(elements)
avg = series.sum() / series.count()
return series / avg
p = beam.Pipeline()
(p
| beam.io.ReadFromText(path='input.txt').with_output_types(int)
| beam.WindowInto(window.FixedWindows(size=5))
| beam.CombineGlobally(normalise)
| combiners.ToList()
| beam.io.WriteToText(path='output.txt'))
p.run()
Example Frontend Code
(Pandas + Beam)
15
import arc.beam as beam
import arc.beam.transforms.window as window
import arc.beam.transforms.combiners as combiners
import arc.pandas as pandas
def normalise(elements):
series = pandas.Series(elements)
avg = series.sum() / series.count()
return series / avg
p = beam.Pipeline()
(p
| beam.io.ReadFromText(path='input.txt').with_output_types(int)
| beam.WindowInto(window.FixedWindows(size=5))
| beam.CombineGlobally(normalise)
| combiners.ToList()
| beam.io.WriteToText(path='output.txt'))
p.run()
?
16
One more thing...
• Flink inspired dataflow engine built in Rust
• Goals:
• Common runtime for Arc applications
• Support dynamic task execution
• First-class support for hardware acceleration
17
Arcon: Native Arc Runner
• Arc is an IR for batch and stream programming.
• By raising the level of abstraction, Arc is able to both
optimise the dataflow and the code within it.
18
Conclusions
Arc and experiments can be found at https://github.com/cda-group
Contact info: klasseg@kth.se & mmeldrum@kth.se
References
Publications:
• Kroll, L., Segeljakt, K., Carbone, P., Schulte, C. and Haridi, S., 2019, June.
Arc: an IR for batch and stream programming. In Proceedings of the 17th
ACM SIGPLAN International Symposium on Database Programming
Languages (pp. 53-58). ACM.
• Meldrum, M., Segeljakt, K., Kroll, L., Carbone, P., Schulte, C. and Haridi,
S., 2019, August. Arcon: Continuous and Deep Data Stream Analytics.
In Proceedings of Real-Time Business Intelligence and Analytics (p. 3).
ACM.
19

More Related Content

PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
PPTX
SICS: Apache Flink Streaming
PDF
Flink Apachecon Presentation
PDF
Extending Flink State Serialization for Better Performance and Smaller Checkp...
PPTX
Building Modern Data Pipelines for Time Series Data on GCP with InfluxData by...
PPTX
Extending Flux - Writing Your Own Functions by Adam Anthony
PDF
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
PDF
Flink Gelly - Karlsruhe - June 2015
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
SICS: Apache Flink Streaming
Flink Apachecon Presentation
Extending Flink State Serialization for Better Performance and Smaller Checkp...
Building Modern Data Pipelines for Time Series Data on GCP with InfluxData by...
Extending Flux - Writing Your Own Functions by Adam Anthony
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Flink Gelly - Karlsruhe - June 2015

What's hot (20)

PDF
Stateful Distributed Stream Processing
PDF
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
PPTX
Going Reactive with Spring 5
PDF
Fast and Reliable Apache Spark SQL Engine
PDF
Flink Streaming Berlin Meetup
PPTX
Apache Flink Training: System Overview
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PPTX
The Past, Present, and Future of Apache Flink®
PDF
Introduction to Streaming with Apache Flink
PPTX
Reactive Spring 5
PDF
Towards sql for streams
PPTX
The Stream Processor as a Database Apache Flink
PPTX
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
PPTX
Apache Beam: A unified model for batch and stream processing data
PDF
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
PPTX
Apache Flink: API, runtime, and project roadmap
PDF
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
PPTX
Flink Streaming @BudapestData
Stateful Distributed Stream Processing
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Going Reactive with Spring 5
Fast and Reliable Apache Spark SQL Engine
Flink Streaming Berlin Meetup
Apache Flink Training: System Overview
A Deep Dive into Query Execution Engine of Spark SQL
The Past, Present, and Future of Apache Flink®
Introduction to Streaming with Apache Flink
Reactive Spring 5
Towards sql for streams
The Stream Processor as a Database Apache Flink
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Apache Beam: A unified model for batch and stream processing data
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Apache Flink: API, runtime, and project roadmap
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
Flink Streaming @BudapestData
Ad

Similar to Introducing Arc: A Common Intermediate Language for Unified Batch and Stream Analytics - Max Meldrum & Klas Segeljakt, KTH (20)

PDF
Arc: An IR for Batch and Stream Programming
PDF
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
PDF
Graph Stream Processing : spinning fast, large scale, complex analytics
PDF
Data flow vs. procedural programming: How to put your algorithms into Flink
PPTX
Introduction to Apache Flink
PDF
Big Data Analytics Platforms by KTH and RISE SICS
PPTX
An Introduction to Distributed Data Streaming
PPTX
Flink Streaming
PDF
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
PDF
Data Stream Analytics - Why they are important
PPTX
Stream processing - Apache flink
PDF
Apache Flink internals
PDF
Machine Learning with Apache Flink at Stockholm Machine Learning Group
PPTX
Apache Flink: Past, Present and Future
PDF
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
PDF
Mikio Braun – Data flow vs. procedural programming
PDF
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
PDF
Data Science in Future Tense
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
PDF
The magic of (data parallel) distributed systems and where it all breaks - Re...
Arc: An IR for Batch and Stream Programming
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Graph Stream Processing : spinning fast, large scale, complex analytics
Data flow vs. procedural programming: How to put your algorithms into Flink
Introduction to Apache Flink
Big Data Analytics Platforms by KTH and RISE SICS
An Introduction to Distributed Data Streaming
Flink Streaming
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Data Stream Analytics - Why they are important
Stream processing - Apache flink
Apache Flink internals
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Apache Flink: Past, Present and Future
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Mikio Braun – Data flow vs. procedural programming
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Data Science in Future Tense
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
The magic of (data parallel) distributed systems and where it all breaks - Re...
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PDF
Introduction of Secrets of Mount Kailash.pdf
PPTX
8 - Airport Statistical Forms icon related
PDF
Memorable Outdoor Adventures with Premium River Rafting & Guided Tours
PDF
4Days Golden Triangle Tour India Pdf Doc
PDF
Delhi Agra Jaipur Tour Package 2025 – Travel with Rajasthan Tours India.pdf
PDF
chopta tour package from delhi chopta tour
PPTX
Exploration of Botanical Gardens of India
PDF
Autumn in Pakistan. Hunza Autumn Tours.
PDF
Why Everyone Misses These 7 Extraordinary Cities — And Why You Should Visit I...
PDF
Fly Smart with Copa Airlines LAX Your Guide to Airfare, Comfort, and Top Attr...
PDF
When is the best time to Visit Kailash Mansarovar.pdf
PDF
Hunza Blossom. Cherry Blossom in Hunza Valley
PDF
Hunza Autumn. Hunza Autumn Tours. Pakistan Autumn Tour
PPTX
Unlocking Travel Insights with Cruise Critic Dataset for Analysis.pptx
PPTX
MALDIVES.pptx.pptx short power point to guide your explanation
PPTX
Quiz- Thursday.pptxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
PPTX
Multimedia - Dinagsa Festival, Cadiz City
PDF
International Kailash Mansarovar Yatra, Visa, Permits, and Package.pdf
PDF
How Do You Plan a Kailash Mansarovar Pilgrimage.pdf
PPTX
MACRO-PERSPECTIVE-IN-HOSPITALITY-AND-TOURISM-MODULES.pptx
Introduction of Secrets of Mount Kailash.pdf
8 - Airport Statistical Forms icon related
Memorable Outdoor Adventures with Premium River Rafting & Guided Tours
4Days Golden Triangle Tour India Pdf Doc
Delhi Agra Jaipur Tour Package 2025 – Travel with Rajasthan Tours India.pdf
chopta tour package from delhi chopta tour
Exploration of Botanical Gardens of India
Autumn in Pakistan. Hunza Autumn Tours.
Why Everyone Misses These 7 Extraordinary Cities — And Why You Should Visit I...
Fly Smart with Copa Airlines LAX Your Guide to Airfare, Comfort, and Top Attr...
When is the best time to Visit Kailash Mansarovar.pdf
Hunza Blossom. Cherry Blossom in Hunza Valley
Hunza Autumn. Hunza Autumn Tours. Pakistan Autumn Tour
Unlocking Travel Insights with Cruise Critic Dataset for Analysis.pptx
MALDIVES.pptx.pptx short power point to guide your explanation
Quiz- Thursday.pptxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Multimedia - Dinagsa Festival, Cadiz City
International Kailash Mansarovar Yatra, Visa, Permits, and Package.pdf
How Do You Plan a Kailash Mansarovar Pilgrimage.pdf
MACRO-PERSPECTIVE-IN-HOSPITALITY-AND-TOURISM-MODULES.pptx

Introducing Arc: A Common Intermediate Language for Unified Batch and Stream Analytics - Max Meldrum & Klas Segeljakt, KTH

  • 1. Introducing Arc Klas Segeljakt Max Meldrum @FlinkForward A Common Intermediate Language for Unified Batch and Stream Analytics
  • 2. Outline • Project Introduction • The Arc Intermediate Representation (IR) • Arc Examples • Arc + Flink Integration? • Conclusions 2
  • 3. The Big Picture 3 • Flink plays an important role in the data science landscape ? ? ? ? ? ?
  • 4. The Big Picture 3 • Flink plays an important role in the data science landscape • Combining Flink with other frameworks can lead to interesting applications ? ? ? ? ? ?
  • 5. The Big Picture 3 • Flink plays an important role in the data science landscape • Combining Flink with other frameworks can lead to interesting applications • However, there is a language barrier ? ? ? ? ? ?
  • 7. Intuition 4 f1 f2 f3 • No cross-optimisation optimisation is possible, e.g. resource sharing
  • 8. Intuition 4 f1 f2 f3 • No cross-optimisation optimisation is possible, e.g. resource sharing • Data movement
 costs ( )
  • 9. Intuition 4 #Frameworks Performance f1 f2 f3 • No cross-optimisation optimisation is possible, e.g. resource sharing • Data movement
 costs ( )
  • 10. Intuition 4 #Frameworks Performance f1 f2 f3 • No cross-optimisation optimisation is possible, e.g. resource sharing • Data movement
 costs ( )
  • 11. Intuition 4 #Frameworks Performance f1 f2 f3 • No cross-optimisation optimisation is possible, e.g. resource sharing • Data movement
 costs ( ) f1 f2 f3IR IR IR
  • 12. Intuition 4 #Frameworks Performance f1 f2 f3 • No cross-optimisation optimisation is possible, e.g. resource sharing • Data movement
 costs ( ) f1 f2 f3IR IR IR f1 + f2 + f3IR
  • 13. The Arc IR 5 • Streams • Tables • Linear algebra High-level • Runners • Hardware Low-level
  • 14. Arc The Arc IR 5 • Streams • Tables • Linear algebra High-level • Runners • Hardware Low-level
  • 15. Arc The Arc IR 5 • Streams • Tables • Linear algebra High-level • Runners • Hardware Low-level Abstractions • Pipelines (Operators/Sources/Sinks) • User-defined Windows • Out-of-Order Processing, ...
  • 16. Arc The Arc IR 5 • Streams • Tables • Linear algebra High-level • Runners • Hardware Low-level Optimisations • Compiler: Partial evaluation, ... • Dataflow: Operator fusion, fission, reordering, ... Abstractions • Pipelines (Operators/Sources/Sinks) • User-defined Windows • Out-of-Order Processing, ...
  • 17. Compiler Pipeline 6 Arc (High Level IR) Frontends Logical Dataflow IR Binaries Physical Dataflow IR
  • 18. Compiler Pipeline 6 Arc (High Level IR) Frontends Logical Dataflow IR Binaries Physical Dataflow IR Flink Backend Flink Frontend
  • 20. 7 What is Arc? A restrictive language for describing batch and stream transformations
  • 21. 7 What is Arc? A restrictive language for describing batch and stream transformations Transformations are modelled through:
  • 22. 7 What is Arc? A restrictive language for describing batch and stream transformations Transformations are modelled through: • Values: Read-only data types (e.g. Vec[T], Stream[T], i8..i64)
  • 23. 7 What is Arc? A restrictive language for describing batch and stream transformations Transformations are modelled through: • Values: Read-only data types (e.g. Vec[T], Stream[T], i8..i64) • Builders: Write-only data types (e.g. Appender[T])
  • 24. 7 What is Arc? A restrictive language for describing batch and stream transformations Transformations are modelled through: • Values: Read-only data types (e.g. Vec[T], Stream[T], i8..i64) • Builders: Write-only data types (e.g. Appender[T]) • Values are written to builders, and builders are lazily materialised back into values
  • 25. 7 What is Arc? A restrictive language for describing batch and stream transformations Transformations are modelled through: • Values: Read-only data types (e.g. Vec[T], Stream[T], i8..i64) • Builders: Write-only data types (e.g. Appender[T]) • Values are written to builders, and builders are lazily materialised back into values ➡ Dependencies between values and builders form a dataflow graph
  • 27. 8 |source:Stream[i32], evenSink:StreamAppender[i32], oddSink:StreamAppender[i32]| let mapped = result(for(source, StreamAppender[i32], |out, v| merge(out, v + 5))); for(mapped, evenSink, |out, v| if (v % 2 == 0, merge(out, v), out)); for(mapped, oddSink, |out, v| if (v % 2 != 0, merge(out, v), out)) Arc source evenSink oddSink map(v+5) filter(v%2==0) filter(v%2!=0) Arc Example
  • 29. 10 source evenSink oddSink map(v+5) then if(x%2==0) Arc Example (Fused) |source:Stream[i32], evenSink:StreamAppender[i32], oddSink:StreamAppender[i32]| let mapped = result(for(source, StreamAppender[i32], |out, v| merge(out, v + 5))); for(mapped, evenSink, |out, v| if (v % 2 == 0, merge(out, v), out)); for(mapped, oddSink, |out, v| if (v % 2 != 0, merge(out, v), out)) Unfused
  • 30. 10 source evenSink oddSink map(v+5) then if(x%2==0) Arc Example (Fused) |source:Stream[i32], evenSink:StreamAppender[i32], oddSink:StreamAppender[i32]| let mapped = result(for(source, StreamAppender[i32], |out, v| merge(out, v + 5))); for(mapped, evenSink, |out, v| if (v % 2 == 0, merge(out, v), out)); for(mapped, oddSink, |out, v| if (v % 2 != 0, merge(out, v), out)) Unfused |source:Stream[i32], evenSink:StreamAppender[i32], oddSink:StreamAppender[i32]| for(source, {evenSink,oddSink}, |out, v| let x = v + 5; if (x % 2 == 0, {merge(out.$1, x), out.$2}, {out.$1, merge(out.$2, x)})) Fused
  • 33. • Benefits: • Enable stronger optimisations 11 Arc + Flink?
  • 34. • Benefits: • Enable stronger optimisations • Use your other favourite libraries together with Flink 11 Arc + Flink?
  • 35. • Benefits: • Enable stronger optimisations • Use your other favourite libraries together with Flink • Make life easier for data scientists 11 Arc + Flink?
  • 36. The black box problem UDFs are black boxes 12 stream.map( ) .filter( ) .reduce( )
  • 37. The black box problem UDFs are black boxes ➡ Flink is unaware of what is being executed inside of each black box 12 stream.map( ) .filter( ) .reduce( )
  • 38. Fusion Levels 13 = Flink Task ~ Thread
  • 39. Fusion Levels 13 = Flink Task ~ Thread x + 1 x + 1 x + 1 x + 11. No Fusion
  • 40. Fusion Levels 13 = Flink Task ~ Thread x + 1 x + 1 x + 1 x + 11. No Fusion x + 1 x + 1 x + 1 x + 12.Task Fusion
  • 41. Fusion Levels 13 = Flink Task ~ Thread x + 1 x + 1 x + 1 x + 11. No Fusion x + 1 x + 1 x + 1 x + 12.Task Fusion 3. Invocation-level Fusion x + 1 for-loop 4X
  • 42. Fusion Levels 13 = Flink Task ~ Thread x + 1 x + 1 x + 1 x + 11. No Fusion x + 1 x + 1 x + 1 x + 12.Task Fusion 4. Instruction-level Fusion x + 4 3. Invocation-level Fusion x + 1 for-loop 4X
  • 43. 14 Experiment Results 100 101 102 103 ExecutionTime(seconds) None Task(Flink) Invocation Instruction 50 maps on 10M elements N one Task(Flink) Invocation Instruction Optimisation level (Lower is better)
  • 46. Example Frontend Code (Pandas + Beam) 15 import arc.beam as beam import arc.beam.transforms.window as window import arc.beam.transforms.combiners as combiners import arc.pandas as pandas
  • 47. Example Frontend Code (Pandas + Beam) 15 import arc.beam as beam import arc.beam.transforms.window as window import arc.beam.transforms.combiners as combiners import arc.pandas as pandas def normalise(elements): series = pandas.Series(elements) avg = series.sum() / series.count() return series / avg
  • 48. Example Frontend Code (Pandas + Beam) 15 import arc.beam as beam import arc.beam.transforms.window as window import arc.beam.transforms.combiners as combiners import arc.pandas as pandas def normalise(elements): series = pandas.Series(elements) avg = series.sum() / series.count() return series / avg p = beam.Pipeline() (p | beam.io.ReadFromText(path='input.txt').with_output_types(int) | beam.WindowInto(window.FixedWindows(size=5)) | beam.CombineGlobally(normalise) | combiners.ToList() | beam.io.WriteToText(path='output.txt')) p.run()
  • 49. Example Frontend Code (Pandas + Beam) 15 import arc.beam as beam import arc.beam.transforms.window as window import arc.beam.transforms.combiners as combiners import arc.pandas as pandas def normalise(elements): series = pandas.Series(elements) avg = series.sum() / series.count() return series / avg p = beam.Pipeline() (p | beam.io.ReadFromText(path='input.txt').with_output_types(int) | beam.WindowInto(window.FixedWindows(size=5)) | beam.CombineGlobally(normalise) | combiners.ToList() | beam.io.WriteToText(path='output.txt')) p.run() ?
  • 51. • Flink inspired dataflow engine built in Rust • Goals: • Common runtime for Arc applications • Support dynamic task execution • First-class support for hardware acceleration 17 Arcon: Native Arc Runner
  • 52. • Arc is an IR for batch and stream programming. • By raising the level of abstraction, Arc is able to both optimise the dataflow and the code within it. 18 Conclusions Arc and experiments can be found at https://github.com/cda-group Contact info: klasseg@kth.se & mmeldrum@kth.se
  • 53. References Publications: • Kroll, L., Segeljakt, K., Carbone, P., Schulte, C. and Haridi, S., 2019, June. Arc: an IR for batch and stream programming. In Proceedings of the 17th ACM SIGPLAN International Symposium on Database Programming Languages (pp. 53-58). ACM. • Meldrum, M., Segeljakt, K., Kroll, L., Carbone, P., Schulte, C. and Haridi, S., 2019, August. Arcon: Continuous and Deep Data Stream Analytics. In Proceedings of Real-Time Business Intelligence and Analytics (p. 3). ACM. 19