Apache Flink Deep Dive

Apache Flink
Deep Dive
Vasia Kalavri
Flink Committer & KTH PhD student
vasia@apache.org
1st Apache Flink Meetup Stockholm
May 11, 2015

Flink Internals
● Job Life-Cycle
○ what happens after you submit a Flink job?
● The Batch Optimizer
○ how are execution plans chosen?
● Delta Iterations
○ how are Flink iterations special for Graph and ML
apps?
2

what happens after you
submit a Flink job?

The Flink Stack
Python
Gelly
Table
FlinkML
SAMOA
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)Hadoop
M/R
Flink Runtime
Local Remote Yarn Tez Embedded
Dataflow
*current Flink master + few PRs
Streaming Optimizer
4

DataSet<String> text = env.readTextFile(input);
DataSet<Tuple2<String, Integer>> result = text
.flatMap((str, out) -> {
for (String token : value.split("W")) {
out.collect(new Tuple2(token, 1));
})
.groupBy(0).aggregate(SUM, 1);
1
3
2
Program Life-Cycle
4
5

Task
Manager
Job
Manager
Task
Manager
Flink Client &
Optimizer
DataSet<String> text = env.readTextFile(input);
DataSet<Tuple2<String, Integer>> result = text
.flatMap((str, out) -> {
for (String token : value.split("W")) {
out.collect(new Tuple2(token, 1));
})
.groupBy(0).aggregate(SUM, 1);
O Romeo,
Romeo,
wherefore art
thou Romeo?
O, 1
Romeo, 3
wherefore, 1
art, 1
thou, 1
6
Nor arm, nor
face, nor any
other part
nor, 3
arm, 1
face, 1,
any, 1,
other, 1
part, 1
creates and submits
the job graph
creates the execution
graph and deploys tasks
execute tasks and send
status updates

Input First SecondX Y
Operator X Operator Y
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> input = env.readTextFile(input);
DataSet<String> first = input.filter (str -> str.contains(“Apache Flink“));
DataSet<String> second = first.filter (str -> str.length() > 40);
second.print()
env.execute();
Series of Transformations
7

DataSet Abstraction
Think of it as a collection of data elements that can be
produced/recovered in several ways:
… like a Java collection
… like an RDD
… perhaps it is never fully materialized (because the program does not
need it to)
… implicitly updated in an iteration
→ this is transparent to the user
8

Romeo,
Romeo,
where art
thou Romeo?
Load Log
Search
for str1
Search
for str2
Search
for str3
Grep 1
Grep 2
Grep 3
Example: grep
9

Romeo,
Romeo,
where art
thou Romeo?
Load Log
Search
for str1
Search
for str2
Search
for str3
Grep 1
Grep 2
Grep 3
Stage 1:
Create/cache Log
Subsequent stages:
Grep log for matches
Caching in-memory
and disk if needed
Staged (batch) execution
10

Romeo,
Romeo,
where art
thou Romeo?
Load Log
Search
for str1
Search
for str2
Search
for str3
Grep 1
Grep 2
Grep 3
001100110011001100110011
Stage 1:
Deploy and start operators
Data transfer in-memory
and disk if needed
Note: Log
DataSet is
never
“created”!
Pipelined execution
11

how are execution plans
chosen?

Flink Batch Optimizer
Inspired by database optimizers, it creates and
selects the execution plan for a user program
14

DataSet<Tuple5<Integer, String, String, String, Integer>> orders = …
DataSet<Tuple2<Integer, Double>> lineitems = …
DataSet<Tuple2<Integer, Integer>> filteredOrders = orders
.filter(. . .)
.project(0,4).types(Integer.class, Integer.class);
DataSet<Tuple3<Integer, Integer, Double>> lineitemsOfOrders = filteredOrders
.join(lineitems)
.where(0).equalTo(0)
.projectFirst(0,1).projectSecond(1)
.types(Integer.class, Integer.class, Double.class);
DataSet<Tuple3<Integer, Integer, Double>> priceSums = lineitemsOfOrders
.groupBy(0,1).aggregate(Aggregations.SUM, 2);
priceSums.writeAsCsv(outputPath);
A Simple Program
15

DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
broadcast forward
Combine
GroupRed
sort
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
GroupRed
sort
forwardBest plan
depends on
relative sizes
of input files
Alternative Execution Plans
16

● Evaluates physical execution strategies
○ e.g. hash-join vs. sort-merge join
● Chooses data shipping strategies
○ e.g. broadcast vs. partition
● Reuses partitioning and sort orders
● Decides to cache loop-invariant data in
iterations
Optimization Examples
18

case class PageVisit(url: String, ip: String, userId: Long)
case class User(id: Long, name: String, email: String, country: String)
// get your data from somewhere
val visits: DataSet[PageVisit] = ...
val users: DataSet[User] = ...
// filter the users data set
val germanUsers = users.filter((u) => u.country.equals("de"))
// join data sets
val germanVisits: DataSet[(PageVisit, User)] =
// equi-join condition (PageVisit.userId = User.id)
visits.join(germanUsers).where("userId").equalTo("id")
Example: Distributed Joins
The join operator needs to
create all the pairs of
elements from the two
inputs, for which the join
condition evaluates to true
19

Example: Distributed Joins
● Ship Strategy: The input data is distributed across all
parallel instances that participate in the join
● Local Strategy: Each parallel instance performs a join
algorithm on its local partition
For both steps, there are multiple valid strategies which are
favorable in different situations.
20

Repartition-Repartition Strategy
Partitions both inputs
using the same
partitioning function.
All elements that share
the same join key are
shipped to the same
parallel instance and can
be locally joined.
21

Broadcast-Forward Strategy
Sends one complete data
set to each parallel
instance that holds a
partition of the other data.
The other Dataset
remains local and is not
shipped at all.
22

The optimizer will compute cost estimates for execution
plans and will pick the “cheapest” plan:
● amount of data shipped over the the network
● if the data of one input is already partitioned
R-R Cost: Full shuffle of both data sets over the network
B-F Cost: Depends on the size of the dataset that is
broadcasted and the number of parallel instances
Read more: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
How does the Optimizer choose?
23

how are Flink iterations
special?

● for/while loop in client submits one job per
iteration step
● Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
Iterate by unrolling
25

Native Iterations
● the runtime is aware of the iterative execution
● no scheduling overhead between iterations
● caching and state maintenance are handled automatically
Caching Loop-invariant DataPushing work
“out of the loop”
Maintain state as index
26

Flink Iteration Operators
Iterate IterateDelta
Input
Iterative
Update Function
Result
Replace
Workset
Iterative
Update Function
Result
Solution Set
State
27

Delta Iteration
● Not all the elements of the state are updated
in each iteration.
● The elements that require an update, are
stored in the workset.
● The step function is applied only to the
workset elements.
28

Partition a graph into components by iteratively
propagating the min vertex ID among neighbors
Example: Connected Components
29

Read the documentation and our blog posts!
● Memory Management
● Serialization and Type Extraction
● Streaming Optimizations
● Fault-Tolerance
Want to learn more?
33

Apache Flink Deep Dive

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Apache Flink Deep Dive (20)

Recently uploaded (20)

Apache Flink Deep Dive