Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Fault Tolerance and Job
Recovery in Apache Flink™
Till Rohrmann
trohrmann@apache.org
@stsffap

Better be safe than sorry
 Failures will happen
 EMC estimated $1.7 billion costs due to
data loss and system downtime
 Recovery will save you time and costs
 Switch between algorithms
 Live upgrade of your system
3

Fault tolerance guarantees
 At most once
• No guarantees at all
 At least once
• For many applications sufficient
 Exactly once
 Flink provides all guarantees
5

Checkpoints
 Consistent snapshots of distributed data
stream and operator state
6

Barriers
 Markers for checkpoints
 Injected in the data flow
7

8
 Alignment for multi-input operators

Operator State
 Stateless operators
 System state
 User defined state
9
ds.filter(_ != 0)
ds.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS))
public class CounterSum implements RichReduceFunction<Long> {
private OperatorState<Long> counter;
@Override public Long reduce(Long v1, Long v2) throws Exception {
counter.update(counter.value() + 1);
return v1 + v2;
}
@Override public void open(Configuration config) {
counter = getRuntimeContext().getOperatorState(“counter”, 0L, false);
}
}

Advantages
 Separation of app logic from recovery
• Checkpointing interval is just a config
parameter
 High throughput
• Controllable checkpointing overhead
 Low impact on latency
14

Without high availability
17
JobManager
TaskManager

With high availability
18
JobManager
TaskManager
Stand-by
JobManager
Apache Zookeeper™
KEEP GOING

Persisting jobs
19
JobManager
Client
TaskManagers
Apache Zookeeper™
Job
1. Submit job

Persisting jobs
20
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Submit job
2. Persist execution graph

Persisting jobs
21
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Submit job
3. Write handle to ZooKeeper

Persisting jobs
22
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Submit job
4. Deploy tasks

Handling checkpoints
23
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Take snapshots

24
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Take snapshots
2. Persist snapshots
3. Send handles to JM

25
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Take snapshots
4. Create global checkpoint

26
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Take snapshots
5. Persist global checkpoint

27
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Take snapshots
5. Persist global checkpoint

TL;DL
 Job recovery mechanism with low latency
and high throughput
 Exactly one processing semantics
 No single point of failure
 Flink will always keep processing
your data
31

Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink (20)

More from Flink Forward (20)

Recently uploaded (20)

Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Editor's Notes