Flink Forward SF 2017: Konstantinos Kloudas - Extending Flink’s Streaming APIs

1
Kostas Kloudas
@KLOUBEN_K
Flink Forward San Francisco
April 11, 2017
Extending Flink’s Streaming APIs

2
Original creators of Apache
Flink®
Providers of the
dA Platform, a supported
Flink distribution

Extensions to the DataStream API
3

Extensions to the DataStream API
4
 ProcessFunction for Low-level Operations
 Support for Asynchronous I/O

Stream Processing
6
Computation
Computations on
never-ending
“streams” of events

Distributed Stream Processing
7
Computation
Computation
spread across
many machines
Computation Computation

Stateful Stream Processing
8
Computation
State
Result depends
on history of
stream

Stream Processing Engines
 Time:
• handle infinite streams
• with out-of-order events
 State:
• guarantee fault-tolerance (distributed)
• guarantee consistency (infinite streams)
9

 Gives access to all basic building blocks:
• Events
• Fault-tolerant, Consistent State
• Timers (event- and processing-time)
• Side Outputs
10
ProcessFunction

Common Usecase Skeleton A
 On each incoming element:
• update some state
• register a callback for a moment in the future
 When that moment comes:
• Check a condition and perform a certain
action, e.g. emit an element
11

 Use built-in windowing:
• +Expressive
• +A lot of functionality out-of-the-box
• - Not always intuitive
• - An overkill for simple cases
 Write your own operator:
• - Too many things to account for
12
Before the ProcessFunction

 Simple yet powerful API:
13
/**
* Process one element from the input stream.
*/
void processElement(I value, Context ctx, Collector<O> out) throws Exception;
/**
* Called when a timer set using {@link TimerService} fires.
*/
void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception;
ProcessFunction

14
/**
*/
/**
*/
A collector to emit result
values
ProcessFunction

15
/**
*/
/**
*/
1. Get the timestamp of the element
2. Register and use side outputs
3. Interact with the TimerService to:
• query the current time
• register timers
1. Do the above
2. Query if we are on Event or
Processing time
ProcessFunction

 Requirements:
• maintain counts per incoming key, and
• emit the key/count pair if no element came for
the key in the last 100 ms (in event time)
16
ProcessFunction: example

17
 Implementation sketch:
• Store the count, key and last mod timestamp in
a ValueState (scoped by key)
• For each record:
• update the counter and the last mod timestamp
• register a timer 100ms from “now” (in event time)
• When the timer fires:
• check the timer’s timestamp against the last mod time for that
key and
• emit the key/count pair if they differ by 100ms

18
public class MyProcessFunction extends
ProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> {
// define your state descriptors
@Override
public void processElement(Tuple2<String, Long> value, Context ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
// update our state and register a timer
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx,
// check the state for the key and emit a result if needed
}
}

19
// define your state descriptors
private final ValueStateDescriptor<CounterWithTS> stateDesc =
new ValueStateDescriptor<>("myState", CounterWithTS.class);
}

20
@Override
public void processElement(Tuple2<String, String> value, Context ctx,
ValueState<MyStateClass> state = getRuntimeContext().getState(stateDesc);
CounterWithTS current = state.value();
if (current == null) {
current = new CounterWithTS();
current.key = value.f0;
}
current.count++;
current.lastModified = ctx.timestamp();
state.update(current);
ctx.timerService().registerEventTimeTimer(current.lastModified + 100);
}
}

21
@Override
CounterWithTS result = getRuntimeContext().getState(stateDesc).value();
if (timestamp == result.lastModified + 100) {
out.collect(new Tuple2<String, Long>(result.key, result.count)); }
}
}

22
stream.keyBy(”key”)
.process(new MyProcessFunction())

ProcessFunction: Side Outputs
 Additional (to the main) output streams
 No type limitations
• each side output can have its own type
23

 Requirements:
• maintain counts per incoming key, and
• emit the key/count pair if no element came for
the key in the last 100 ms (in event time)
• in other case, if the count > 10, send the key
to a side-output named gt10
24
ProcessFunction: example+

25
final OutputTag<String> outputTag = new OutputTag<String>(”gt10"){};
SingleOutputStreamOperator<Tuple2<String, Long>> mainStream = input.process(
new ProcessFunction<Tuple2<String, String>, Tuple2<String, Long>>() {
@Override
CounterWithTS result = getRuntimeContext().getState(adStateDesc).value();
out.collect(new Tuple2<String, Long>(result.key, result.count));
} else if (result.count > 10) {
ctx.output(outputTag, result.key);
}
}
DataStream<String> sideOutputStream = mainStream.getSideOutput(outputTag);

26
final OutputTag<String> outputTag = new OutputTag<String>(”gt10"){};
SingleOutputStreamOperator<Tuple2<String, Long>> mainStream = input.process(
new ProcessFunction<Tuple2<String, String>, Tuple2<String, Long>>() {
@Override
CounterWithTS result = getRuntimeContext().getState(adStateDesc).value();
out.collect(new Tuple2<String, Long>(result.key, result.count));
} else if (result.count > 10) {
ctx.output(outputTag, result.key);
}
}
DataStream<String> sideOutputStream = mainStream.getSideOutput(outputTag);

27
 Applicable to Keyed streams
 For Non-Keyed streams:
 group on a dummy key if you need the timers
 BEWARE: parallelism of 1
 Use it directly without the timers
 CoProcessFunction for low-level joins:
• Applied on two input streams
ProcessFunction

Common Usecase Skeleton B
29
 On each incoming element:
• extract some info from the element (e.g. key)
• query an external storage system (DB or KV-
store) for additional info
• emit an enriched version of the input element

 Write a MapFunction that queries the DB:
• +Simple
• - Slow (synchronous access) or/and
• - Requires high parallelism (more tasks)
30
Before the AsuncIO support

 Write a MapFunction that queries the DB:
• +Simple
• - Slow (synchronous access) or/and
• - Requires high parallelism (more tasks)
31
Before the AsyncIO support

33
Communication delay can
dominate application
throughput and latency
Synchronous Access

 Requirement:
• a client that supports asynchronous requests
 Flink handles the rest:
• integration of async IO with DataStream API
• fault-tolerance
• order of emitted elements
• correct time semantics (event/processing time)
35
AsyncFunction

 Simple API:
/**
* Trigger async operation for each stream input.
*/
void asyncInvoke(IN input, AsyncCollector<OUT> collector) throws Exception;
 API call:
/**
* Example async function call.
*/
DataStream<...> result = AsyncDataStream.(un)orderedWait(stream,
new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100);
36
AsyncFunction

37
Emitter
P2P3 P1P4
AsyncWaitOperator
E5
AsyncWaitOperator:
• a queue of “Promises”
• a separate thread (Emitter)
AsyncFunction

38
Emitter
P2P3 P1P4
AsyncWaitOperator
• Wrap E5 in a “promise” P5
• Put P5 in the queue
• Call asyncInvoke(E5, P5)
E5
P5
asyncInvoke(E5, P5)P5
AsyncFunction

39
Emitter
P2P3 P1P4
AsyncWaitOperator
E5
P5
asyncInvoke(value, asyncCollector):
• a user-defined function
• value : the input element
• asyncCollector : the collector of the
result (when the query returns)
AsyncFunction

40
Emitter
P2P3 P1P4
AsyncWaitOperator
E5
P5
Future<String> future = client.query(E5);
future.thenAccept((String result) -> { P5.collect(
Collections.singleton(
new Tuple2<>(E5, result)));
});
AsyncFunction

41
Emitter
P2P3 P1P4
AsyncWaitOperator
E5
P5
Future<String> future = client.query(E5);
future.thenAccept((String result) -> { P5.collect(
Collections.singleton(
new Tuple2<>(E5, result)));
});
AsyncFunction

42
Emitter
P2P3 P1P4
AsyncWaitOperator
E5
P5
Emitter:
• separate thread
• polls queue for completed
promises (blocking)
• emits elements downstream
AsyncFunction

43
DataStream<Tuple2<String, String>> result =
AsyncDataStream.(un)orderedWait(stream,
new MyAsyncFunction(),
1000, TimeUnit.MILLISECONDS,
100);
 our asyncFunction
 a timeout: max time until considered failed
 capacity: max number of in-flight requests
AsyncFunction

44
100);
AsyncFunction

45
100);
P2P3 P1P4E2E3 E1E4
Ideally... Emitter
AsyncFunction

46
AsyncDataStream.unorderedWait(stream,
100);
P2P3 P1P4E2E3 E1E4
Reallistically... Emitter
...output ordered based on which request finished first
AsyncFunction

47
P2P3 P1P4E2E3 E1E4
Emitter
 unorderedWait: emit results in order of completion
 orderedWait: emit results in order of arrival
 Always: watermarks never overpass elements and vice versa
AsyncFunction

Documentation
 ProcessFunction:
https://ci.apache.org/projects/flink/flink-docs-release-
1.2/dev/stream/process_function.html
https://ci.apache.org/projects/flink/flink-docs-release-
1.3/dev/stream/process_function.html
 AsyncIO:
https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/asyncio.html
48

4
Thank you!
@KLOUBEN_K
@ApacheFlink
@dataArtisans

50
Stream Processing
and Apache Flink®'s
approach to it
@StephanEwen
Apache Flink PMC
CTO @ data ArtisansFLINKFORWARD IS COMING BACKTO BERLIN
SEPTEMBER11-13, 2017
BERLIN.FLINK-FORWARD.ORG -

We are hiring!
data-artisans.com/careers

Flink Forward SF 2017: Konstantinos Kloudas - Extending Flink’s Streaming APIs

More Related Content

What's hot (20)

Similar to Flink Forward SF 2017: Konstantinos Kloudas - Extending Flink’s Streaming APIs (20)

More from Flink Forward (20)

Recently uploaded (20)

Flink Forward SF 2017: Konstantinos Kloudas - Extending Flink’s Streaming APIs

Editor's Notes