Apache Flink Training: DataStream API Part 1 Basic

Apache Flink® Training
DataStream API Basic
August 26, 2015

DataStream API
 Stream Processing
 Java and Scala
 All examples here in Java
 Documentation available at
flink.apache.org
 Currently labeled as beta – some API
changes are pending
• Noted in the slides with a warning
2

Window WordCount: main Method
4
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}

Stream Execution Environment
5
.groupBy(0)
.sum(1);
counts.print();
// execute program
}

Data Sources
6
.groupBy(0)
.sum(1);
counts.print();
// execute program
}

Data types
7
.groupBy(0)
.sum(1);
counts.print();
// execute program
}

Transformations
8
.groupBy(0)
.sum(1);
counts.print();
// execute program
}

User functions
9
.groupBy(0)
.sum(1);
counts.print();
// execute program
}

DataSinks
10
.groupBy(0)
.sum(1);
counts.print();
// execute program
}

Execute!
11
.groupBy(0)
.sum(1);
counts.print();
// execute program
}

Window WordCount: FlatMap
public static class Splitter
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value,
Collector<Tuple2<String, Integer>> out)
throws Exception {
// normalize and split the line
String[] tokens = value.toLowerCase().split("W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(
new Tuple2<String, Integer>(token, 1));
}
}
}
}
12

WordCount: Map: Interface
13
@Override
throws Exception {
// emit the pairs
out.collect(
}
}
}
}

WordCount: Map: Types
14
@Override
throws Exception {
// emit the pairs
out.collect(
}
}
}
}

WordCount: Map: Collector
15
@Override
throws Exception {
// emit the pairs
out.collect(
}
}
}
}

(Selected) Data Types
 Basic Java Types
• String, Long, Integer, Boolean,…
• Arrays
 Composite Types
• Tuples
• Many more (covered in the advanced slides)
17

Tuples
 The easiest and most lightweight way of
encapsulating data in Flink
 Tuple1 up to Tuple25
Tuple2<String, String> person = new Tuple2<>("Max", "Mustermann”);
Tuple3<String, String, Integer> person = new Tuple3<>("Max", "Mustermann", 42);
Tuple4<String, String, Integer, Boolean> person =
new Tuple4<>("Max", "Mustermann", 42, true);
// zero based index!
String firstName = person.f0;
String secondName = person.f1;
Integer age = person.f2;
Boolean fired = person.f3;
18

Transformations: Map
DataStream<Integer> integers = env.fromElements(1, 2, 3, 4);
// Regular Map - Takes one element and produces one element
DataStream<Integer> doubleIntegers =
integers.map(new MapFunction<Integer, Integer>() {
@Override
public Integer map(Integer value) {
return value * 2;
}
});
doubleIntegers.print();
> 2, 4, 6, 8
// Flat Map - Takes one element and produces zero, one, or more elements.
DataStream<Integer> doubleIntegers2 =
integers.flatMap(new FlatMapFunction<Integer, Integer>() {
@Override
public void flatMap(Integer value, Collector<Integer> out) {
out.collect(value * 2);
}
});
doubleIntegers2.print();
> 2, 4, 6, 8
19

Transformations: Filter
// The DataStream
DataStream<Integer> integers = env.fromElements(1, 2, 3, 4);
DataStream<Integer> filtered =
integers.filter(new FilterFunction<Integer>() {
@Override
public boolean filter(Integer value) {
return value != 3;
}
});
integers.print();
> 1, 2, 4
20

Transformations: Partitioning
 DataStreams can be partitioned by a key
21
// (name, age) of employees
DataStream<Tuple2<String, Integer>> passengers = …
// group by second field (age)
DataStream<Integer, Integer> grouped = passengers.groupBy(1)
Stephan, 18 Fabian, 23
Julia, 27 Anna, 18
Romeo, 27
Anna, 18 Stephan, 18
Julia, 27 Romeo, 27
Fabian, 23
Warning: Possible
renaming in next
releasesBen, 25
Ben, 25

Data Shipping Strategies
 Optionally, you can specify how data is shipped
between two transformations
 Forward: stream.forward()
• Only local communication
 Rebalance: stream.rebalance()
• Round-robin partitioning
 Partition by hash: stream.partitionByHash(...)
 Custom partitioning: stream.partitionCustom(...)
 Broadcast: stream.broadcast()
• Broadcast to all nodes
22

Data Sources
Collection
 fromCollection(collection)
 fromElements(1,2,3,4,5)
23

Data Sources (2)
Text socket
 socketTextStream("hostname",port)
Text file
 readFileStream(“/path/to/file”, 1000,
WatchType.PROCESS_ONLY_APPENDED)
Connectors
 E.g., Apache Kafka, RabbitMQ, …
24

Data Sources: Collections
StreamExecutionEnvironment env =
// read from elements
DataStream<String> names = env.fromElements(“Some”, “Example”, “Strings”);
// read from Java collection
List<String> list = new ArrayList<String>();
list.add(“Some”);
list.add(“Example”);
list.add(“Strings”);
DataStream<String> names = env.fromCollection(list);
25

Data Sources: Files,Sockets,Connectors
StreamExecutionEnvironment env =
// read text socket from port
DataStream<String> socketLines = env
.socketTextStream(”localhost", 9999);
// read a text file ingesting new elements every 100 milliseconds
DataStream<String> localLines = env
.readFileStream(”/path/to/file", 1000,
WatchType.PROCESS_ONLY_APPENDED);
26

Data Sinks
Text
 writeAsText(“/path/to/file”)
CSV
 writeAsCsv(“/path/to/file”)
Return data to the Client
 print()
27
Note: Identical to
DataSet API

Data Sinks (2)
Socket
 writeToSocket(hostname, port, SerializationSchema)
Connectors
 E.g., Apache Kafka, Elasticsearch,
Rolling HDFS Files
28

Data Sinks
 Lazily executed when env.execute() is called
DataStream<…> result;
// nothing happens
result.writeToSocket(...);
// nothing happens
result.writeAsText("/path/to/file", "n", "|");
// Execution really starts here
env.execute();
29

Fault Tolerance in Flink
 Flink provides recovery by taking a consistent checkpoint every N
milliseconds and rolling back to the checkpointed state
• https://ci.apache.org/projects/flink/flink-docs-
master/internals/stream_checkpointing.html
 Exactly once (default)
• // Take checkpoint every 5000 milliseconds
env.enableCheckpointing (5000)
 At least once (for lower latency)
• // Take checkpoint every 5000 milliseconds
env.enableCheckpointing (5000, CheckpointingMode.AT_LEAST_ONCE)
 Setting the interval to few seconds should be good for most
applications
 If checkpointing is not enabled, no recovery guarantees are provided
31

Some advice
 Use env.fromElements(..) or env.fromCollection(..) to
quickly get a DataStream to experiment
with
 Use print() to quickly print a DataStream
33

From 0.9 to 0.10
 groupBy(…) -> keyBy(…)
 DataStream renames:
• KeyedDataStream -> KeyedStream
• WindowedDataStream -> WindowedStream
• ConnectedDataStream -> ConnectedStream
• JoinOperator -> JoinedStreams
35

Apache Flink Training: DataStream API Part 1 Basic

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Apache Flink Training: DataStream API Part 1 Basic (20)

More from Flink Forward (20)

Recently uploaded (20)

Apache Flink Training: DataStream API Part 1 Basic