SlideShare a Scribd company logo
Apache Flink® Training
DataStream API Basic
August 26, 2015
DataStream API
 Stream Processing
 Java and Scala
 All examples here in Java
 Documentation available at
flink.apache.org
 Currently labeled as beta – some API
changes are pending
• Noted in the slides with a warning
2
DataStream API by Example
3
Window WordCount: main Method
4
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
Stream Execution Environment
5
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
Data Sources
6
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
Data types
7
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
Transformations
8
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
User functions
9
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
DataSinks
10
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
Execute!
11
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Integer>> counts = env
// read stream of words from socket
.socketTextStream("localhost", 9999)
// split up the lines in tuples containing: (word,1)
.flatMap(new Splitter())
// group by the tuple field "0"
.groupBy(0)
// keep the last 5 minute of data
.window(Time.of(5, TimeUnit.MINUTES))
//sum up tuple field "1"
.sum(1);
// print result in command line
counts.print();
// execute program
env.execute("Socket Incremental WordCount Example");
}
Window WordCount: FlatMap
public static class Splitter
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value,
Collector<Tuple2<String, Integer>> out)
throws Exception {
// normalize and split the line
String[] tokens = value.toLowerCase().split("W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(
new Tuple2<String, Integer>(token, 1));
}
}
}
}
12
WordCount: Map: Interface
13
public static class Splitter
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value,
Collector<Tuple2<String, Integer>> out)
throws Exception {
// normalize and split the line
String[] tokens = value.toLowerCase().split("W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(
new Tuple2<String, Integer>(token, 1));
}
}
}
}
WordCount: Map: Types
14
public static class Splitter
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value,
Collector<Tuple2<String, Integer>> out)
throws Exception {
// normalize and split the line
String[] tokens = value.toLowerCase().split("W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(
new Tuple2<String, Integer>(token, 1));
}
}
}
}
WordCount: Map: Collector
15
public static class Splitter
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value,
Collector<Tuple2<String, Integer>> out)
throws Exception {
// normalize and split the line
String[] tokens = value.toLowerCase().split("W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(
new Tuple2<String, Integer>(token, 1));
}
}
}
}
DataStream API Concepts
16
(Selected) Data Types
 Basic Java Types
• String, Long, Integer, Boolean,…
• Arrays
 Composite Types
• Tuples
• Many more (covered in the advanced slides)
17
Tuples
 The easiest and most lightweight way of
encapsulating data in Flink
 Tuple1 up to Tuple25
Tuple2<String, String> person = new Tuple2<>("Max", "Mustermann”);
Tuple3<String, String, Integer> person = new Tuple3<>("Max", "Mustermann", 42);
Tuple4<String, String, Integer, Boolean> person =
new Tuple4<>("Max", "Mustermann", 42, true);
// zero based index!
String firstName = person.f0;
String secondName = person.f1;
Integer age = person.f2;
Boolean fired = person.f3;
18
Transformations: Map
DataStream<Integer> integers = env.fromElements(1, 2, 3, 4);
// Regular Map - Takes one element and produces one element
DataStream<Integer> doubleIntegers =
integers.map(new MapFunction<Integer, Integer>() {
@Override
public Integer map(Integer value) {
return value * 2;
}
});
doubleIntegers.print();
> 2, 4, 6, 8
// Flat Map - Takes one element and produces zero, one, or more elements.
DataStream<Integer> doubleIntegers2 =
integers.flatMap(new FlatMapFunction<Integer, Integer>() {
@Override
public void flatMap(Integer value, Collector<Integer> out) {
out.collect(value * 2);
}
});
doubleIntegers2.print();
> 2, 4, 6, 8
19
Transformations: Filter
// The DataStream
DataStream<Integer> integers = env.fromElements(1, 2, 3, 4);
DataStream<Integer> filtered =
integers.filter(new FilterFunction<Integer>() {
@Override
public boolean filter(Integer value) {
return value != 3;
}
});
integers.print();
> 1, 2, 4
20
Transformations: Partitioning
 DataStreams can be partitioned by a key
21
// (name, age) of employees
DataStream<Tuple2<String, Integer>> passengers = …
// group by second field (age)
DataStream<Integer, Integer> grouped = passengers.groupBy(1)
Stephan, 18 Fabian, 23
Julia, 27 Anna, 18
Romeo, 27
Anna, 18 Stephan, 18
Julia, 27 Romeo, 27
Fabian, 23
Warning: Possible
renaming in next
releasesBen, 25
Ben, 25
Data Shipping Strategies
 Optionally, you can specify how data is shipped
between two transformations
 Forward: stream.forward()
• Only local communication
 Rebalance: stream.rebalance()
• Round-robin partitioning
 Partition by hash: stream.partitionByHash(...)
 Custom partitioning: stream.partitionCustom(...)
 Broadcast: stream.broadcast()
• Broadcast to all nodes
22
Data Sources
Collection
 fromCollection(collection)
 fromElements(1,2,3,4,5)
23
Data Sources (2)
Text socket
 socketTextStream("hostname",port)
Text file
 readFileStream(“/path/to/file”, 1000,
WatchType.PROCESS_ONLY_APPENDED)
Connectors
 E.g., Apache Kafka, RabbitMQ, …
24
Data Sources: Collections
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
// read from elements
DataStream<String> names = env.fromElements(“Some”, “Example”, “Strings”);
// read from Java collection
List<String> list = new ArrayList<String>();
list.add(“Some”);
list.add(“Example”);
list.add(“Strings”);
DataStream<String> names = env.fromCollection(list);
25
Data Sources: Files,Sockets,Connectors
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
// read text socket from port
DataStream<String> socketLines = env
.socketTextStream(”localhost", 9999);
// read a text file ingesting new elements every 100 milliseconds
DataStream<String> localLines = env
.readFileStream(”/path/to/file", 1000,
WatchType.PROCESS_ONLY_APPENDED);
26
Data Sinks
Text
 writeAsText(“/path/to/file”)
CSV
 writeAsCsv(“/path/to/file”)
Return data to the Client
 print()
27
Note: Identical to
DataSet API
Data Sinks (2)
Socket
 writeToSocket(hostname, port, SerializationSchema)
Connectors
 E.g., Apache Kafka, Elasticsearch,
Rolling HDFS Files
28
Data Sinks
 Lazily executed when env.execute() is called
DataStream<…> result;
// nothing happens
result.writeToSocket(...);
// nothing happens
result.writeAsText("/path/to/file", "n", "|");
// Execution really starts here
env.execute();
29
Fault Tolerance
30
Fault Tolerance in Flink
 Flink provides recovery by taking a consistent checkpoint every N
milliseconds and rolling back to the checkpointed state
• https://ci.apache.org/projects/flink/flink-docs-
master/internals/stream_checkpointing.html
 Exactly once (default)
• // Take checkpoint every 5000 milliseconds
env.enableCheckpointing (5000)
 At least once (for lower latency)
• // Take checkpoint every 5000 milliseconds
env.enableCheckpointing (5000, CheckpointingMode.AT_LEAST_ONCE)
 Setting the interval to few seconds should be good for most
applications
 If checkpointing is not enabled, no recovery guarantees are provided
31
Best Practices
32
Some advice
 Use env.fromElements(..) or env.fromCollection(..) to
quickly get a DataStream to experiment
with
 Use print() to quickly print a DataStream
33
Update Guide
34
From 0.9 to 0.10
 groupBy(…) -> keyBy(…)
 DataStream renames:
• KeyedDataStream -> KeyedStream
• WindowedDataStream -> WindowedStream
• ConnectedDataStream -> ConnectedStream
• JoinOperator -> JoinedStreams
35
36

More Related Content

PPTX
Apache Flink Training: DataStream API Part 2 Advanced
PDF
Stream Processing with Apache Flink
PPTX
Introduction to Storm
PPTX
Kafka presentation
PPTX
Introduction to Apache Flink
PDF
Monitoring with prometheus
PDF
An Introduction to Apache Kafka
PPSX
Big Data Redis Mongodb Dynamodb Sharding
Apache Flink Training: DataStream API Part 2 Advanced
Stream Processing with Apache Flink
Introduction to Storm
Kafka presentation
Introduction to Apache Flink
Monitoring with prometheus
An Introduction to Apache Kafka
Big Data Redis Mongodb Dynamodb Sharding

What's hot (20)

PDF
Etl is Dead; Long Live Streams
PDF
Apache flink
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PPTX
Prometheus and Grafana
PDF
Introduction to Apache NiFi 1.11.4
PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
PDF
Data integration with Apache Kafka
PDF
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
PDF
SeaweedFS introduction
PPTX
Elastic stack Presentation
PPTX
Apache Flink Deep Dive
PDF
OpenStack Architecture
PDF
Introduction To Flink
PDF
Kafka streams windowing behind the curtain
PDF
Networking in Java with NIO and Netty
PPTX
Stability Patterns for Microservices
PDF
PDF
Apache NiFi Meetup - Princeton NJ 2016
PPTX
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
PDF
Data ingestion and distribution with apache NiFi
Etl is Dead; Long Live Streams
Apache flink
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Prometheus and Grafana
Introduction to Apache NiFi 1.11.4
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Data integration with Apache Kafka
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
SeaweedFS introduction
Elastic stack Presentation
Apache Flink Deep Dive
OpenStack Architecture
Introduction To Flink
Kafka streams windowing behind the curtain
Networking in Java with NIO and Netty
Stability Patterns for Microservices
Apache NiFi Meetup - Princeton NJ 2016
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Data ingestion and distribution with apache NiFi
Ad

Viewers also liked (20)

PDF
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
PPTX
Slim Baltagi – Flink vs. Spark
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PDF
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
PDF
Marton Balassi – Stateful Stream Processing
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
PPTX
Kamal Hakimzadeh – Reproducible Distributed Experiments
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
PPTX
Fabian Hueske – Cascading on Flink
PDF
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
PPTX
Assaf Araki – Real Time Analytics at Scale
PPTX
Apache Flink - Hadoop MapReduce Compatibility
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
PDF
Fabian Hueske – Juggling with Bits and Bytes
PDF
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
PDF
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
PPTX
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
PPTX
Michael Häusler – Everyday flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Slim Baltagi – Flink vs. Spark
Introduction to Apache Flink - Fast and reliable big data processing
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Marton Balassi – Stateful Stream Processing
Matthias J. Sax – A Tale of Squirrels and Storms
Flink 0.10 @ Bay Area Meetup (October 2015)
Kamal Hakimzadeh – Reproducible Distributed Experiments
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Fabian Hueske – Cascading on Flink
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Assaf Araki – Real Time Analytics at Scale
Apache Flink - Hadoop MapReduce Compatibility
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Fabian Hueske – Juggling with Bits and Bytes
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Michael Häusler – Everyday flink
Ad

Similar to Apache Flink Training: DataStream API Part 1 Basic (20)

PPTX
Apache Flink Training: DataSet API Basics
PDF
Apache Flink Stream Processing
PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
PDF
Real Time Big Data Management
PPTX
Apache Flink: API, runtime, and project roadmap
PDF
Streaming Dataflow with Apache Flink
PPTX
Advanced
PDF
A Sceptical Guide to Functional Programming
PPTX
Flink Batch Processing and Iterations
PDF
Java 8 lambda expressions
PDF
Demystifying functional programming with Scala
PDF
OOP and FP: become a better programmer - Simone Bordet, Mario Fusco - Codemot...
PDF
OOP and FP - Become a Better Programmer
PDF
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
PPTX
Lambdas Hands On Lab
PPTX
Flink internals web
PPTX
Scalable and Flexible Machine Learning With Scala @ LinkedIn
PPTX
Lambdas And Streams Hands On Lab, JavaOne 2014
PPTX
Kostas Kloudas - Extending Flink's Streaming APIs
ODP
Stratosphere Intro (Java and Scala Interface)
Apache Flink Training: DataSet API Basics
Apache Flink Stream Processing
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Real Time Big Data Management
Apache Flink: API, runtime, and project roadmap
Streaming Dataflow with Apache Flink
Advanced
A Sceptical Guide to Functional Programming
Flink Batch Processing and Iterations
Java 8 lambda expressions
Demystifying functional programming with Scala
OOP and FP: become a better programmer - Simone Bordet, Mario Fusco - Codemot...
OOP and FP - Become a Better Programmer
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Lambdas Hands On Lab
Flink internals web
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Lambdas And Streams Hands On Lab, JavaOne 2014
Kostas Kloudas - Extending Flink's Streaming APIs
Stratosphere Intro (Java and Scala Interface)

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Big Data Technologies - Introduction.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Modernizing your data center with Dell and AMD
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Cloud computing and distributed systems.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectral efficient network and resource selection model in 5G networks
NewMind AI Monthly Chronicles - July 2025
NewMind AI Weekly Chronicles - August'25 Week I
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Big Data Technologies - Introduction.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Modernizing your data center with Dell and AMD
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Encapsulation_ Review paper, used for researhc scholars
Digital-Transformation-Roadmap-for-Companies.pptx
Unlocking AI with Model Context Protocol (MCP)
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Cloud computing and distributed systems.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Understanding_Digital_Forensics_Presentation.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
“AI and Expert System Decision Support & Business Intelligence Systems”
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Apache Flink Training: DataStream API Part 1 Basic

  • 1. Apache Flink® Training DataStream API Basic August 26, 2015
  • 2. DataStream API  Stream Processing  Java and Scala  All examples here in Java  Documentation available at flink.apache.org  Currently labeled as beta – some API changes are pending • Noted in the slides with a warning 2
  • 3. DataStream API by Example 3
  • 4. Window WordCount: main Method 4 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 5. Stream Execution Environment 5 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 6. Data Sources 6 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 7. Data types 7 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 8. Transformations 8 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 9. User functions 9 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 10. DataSinks 10 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 11. Execute! 11 public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<String, Integer>> counts = env // read stream of words from socket .socketTextStream("localhost", 9999) // split up the lines in tuples containing: (word,1) .flatMap(new Splitter()) // group by the tuple field "0" .groupBy(0) // keep the last 5 minute of data .window(Time.of(5, TimeUnit.MINUTES)) //sum up tuple field "1" .sum(1); // print result in command line counts.print(); // execute program env.execute("Socket Incremental WordCount Example"); }
  • 12. Window WordCount: FlatMap public static class Splitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception { // normalize and split the line String[] tokens = value.toLowerCase().split("W+"); // emit the pairs for (String token : tokens) { if (token.length() > 0) { out.collect( new Tuple2<String, Integer>(token, 1)); } } } } 12
  • 13. WordCount: Map: Interface 13 public static class Splitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception { // normalize and split the line String[] tokens = value.toLowerCase().split("W+"); // emit the pairs for (String token : tokens) { if (token.length() > 0) { out.collect( new Tuple2<String, Integer>(token, 1)); } } } }
  • 14. WordCount: Map: Types 14 public static class Splitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception { // normalize and split the line String[] tokens = value.toLowerCase().split("W+"); // emit the pairs for (String token : tokens) { if (token.length() > 0) { out.collect( new Tuple2<String, Integer>(token, 1)); } } } }
  • 15. WordCount: Map: Collector 15 public static class Splitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception { // normalize and split the line String[] tokens = value.toLowerCase().split("W+"); // emit the pairs for (String token : tokens) { if (token.length() > 0) { out.collect( new Tuple2<String, Integer>(token, 1)); } } } }
  • 17. (Selected) Data Types  Basic Java Types • String, Long, Integer, Boolean,… • Arrays  Composite Types • Tuples • Many more (covered in the advanced slides) 17
  • 18. Tuples  The easiest and most lightweight way of encapsulating data in Flink  Tuple1 up to Tuple25 Tuple2<String, String> person = new Tuple2<>("Max", "Mustermann”); Tuple3<String, String, Integer> person = new Tuple3<>("Max", "Mustermann", 42); Tuple4<String, String, Integer, Boolean> person = new Tuple4<>("Max", "Mustermann", 42, true); // zero based index! String firstName = person.f0; String secondName = person.f1; Integer age = person.f2; Boolean fired = person.f3; 18
  • 19. Transformations: Map DataStream<Integer> integers = env.fromElements(1, 2, 3, 4); // Regular Map - Takes one element and produces one element DataStream<Integer> doubleIntegers = integers.map(new MapFunction<Integer, Integer>() { @Override public Integer map(Integer value) { return value * 2; } }); doubleIntegers.print(); > 2, 4, 6, 8 // Flat Map - Takes one element and produces zero, one, or more elements. DataStream<Integer> doubleIntegers2 = integers.flatMap(new FlatMapFunction<Integer, Integer>() { @Override public void flatMap(Integer value, Collector<Integer> out) { out.collect(value * 2); } }); doubleIntegers2.print(); > 2, 4, 6, 8 19
  • 20. Transformations: Filter // The DataStream DataStream<Integer> integers = env.fromElements(1, 2, 3, 4); DataStream<Integer> filtered = integers.filter(new FilterFunction<Integer>() { @Override public boolean filter(Integer value) { return value != 3; } }); integers.print(); > 1, 2, 4 20
  • 21. Transformations: Partitioning  DataStreams can be partitioned by a key 21 // (name, age) of employees DataStream<Tuple2<String, Integer>> passengers = … // group by second field (age) DataStream<Integer, Integer> grouped = passengers.groupBy(1) Stephan, 18 Fabian, 23 Julia, 27 Anna, 18 Romeo, 27 Anna, 18 Stephan, 18 Julia, 27 Romeo, 27 Fabian, 23 Warning: Possible renaming in next releasesBen, 25 Ben, 25
  • 22. Data Shipping Strategies  Optionally, you can specify how data is shipped between two transformations  Forward: stream.forward() • Only local communication  Rebalance: stream.rebalance() • Round-robin partitioning  Partition by hash: stream.partitionByHash(...)  Custom partitioning: stream.partitionCustom(...)  Broadcast: stream.broadcast() • Broadcast to all nodes 22
  • 24. Data Sources (2) Text socket  socketTextStream("hostname",port) Text file  readFileStream(“/path/to/file”, 1000, WatchType.PROCESS_ONLY_APPENDED) Connectors  E.g., Apache Kafka, RabbitMQ, … 24
  • 25. Data Sources: Collections StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // read from elements DataStream<String> names = env.fromElements(“Some”, “Example”, “Strings”); // read from Java collection List<String> list = new ArrayList<String>(); list.add(“Some”); list.add(“Example”); list.add(“Strings”); DataStream<String> names = env.fromCollection(list); 25
  • 26. Data Sources: Files,Sockets,Connectors StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // read text socket from port DataStream<String> socketLines = env .socketTextStream(”localhost", 9999); // read a text file ingesting new elements every 100 milliseconds DataStream<String> localLines = env .readFileStream(”/path/to/file", 1000, WatchType.PROCESS_ONLY_APPENDED); 26
  • 27. Data Sinks Text  writeAsText(“/path/to/file”) CSV  writeAsCsv(“/path/to/file”) Return data to the Client  print() 27 Note: Identical to DataSet API
  • 28. Data Sinks (2) Socket  writeToSocket(hostname, port, SerializationSchema) Connectors  E.g., Apache Kafka, Elasticsearch, Rolling HDFS Files 28
  • 29. Data Sinks  Lazily executed when env.execute() is called DataStream<…> result; // nothing happens result.writeToSocket(...); // nothing happens result.writeAsText("/path/to/file", "n", "|"); // Execution really starts here env.execute(); 29
  • 31. Fault Tolerance in Flink  Flink provides recovery by taking a consistent checkpoint every N milliseconds and rolling back to the checkpointed state • https://ci.apache.org/projects/flink/flink-docs- master/internals/stream_checkpointing.html  Exactly once (default) • // Take checkpoint every 5000 milliseconds env.enableCheckpointing (5000)  At least once (for lower latency) • // Take checkpoint every 5000 milliseconds env.enableCheckpointing (5000, CheckpointingMode.AT_LEAST_ONCE)  Setting the interval to few seconds should be good for most applications  If checkpointing is not enabled, no recovery guarantees are provided 31
  • 33. Some advice  Use env.fromElements(..) or env.fromCollection(..) to quickly get a DataStream to experiment with  Use print() to quickly print a DataStream 33
  • 35. From 0.9 to 0.10  groupBy(…) -> keyBy(…)  DataStream renames: • KeyedDataStream -> KeyedStream • WindowedDataStream -> WindowedStream • ConnectedDataStream -> ConnectedStream • JoinOperator -> JoinedStreams 35
  • 36. 36