SlideShare a Scribd company logo
Getting started with Apache Flink (and Apache Kafka)
From zero to stream processing
Kenny Gorman
Founder and CEO
www.eventador.io
www.kennygorman.com
@kennygorman
I have done database foo for my whole career, going on 25
years.
Sybase, Oracle DBA, PostgreSQL DBA, MySQL aficionado,
MongoDB early adopter, founded two companies based on
data technologies
Streaming data is a game changer. Fell in love with Apache
Kafka and Apache Flink. We went ‘all in’.
I am a data nerd
‘02 had hair ^
Now… lol
Performing some operation on a boundless data stream
Apache Kafka FTW
But how to process the data?
What is stream processing?
Stream Processing Frameworks
Technology Description
Apache Spark Traditionally more of a batch execution environment. Born from Apache Hadoop
ecosystem. Good streaming API with micro-batch streaming model. Mature.
Apache Storm True boundless stream processing, based around concept of “topologies”, “spouts”,
and “bolts”. Open sourced by Twitter (who are now working on Heron).
Apache Kafka Traditionally a transport mechanism for data, now has API’s for streaming (KStreams,
KSQL). Popular, management req’d. Just went 1.0.
Apache Flink Pure streaming execution environment, exactly once semantics, checkpointing, high
availability, source/sink connectors, powerful APIs with higher order functionality for
windowing, recoverability, state
... The landscape is evolving fast
Apache Flink is an open source stream processing framework developed by the Apache Software
Foundation. The core of Apache Flink is a distributed streaming dataflow engine written in Java and
Scala.
Flink provides a high-throughput, low-latency streaming engine[7]
as well as support for event-time
processing and state management. Flink applications are fault-tolerant in the event of machine failure
and support exactly-once semantics.[8]
Programs can be written in Java, Scala,[9]
Python,[10]
and SQL[11]
and are automatically compiled and optimized[12]
into dataflow programs that are executed in a cluster
or cloud environment.
Apache Flink
Flink Development API’s Decomposed
- DataSet API (batch) vs DataStream API (streaming)
- DataStream most powerful of Flink API’s
- Table API most convenient and simple of Flink API’s
- Flink SQL (Apache Calcite)
Anatomy of a Flink job - Table API
- Declaritive DSL centered around the concept of a dynamic table
- Follows an extended relational model
- What logical operation should be performed on the data
- Table API + SQL FTW
- Sources and Sinks ←!!!!!!!
- Kafka, CSV, roll your own..
Table API vs Table API + SQL
// Table API
// register a table source
Table orders = tableEnv.scan("Orders");
Table result = orders.groupBy("a").select("a, b.sum as d");
// Table API + SQL
// register a table source
Table result = tableEnv.sql("SELECT a, b.sum as d FROM orders GROUP BY a");
Our example stream processor
- Simple usage of Table API
- Streaming data from aircraft via ADSB (http://www.eventador.io/planestream.html)
- Produce data into Kafka topic A
- Consume from Kafka topic B
- Do some filtering in-between A and B
KafkaSource Flink
Dest
Dest
Dest
Your code, performing
the filtering
Line by line
public class FlinkReadWriteKafkaJSON {
public static void main(String[] args) throws Exception {
// Read parameters from command line
final ParameterTool params = ParameterTool.fromArgs(args);
if(params.getNumberOfParameters() < 4) {
System.out.println("nUsage: FlinkReadKafkaJSON
--read-topic <topic>
--write-topic <topic>
--bootstrap.servers <kafka brokers>
--group.id <groupid>");
return;
}
Line by line
// setup flink environment if needed, but if deploying to cluster totally cool
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// a couple example settings
env.getConfig().setRestartStrategy(RestartStrategies.fixedDelayRestart(4, 10000));
env.enableCheckpointing(300000); // 300 seconds for recovery checkpointing
env.getConfig().setGlobalJobParameters(params);
Line by line
// create a table environment
StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env);
// let’s define a schema to pass the table
TypeInformation<Row> typeInfo = Types.ROW(
new String[] { "flight", "timestamp_verbose", "msg_type", "track",
"timestamp", "altitude", "counter", "lon", "icao",
"vr", "lat", "speed" },
new TypeInformation<?>[] { Types.STRING(), Types.STRING(), Types.STRING(),
Types.STRING(), Types.SQL_TIMESTAMP(), Types.STRING(),
Types.STRING(), Types.STRING(), Types.STRING(),
Types.STRING(), Types.STRING(), Types.STRING() }
);
Line by line
// create a new tablesource of JSON from kafka
KafkaJsonTableSource kafkaTableSource = new Kafka010JsonTableSource(
params.getRequired("read-topic"),
params.getProperties(),
typeInfo
);
Line by line
// let’s define a simple filtering SQL statement
String sql = "SELECT icao, lat, lon, altitude FROM flights WHERE altitude <> ‘’";
// or maybe something more complicated..
String sql = “SELECT icao, max(altitude)
FROM flights
GROUP BY tumble(timestamp, INTERVAL ‘5’ SECOND), icao”;
// apply that statement to the table
tableEnv.registerTableSource("flights", kafkaTableSource);
Table result = tableEnv.sql(sql);
Line by line
// Flink needs to know how to partition the data at consume time from kafka
FlinkFixedPartitioner partition = new FlinkFixedPartitioner();
// Create a sink to put the data into
KafkaJsonTableSink kafkaTableSink = new Kafka09JsonTableSink(
params.getRequired("write-topic"),
params.getProperties(),
partition
);
// write
result.writeToSink(kafkaTableSink);
Line by line
// run it!
env.execute("FlinkReadWriteKafkaJSON");
Demo
In Summary
- Table API plus SQL is super cool
- Calcite supports loads of SQL operations
- At some level of complexity, choose the lower level datastream API
- Growing amount of development on Table API/SQL by community
https://github.com/kgorman/TrafficAnalyzer
https://github.com/kgorman/TrafficAnalyzer/blob/master/src/main/java/io/eventador/FlinkReadWriteKafkaJSON.java
https://github.com/kgorman/TrafficAnalyzer/blob/master/src/main/java/io/eventador/FlinkReadWriteKafkaSinker.java
https://ci.apache.org/projects/flink/flink-docs-release-1.3/
https://calcite.apache.org/docs/reference.html
support@eventador.io
www.eventador.io
@eventadorlabs
Contact

More Related Content

PDF
Scaling an invoicing SaaS from zero to over 350k customers
PDF
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
PPTX
Big data lambda architecture - Streaming Layer Hands On
PDF
Fluentd at Bay Area Kubernetes Meetup
PPTX
Real-time streaming and data pipelines with Apache Kafka
PDF
Red Hat Nordics 2020 - Apache Camel 3 the next generation of enterprise integ...
PDF
Streaming Processing with a Distributed Commit Log
PDF
State of integration with Apache Camel (ApacheCon 2019)
Scaling an invoicing SaaS from zero to over 350k customers
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Big data lambda architecture - Streaming Layer Hands On
Fluentd at Bay Area Kubernetes Meetup
Real-time streaming and data pipelines with Apache Kafka
Red Hat Nordics 2020 - Apache Camel 3 the next generation of enterprise integ...
Streaming Processing with a Distributed Commit Log
State of integration with Apache Camel (ApacheCon 2019)

What's hot (20)

PPTX
Gobblin on-aws
PPTX
Advanced Sqoop
PPTX
Apache Camel K - Copenhagen
PDF
Cassandra Day SV 2014: Netflix’s Astyanax Java Client Driver for Apache Cassa...
PDF
Integrating systems in the age of Quarkus and Camel
PPTX
ApacheCon EU 2016 - Apache Camel the integration library
PPTX
Apache Camel K - Copenhagen v2
PPTX
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
PPTX
Apache Camel K - Fredericia
PPTX
Integrating microservices with apache camel on kubernetes
PDF
Best Practices for Middleware and Integration Architecture Modernization with...
PDF
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
PDF
Introduction to Apache Sqoop
PDF
Apache Camel v3, Camel K and Camel Quarkus
PDF
Habits of Effective Sqoop Users
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
PPTX
HBaseConEast2016: Coprocessors – Uses, Abuses and Solutions
PDF
Prestogres internals
PPTX
Camel riders in the cloud
PPTX
Serverless integration with Knative and Apache Camel on Kubernetes
Gobblin on-aws
Advanced Sqoop
Apache Camel K - Copenhagen
Cassandra Day SV 2014: Netflix’s Astyanax Java Client Driver for Apache Cassa...
Integrating systems in the age of Quarkus and Camel
ApacheCon EU 2016 - Apache Camel the integration library
Apache Camel K - Copenhagen v2
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
Apache Camel K - Fredericia
Integrating microservices with apache camel on kubernetes
Best Practices for Middleware and Integration Architecture Modernization with...
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
Introduction to Apache Sqoop
Apache Camel v3, Camel K and Camel Quarkus
Habits of Effective Sqoop Users
DataEngConf SF16 - Collecting and Moving Data at Scale
HBaseConEast2016: Coprocessors – Uses, Abuses and Solutions
Prestogres internals
Camel riders in the cloud
Serverless integration with Knative and Apache Camel on Kubernetes
Ad

Similar to From Zero to Stream Processing (20)

PDF
Strata NYC 2015: What's new in Spark Streaming
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
Spark streaming state of the union
PDF
20170126 big data processing
PDF
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
PDF
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
PDF
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
PDF
Introduction to apache kafka, confluent and why they matter
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
PPTX
Sparkstreaming with kafka and h base at scale (1)
PPTX
Simplifying Apache Cascading
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
BBL KAPPA Lesfurets.com
PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
PDF
Writing Continuous Applications with Structured Streaming in PySpark
PDF
Jug - ecosystem
PDF
Spark (Structured) Streaming vs. Kafka Streams
PDF
Simple Apache Spark Introduction - Part 2
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Strata NYC 2015: What's new in Spark Streaming
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Spark streaming state of the union
20170126 big data processing
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Introduction to apache kafka, confluent and why they matter
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Sparkstreaming with kafka and h base at scale (1)
Simplifying Apache Cascading
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
BBL KAPPA Lesfurets.com
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Writing Continuous Applications with Structured Streaming in PySpark
Jug - ecosystem
Spark (Structured) Streaming vs. Kafka Streams
Simple Apache Spark Introduction - Part 2
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Ad

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
Teaching material agriculture food technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Cloud computing and distributed systems.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
Network Security Unit 5.pdf for BCA BBA.
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation_ Review paper, used for researhc scholars
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Unlocking AI with Model Context Protocol (MCP)
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The Rise and Fall of 3GPP – Time for a Sabbatical?
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Teaching material agriculture food technology
NewMind AI Weekly Chronicles - August'25 Week I
Digital-Transformation-Roadmap-for-Companies.pptx
Cloud computing and distributed systems.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Programs and apps: productivity, graphics, security and other tools
cuic standard and advanced reporting.pdf

From Zero to Stream Processing

  • 1. Getting started with Apache Flink (and Apache Kafka) From zero to stream processing Kenny Gorman Founder and CEO www.eventador.io www.kennygorman.com @kennygorman
  • 2. I have done database foo for my whole career, going on 25 years. Sybase, Oracle DBA, PostgreSQL DBA, MySQL aficionado, MongoDB early adopter, founded two companies based on data technologies Streaming data is a game changer. Fell in love with Apache Kafka and Apache Flink. We went ‘all in’. I am a data nerd ‘02 had hair ^ Now… lol
  • 3. Performing some operation on a boundless data stream Apache Kafka FTW But how to process the data? What is stream processing?
  • 4. Stream Processing Frameworks Technology Description Apache Spark Traditionally more of a batch execution environment. Born from Apache Hadoop ecosystem. Good streaming API with micro-batch streaming model. Mature. Apache Storm True boundless stream processing, based around concept of “topologies”, “spouts”, and “bolts”. Open sourced by Twitter (who are now working on Heron). Apache Kafka Traditionally a transport mechanism for data, now has API’s for streaming (KStreams, KSQL). Popular, management req’d. Just went 1.0. Apache Flink Pure streaming execution environment, exactly once semantics, checkpointing, high availability, source/sink connectors, powerful APIs with higher order functionality for windowing, recoverability, state ... The landscape is evolving fast
  • 5. Apache Flink is an open source stream processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming dataflow engine written in Java and Scala. Flink provides a high-throughput, low-latency streaming engine[7] as well as support for event-time processing and state management. Flink applications are fault-tolerant in the event of machine failure and support exactly-once semantics.[8] Programs can be written in Java, Scala,[9] Python,[10] and SQL[11] and are automatically compiled and optimized[12] into dataflow programs that are executed in a cluster or cloud environment. Apache Flink
  • 6. Flink Development API’s Decomposed - DataSet API (batch) vs DataStream API (streaming) - DataStream most powerful of Flink API’s - Table API most convenient and simple of Flink API’s - Flink SQL (Apache Calcite)
  • 7. Anatomy of a Flink job - Table API - Declaritive DSL centered around the concept of a dynamic table - Follows an extended relational model - What logical operation should be performed on the data - Table API + SQL FTW - Sources and Sinks ←!!!!!!! - Kafka, CSV, roll your own..
  • 8. Table API vs Table API + SQL // Table API // register a table source Table orders = tableEnv.scan("Orders"); Table result = orders.groupBy("a").select("a, b.sum as d"); // Table API + SQL // register a table source Table result = tableEnv.sql("SELECT a, b.sum as d FROM orders GROUP BY a");
  • 9. Our example stream processor - Simple usage of Table API - Streaming data from aircraft via ADSB (http://www.eventador.io/planestream.html) - Produce data into Kafka topic A - Consume from Kafka topic B - Do some filtering in-between A and B KafkaSource Flink Dest Dest Dest Your code, performing the filtering
  • 10. Line by line public class FlinkReadWriteKafkaJSON { public static void main(String[] args) throws Exception { // Read parameters from command line final ParameterTool params = ParameterTool.fromArgs(args); if(params.getNumberOfParameters() < 4) { System.out.println("nUsage: FlinkReadKafkaJSON --read-topic <topic> --write-topic <topic> --bootstrap.servers <kafka brokers> --group.id <groupid>"); return; }
  • 11. Line by line // setup flink environment if needed, but if deploying to cluster totally cool StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // a couple example settings env.getConfig().setRestartStrategy(RestartStrategies.fixedDelayRestart(4, 10000)); env.enableCheckpointing(300000); // 300 seconds for recovery checkpointing env.getConfig().setGlobalJobParameters(params);
  • 12. Line by line // create a table environment StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env); // let’s define a schema to pass the table TypeInformation<Row> typeInfo = Types.ROW( new String[] { "flight", "timestamp_verbose", "msg_type", "track", "timestamp", "altitude", "counter", "lon", "icao", "vr", "lat", "speed" }, new TypeInformation<?>[] { Types.STRING(), Types.STRING(), Types.STRING(), Types.STRING(), Types.SQL_TIMESTAMP(), Types.STRING(), Types.STRING(), Types.STRING(), Types.STRING(), Types.STRING(), Types.STRING(), Types.STRING() } );
  • 13. Line by line // create a new tablesource of JSON from kafka KafkaJsonTableSource kafkaTableSource = new Kafka010JsonTableSource( params.getRequired("read-topic"), params.getProperties(), typeInfo );
  • 14. Line by line // let’s define a simple filtering SQL statement String sql = "SELECT icao, lat, lon, altitude FROM flights WHERE altitude <> ‘’"; // or maybe something more complicated.. String sql = “SELECT icao, max(altitude) FROM flights GROUP BY tumble(timestamp, INTERVAL ‘5’ SECOND), icao”; // apply that statement to the table tableEnv.registerTableSource("flights", kafkaTableSource); Table result = tableEnv.sql(sql);
  • 15. Line by line // Flink needs to know how to partition the data at consume time from kafka FlinkFixedPartitioner partition = new FlinkFixedPartitioner(); // Create a sink to put the data into KafkaJsonTableSink kafkaTableSink = new Kafka09JsonTableSink( params.getRequired("write-topic"), params.getProperties(), partition ); // write result.writeToSink(kafkaTableSink);
  • 16. Line by line // run it! env.execute("FlinkReadWriteKafkaJSON");
  • 17. Demo
  • 18. In Summary - Table API plus SQL is super cool - Calcite supports loads of SQL operations - At some level of complexity, choose the lower level datastream API - Growing amount of development on Table API/SQL by community https://github.com/kgorman/TrafficAnalyzer https://github.com/kgorman/TrafficAnalyzer/blob/master/src/main/java/io/eventador/FlinkReadWriteKafkaJSON.java https://github.com/kgorman/TrafficAnalyzer/blob/master/src/main/java/io/eventador/FlinkReadWriteKafkaSinker.java https://ci.apache.org/projects/flink/flink-docs-release-1.3/ https://calcite.apache.org/docs/reference.html