From Zero to Stream Processing

Getting started with Apache Flink (and Apache Kafka)
From zero to stream processing
Kenny Gorman
Founder and CEO
www.eventador.io
www.kennygorman.com
@kennygorman

I have done database foo for my whole career, going on 25
years.
Sybase, Oracle DBA, PostgreSQL DBA, MySQL aficionado,
MongoDB early adopter, founded two companies based on
data technologies
Streaming data is a game changer. Fell in love with Apache
Kafka and Apache Flink. We went ‘all in’.
I am a data nerd
‘02 had hair ^
Now… lol

Performing some operation on a boundless data stream
Apache Kafka FTW
But how to process the data?
What is stream processing?

Stream Processing Frameworks
Technology Description
Apache Spark Traditionally more of a batch execution environment. Born from Apache Hadoop
ecosystem. Good streaming API with micro-batch streaming model. Mature.
Apache Storm True boundless stream processing, based around concept of “topologies”, “spouts”,
and “bolts”. Open sourced by Twitter (who are now working on Heron).
Apache Kafka Traditionally a transport mechanism for data, now has API’s for streaming (KStreams,
KSQL). Popular, management req’d. Just went 1.0.
Apache Flink Pure streaming execution environment, exactly once semantics, checkpointing, high
availability, source/sink connectors, powerful APIs with higher order functionality for
windowing, recoverability, state
... The landscape is evolving fast

Apache Flink is an open source stream processing framework developed by the Apache Software
Foundation. The core of Apache Flink is a distributed streaming dataflow engine written in Java and
Scala.
Flink provides a high-throughput, low-latency streaming engine[7]
as well as support for event-time
processing and state management. Flink applications are fault-tolerant in the event of machine failure
and support exactly-once semantics.[8]
Programs can be written in Java, Scala,[9]
Python,[10]
and SQL[11]
and are automatically compiled and optimized[12]
into dataflow programs that are executed in a cluster
or cloud environment.
Apache Flink

Flink Development API’s Decomposed
- DataSet API (batch) vs DataStream API (streaming)
- DataStream most powerful of Flink API’s
- Table API most convenient and simple of Flink API’s
- Flink SQL (Apache Calcite)

Anatomy of a Flink job - Table API
- Declaritive DSL centered around the concept of a dynamic table
- Follows an extended relational model
- What logical operation should be performed on the data
- Table API + SQL FTW
- Sources and Sinks ←!!!!!!!
- Kafka, CSV, roll your own..

Table API vs Table API + SQL
// Table API
// register a table source
Table orders = tableEnv.scan("Orders");
Table result = orders.groupBy("a").select("a, b.sum as d");
// Table API + SQL
// register a table source
Table result = tableEnv.sql("SELECT a, b.sum as d FROM orders GROUP BY a");

Our example stream processor
- Simple usage of Table API
- Streaming data from aircraft via ADSB (http://www.eventador.io/planestream.html)
- Produce data into Kafka topic A
- Consume from Kafka topic B
- Do some filtering in-between A and B
KafkaSource Flink
Dest
Dest
Dest
Your code, performing
the filtering

Line by line
public class FlinkReadWriteKafkaJSON {
public static void main(String[] args) throws Exception {
// Read parameters from command line
final ParameterTool params = ParameterTool.fromArgs(args);
if(params.getNumberOfParameters() < 4) {
System.out.println("nUsage: FlinkReadKafkaJSON
--read-topic <topic>
--write-topic <topic>
--bootstrap.servers <kafka brokers>
--group.id <groupid>");
return;
}

Line by line
// setup flink environment if needed, but if deploying to cluster totally cool
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// a couple example settings
env.getConfig().setRestartStrategy(RestartStrategies.fixedDelayRestart(4, 10000));
env.enableCheckpointing(300000); // 300 seconds for recovery checkpointing
env.getConfig().setGlobalJobParameters(params);

Line by line
// create a table environment
StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env);
// let’s define a schema to pass the table
TypeInformation<Row> typeInfo = Types.ROW(
new String[] { "flight", "timestamp_verbose", "msg_type", "track",
"timestamp", "altitude", "counter", "lon", "icao",
"vr", "lat", "speed" },
new TypeInformation<?>[] { Types.STRING(), Types.STRING(), Types.STRING(),
Types.STRING(), Types.SQL_TIMESTAMP(), Types.STRING(),
Types.STRING(), Types.STRING(), Types.STRING(),
Types.STRING(), Types.STRING(), Types.STRING() }
);

Line by line
// create a new tablesource of JSON from kafka
KafkaJsonTableSource kafkaTableSource = new Kafka010JsonTableSource(
params.getRequired("read-topic"),
params.getProperties(),
typeInfo
);

Line by line
// let’s define a simple filtering SQL statement
String sql = "SELECT icao, lat, lon, altitude FROM flights WHERE altitude <> ‘’";
// or maybe something more complicated..
String sql = “SELECT icao, max(altitude)
FROM flights
GROUP BY tumble(timestamp, INTERVAL ‘5’ SECOND), icao”;
// apply that statement to the table
tableEnv.registerTableSource("flights", kafkaTableSource);
Table result = tableEnv.sql(sql);

Line by line
// Flink needs to know how to partition the data at consume time from kafka
FlinkFixedPartitioner partition = new FlinkFixedPartitioner();
// Create a sink to put the data into
KafkaJsonTableSink kafkaTableSink = new Kafka09JsonTableSink(
params.getRequired("write-topic"),
params.getProperties(),
partition
);
// write
result.writeToSink(kafkaTableSink);

Line by line
// run it!
env.execute("FlinkReadWriteKafkaJSON");

In Summary
- Table API plus SQL is super cool
- Calcite supports loads of SQL operations
- At some level of complexity, choose the lower level datastream API
- Growing amount of development on Table API/SQL by community
https://github.com/kgorman/TrafficAnalyzer
https://github.com/kgorman/TrafficAnalyzer/blob/master/src/main/java/io/eventador/FlinkReadWriteKafkaJSON.java
https://github.com/kgorman/TrafficAnalyzer/blob/master/src/main/java/io/eventador/FlinkReadWriteKafkaSinker.java
https://ci.apache.org/projects/flink/flink-docs-release-1.3/
https://calcite.apache.org/docs/reference.html

support@eventador.io
www.eventador.io
@eventadorlabs
Contact

From Zero to Stream Processing

More Related Content

What's hot (20)

Similar to From Zero to Stream Processing (20)

Recently uploaded (20)

From Zero to Stream Processing