Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Processing in Your Favourite Language with Beam on Flink

Talk Python to Me
Stream Processing in Your Favourite Language with Beam on Flink
Apache Beam
Slides by Aljoscha Krettek, September 2017, Flink Forward 2017
Apache Flink
Based on work and slides by Frances Perry, Tyler Akidau, Kenneth
Knowles & Sourabh Bajaj

2
Agenda
1. What is Beam?
2. The Beam Portability APIs (Fn / Pipeline)
3. Executing Pythonic Beam Jobs on Flink
4. The Future

The Evolution of Apache Beam
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
Apache
Beam
Google Cloud
Dataflow

5
Beam Model: Generations Beyond MapReduce
Improved abstractions let you focus
on your application logic
Batch and stream processing are
both first-class citizens -- no need to
choose.
Clearly separates event time from
processing time.

6
The Apache Beam Vision
1. End users: who want to write
pipelines in a language that’s familiar.
2. SDK writers: who want to make Beam
concepts available in new languages.
3. Runner writers: who have a
distributed processing environment
and want to support Beam pipelines
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution

7
The Beam Model
(Flink draws it more like this)

8
The Beam Model
Pipeline
PTransform
PCollection
(bounded or unbounded)

9
Beam Model: Asking the Right Questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?

The Beam Model: What is Being Computed?
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
scores= (input
| Sum.integersPerKey())

The Beam Model: What is Being Computed?

The Beam Model: Where in Event Time?
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
scores= (input
| beam.WindowInto(FixedWindows(2 * 60))

The Beam Model: Where in Event Time?

The Beam Model: When in Processing Time?
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
scores = (input
| beam.WindowInto(FixedWindows(2 * 60)
.triggering(AtWatermark()))

The Beam Model: When in Processing Time?

The Beam Model: How Do Refinements Relate?
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
scores = (input
| beam.WindowInto(FixedWindows(2 * 60)
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(1 * 60))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())

The Beam Model: How Do Refinements Relate?

18
Customizing What Where When How
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
For more information see https://cloud.google.com/dataflow/examples/gaming-example
2
Windowed
Batch

A Complete Example of Pythonic Beam Code
import apache_beam as beam
with beam.Pipeline() as p:
(p
| beam.io.ReadStringsFromPubSub("twitter_topic")
| beam.WindowInto(SlidingWindows(5*60, 1*60))
| beam.ParDo(ParseHashTagDoFn())
| beam.combiners.Count.PerElement()
| beam.ParDo(BigQueryOutputFormatDoFn())
| beam.io.WriteToBigQuery("trends_table"))

What is Apache Beam?
1. The Beam Model: What / Where / When / How
2. SDKs for writing Beam pipelines
3. Runners for Existing Distributed Processing Backends
○ Apache Apex
○ Apache Flink
○ Apache Spark
○ Google Cloud Dataflow
○ Local (in-process) runner for testing

2121
Beam Portability APIs (Pipeline / Job / Fn)

22
What are we trying to solve?
● Executing user code written in an arbitrary language (Python) on a Runner
written in a different language (Java)
● Mixing user functions written in different languages (Connectors, Sources,
Sinks, …)

23
Terminology
Beam Model
Describes the API concepts and the
possible operations on
PCollections.
Pipeline
User-defined graph of
transformations on PCollections.
This is constructed using a Beam
SDK. The transformations can
contain UDFs.
Runner
Executes a Pipeline. For example:
FlinkRunner.
Beam SDK
Language specific
library/framework for creating
programs that use the Beam Model.
Allows defining Pipelines and UDFs
and provides APIs for executing
them.
User-defined function (UDF)
Code in Java, Python, … that
specifies how data is transformed.
For example DoFn or CombineFn.

24
Executing a Beam Pipeline - The Big Picture
SDK
User
Pipeline
Pipeline
API
Runner
Worker
Fn
API
Job
API

25
APIs for Different Pipeline Lifecycle Stages
Pipeline API
● Used by the SDK to construct
SDK-agnostic Pipeline
representation
● Used by the Runner to
translate a Pipeline to
runner-specific operations
Fn API
● Used by an SDK harness for
communication with a Runner
● User by the Runner to push
work into an SDK harness
Job API
● (API for interacting with a
running Pipeline)

26
Pipeline API (simplified)
● Definition of common primitive transformations
(Read, ParDo, Flatten, Window.into, GroupByKey)
● Definition of serialized Pipeline (protobuf)
https://s.apache.org/beam-runner-ap
i
Pipeline = {PCollection*, PTransform*, WindowingStrategy*,
Coder*}
PTransform = {Inputs*, Outputs*, FunctionSpec}
FunctionSpec = {URN, payload}

27
Job API
public interface JobApi {
State getState(); // RUNNING, DONE, CANCELED, FAILED ...
State cancel() throws IOException;
State waitUntilFinish(Duration duration);
State waitUntilFinish();
MetricResults metrics();
}

28
Fn API
● gRPC interface definitions for communication between an SDK harness
and a Runner
https://s.apache.org/beam-fn-api
● Control: Used to tell the SDK which UDFs to execute and when to execute
them.
● Data: Used to move data between the language specific SDK harness and
the runner.
● State: Used to support user state, side inputs, and group by key
reiteration.
● Logging: Used to aggregate logging information from the language
specific SDK harness.

29
Fn API (continued)

30
Fn API - Bundle Processing
https://s.apache.org/beam-fn-api-processing-a-bundl
e

31
Fn API - Processing DoFns
https://s.apache.org/beam-fn-api-send-and-receive-data
Say we need to
execute this part

32
Fn API - Processing DoFns
Python DoFn
Python DoFn

33
Fn API - Processing DoFns (Pipeline manipulation)
Python DoFn
Python DoFn
gRPC Source
gRPC Sink
The Runner
inserts these

34
Fn API - Executing the user Fn using a SDK Harness
● We can execute as a separate process
● We can execute in a Docker container
Worker
Fn
API
https://s.apache.org/beam-fn-api-container-contract
● Repository of containers for different
SDKs
● We inject the user code into the
container when starting
● Container is user-configurable

3535
Executing Pythonic* Beam Jobs on Fink
*or other languages

36
What is the (Flink) Runner/Flink doing in all this?
● Analyze/transform the Pipeline (Pipeline API)
● Create a Flink Job (DataSet/DataStream API)
● Ship the user code/docker container description
● In an operator: Open gRPC services for control/data/logging/state
plane
● Execute arbitrary user code using the Fn API
Easy, because Flink state/timers map well to Beam concepts!

37
Advantages/Disadvantages
● Complete isolation of user
code
● Complete configurability of
execution environment (with
Docker)
● We can support code written in
arbitrary languages
● We can mix user code written
in different languages
● Slower (RPC overhead)
● Using Docker requires docker

39
Future work
● Finish what I just talked about
● Finalize the different APIs (not Flink-specific)
● Mixing and matching connectors written in different languages
● Wait for new SDKs in other languages, they will just work

40
Learn More!
Apache Beam/Apache Flink
https://flink.apache.org / https://beam.apache.org
Beam Fn API design documents
https://s.apache.org/beam-runner-api
https://s.apache.org/beam-fn-api-processing-a-bundle
https://s.apache.org/beam-fn-state-api-and-bundle-processing
https://s.apache.org/beam-fn-api-container-contract
Join the mailing lists!
user-subscribe@flink.apache.org / dev-subscribe@flink.apache.org
user-subscribe@beam.apache.org / dev-subscribe@beam.apache.org
Follow @ApacheFlink / @ApacheBeam on Twitter

43
Processing Time vs. Event Time

Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Processing in Your Favourite Language with Beam on Flink

More Related Content

What's hot (20)

Similar to Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Processing in Your Favourite Language with Beam on Flink (20)

More from Flink Forward (20)

Recently uploaded (20)

Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Processing in Your Favourite Language with Beam on Flink