Portable Streaming Pipelines with Apache Beam

Portable Streaming
Pipelines with Apache Beam
Frances Perry
PMC for Apache Beam, Tech Lead at Google
Kafka Summit, May 2017

Apache Beam: Open Source data processing APIs
● Expresses data-parallel batch and streaming
algorithms using one unified API
● Cleanly separates data processing logic
from runtime requirements
● Supports execution on multiple distributed
processing runtime environments

The evolution of Apache Beam
MapReduce
Apache
Beam
Cloud
Dataflow
BigTable DremelColossus
FlumeMegastore Spanner
PubSub
Millwheel

Agenda
1. Beam Model: Model Basics
2. Extensible IO Connectors
3. Portability: Write Once, Run Anywhere
4. Demo
5. Getting Started

Model Basics
A unified model for batch and streaming

Processing time vs. event time

The Beam Model: asking the right questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?

PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
The Beam Model: What is being computed?

The Beam Model: What is being computed?

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
The Beam Model: Where in event time?

The Beam Model: Where in event time?

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
The Beam Model: When in processing time?

The Beam Model: When in processing time?

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
The Beam Model: How do refinements relate?

The Beam Model: How do refinements relate?

Customizing What Where When How
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
2
Windowed
Batch

Extensible IO Connectors
Like Kafka!

The Beam vision for portablility
Write once, run anywhere

Beam Vision: mix and match SDKs and runtimes
● The Beam Model: the abstractions
at the core of Apache BeamLanguage B
SDK
Language A
SDK
Language C
SDK
Runner 1 Runner 3Runner 2
● Choice of SDK: Users write their
pipelines in a language that’s
familiar and integrated with their
other tooling
● Choice of Runners: Users choose
the right runtime for their current
needs -- on-prem / cloud, open
source / not, fully managed / not
● Scalability for Developers: Clean
APIs allow developers to contribute
modules independently
The Beam Model
Language A Language CLanguage B
The Beam Model

● Beam’s Java SDK runs on multiple
runtime environments, including:
• Apache Apex
• Apache Spark
• Apache Flink
• Google Cloud Dataflow
• [in development] Apache Gearpump
● Cross-language infrastructure is in
progress.
• Beam’s Python SDK currently runs
on Google Cloud Dataflow
Beam Vision: as of March 2017
Beam Model: Fn Runners
Apache
Spark
Cloud
Dataflow
Beam Model: Pipeline Construction
Apache
Flink
Java
Java
Python
Python
Apache
Apex
Apache
Gearpump

Example Beam Runners
Apache Spark
● Open-source cluster-
computing framework
● Large ecosystem of
APIs and tools
● Runs on premise or in
the cloud
Apache Flink
● Open-source
distributed data
processing engine
● High-throughput and
low-latency stream
processing
● Runs on premise or in
the cloud
Google Cloud Dataflow
● Fully-managed service
for batch and stream
data processing
● Provides dynamic
auto-scaling,
monitoring tools, and
tight integration with
Google Cloud
Platform

How do you build an abstraction layer?
Apache
Spark
Cloud
Dataflow
Apache
Flink
????????
????????

Beam: the intersection of runner functionality?

Beam: the union of runner functionality?

Categorizing Runner Capabilities
http://beam.incubator.apache.org/
documentation/runners/capability-matrix/

Parallel and portable pipelines in practice
Demo!

Getting Started with Apache Beam
Beaming into the Future

Getting Started with Apache Beam
Quickstarts
● Java SDK
● Python SDK
Example walkthroughs
● Word Count
● Mobile Gaming
Extensive documentation

Learn more!
Apache Beam
https://beam.apache.org
Join the Beam mailing lists
user-subscribe@beam.apache.org
dev-subscribe@beam.apache.org
Follow @ApacheBeam on Twitter

Demo screenshots
because if I make them, I won’t need to use them

Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

More Related Content

What's hot (20)

Similar to Portable Streaming Pipelines with Apache Beam (20)

More from confluent (20)

Recently uploaded (20)

Portable Streaming Pipelines with Apache Beam

Editor's Notes