SlideShare a Scribd company logo
Portable Streaming
Pipelines with Apache Beam
Frances Perry
PMC for Apache Beam, Tech Lead at Google
Kafka Summit, May 2017
Apache Beam: Open Source data processing APIs
● Expresses data-parallel batch and streaming
algorithms using one unified API
● Cleanly separates data processing logic
from runtime requirements
● Supports execution on multiple distributed
processing runtime environments
The evolution of Apache Beam
MapReduce
Apache
Beam
Cloud
Dataflow
BigTable DremelColossus
FlumeMegastore Spanner
PubSub
Millwheel
Agenda
1. Beam Model: Model Basics
2. Extensible IO Connectors
3. Portability: Write Once, Run Anywhere
4. Demo
5. Getting Started
Model Basics
A unified model for batch and streaming
Processing time vs. event time
The Beam Model: asking the right questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
The Beam Model: What is being computed?
The Beam Model: What is being computed?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
The Beam Model: Where in event time?
The Beam Model: Where in event time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
The Beam Model: When in processing time?
The Beam Model: When in processing time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
The Beam Model: How do refinements relate?
The Beam Model: How do refinements relate?
Customizing What Where When How
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
2
Windowed
Batch
Extensible IO Connectors
Like Kafka!
The Beam vision for portablility
Write once, run anywhere
Beam Vision: mix and match SDKs and runtimes
● The Beam Model: the abstractions
at the core of Apache BeamLanguage B
SDK
Language A
SDK
Language C
SDK
Runner 1 Runner 3Runner 2
● Choice of SDK: Users write their
pipelines in a language that’s
familiar and integrated with their
other tooling
● Choice of Runners: Users choose
the right runtime for their current
needs -- on-prem / cloud, open
source / not, fully managed / not
● Scalability for Developers: Clean
APIs allow developers to contribute
modules independently
The Beam Model
Language A Language CLanguage B
The Beam Model
● Beam’s Java SDK runs on multiple
runtime environments, including:
• Apache Apex
• Apache Spark
• Apache Flink
• Google Cloud Dataflow
• [in development] Apache Gearpump
● Cross-language infrastructure is in
progress.
• Beam’s Python SDK currently runs
on Google Cloud Dataflow
Beam Vision: as of March 2017
Beam Model: Fn Runners
Apache
Spark
Cloud
Dataflow
Beam Model: Pipeline Construction
Apache
Flink
Java
Java
Python
Python
Apache
Apex
Apache
Gearpump
Example Beam Runners
Apache Spark
● Open-source cluster-
computing framework
● Large ecosystem of
APIs and tools
● Runs on premise or in
the cloud
Apache Flink
● Open-source
distributed data
processing engine
● High-throughput and
low-latency stream
processing
● Runs on premise or in
the cloud
Google Cloud Dataflow
● Fully-managed service
for batch and stream
data processing
● Provides dynamic
auto-scaling,
monitoring tools, and
tight integration with
Google Cloud
Platform
How do you build an abstraction layer?
Apache
Spark
Cloud
Dataflow
Apache
Flink
????????
????????
Beam: the intersection of runner functionality?
Beam: the union of runner functionality?
Beam: the future!
Categorizing Runner Capabilities
http://beam.incubator.apache.org/
documentation/runners/capability-matrix/
Parallel and portable pipelines in practice
Demo!
Getting Started with Apache Beam
Beaming into the Future
Getting Started with Apache Beam
Quickstarts
● Java SDK
● Python SDK
Example walkthroughs
● Word Count
● Mobile Gaming
Extensive documentation
Learn more!
Apache Beam
https://beam.apache.org
Join the Beam mailing lists
user-subscribe@beam.apache.org
dev-subscribe@beam.apache.org
Follow @ApacheBeam on Twitter
Demo screenshots
because if I make them, I won’t need to use them
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam

More Related Content

PDF
REST APIs with Spring
PDF
ORM: Object-relational mapping
PPTX
Async Programming in C# 5
PDF
REST vs GraphQL
PDF
Introduction to Spring WebFlux #jsug #sf_a1
PPTX
Functional Patterns with Java8 @Bucharest Java User Group
PPTX
Java overview and architecture
PPTX
REST Easy with Django-Rest-Framework
REST APIs with Spring
ORM: Object-relational mapping
Async Programming in C# 5
REST vs GraphQL
Introduction to Spring WebFlux #jsug #sf_a1
Functional Patterns with Java8 @Bucharest Java User Group
Java overview and architecture
REST Easy with Django-Rest-Framework

What's hot (20)

PPTX
Optional in Java 8
PDF
Spring boot jpa
PDF
Lambda Expressions in Java | Java Lambda Tutorial | Java Certification Traini...
PDF
Java Annotation Processing: A Beginner Walkthrough
PDF
Intro to GraphQL
PDF
The Enterprise Case for Node.js
PPTX
JS Event Loop
PDF
Nodejs - A performance que eu sempre quis ter
PPT
A Deeper look into Javascript Basics
ODP
Django for Beginners
PDF
JSON and PL/SQL: A Match Made in Database
PPTX
Introduction to GraphQL
PDF
Enterprise Messaging with Apache ActiveMQ
PDF
Spring Data JPA
PPTX
SpringBoot with MyBatis, Flyway, QueryDSL
ODP
Introduction to Java 8
PDF
LinkedIn Communication Architecture
PPTX
Java 11 to 17 : What's new !?
PDF
Spring Data JPA from 0-100 in 60 minutes
PDF
Clean architectures with fast api pycones
Optional in Java 8
Spring boot jpa
Lambda Expressions in Java | Java Lambda Tutorial | Java Certification Traini...
Java Annotation Processing: A Beginner Walkthrough
Intro to GraphQL
The Enterprise Case for Node.js
JS Event Loop
Nodejs - A performance que eu sempre quis ter
A Deeper look into Javascript Basics
Django for Beginners
JSON and PL/SQL: A Match Made in Database
Introduction to GraphQL
Enterprise Messaging with Apache ActiveMQ
Spring Data JPA
SpringBoot with MyBatis, Flyway, QueryDSL
Introduction to Java 8
LinkedIn Communication Architecture
Java 11 to 17 : What's new !?
Spring Data JPA from 0-100 in 60 minutes
Clean architectures with fast api pycones
Ad

Similar to Portable Streaming Pipelines with Apache Beam (20)

PDF
Realizing the promise of portability with Apache Beam
PDF
Unified, Efficient, and Portable Data Processing with Apache Beam
PDF
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
PDF
Present and future of unified, portable, and efficient data processing with A...
PDF
Realizing the Promise of Portable Data Processing with Apache Beam
PDF
Present and future of unified, portable and efficient data processing with Ap...
PDF
Realizing the promise of portable data processing with Apache Beam
PPTX
ApacheBeam_Google_Theater_TalendConnect2017.pptx
PDF
ApacheBeam_Google_Theater_TalendConnect2017.pdf
PDF
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
PDF
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
PDF
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
PDF
Introduction to Apache Beam
PDF
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
PDF
The Next Generation of Data Processing and Open Source
PDF
hbaseconasia2017: HBase on Beam
PDF
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
PDF
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
PDF
Introduction to Apache Beam
PPTX
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Realizing the promise of portability with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Present and future of unified, portable, and efficient data processing with A...
Realizing the Promise of Portable Data Processing with Apache Beam
Present and future of unified, portable and efficient data processing with Ap...
Realizing the promise of portable data processing with Apache Beam
ApacheBeam_Google_Theater_TalendConnect2017.pptx
ApacheBeam_Google_Theater_TalendConnect2017.pdf
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Introduction to Apache Beam
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
The Next Generation of Data Processing and Open Source
hbaseconasia2017: HBase on Beam
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Introduction to Apache Beam
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
ai tools demonstartion for schools and inter college
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
L1 - Introduction to python Backend.pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Digital Strategies for Manufacturing Companies
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPTX
Introduction to Artificial Intelligence
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Nekopoi APK 2025 free lastest update
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
Odoo Companies in India – Driving Business Transformation.pdf
CHAPTER 2 - PM Management and IT Context
Odoo POS Development Services by CandidRoot Solutions
ai tools demonstartion for schools and inter college
Design an Analysis of Algorithms I-SECS-1021-03
L1 - Introduction to python Backend.pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Understanding Forklifts - TECH EHS Solution
Digital Strategies for Manufacturing Companies
Which alternative to Crystal Reports is best for small or large businesses.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Navsoft: AI-Powered Business Solutions & Custom Software Development
Operating system designcfffgfgggggggvggggggggg
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
ManageIQ - Sprint 268 Review - Slide Deck
Introduction to Artificial Intelligence
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
ISO 45001 Occupational Health and Safety Management System
Nekopoi APK 2025 free lastest update
How to Choose the Right IT Partner for Your Business in Malaysia

Portable Streaming Pipelines with Apache Beam

  • 1. Portable Streaming Pipelines with Apache Beam Frances Perry PMC for Apache Beam, Tech Lead at Google Kafka Summit, May 2017
  • 2. Apache Beam: Open Source data processing APIs ● Expresses data-parallel batch and streaming algorithms using one unified API ● Cleanly separates data processing logic from runtime requirements ● Supports execution on multiple distributed processing runtime environments
  • 3. The evolution of Apache Beam MapReduce Apache Beam Cloud Dataflow BigTable DremelColossus FlumeMegastore Spanner PubSub Millwheel
  • 4. Agenda 1. Beam Model: Model Basics 2. Extensible IO Connectors 3. Portability: Write Once, Run Anywhere 4. Demo 5. Getting Started
  • 5. Model Basics A unified model for batch and streaming
  • 6. Processing time vs. event time
  • 7. The Beam Model: asking the right questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  • 8. PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); The Beam Model: What is being computed?
  • 9. The Beam Model: What is being computed?
  • 10. PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey()); The Beam Model: Where in event time?
  • 11. The Beam Model: Where in event time?
  • 12. PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); The Beam Model: When in processing time?
  • 13. The Beam Model: When in processing time?
  • 14. PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings( AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey()); The Beam Model: How do refinements relate?
  • 15. The Beam Model: How do refinements relate?
  • 16. Customizing What Where When How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
  • 18. The Beam vision for portablility Write once, run anywhere
  • 19. Beam Vision: mix and match SDKs and runtimes ● The Beam Model: the abstractions at the core of Apache BeamLanguage B SDK Language A SDK Language C SDK Runner 1 Runner 3Runner 2 ● Choice of SDK: Users write their pipelines in a language that’s familiar and integrated with their other tooling ● Choice of Runners: Users choose the right runtime for their current needs -- on-prem / cloud, open source / not, fully managed / not ● Scalability for Developers: Clean APIs allow developers to contribute modules independently The Beam Model Language A Language CLanguage B The Beam Model
  • 20. ● Beam’s Java SDK runs on multiple runtime environments, including: • Apache Apex • Apache Spark • Apache Flink • Google Cloud Dataflow • [in development] Apache Gearpump ● Cross-language infrastructure is in progress. • Beam’s Python SDK currently runs on Google Cloud Dataflow Beam Vision: as of March 2017 Beam Model: Fn Runners Apache Spark Cloud Dataflow Beam Model: Pipeline Construction Apache Flink Java Java Python Python Apache Apex Apache Gearpump
  • 21. Example Beam Runners Apache Spark ● Open-source cluster- computing framework ● Large ecosystem of APIs and tools ● Runs on premise or in the cloud Apache Flink ● Open-source distributed data processing engine ● High-throughput and low-latency stream processing ● Runs on premise or in the cloud Google Cloud Dataflow ● Fully-managed service for batch and stream data processing ● Provides dynamic auto-scaling, monitoring tools, and tight integration with Google Cloud Platform
  • 22. How do you build an abstraction layer? Apache Spark Cloud Dataflow Apache Flink ???????? ????????
  • 23. Beam: the intersection of runner functionality?
  • 24. Beam: the union of runner functionality?
  • 27. Parallel and portable pipelines in practice Demo!
  • 28. Getting Started with Apache Beam Beaming into the Future
  • 29. Getting Started with Apache Beam Quickstarts ● Java SDK ● Python SDK Example walkthroughs ● Word Count ● Mobile Gaming Extensive documentation
  • 30. Learn more! Apache Beam https://beam.apache.org Join the Beam mailing lists user-subscribe@beam.apache.org dev-subscribe@beam.apache.org Follow @ApacheBeam on Twitter
  • 31. Demo screenshots because if I make them, I won’t need to use them

Editor's Notes

  • #2: Good afternoon! My name is Frances Perry. I’m an engineer at Google and on the project management committee for Apache Beam. Today I’m going to give an introduction to Apache Beam… can you see me?
  • #3: -- which is a new open source project for expressing both batch and streaming data processing use cases When you use Beam, you’re focusing on your logic and your data, without letting runtime details leak through into your code. That separation means that a Beam pipeline can be run many existing runtimes that you know and love, including Apache Spark, Apache Flink, and Google Cloud Dataflow. To put Beam context in the broader BigData ecosystem, let’s talk briefly about its evolution.
  • #4: Google published the original paper on MapReduce in 2004 -- fundamentally change the way we do distributed processing. <animate> Inside Google, kept innovating, but initially just publishing papers. In 2014, we released Google Cloud Dataflow -- included both a new programming model and fully managed service <animate> Externally the open source community created Hadoop, and an entire ecosystem flourished around this. Beam is bring these two streams of work. It’s based on the Dataflow programming model, but generalized and integrated with the broader ecosystem.
  • #5: Today I’m going to go into more detail on two key pieces of Apache Beam the programming model that intuitively expresses data-parallel operations, including both batch and streaming use cases the portability infrastructure, which lets you execute the same Beam pipeline across multiple runtimes Next it’s time to get concrete -- I’ll show you how these concepts in practice with a demo of the same pipeline, reading from Kafka, and running on Apache Spark, Apache Flink, and Cloud Dataflow. And finally we’ll end with some pointers for getting started.
  • #6: We’ll start with a brief overview of the Beam model. If you been to other talks over the last few days, you may have heard my favorite example already, so I’ll just set the context briefly. We’re going to be be using a running example of analyzing mobile gaming logs. We’ve just launched an addictive new mobile game, where we’ve got users across the globe forming teams and scoring points on their mobile devices.
  • #7: Let’s take a look at some sample data -- the points scored for a specific team. On the x-axis we’ve got event time, and the y-axis is processing time. <animate>If everything was perfect, elements would arrive in our system immediately, and so we’d see things along this dashed line. But distributed systems often don’t cooperate. <animate> Sometimes it’s not so bad. So here this event from just before 12:07 maybe just encountered a small network delay and arrives almost immediately after 12:07 <animate> But this one over here was more like 7 minutes delayed. Perhaps our user was playing in an elevator or in a subway -- so the score is delayed by a temporary lack of network connectivity. And this graph can’t even contain what we’d see if our game supports an offline mode. If a user is playing on a transatlantic flight in airplane mode, it might be hours until that flight lands and we get those scores for processing. These types of infinite, out of order data sources can be really tricky to reason about… unless you know what questions to ask.
  • #8: The Beam model is based on four key questions: What results are calculated? Are you doing computing sums, joins, histograms, machine learning models? Where in event time are results calculated? How does the time each event originally occurred affect results? Are results aggregated for all time, in fixed windows, or as user activity sessions? When in processing time are results materialized? Does the time each element arrives in the system affect results? How do we know when to emit a result? What do we do about data that comes in late from those pesky users playing on transatlantic flights? How do refinements relate? If we choose to emit results multiple times, is each result independent and distinct, do they build upon one another? Let’s take a quick look at how we can use this questions to build a pipeline.
  • #9: Here’s a snippet from a pipeline that processes scoring results from that mobile gaming application. In Yellow, you can see the computation that we’re performing -- the what -- in this case taking team-score pairs and summing them per team. So now let’s see what happens to our sample data if we execute this in traditional batch style.
  • #10: In this looping animation, the grey line represents processing time. As the pipeline executes and processes elements, they’re accumulated into the intermediate state, just under the processing time line.. When processing completes, the system emits the result in yellow. This is pretty standard batch processing. But as we dive into the remaining three questions, that’s going to change.
  • #11: Let’s start by playing with event time. By specifying a windowing function, we can calculate independent results for different slices of event time. For example every minute, every hour, every day... In this case, our same integer summation will output one sum every two minutes.
  • #12: Now if we look at how things execute, you can see that we are calculating a independent answer for every two minute period of event time. But we’re still waiting until the entire computation completes to emit any results. That might work fine for bounded data sets, when we’ll eventually finish processing. But it’s not going to work if we’re trying to process an infinite amount of data!
  • #13: In that case we want to reduce the latency of individual results. We do that by asking for results to be triggered based on the system’s best estimate of when it has all the input data. We call this estimate the watermark.
  • #14: The watermark is drawn in green. And now the result for each window is emitted as soon as we roughly think we’re done. But again, the watermark is often just a heuristic. It’s the system’s best guess about data completeness. Right now, the watermark is too fast -- and in some cases we’re moving on without all the data. So that user who scored 9 points in the elevator is just plain out of luck. But we don’t want to be too slow either -- it’s no good if we wait to emit anything until all the flights everywhere had landed just in case someone in 16B is playing our game.
  • #15: So let’s use a more sophisticated trigger to request both speculative, early firings as data is still trickling in -- and also update results if late elements arrive. Once we do this though, we might get multiple results for the same window of event time. So we have answer the fourth question about how refined results relate. Here we choose to just continually accumulate the score.
  • #16: Now, there are multiple results for each window. Some windows, like the second, produce early, incomplete results as data arrives. There’s one on time result when we think we’ve pretty much got all the data. And there are late results if additional data comes in behind the watermark, like in the first window. And because we chose to accumulate, each result includes all the elements in the window, even if they have already been part of an earlier result.
  • #17: So we took an algorithm -- in this case it happened to be a simple integer summation. I could have used something more complicated -- but the animations would have gotten out of control. And just by tweaking just a line here or there, we went through a number of use cases -- from the simple traditional batch style through to advanced streaming situations. Just like the MapReduce model fundamentally changed the way we do distributed processing by providing the right set of abstractions, we hope that the Beam model will change the way we unify batch and streaming processing in the future.
  • #18: We’ll start with a brief overview of the Beam model. If you been to other talks over the last few days, you may have heard my favorite example already, so I’ll just set the context briefly. We’re going to be be using a running example of analyzing mobile gaming logs. We’ve just launched an addictive new mobile game, where we’ve got users across the globe forming teams and scoring points on their mobile devices.
  • #19: So that was the conceptual introduction to the types of use cases the Beam model can cover. Next, let’s talk about how the model enable portability.
  • #20: The heart of Beam is the model. We have multiple language-specific SDKs for constructing a Beam pipeline. Developers often have strong opinions about their language of choice, so we want to meet them where they are. Next, we have multiple runners for executing Beam pipelines on existing distributed processing engines. This let’s the user choose the right environment for their use case. It might be on premise, or in the cloud. It might be open source. It might be fully managed. And these needs may change over time. Now each of these runners needs to execute user processing, which means they need the ability to execute code in different languages. And to do that, while making Beam components modular, we are building APIs to that cleanly specify how a runner calls language-specific progressing.
  • #21: Now that was where the project is going. In reality, this is where we are. The Java SDK runs across multiple runtimes. However the Python SDK currently runs only on Cloud Dataflow in batch, as we’re still building out the streaming and cross-language infrastructure.
  • #22: Let’s go into a bit more detail about some of these runners. I’m going to focus today on Spark, Flink, and Dataflow. These were the original three runners in Beam and are also the three I’ll be demoing today. Many of you are probably familiar with Apache Spark. It’s a very popular choice right now in the Big Data world. It excels at in-memory and interactive computations. Apache Flink is more of a newcomer to the broader big data scene. It’s got really clean semantics for stream processing. And Cloud Dataflow is GCP’s fully managed service for data processing pipelines that evolved from all those years of internal work.
  • #23: And though each of these runners does parallel data processing, they have some significant differences in how they go about that. And that makes it tricky to build an abstraction layer around them.
  • #24: We can’t just take the intersection of the functionality of all the engines -- that’s too limited.
  • #25: And on the other hand, taking the union would be a kitchen sink of chaos…
  • #26: Really, Beam tries to be at the forefront of where data processing is going, both pushing functionality into and pulling patterns out of the runtime engines. Keyed State is a great example of functionality that existed in various engines for a while and enabled interesting and common use cases, and was only recently added to Beam. And vice versa, we hope that Beam will influence the roadmaps of various engines as well. For example, the semantics of Flink's DataStreams were influenced by the Beam model.
  • #27: This also means there may be, at times, some divergence between the Beam model and the support a given runner has. So that's why Beam is tracking which portions of the model each runner currently supports -- and this get updated as new functionality is being built out.
  • #28: So with those concepts behind us, let’s get hands on.
  • #29: Finally, let’s look at how you can get started using Apache Beam
  • #31: And of course, please come learn more about Beam in general. The Beam website has all sorts of good information, including details on all the different runners.