SlideShare a Scribd company logo
Stream processing for the masses with
Beam, Python and Flink
Sept 12th, 2019
Enrico Canzonieri
enrico@yelp.com @EnricoC89
Yelp’s Mission
Connecting
people with great
local businesses
Evolving data processing
Latency ~
hours/days
Evolving data processing
Latency ~
hours/days Latency ~
seconds/minutes
STREAMS POWER YELP
Powered by
streaming
Notifications
Real-time visit detection
User Search
Indexing pipeline
User personalization
Purchase flows
ML feature ETL
Ads
Product development
Transactions
Realtime campaign shut-off
Experimentation infrastructure and guardrail metrics
Scribe
2015 2017
Tooling innovation
leads to more data
20192016
Scribe
2015 2017
Tooling innovation
leads to more data
20192016
DATA PIPELINE
Strong data schematization and documentation
Standardized wire protocol (AVRO)
Contract between data producers and consumers
Centralized schema registry
Decouple data
ETL
PROCESSOR
Paastorm
Paastorm was Yelp’s answer to the lack of good
open source Python stream processors
Paastorm provides a thin wrapper around Kafka
producer/consumer
Good fit to perform map/flatmap transformations
PROCESSOR
Paastorm
Adoption
Paastorm API is a class called Spolt
Users extend the Spolt and implement
process_message(self, message)
Over 150 production Paastorm applications
class GreatReviewsSpolt(Spolt):
def process_message(self, message):
payload = message.payload_data
if payload['rating'] >= 4.0:
yield message
if __name__ == ‘__main__’:
Paastorm(GreatReviewsSpolt()).run()
PROCESSOR
Paastorm
Code
Need shiny new tools to leverage the value
of real-time data
2017
Unlocked real-time data processing at scale
Stateful processing
Powerful streaming oriented (DataStream) API
Apache Flink
Event time processing
2017
Stream SQL
Joinery
Aggregator
Use Cases
Connectors
Sessionizer
cassandra, elasticsearch, redshift, etc.
Flink SQL wrapper to run arbitrary queries
unwindowed streaming join of table change streams
unwindowed aggregation of table change streams
create sessions from event logs
Yelp’s data pipeline Stack
LIMITATIONS
Tightly coupled to Kafka
No high level primitives: groupBy, windowing,
filter, etc.
No stateful processing support
High cost to implement and maintain new features
CHALLENGES
Hundreds of Python libraries implementing
business logic
Flink SQL good mostly for simple SQL like
transformations
High barrier of entry to JVM language
Backend
Finding the next new shiny tool
BEAM
And more ...
Pipeline
Driver program
Programming
model
Execution
responsible for defining the pipeline in the Beam SDK
represent the logical data processing tasks
run on any supported distributed processing framework
BEAM
Pipeline
PTransform
IO /
Create
IO /
Write
PCollection
PTransform
PCollections elements
distributed bounded or unbounded data sets
processing step that transforms PCollections
have an associated timestamp
BEAM
Python SDK
High level API
Side Input and tagged output
State and Timers
ParDo, Map, Flatmap, Filter, GroupByKey, CoGroupByKey
Support for Window and Triggers
Fixed, Sliding and Session windows and a variety triggers
ParDo with two or more inputs and two or more outputs
Can be combined to build complex stateful applications
EXECUTION
Beam
Portability API
Portability API
Define the protocols used by the runner (e.g. Flink) to translate
and run the pipeline
Python SDK
Make use of a containerized SDK harness to run
language specific UDFs
Fn API
Rely on gRPC for runner - SDK worker
communication
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
EXECUTION
The Flink
Runner
INTEGRATION
Data Pipeline
Source and
Sink
Yelp specific implementation to discover Kafka
clusters
Team expertise around the Flink
Consumer/Producer
Customize Portable translation to “attach”
existing Flink components to a Beam pipeline
Flink
DPSource
Beam Flink DPSink
Coder
deserializer
Coder
serializer
WindowedValue<byte[]> WindowedValue<byte[]>
INTEGRATION
Invoking Flink
code
PBegin for the source and PDone for the sink
class FlinkYelpDatapipelineSource(PTransform):
def expand(self, pbegin): return pvalue.PCollection(pbegin.pipeline)
def infer_output_type(self, unused_input_type): return Message
def to_runner_api_parameter(self, context):
api_parameters = ('yelp:flinkYelpDatapipelineSource',
json.dumps({...}))
return api_parameters
@staticmethod
@PTransform.register_urn('yelp:flinkYelpDatapipelineSource', None)
def from_runner_api_parameter(spec_parameter, _unused_context):
instance = FlinkYelpDatapipelineSource()
params = json.loads(spec_parameter)
....
return instance
Can use json to pass parameters from Python
Beam to Java Flink
Beam urn identifies a PTransform during the
translation
INTEGRATION
Translate to a
Flink operator
Fork FlinkStreamingPortablePipelineTranslator.java
Add your urn and translation function to the
translatorMap
The result of the translation is a “chunk” of Flink
pipeline
The output/input Flink DataStream is of type
WindowedValue<byte[]>
Message
Envelope
INTEGRATION
Bytes from Kafka
Timestamp Micros
UUID
Metadata / Headers
Field 1
Field n
Payload
Message Type
Message
Timestamp Micros
UUID
Metadata / Headers
Field 1
Field n
Payload
Message Type
Kafka Position
Kafka Position
Cluster
Topic
Partition
Offset
Beam Coder
SERIALIZATION
Data needs to be properly serialized between
Flink and Beam SDK worker
Extend the Beam Coder class to implement a
custom coder for the Message class
Register the coder when the source/sink is
being used
class DataPipelineCoder(Coder):
def encode(self, value: Message):
envelope = Envelope()
return envelope.pack(value)
def decode(self, value: bytes) -> Message:
return create_from_kafka_message_value(value)
registry.register_coder(Message, DataPipelineCoder)
Beam
typehints
SERIALIZATION
Critical to make sure that the proper Coder is
being used
Every PTransform that returns a Message must
use the typehint
Annotation
@typehints.with_output_types(Message)
DEVELOPMENT
Beam
application
Yelp specific integration into yelp-beam wrapper
Makefile to download and start Flink and Job
Server locally
Run SDK worker on host instead of Docker
DEVELOPMENT
Acceptance
testing
FLINK
Practical
differences
Processing time using GlobalWindow
No access to time characteristic
Powerful but possibly complex trigger composition
FLINK
Practical
differences
dataStream
.keyBy()
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.process(new SomeCount())
Beam
pcoll
| beam.Map(f()-> (key, message))
| beam.WindowInto(
window.GlobalWindows(),
trigger=Repeatedly(AfterProcessingTime(10000)),
accumulation_mode=AccumulationMode.DISCARDING,
)
| beam.GroupByKey()
| beam.ParDo(SomeCount())
DEPLOYMENT
Running Beam
Run on Kubernetes
One Flink Cluster per Beam service
One base Docker image extended with service specific
deps
Run SDK Worker in same Task Manager container
DEPLOYMENT
Yelp’s Flink
Operator
DEPLOYMENT
Flink
Supervisor
Long running Python process
Controls Beam/Flink job startup
Checkpoint and Savepoint management
Handles job failures and restarts
Monitoring and alerting
DEPLOYMENT
Job
Launching
DEPLOYMENT
Can do
better?
BEAM-7966 Write portable Beam application jar
BEAM-7980 External environment with containerized worker
pool (Beam 2.16)
Pluggable custom portable translations
ADOPTION
Paastorm on
Beam
class GreatReviewsSpolt(Spolt):
def process_message(self, message):
payload = message.payload_data
if payload['rating'] >= 4.0:
yield message
if __name__ == ‘__main__’:
Paastorm(GreatReviewsSpolt()).run()
ADOPTION
Paastorm on
Beam
@typehints.with_output_types(Message)
class Spolt(beam.DoFn):
def process_message(self, message):
raise NotImplementedError()
def process(self, element):
return self.process_message(element)
class Paastorm:
def __init__(self, paastorm_fn):
self.paastorm_fn = paastorm_fn
def run():
p = beam.Pipeline(options=options)
messages = p | DataPipelineSource()
| beam.Map(f()-> (kafka_partition, message))
| beam.ParDo(self.paastorm_fn)
| DataPipelineSink()
p.run()
TAKEAWAYS
The future is
now
Run all of our stream processing on one engine:
Flink
Legacy Paastorm easily migrated
Feature parity across languages
New applications use native Beam
Yelp’s Indexing pipeline as first use case
@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp
Questions/Suggestions?
enrico@yelp.com

More Related Content

PDF
Kafka summit apac session
PDF
New Features in Confluent Platform 6.0 / Apache Kafka 2.6
PDF
Concepts and Patterns for Streaming Services with Kafka
PPTX
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
PDF
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
PDF
Event streaming: A paradigm shift in enterprise software architecture
PDF
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
PDF
Battle-tested event-driven patterns for your microservices architecture - Sca...
Kafka summit apac session
New Features in Confluent Platform 6.0 / Apache Kafka 2.6
Concepts and Patterns for Streaming Services with Kafka
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
Event streaming: A paradigm shift in enterprise software architecture
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
Battle-tested event-driven patterns for your microservices architecture - Sca...

What's hot (20)

PDF
What is Apache Kafka and What is an Event Streaming Platform?
PDF
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
PPTX
Real time analytics in Azure IoT
PDF
GCP for Apache Kafka® Users: Stream Ingestion and Processing
PDF
Real time data processing and model inferncing platform with Kafka streams (N...
PDF
How to build 1000 microservices with Kafka and thrive
PPTX
Should we manage events like APIs? | Kim Clark, IBM
PDF
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
PDF
Apache Kafka as Event Streaming Platform for Microservice Architectures
PDF
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...
PDF
스타트업을 위한 Confluent 세미나
PDF
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
PDF
Battle Tested Event-Driven Patterns for your Microservices Architecture - Dev...
PDF
Serverless London 2019 FaaS composition using Kafka and CloudEvents
PDF
Top use cases for 2022 with Data in Motion and Apache Kafka
PDF
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
PDF
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
PPTX
Bridge Your Kafka Streams to Azure Webinar
PDF
Neha Narkhede | Kafka Summit London 2019 Keynote | Event Streaming: Our Cloud...
PDF
Battle-tested event-driven patterns for your microservices architecture - Sca...
What is Apache Kafka and What is an Event Streaming Platform?
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
Real time analytics in Azure IoT
GCP for Apache Kafka® Users: Stream Ingestion and Processing
Real time data processing and model inferncing platform with Kafka streams (N...
How to build 1000 microservices with Kafka and thrive
Should we manage events like APIs? | Kim Clark, IBM
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
Apache Kafka as Event Streaming Platform for Microservice Architectures
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...
스타트업을 위한 Confluent 세미나
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Battle Tested Event-Driven Patterns for your Microservices Architecture - Dev...
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Top use cases for 2022 with Data in Motion and Apache Kafka
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Bridge Your Kafka Streams to Azure Webinar
Neha Narkhede | Kafka Summit London 2019 Keynote | Event Streaming: Our Cloud...
Battle-tested event-driven patterns for your microservices architecture - Sca...
Ad

Similar to Stream processing for the masses with beam, python and flink (20)

PDF
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
PPTX
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
PDF
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
PPTX
Python Streaming Pipelines with Beam on Flink
PDF
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
PDF
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
PDF
Near real-time anomaly detection at Lyft
PDF
The magic behind your Lyft ride prices: A case study on machine learning and ...
PDF
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
PDF
Streaming your Lyft Ride Prices - Flink Forward SF 2019
PDF
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
PDF
Maximilian Michels - Flink and Beam
PDF
Present and future of unified, portable and efficient data processing with Ap...
PDF
Realizing the promise of portability with Apache Beam
PDF
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
PPTX
Portable Streaming Pipelines with Apache Beam
PDF
Realizing the promise of portable data processing with Apache Beam
PDF
Present and future of unified, portable, and efficient data processing with A...
PDF
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
PDF
The Next Generation of Data Processing and Open Source
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Python Streaming Pipelines with Beam on Flink
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Near real-time anomaly detection at Lyft
The magic behind your Lyft ride prices: A case study on machine learning and ...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Maximilian Michels - Flink and Beam
Present and future of unified, portable and efficient data processing with Ap...
Realizing the promise of portability with Apache Beam
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Portable Streaming Pipelines with Apache Beam
Realizing the promise of portable data processing with Apache Beam
Present and future of unified, portable, and efficient data processing with A...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
The Next Generation of Data Processing and Open Source
Ad

Recently uploaded (20)

PPTX
Online Work Permit System for Fast Permit Processing
PDF
Nekopoi APK 2025 free lastest update
PPTX
Transform Your Business with a Software ERP System
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Digital Strategies for Manufacturing Companies
PPTX
L1 - Introduction to python Backend.pptx
PPTX
history of c programming in notes for students .pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
System and Network Administration Chapter 2
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Introduction to Artificial Intelligence
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
AI in Product Development-omnex systems
Online Work Permit System for Fast Permit Processing
Nekopoi APK 2025 free lastest update
Transform Your Business with a Software ERP System
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Which alternative to Crystal Reports is best for small or large businesses.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
Digital Strategies for Manufacturing Companies
L1 - Introduction to python Backend.pptx
history of c programming in notes for students .pptx
Odoo Companies in India – Driving Business Transformation.pdf
System and Network Administration Chapter 2
Wondershare Filmora 15 Crack With Activation Key [2025
CHAPTER 2 - PM Management and IT Context
Introduction to Artificial Intelligence
PTS Company Brochure 2025 (1).pdf.......
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Understanding Forklifts - TECH EHS Solution
Operating system designcfffgfgggggggvggggggggg
AI in Product Development-omnex systems

Stream processing for the masses with beam, python and flink