SlideShare a Scribd company logo
2
Most read
8
Most read
11
Most read
KAPPA+
FLINK FORWARD 2019 – SAN FRANCISCO
Moving from Lambda and Kappa Architectures to Kappa+ at
UBER
ROSHAN NAIK
PROBLEM
Realtime jobs often need an offline counterpart:
• Backfill – Retroactively fix, or recompute values once all data has arrived.
• Offline Experimentation & Testing: Before taking the job online.
• Online and offline feature generation: for ML.
• Bootstrapping State: for realtime jobs.
CURRENT SOLUTIONS
1. Lambda Architecture [2011]
• Nathan Marz (Creator of Apache Storm)
• “How to beat the CAP theorem”
• Evidence of prior art [1983]:
• Butler Lampson (Turing Award Laureate)
• “Hints for Computer System Design” – Xerox PARC
• Core Idea: Streaming job for realtime processing. Batch job for offline processing.
2. Kappa Architecture [2014]
• Jay Krepps (Creator of Kafka, CoFounder/CEO Confluent)
• "Questioning the Lambda Architecture”
• Core Idea: Long data retention in Kafka. Replay using realtime code from an older point.
LIMITATIONS : LAMBDA ARCHITECTURE
• Maintain dual code for Batch and Streaming
• Batch APIs often lack required constructs (e.g. sliding windows)
• Variation: Unified API : SQL / Beam. – Offline job run in batch
mode
• Limitations of Batch mode (e.g. Spark):
• Divide large jobs into smaller ones to limit resource consumption
• Manual/automated sub job co-ordination
• Windows that span batch boundaries are problematic
LIMITATIONS : KAPPA ARCHITECTURE
• Longer retention in Kafka: Expensive, Infeasible
• Kafka not really a data warehouse. More expensive than HDFS.
• Retention beyond a few days not feasible. Single node storage limits partition size.
• Workaround 1: Tiered Storage (Pulsar)
• Data duplication: Usually need a separate queryable copy in Hive/warehouse.
• Low utilization: old data accessed only by Backfill jobs.
• Workaround 2: Mini batches
• Load small batches into Kafka and process one batch at a time.
• Sort before loading in Kafka. Try to recreate original arrival order.
• Expensive: Copying to Kafka and sorting are both costly.
• Issues when using multiple sources
• Low volume topic drains faster  messes up windowing  dropped data or OOM
DESIRED CHARACTERISTICS
• Reuse code for Online and Offline Processing.
• Windowing should work well in offline mode as well.
• No splitting jobs. Single job should processes any amount of data.
• Hardware requirements should not balloon with size of input data.
• Not have to rewrite jobs using new APIs.
• Efficient.
KAPPA+
Introducing the Architecture
KEY CHANGE IN PERSPECTIVE
• Decoupling Concepts:
• Bounded vs Unbounded (Nature of Data)
• Batch vs Streaming (Nature of Compute)
• Offline vs Realtime (Mode of Use)
• Instead of thinking: How to enable any job in Streaming and Batch
mode.
• Lambda / SQL / Beam / Unified APIs
• Think: Limits to job types that can run in Realtime (and Offline) mode.
• Kappa+
Impact:
• No need to support every type of batch job (departure from Unified API
approach).
• Identify the types of jobs to support: Kappa+ job classification system.
ARCHITECTURE
Central Idea - counter intuitive
• Use Streaming compute to process data directly from warehouse. (i.e. not tied to
Kafka)
Architectural Components:
1. Job classification system
• 4 categories
2. Processing model
• Same processing basic model with tweaks based on job category
Assumes: Data in warehouse (Hive/Hdfs/etc) is partitioned by time
(hourly/daily/etc).
JOB CLASSIFICATION SYSTEM
• Category 1 : Stateless Jobs
• No windowing. Memory not a concern.
• Data order usually not concern.
• Category 2 : Windowing with
aggregation (Low - Medium Memory)
• Eg: Aggregated Windows: sum / avg / count
/ min / max / reduce
• Retains only aggregate value in each window.
• Order of data is important. But solvable
without need for strict ordering.
• Category 3 : Windowing with retention
(High Memory)
• Holds on to all records till window expiration.
• Eg. Joins, pattern analysis within window.
• Memory requirements much higher than cat 2.
• Cat 4 : Global Windows with retention
• E.g. Sorting entire input data. Joins without
windowing.
• Not found in realtime jobs.
PROCESSING MODEL
1. Partially ordered reads
• Strict Ordering across partitions: Only one partition at a time, older partitions first.
• Constrains memory/container requirements to what is needed to process 1 partition.
• Single job can process any number of partitions with finite resources.
• Helps windowing correctness.
• Un-Ordered reads within partition
• Read records/files within a partition in any order. Opens up concurrency and high throughputs.
• Order could be exploited if necessary.
2. Emit watermark when switching to next partition
• Allows Out-Of-Order reads within partition and windowing correctness.
3. Lockstep progression, in case of multiple sources
• All sources move to next partition at the same time.
• Prevents low volume sources from racing ahead.
HANDLING EACH CATEGORY
• Cat 1 : Stateless
• Nothing special. Set parallelisms based on desired throughput.
• Cat 2 : Windowing with aggregation (Low – Med mem)
• Employ memory state backend.
• Windowing parallelism based on amount of data hosted in
memory for one partition. Other parallelisms, based on
throughput.
• Cat 3 : Windowing with retention (High Mem)
• Either: Use RocksDB state backend.
• Or: reduce partition size, and use Mem state backend.
• Or: Look into exploiting order within partition.
BENEFITS OVER BATCH
BATCH (SQL/ BEAM/ UNIFIED API)
1. Resource requirements grows with total
data volume.
• Tricky to estimate and allocate
2. Split into smaller jobs and coordinate
them.
• Windows that cross batch boundaries are
problematic.
3. Results visible after all data is
processed.
KAPPA+
1. Resources bounded by amount needed
to process 1 partition.
• Easier to estimate and allocate
2. Single job can process any number of
partitions.
3. Results visible after each partition.
Note: Kappa+ processing model could be adopted in Unified APIs to address these
limitations.
IMPLEMENTATION
Architecture is not tied any Streaming Engine.
ADOPTING KAPPA+ ON STREAMING ENGINES
• No new APIs.
• Hdfs/Hive/etc. Sources need behavioral change:
1. One partition at a time, older partitions first.
2. Concurrent reads within partition.
3. Lock step progression in case of multiple sources.
• Kafka source needs to only support #3 since data is already in order.
• Watermarking:
• Emit watermarks at the end of partition to flush windows.
A JOB SUPPORTING REALTIME & OFFLINE
dataSource = offlineMode ? hiveSource : kafkaSource;
watermaker = offlineMode ? new OfflineWM() : new RealtimeWM();
dataSource.assignWatermarkGenerator(watermarker);
// Same logic. Adjust parallelisms for offline & online modes.
job = dataSource.transform(..)
.filter(..)
.keyBy(..)
.window(..)
...
KAPPA+ ON FLINK
UBER internal Hive (/HDFS) source with Kappa+ support.
ONE PARTITION AT A TIME & CONCURRENT READS
File Selector
Source = 1
File Reader
Operator = N
Rest of JOB
ZK Updater
Operator = 1
ZK
- List files from next partition
- Emit names of files in partition
- Create zNode for partition
- Bulk create child zNodes for each
file
- Deserialize file and emit recs
- Inform ZK Updater on
completing file
- Delete zNode for completed file
- Delete zNode for partition if
empty
- Notify File Selector when
zNode for partition is deleted
File
Names
Records
MULTI SOURCE LOCK STEP PROGRESSION
File Selector A
Source = 1
File Reader
Operator = M
Join/Union
ZK Updater
Operator = 1
File Selector B
Source = 1
File Reader
Operator = N
ZK Updater
Operator = 1
Join
ZK
- Enter Barrier and wait for others
- Process next partition
- Exit Barrier and repeat
DETAILS
1. Time Skew: If arrival time is used for partitioning data in the warehouse, instead of
event creation time (used by job). There can be two types of data skews:
• Forward skew: Some events move into a future partition. For example due to late arrival.
• Could lead to appearance of missing data.
• Consider processing an additional partition after the last one, if this is an issue.
• Backward skew: Some events moving into an older partition.
• Can lead to appearance of data loss. As the events are not in the partition that you processed.
• Improper watermarking can close Windows prematurely and cause data loss.
2. Differing partition strategies: Job has two sources. First source reads Hive table
with daily partitions, second source reads table with hourly partition.
• Solution: Watermark progression dictated by daily (i.e. larger) partition
3. May need to throttle throughput of offline job if writing to production critical
destination.
DISTRIBUTED COMPUTING
Realtime
Batch Compute
Spark
(Micro Batching)
Streaming
Compute
Flink, Storm
Offline
Realtime
Batch Compute
Spark
(Micro Batching)
Streaming
Compute
Flink, Storm
Offline
Batch Compute
In Memory Batch
Systems
Full Fledged
Batch
DISTRIBUTED COMPUTING
Realtime
Batch Compute
Spark
(Micro Batching)
Streaming
Compute
Flink, Storm/Trident
Offline
Batch Compute
In Memory
Batch Systems
Full Fledged
Batch
Streaming
Compute
Kappa+
DISTRIBUTED COMPUTING
QUESTIONS
Email: roshan@uber.com Twitter: @naikrosh ,
@UberEng
UBER Engineering Blog: eng.uber.com
UBER is hiring!! Realtime Platform needs your expertise!

More Related Content

PPTX
RocksDB compaction
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PPTX
Apache Flink and what it is used for
PDF
Apache Spark Overview
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PPTX
RocksDB detail
PDF
Big data real time architectures
RocksDB compaction
A Deep Dive into Query Execution Engine of Spark SQL
Apache Flink and what it is used for
Apache Spark Overview
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
RocksDB detail
Big data real time architectures

What's hot (20)

PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PDF
MyRocks Deep Dive
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PDF
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
PDF
ksqlDB: A Stream-Relational Database System
PDF
Dynamic Partition Pruning in Apache Spark
PPTX
Introduction to Apache Flink
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PDF
Batch Processing at Scale with Flink & Iceberg
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
Can Apache Kafka Replace a Database?
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
NiFi Developer Guide
PDF
Data Source API in Spark
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Introduction to Apache Flink - Fast and reliable big data processing
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
MyRocks Deep Dive
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Dynamic Rule-based Real-time Market Data Alerts
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
ksqlDB: A Stream-Relational Database System
Dynamic Partition Pruning in Apache Spark
Introduction to Apache Flink
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Batch Processing at Scale with Flink & Iceberg
Apache Spark on K8S Best Practice and Performance in the Cloud
Evening out the uneven: dealing with skew in Flink
Can Apache Kafka Replace a Database?
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
Fine Tuning and Enhancing Performance of Apache Spark Jobs
NiFi Developer Guide
Data Source API in Spark
Ad

Similar to Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures to Kappa+ at Uber - Roshan Naik (20)

PPTX
Spark Overview and Performance Issues
PDF
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
PDF
Lessons Learned: Using Spark and Microservices
PDF
Building High-Throughput, Low-Latency Pipelines in Kafka
PDF
[262] netflix 빅데이터 플랫폼
PDF
Diagnosing Problems in Production - Cassandra
PDF
John adams talk cloudy
PPTX
High performace network of Cloud Native Taiwan User Group
PDF
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
PPTX
Composable Futures with Akka 2.0
PPT
PPTX
Streaming in Practice - Putting Apache Kafka in Production
PDF
Making Apache Kafka Even Faster And More Scalable
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
PDF
Fixing twitter
PDF
Fixing_Twitter
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PPTX
Stateful streaming and the challenge of state
PPTX
Background processing with hangfire
Spark Overview and Performance Issues
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Lessons Learned: Using Spark and Microservices
Building High-Throughput, Low-Latency Pipelines in Kafka
[262] netflix 빅데이터 플랫폼
Diagnosing Problems in Production - Cassandra
John adams talk cloudy
High performace network of Cloud Native Taiwan User Group
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Composable Futures with Akka 2.0
Streaming in Practice - Putting Apache Kafka in Production
Making Apache Kafka Even Faster And More Scalable
Stephan Ewen - Experiences running Flink at Very Large Scale
Fixing twitter
Fixing_Twitter
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Stateful streaming and the challenge of state
Background processing with hangfire
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PPTX
Welcome to the Flink Community!
PPTX
Practical learnings from running thousands of Flink jobs
PPTX
Extending Flink SQL for stream processing use cases
PPTX
The top 3 challenges running multi-tenant Flink at scale
Building a fully managed stream processing platform on Flink at scale for Lin...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Welcome to the Flink Community!
Practical learnings from running thousands of Flink jobs
Extending Flink SQL for stream processing use cases
The top 3 challenges running multi-tenant Flink at scale

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Big Data Technologies - Introduction.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Electronic commerce courselecture one. Pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Big Data Technologies - Introduction.pptx
MYSQL Presentation for SQL database connectivity
NewMind AI Weekly Chronicles - August'25 Week I
Electronic commerce courselecture one. Pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation_ Review paper, used for researhc scholars
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Spectral efficient network and resource selection model in 5G networks
Understanding_Digital_Forensics_Presentation.pptx
NewMind AI Monthly Chronicles - July 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
20250228 LYD VKU AI Blended-Learning.pptx
Modernizing your data center with Dell and AMD
Building Integrated photovoltaic BIPV_UPV.pdf

Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures to Kappa+ at Uber - Roshan Naik

  • 1. KAPPA+ FLINK FORWARD 2019 – SAN FRANCISCO Moving from Lambda and Kappa Architectures to Kappa+ at UBER ROSHAN NAIK
  • 2. PROBLEM Realtime jobs often need an offline counterpart: • Backfill – Retroactively fix, or recompute values once all data has arrived. • Offline Experimentation & Testing: Before taking the job online. • Online and offline feature generation: for ML. • Bootstrapping State: for realtime jobs.
  • 3. CURRENT SOLUTIONS 1. Lambda Architecture [2011] • Nathan Marz (Creator of Apache Storm) • “How to beat the CAP theorem” • Evidence of prior art [1983]: • Butler Lampson (Turing Award Laureate) • “Hints for Computer System Design” – Xerox PARC • Core Idea: Streaming job for realtime processing. Batch job for offline processing. 2. Kappa Architecture [2014] • Jay Krepps (Creator of Kafka, CoFounder/CEO Confluent) • "Questioning the Lambda Architecture” • Core Idea: Long data retention in Kafka. Replay using realtime code from an older point.
  • 4. LIMITATIONS : LAMBDA ARCHITECTURE • Maintain dual code for Batch and Streaming • Batch APIs often lack required constructs (e.g. sliding windows) • Variation: Unified API : SQL / Beam. – Offline job run in batch mode • Limitations of Batch mode (e.g. Spark): • Divide large jobs into smaller ones to limit resource consumption • Manual/automated sub job co-ordination • Windows that span batch boundaries are problematic
  • 5. LIMITATIONS : KAPPA ARCHITECTURE • Longer retention in Kafka: Expensive, Infeasible • Kafka not really a data warehouse. More expensive than HDFS. • Retention beyond a few days not feasible. Single node storage limits partition size. • Workaround 1: Tiered Storage (Pulsar) • Data duplication: Usually need a separate queryable copy in Hive/warehouse. • Low utilization: old data accessed only by Backfill jobs. • Workaround 2: Mini batches • Load small batches into Kafka and process one batch at a time. • Sort before loading in Kafka. Try to recreate original arrival order. • Expensive: Copying to Kafka and sorting are both costly. • Issues when using multiple sources • Low volume topic drains faster  messes up windowing  dropped data or OOM
  • 6. DESIRED CHARACTERISTICS • Reuse code for Online and Offline Processing. • Windowing should work well in offline mode as well. • No splitting jobs. Single job should processes any amount of data. • Hardware requirements should not balloon with size of input data. • Not have to rewrite jobs using new APIs. • Efficient.
  • 8. KEY CHANGE IN PERSPECTIVE • Decoupling Concepts: • Bounded vs Unbounded (Nature of Data) • Batch vs Streaming (Nature of Compute) • Offline vs Realtime (Mode of Use) • Instead of thinking: How to enable any job in Streaming and Batch mode. • Lambda / SQL / Beam / Unified APIs • Think: Limits to job types that can run in Realtime (and Offline) mode. • Kappa+ Impact: • No need to support every type of batch job (departure from Unified API approach). • Identify the types of jobs to support: Kappa+ job classification system.
  • 9. ARCHITECTURE Central Idea - counter intuitive • Use Streaming compute to process data directly from warehouse. (i.e. not tied to Kafka) Architectural Components: 1. Job classification system • 4 categories 2. Processing model • Same processing basic model with tweaks based on job category Assumes: Data in warehouse (Hive/Hdfs/etc) is partitioned by time (hourly/daily/etc).
  • 10. JOB CLASSIFICATION SYSTEM • Category 1 : Stateless Jobs • No windowing. Memory not a concern. • Data order usually not concern. • Category 2 : Windowing with aggregation (Low - Medium Memory) • Eg: Aggregated Windows: sum / avg / count / min / max / reduce • Retains only aggregate value in each window. • Order of data is important. But solvable without need for strict ordering. • Category 3 : Windowing with retention (High Memory) • Holds on to all records till window expiration. • Eg. Joins, pattern analysis within window. • Memory requirements much higher than cat 2. • Cat 4 : Global Windows with retention • E.g. Sorting entire input data. Joins without windowing. • Not found in realtime jobs.
  • 11. PROCESSING MODEL 1. Partially ordered reads • Strict Ordering across partitions: Only one partition at a time, older partitions first. • Constrains memory/container requirements to what is needed to process 1 partition. • Single job can process any number of partitions with finite resources. • Helps windowing correctness. • Un-Ordered reads within partition • Read records/files within a partition in any order. Opens up concurrency and high throughputs. • Order could be exploited if necessary. 2. Emit watermark when switching to next partition • Allows Out-Of-Order reads within partition and windowing correctness. 3. Lockstep progression, in case of multiple sources • All sources move to next partition at the same time. • Prevents low volume sources from racing ahead.
  • 12. HANDLING EACH CATEGORY • Cat 1 : Stateless • Nothing special. Set parallelisms based on desired throughput. • Cat 2 : Windowing with aggregation (Low – Med mem) • Employ memory state backend. • Windowing parallelism based on amount of data hosted in memory for one partition. Other parallelisms, based on throughput. • Cat 3 : Windowing with retention (High Mem) • Either: Use RocksDB state backend. • Or: reduce partition size, and use Mem state backend. • Or: Look into exploiting order within partition.
  • 13. BENEFITS OVER BATCH BATCH (SQL/ BEAM/ UNIFIED API) 1. Resource requirements grows with total data volume. • Tricky to estimate and allocate 2. Split into smaller jobs and coordinate them. • Windows that cross batch boundaries are problematic. 3. Results visible after all data is processed. KAPPA+ 1. Resources bounded by amount needed to process 1 partition. • Easier to estimate and allocate 2. Single job can process any number of partitions. 3. Results visible after each partition. Note: Kappa+ processing model could be adopted in Unified APIs to address these limitations.
  • 14. IMPLEMENTATION Architecture is not tied any Streaming Engine.
  • 15. ADOPTING KAPPA+ ON STREAMING ENGINES • No new APIs. • Hdfs/Hive/etc. Sources need behavioral change: 1. One partition at a time, older partitions first. 2. Concurrent reads within partition. 3. Lock step progression in case of multiple sources. • Kafka source needs to only support #3 since data is already in order. • Watermarking: • Emit watermarks at the end of partition to flush windows.
  • 16. A JOB SUPPORTING REALTIME & OFFLINE dataSource = offlineMode ? hiveSource : kafkaSource; watermaker = offlineMode ? new OfflineWM() : new RealtimeWM(); dataSource.assignWatermarkGenerator(watermarker); // Same logic. Adjust parallelisms for offline & online modes. job = dataSource.transform(..) .filter(..) .keyBy(..) .window(..) ...
  • 17. KAPPA+ ON FLINK UBER internal Hive (/HDFS) source with Kappa+ support.
  • 18. ONE PARTITION AT A TIME & CONCURRENT READS File Selector Source = 1 File Reader Operator = N Rest of JOB ZK Updater Operator = 1 ZK - List files from next partition - Emit names of files in partition - Create zNode for partition - Bulk create child zNodes for each file - Deserialize file and emit recs - Inform ZK Updater on completing file - Delete zNode for completed file - Delete zNode for partition if empty - Notify File Selector when zNode for partition is deleted File Names Records
  • 19. MULTI SOURCE LOCK STEP PROGRESSION File Selector A Source = 1 File Reader Operator = M Join/Union ZK Updater Operator = 1 File Selector B Source = 1 File Reader Operator = N ZK Updater Operator = 1 Join ZK - Enter Barrier and wait for others - Process next partition - Exit Barrier and repeat
  • 20. DETAILS 1. Time Skew: If arrival time is used for partitioning data in the warehouse, instead of event creation time (used by job). There can be two types of data skews: • Forward skew: Some events move into a future partition. For example due to late arrival. • Could lead to appearance of missing data. • Consider processing an additional partition after the last one, if this is an issue. • Backward skew: Some events moving into an older partition. • Can lead to appearance of data loss. As the events are not in the partition that you processed. • Improper watermarking can close Windows prematurely and cause data loss. 2. Differing partition strategies: Job has two sources. First source reads Hive table with daily partitions, second source reads table with hourly partition. • Solution: Watermark progression dictated by daily (i.e. larger) partition 3. May need to throttle throughput of offline job if writing to production critical destination.
  • 21. DISTRIBUTED COMPUTING Realtime Batch Compute Spark (Micro Batching) Streaming Compute Flink, Storm Offline
  • 22. Realtime Batch Compute Spark (Micro Batching) Streaming Compute Flink, Storm Offline Batch Compute In Memory Batch Systems Full Fledged Batch DISTRIBUTED COMPUTING
  • 23. Realtime Batch Compute Spark (Micro Batching) Streaming Compute Flink, Storm/Trident Offline Batch Compute In Memory Batch Systems Full Fledged Batch Streaming Compute Kappa+ DISTRIBUTED COMPUTING
  • 24. QUESTIONS Email: roshan@uber.com Twitter: @naikrosh , @UberEng UBER Engineering Blog: eng.uber.com UBER is hiring!! Realtime Platform needs your expertise!