SlideShare a Scribd company logo
Exactly-Once Financial Data Processing at
Scale with Flink and Pinot
Speakers
2
Xiang Zhang
Stripe
Pratyush Sharma
Stripe
Xiaoman Dong
StarTree
Agenda
Problem: near real-time end-to-end exactly once processing pipeline at scale
The architecture: Kafka, Flink, Pinot and how to connect all together
Operational challenges and learnings
3
1
2
3
The problem to solve—Ledger dataset
Ledger is a data set that Stripe
maintains to record all money
movements
4
Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes; p90 < 1 minute)
5
1
Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute)
Be able to process events at scale
6
1
2
Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute)
Be able to process events at scale
No missing transactions: a single transaction can be of millions of dollars
7
1
2
3
Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute)
Be able to process events at scale
No missing transactions: a single transaction can be of millions of dollars
No duplicate transactions across the entire history:
â—Ź Duplicates are inevitable on the source sides (deployments, restarts, accidental
duplicate job executions etc.)
8
1
2
3
4
Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute)
Be able to process events at scale
No missing transactions: a single transaction can be of millions of dollars
No duplicate transactions across the entire history
9
1
2
3
4
Near real-time end-to-end exactly-once processing at scale!
Agenda
Problem: near real-time end-to-end exactly once processing pipeline at scale
The architecture: Kafka, Flink, Pinot and how to connect all together
Operational challenges and learnings
10
1
2
3
High-Level Pipeline
11
High-Level Pipeline
12
The Deduplicator
13
In reality, we store transactions IDs in Flink state for deduplication
Flink End-to-End Exactly Once Processing - Flink Deduplicator (1/3)
14
Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
Flink End-to-End Exactly Once Processing - Flink Deduplicator (2/3)
15
Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
Flink End-to-End Exactly Once Processing - Flink Deduplicator (3/3)
16
Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
High-Level Pipeline
17
Pinot Exactly Once Ingestion (1/5)
18
Pinot Exactly Once Ingestion (2/5)
19
â—Ź Pinot table rows are stored in
immutable chunks/batches called
segments
â—Ź Real time segments being indexed are
mutable. Once they are full they will be
“sealed” and become immutable. New
mutable segments will be created to
continue indexing.
Pinot Exactly Once Ingestion (3/5)
20
We can consider Pinot’s latest segment as one
database transaction:
â—Ź Transaction begins at segment creation
● Transaction is committed when “sealed”
â—Ź Kafka offset stored atomically along with Pinot
segment metadata
â—Ź If any exception happens, the whole
transaction (segment) is aborted and restarted
Pinot Exactly Once Ingestion (4/5)
21
{
"segment.crc": "3251475672",
"segment.creation.time": "1648231912328",
"segment.download.url":
"s3://some/table/mytable__8__0__20220325T1811Z",
"segment.end.time": "1388707200000",
"segment.flush.threshold.size": "4166",
"segment.index.version": "v3",
"segment.realtime.endOffset": "10264",
"segment.realtime.numReplicas": "2",
"segment.realtime.startOffset": "10240",
"segment.realtime.status": "DONE",
"segment.start.time": "1388707200000",
"segment.time.unit": "MILLISECONDS",
"segment.total.docs": "24"
}
â—Ź Each segment has one single Zookeeper
node storing its metadata
â—Ź Kafka Offsets are stored inside segment
metadata
â—Ź Atomicity
â—‹ Zookeeper node update is atomic
â—‹ Kafka offset is updated at the same
time segment status updates
(“DONE”)
Pinot Exactly Once Ingestion (5/5)
22
If Pinot server is restarted or crashed
â—Ź Whole segment is discarded
â—Ź Segment recreated starting from the
next position of offset from Segment_0
â—Ź Kafka consumer seek() is called
Agenda
Problem: near real-time end-to-end exactly once processing pipeline at scale
The architecture: Kafka, Flink, Pinot and how to connect all together
Operational challenges and learnings
23
1
2
3
Caveats of exactly-once - nothing is free!
Exactly-once is not bulletproof. Data loss or duplicate can still happen.
24
1
Caveats of exactly-once - nothing is free!
Exactly-once is not bulletproof. Data loss or duplicate can still happen.
It might give users a false sense of security.
25
1
2
Caveats of exactly-once - nothing is free!
Exactly-once is not bulletproof. Data loss or duplicate can still happen.
It might give users a false sense of security.
Hard to add additional layers to the architecture due to transactional guarantee.
26
1
2
3
Caveats of exactly-once - nothing is free!
Exactly-once is not bulletproof. Data loss or duplicate can still happen.
It might give users a false sense of security.
Hard to add additional layers to the architecture due to transactional guarantee.
Latency and SLO is impacted by checkpoint intervals.
27
1
2
3
4
Potential data loss in two-phase commit
28
Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
Potential data loss in two-phase commit
29
Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
The transaction can be expired in Kafka!
Optimizing large state hydration at recovery time
The ledger deduplicator app maintains tens of terabytes of states to do all time
deduplication
30
1
Optimizing large state hydration at recovery time
The ledger deduplicator app maintains tens of terabytes of states to do all time deduplication
Task local recovery doesn’t work with multiple disks mounted (FLINK-10954)
â—Ź Need to hydrate the entire state everytime the job is rescheduled (job failure, host
failure/restarts/recycle)
â—Ź Impacts end to end latency
31
1
2
Optimizing large state hydration at recovery time
The ledger deduplicator app maintains tens of terabytes of states to do all time deduplication
Task local recovery doesn’t work with multiple disks mounted (FLINK-10954)
â—Ź Need to hydrate the entire state everytime the job is rescheduled (job failure, host
failure/restarts/recycle)
â—Ź Impacts end to end latency
Even if we make local recovery work, Stripe recycles hosts periodically
â—Ź The pipeline is as slow as the slowest host to recover state
32
1
2
3
Optimizing large state hydration at recovery time
â—Ź Task parallelism increase to the rescue: the more threads, the faster to download the state and
to rebuild the local state DB!
33
â—‹ Increasing parallelism requires state
redistribution.
â—‹ Flink uses the concept of key-group
as an atomic unit for state
distribution.
Parallelism Increase from 180 to 270 doesn’t work
34
Each task gets 2 key groups assigned
1. 180 tasks have 1 key group
2. 90 tasks have 2 key groups
What we want is even distribution
35
Each task gets 2 key groups assigned Each task gets 1 key group assigned
Monitoring Large State Size
â—Ź Flink can report native RocksDB metrics.
â—Ź State backend latency tracking metrics can help debugging.
â—Ź Large pending Rocks DB compactions can affect performance.
36
Linux OOM Kills Causing Job Restarts
â—Ź Flink < 1.12 uses glibc to allocate memory, which leads to memory fragmentation.
â—Ź Combined with large states required by the deduplicator app, it consistently causes OOM.
â—Ź With large number of task managers and time it takes to rehydrate state, it impacts latency SLO.
37
jemalloc Everywhere
â—Ź Flink switched to jemalloc for its default memory allocator in Docker images in Flink 1.12.
38
Pre jemalloc Post jemalloc
Data Quality Monitoring
● Pinot is an analytics platform that runs SQL blazingly fast, so…
â—‹ Duplicate detection:
â–  SELECT primary_key, count(*) as cnt FROM mytable
GROUP BY primary_key HAVING cnt > 1
â–  Run query in REALTIME only to help query performance by using special table name
like mytable_REALTIME
â—‹ Missing entry detection:
â–  Bucket rows by time and count by bucket
â–  JOIN/Compare to source of truth (upstream metric in Data Warehouse)
39
How to repair data in Pinot?
â—Ź If some range of data are corrupted (contains duplicate)
â—‹ Find the duplicated data by SQL query.
â—‹ Delete and rebuild the Pinot segments containing duplicates.
â—‹ Pinot virtual column names like $segmentName helps locating segments.
â—Ź Best Practices
â—‹ A reliable exactly-once Kafka Archive (backup) will come in handy in a fire.
â—‹ Build stable/reliable timestamp into primary key, use that timestamp as Pinot timestamp.
40
Lessons Learned
â—Ź Flink
â—‹ Set a Kafka transaction timeout large enough to account for any job downtime.
â—‹ Set a parallelism to a number such that max parallelism is divisible by this number.
â—‹ Use jemalloc in Flink.
â—Ź Pinot
â—‹ Higher Kafka transaction frequency and shorter Flink checkpoint intervals will improve end
to end data freshness in Pinot.
â—‹ Beware of bogus message counts: Many Kafka internal metrics include messages of failed
transactions.
â—‹ Duplicate monitoring is a must for critical apps.
41

More Related Content

PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PDF
Changelog Stream Processing with Apache Flink
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
The Current State of Table API in 2022
PDF
Flink powered stream processing platform at Pinterest
PDF
Deploying Flink on Kubernetes - David Anderson
Evening out the uneven: dealing with skew in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Changelog Stream Processing with Apache Flink
Introducing the Apache Flink Kubernetes Operator
The Current State of Table API in 2022
Flink powered stream processing platform at Pinterest
Deploying Flink on Kubernetes - David Anderson

What's hot (20)

PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Fundamentals of Apache Kafka
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Using Queryable State for Fun and Profit
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
PPTX
Autoscaling Flink with Reactive Mode
PDF
Apache Flink internals
PDF
The Parquet Format and Performance Optimization Opportunities
PPTX
One sink to rule them all: Introducing the new Async Sink
PDF
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
PDF
Dongwon Kim – A Comparative Performance Evaluation of Flink
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Building a fully managed stream processing platform on Flink at scale for Lin...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Tuning Apache Kafka Connectors for Flink.pptx
Fundamentals of Apache Kafka
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Introduction to Apache Flink - Fast and reliable big data processing
Using Queryable State for Fun and Profit
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Autoscaling Flink with Reactive Mode
Apache Flink internals
The Parquet Format and Performance Optimization Opportunities
One sink to rule them all: Introducing the new Async Sink
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Building Reliable Lakehouses with Apache Flink and Delta Lake
Tame the small files problem and optimize data layout for streaming ingestion...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Dongwon Kim – A Comparative Performance Evaluation of Flink
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Ad

Similar to Exactly-Once Financial Data Processing at Scale with Flink and Pinot (20)

PDF
Flink at netflix paypal speaker series
PDF
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
PDF
Flink forward-2017-netflix keystones-paas
PPTX
Building Stream Processing as a Service
PDF
When Streaming Needs Batch With Konstantin Knauf | Current 2022
PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
PDF
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
PDF
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
PDF
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
PPTX
Apache Flink(tm) - A Next-Generation Stream Processor
PDF
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
PPTX
Counting Elements in Streams
PDF
Stream processing with Apache Flink - Maximilian Michels Data Artisans
PDF
Big Data Warsaw
PPTX
Streaming in Practice - Putting Apache Kafka in Production
PPTX
Debunking Six Common Myths in Stream Processing
PDF
SFBigAnalytics_20190724: Monitor kafka like a Pro
PDF
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Flink at netflix paypal speaker series
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink forward-2017-netflix keystones-paas
Building Stream Processing as a Service
When Streaming Needs Batch With Konstantin Knauf | Current 2022
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Stephan Ewen - Experiences running Flink at Very Large Scale
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Apache Flink(tm) - A Next-Generation Stream Processor
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Counting Elements in Streams
Stream processing with Apache Flink - Maximilian Michels Data Artisans
Big Data Warsaw
Streaming in Practice - Putting Apache Kafka in Production
Debunking Six Common Myths in Stream Processing
SFBigAnalytics_20190724: Monitor kafka like a Pro
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Ad

More from Flink Forward (13)

PPTX
Apache Flink in the Cloud-Native Era
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Batch Processing at Scale with Flink & Iceberg
PPTX
Welcome to the Flink Community!
PPTX
Practical learnings from running thousands of Flink jobs
PPTX
Extending Flink SQL for stream processing use cases
PPTX
The top 3 challenges running multi-tenant Flink at scale
PPTX
Large Scale Real Time Fraudulent Web Behavior Detection
PPTX
Near real-time statistical modeling and anomaly detection using Flink!
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Apache Flink in the Cloud-Native Era
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Processing Semantically-Ordered Streams in Financial Services
Batch Processing at Scale with Flink & Iceberg
Welcome to the Flink Community!
Practical learnings from running thousands of Flink jobs
Extending Flink SQL for stream processing use cases
The top 3 challenges running multi-tenant Flink at scale
Large Scale Real Time Fraudulent Web Behavior Detection
Near real-time statistical modeling and anomaly detection using Flink!
How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Empathic Computing: Creating Shared Understanding
PDF
cuic standard and advanced reporting.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
 
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Cloud computing and distributed systems.
DOCX
The AUB Centre for AI in Media Proposal.docx
 
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
 
Understanding_Digital_Forensics_Presentation.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Advanced methodologies resolving dimensionality complications for autism neur...
Spectral efficient network and resource selection model in 5G networks
Empathic Computing: Creating Shared Understanding
cuic standard and advanced reporting.pdf
A Presentation on Artificial Intelligence
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Monthly Chronicles - July 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Chapter 3 Spatial Domain Image Processing.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
 
Big Data Technologies - Introduction.pptx
Encapsulation theory and applications.pdf
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Cloud computing and distributed systems.
The AUB Centre for AI in Media Proposal.docx
 
CIFDAQ's Market Insight: SEC Turns Pro Crypto
 

Exactly-Once Financial Data Processing at Scale with Flink and Pinot

  • 1. Exactly-Once Financial Data Processing at Scale with Flink and Pinot
  • 3. Agenda Problem: near real-time end-to-end exactly once processing pipeline at scale The architecture: Kafka, Flink, Pinot and how to connect all together Operational challenges and learnings 3 1 2 3
  • 4. The problem to solve—Ledger dataset Ledger is a data set that Stripe maintains to record all money movements 4
  • 5. Requirements for the Ledger pipeline Near real-time processing to meet SLO targets (p99 in orders of minutes; p90 < 1 minute) 5 1
  • 6. Requirements for the Ledger pipeline Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute) Be able to process events at scale 6 1 2
  • 7. Requirements for the Ledger pipeline Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute) Be able to process events at scale No missing transactions: a single transaction can be of millions of dollars 7 1 2 3
  • 8. Requirements for the Ledger pipeline Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute) Be able to process events at scale No missing transactions: a single transaction can be of millions of dollars No duplicate transactions across the entire history: â—Ź Duplicates are inevitable on the source sides (deployments, restarts, accidental duplicate job executions etc.) 8 1 2 3 4
  • 9. Requirements for the Ledger pipeline Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute) Be able to process events at scale No missing transactions: a single transaction can be of millions of dollars No duplicate transactions across the entire history 9 1 2 3 4 Near real-time end-to-end exactly-once processing at scale!
  • 10. Agenda Problem: near real-time end-to-end exactly once processing pipeline at scale The architecture: Kafka, Flink, Pinot and how to connect all together Operational challenges and learnings 10 1 2 3
  • 13. The Deduplicator 13 In reality, we store transactions IDs in Flink state for deduplication
  • 14. Flink End-to-End Exactly Once Processing - Flink Deduplicator (1/3) 14 Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
  • 15. Flink End-to-End Exactly Once Processing - Flink Deduplicator (2/3) 15 Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
  • 16. Flink End-to-End Exactly Once Processing - Flink Deduplicator (3/3) 16 Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
  • 18. Pinot Exactly Once Ingestion (1/5) 18
  • 19. Pinot Exactly Once Ingestion (2/5) 19 â—Ź Pinot table rows are stored in immutable chunks/batches called segments â—Ź Real time segments being indexed are mutable. Once they are full they will be “sealed” and become immutable. New mutable segments will be created to continue indexing.
  • 20. Pinot Exactly Once Ingestion (3/5) 20 We can consider Pinot’s latest segment as one database transaction: â—Ź Transaction begins at segment creation â—Ź Transaction is committed when “sealed” â—Ź Kafka offset stored atomically along with Pinot segment metadata â—Ź If any exception happens, the whole transaction (segment) is aborted and restarted
  • 21. Pinot Exactly Once Ingestion (4/5) 21 { "segment.crc": "3251475672", "segment.creation.time": "1648231912328", "segment.download.url": "s3://some/table/mytable__8__0__20220325T1811Z", "segment.end.time": "1388707200000", "segment.flush.threshold.size": "4166", "segment.index.version": "v3", "segment.realtime.endOffset": "10264", "segment.realtime.numReplicas": "2", "segment.realtime.startOffset": "10240", "segment.realtime.status": "DONE", "segment.start.time": "1388707200000", "segment.time.unit": "MILLISECONDS", "segment.total.docs": "24" } â—Ź Each segment has one single Zookeeper node storing its metadata â—Ź Kafka Offsets are stored inside segment metadata â—Ź Atomicity â—‹ Zookeeper node update is atomic â—‹ Kafka offset is updated at the same time segment status updates (“DONE”)
  • 22. Pinot Exactly Once Ingestion (5/5) 22 If Pinot server is restarted or crashed â—Ź Whole segment is discarded â—Ź Segment recreated starting from the next position of offset from Segment_0 â—Ź Kafka consumer seek() is called
  • 23. Agenda Problem: near real-time end-to-end exactly once processing pipeline at scale The architecture: Kafka, Flink, Pinot and how to connect all together Operational challenges and learnings 23 1 2 3
  • 24. Caveats of exactly-once - nothing is free! Exactly-once is not bulletproof. Data loss or duplicate can still happen. 24 1
  • 25. Caveats of exactly-once - nothing is free! Exactly-once is not bulletproof. Data loss or duplicate can still happen. It might give users a false sense of security. 25 1 2
  • 26. Caveats of exactly-once - nothing is free! Exactly-once is not bulletproof. Data loss or duplicate can still happen. It might give users a false sense of security. Hard to add additional layers to the architecture due to transactional guarantee. 26 1 2 3
  • 27. Caveats of exactly-once - nothing is free! Exactly-once is not bulletproof. Data loss or duplicate can still happen. It might give users a false sense of security. Hard to add additional layers to the architecture due to transactional guarantee. Latency and SLO is impacted by checkpoint intervals. 27 1 2 3 4
  • 28. Potential data loss in two-phase commit 28 Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
  • 29. Potential data loss in two-phase commit 29 Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html The transaction can be expired in Kafka!
  • 30. Optimizing large state hydration at recovery time The ledger deduplicator app maintains tens of terabytes of states to do all time deduplication 30 1
  • 31. Optimizing large state hydration at recovery time The ledger deduplicator app maintains tens of terabytes of states to do all time deduplication Task local recovery doesn’t work with multiple disks mounted (FLINK-10954) â—Ź Need to hydrate the entire state everytime the job is rescheduled (job failure, host failure/restarts/recycle) â—Ź Impacts end to end latency 31 1 2
  • 32. Optimizing large state hydration at recovery time The ledger deduplicator app maintains tens of terabytes of states to do all time deduplication Task local recovery doesn’t work with multiple disks mounted (FLINK-10954) â—Ź Need to hydrate the entire state everytime the job is rescheduled (job failure, host failure/restarts/recycle) â—Ź Impacts end to end latency Even if we make local recovery work, Stripe recycles hosts periodically â—Ź The pipeline is as slow as the slowest host to recover state 32 1 2 3
  • 33. Optimizing large state hydration at recovery time â—Ź Task parallelism increase to the rescue: the more threads, the faster to download the state and to rebuild the local state DB! 33 â—‹ Increasing parallelism requires state redistribution. â—‹ Flink uses the concept of key-group as an atomic unit for state distribution.
  • 34. Parallelism Increase from 180 to 270 doesn’t work 34 Each task gets 2 key groups assigned 1. 180 tasks have 1 key group 2. 90 tasks have 2 key groups
  • 35. What we want is even distribution 35 Each task gets 2 key groups assigned Each task gets 1 key group assigned
  • 36. Monitoring Large State Size â—Ź Flink can report native RocksDB metrics. â—Ź State backend latency tracking metrics can help debugging. â—Ź Large pending Rocks DB compactions can affect performance. 36
  • 37. Linux OOM Kills Causing Job Restarts â—Ź Flink < 1.12 uses glibc to allocate memory, which leads to memory fragmentation. â—Ź Combined with large states required by the deduplicator app, it consistently causes OOM. â—Ź With large number of task managers and time it takes to rehydrate state, it impacts latency SLO. 37
  • 38. jemalloc Everywhere â—Ź Flink switched to jemalloc for its default memory allocator in Docker images in Flink 1.12. 38 Pre jemalloc Post jemalloc
  • 39. Data Quality Monitoring â—Ź Pinot is an analytics platform that runs SQL blazingly fast, so… â—‹ Duplicate detection: â–  SELECT primary_key, count(*) as cnt FROM mytable GROUP BY primary_key HAVING cnt > 1 â–  Run query in REALTIME only to help query performance by using special table name like mytable_REALTIME â—‹ Missing entry detection: â–  Bucket rows by time and count by bucket â–  JOIN/Compare to source of truth (upstream metric in Data Warehouse) 39
  • 40. How to repair data in Pinot? â—Ź If some range of data are corrupted (contains duplicate) â—‹ Find the duplicated data by SQL query. â—‹ Delete and rebuild the Pinot segments containing duplicates. â—‹ Pinot virtual column names like $segmentName helps locating segments. â—Ź Best Practices â—‹ A reliable exactly-once Kafka Archive (backup) will come in handy in a fire. â—‹ Build stable/reliable timestamp into primary key, use that timestamp as Pinot timestamp. 40
  • 41. Lessons Learned â—Ź Flink â—‹ Set a Kafka transaction timeout large enough to account for any job downtime. â—‹ Set a parallelism to a number such that max parallelism is divisible by this number. â—‹ Use jemalloc in Flink. â—Ź Pinot â—‹ Higher Kafka transaction frequency and shorter Flink checkpoint intervals will improve end to end data freshness in Pinot. â—‹ Beware of bogus message counts: Many Kafka internal metrics include messages of failed transactions. â—‹ Duplicate monitoring is a must for critical apps. 41

Editor's Notes

  • #19: Transitional slides from talking about deduplicator to pinot ingestion.
  • #20: Transitional slides from talking about deduplicator to pinot ingestion.
  • #21: Transitional slides from talking about deduplicator to pinot ingestion.
  • #22: Transitional slides from talking about deduplicator to pinot ingestion.
  • #23: Transitional slides from talking about deduplicator to pinot ingestion.