Exactly-Once Financial Data Processing at Scale with Flink and Pinot

Exactly-Once Financial Data Processing at
Scale with Flink and Pinot

Speakers
2
Xiang Zhang
Stripe
Pratyush Sharma
Stripe
Xiaoman Dong
StarTree

Agenda
Problem: near real-time end-to-end exactly once processing pipeline at scale
The architecture: Kafka, Flink, Pinot and how to connect all together
Operational challenges and learnings
3
1
2
3

The problem to solve—Ledger dataset
Ledger is a data set that Stripe
maintains to record all money
movements
4

Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes; p90 < 1 minute)
5
1

Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute)
Be able to process events at scale
6
1
2

No missing transactions: a single transaction can be of millions of dollars
7
1
2
3

No duplicate transactions across the entire history:
● Duplicates are inevitable on the source sides (deployments, restarts, accidental
duplicate job executions etc.)
8
1
2
3
4

No duplicate transactions across the entire history
9
1
2
3
4
Near real-time end-to-end exactly-once processing at scale!

Agenda
10
1
2
3

The Deduplicator
13
In reality, we store transactions IDs in Flink state for deduplication

Flink End-to-End Exactly Once Processing - Flink Deduplicator (1/3)
14
Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html

15

16

Pinot Exactly Once Ingestion (1/5)
18

19
● Pinot table rows are stored in
immutable chunks/batches called
segments
● Real time segments being indexed are
mutable. Once they are full they will be
“sealed” and become immutable. New
mutable segments will be created to
continue indexing.

20
We can consider Pinot’s latest segment as one
database transaction:
● Transaction begins at segment creation
● Transaction is committed when “sealed”
● Kafka offset stored atomically along with Pinot
segment metadata
● If any exception happens, the whole
transaction (segment) is aborted and restarted

21
{
"segment.crc": "3251475672",
"segment.creation.time": "1648231912328",
"segment.download.url":
"s3://some/table/mytable__8__0__20220325T1811Z",
"segment.end.time": "1388707200000",
"segment.flush.threshold.size": "4166",
"segment.index.version": "v3",
"segment.realtime.endOffset": "10264",
"segment.realtime.numReplicas": "2",
"segment.realtime.startOffset": "10240",
"segment.realtime.status": "DONE",
"segment.start.time": "1388707200000",
"segment.time.unit": "MILLISECONDS",
"segment.total.docs": "24"
}
● Each segment has one single Zookeeper
node storing its metadata
● Kafka Offsets are stored inside segment
metadata
● Atomicity
○ Zookeeper node update is atomic
○ Kafka offset is updated at the same
time segment status updates
(“DONE”)

22
If Pinot server is restarted or crashed
● Whole segment is discarded
● Segment recreated starting from the
next position of offset from Segment_0
● Kafka consumer seek() is called

Agenda
23
1
2
3

Caveats of exactly-once - nothing is free!
Exactly-once is not bulletproof. Data loss or duplicate can still happen.
24
1

It might give users a false sense of security.
25
1
2

Hard to add additional layers to the architecture due to transactional guarantee.
26
1
2
3

Hard to add additional layers to the architecture due to transactional guarantee.
Latency and SLO is impacted by checkpoint intervals.
27
1
2
3
4

Potential data loss in two-phase commit
28

Potential data loss in two-phase commit
29
The transaction can be expired in Kafka!

Optimizing large state hydration at recovery time
The ledger deduplicator app maintains tens of terabytes of states to do all time
deduplication
30
1

The ledger deduplicator app maintains tens of terabytes of states to do all time deduplication
Task local recovery doesn’t work with multiple disks mounted (FLINK-10954)
● Need to hydrate the entire state everytime the job is rescheduled (job failure, host
failure/restarts/recycle)
● Impacts end to end latency
31
1
2

The ledger deduplicator app maintains tens of terabytes of states to do all time deduplication
Task local recovery doesn’t work with multiple disks mounted (FLINK-10954)
● Need to hydrate the entire state everytime the job is rescheduled (job failure, host
failure/restarts/recycle)
● Impacts end to end latency
Even if we make local recovery work, Stripe recycles hosts periodically
● The pipeline is as slow as the slowest host to recover state
32
1
2
3

● Task parallelism increase to the rescue: the more threads, the faster to download the state and
to rebuild the local state DB!
33
○ Increasing parallelism requires state
redistribution.
○ Flink uses the concept of key-group
as an atomic unit for state
distribution.

Parallelism Increase from 180 to 270 doesn’t work
34
Each task gets 2 key groups assigned
1. 180 tasks have 1 key group
2. 90 tasks have 2 key groups

What we want is even distribution
35
Each task gets 2 key groups assigned Each task gets 1 key group assigned

Monitoring Large State Size
● Flink can report native RocksDB metrics.
● State backend latency tracking metrics can help debugging.
● Large pending Rocks DB compactions can affect performance.
36

Linux OOM Kills Causing Job Restarts
● Flink < 1.12 uses glibc to allocate memory, which leads to memory fragmentation.
● Combined with large states required by the deduplicator app, it consistently causes OOM.
● With large number of task managers and time it takes to rehydrate state, it impacts latency SLO.
37

jemalloc Everywhere
● Flink switched to jemalloc for its default memory allocator in Docker images in Flink 1.12.
38
Pre jemalloc Post jemalloc

Data Quality Monitoring
● Pinot is an analytics platform that runs SQL blazingly fast, so…
○ Duplicate detection:
■ SELECT primary_key, count(*) as cnt FROM mytable
GROUP BY primary_key HAVING cnt > 1
■ Run query in REALTIME only to help query performance by using special table name
like mytable_REALTIME
○ Missing entry detection:
■ Bucket rows by time and count by bucket
■ JOIN/Compare to source of truth (upstream metric in Data Warehouse)
39

How to repair data in Pinot?
● If some range of data are corrupted (contains duplicate)
○ Find the duplicated data by SQL query.
○ Delete and rebuild the Pinot segments containing duplicates.
○ Pinot virtual column names like $segmentName helps locating segments.
● Best Practices
○ A reliable exactly-once Kafka Archive (backup) will come in handy in a fire.
○ Build stable/reliable timestamp into primary key, use that timestamp as Pinot timestamp.
40

Lessons Learned
● Flink
○ Set a Kafka transaction timeout large enough to account for any job downtime.
○ Set a parallelism to a number such that max parallelism is divisible by this number.
○ Use jemalloc in Flink.
● Pinot
○ Higher Kafka transaction frequency and shorter Flink checkpoint intervals will improve end
to end data freshness in Pinot.
○ Beware of bogus message counts: Many Kafka internal metrics include messages of failed
transactions.
○ Duplicate monitoring is a must for critical apps.
41

Exactly-Once Financial Data Processing at Scale with Flink and Pinot

More Related Content

What's hot (20)

Similar to Exactly-Once Financial Data Processing at Scale with Flink and Pinot (20)

More from Flink Forward (13)

Recently uploaded (20)

Exactly-Once Financial Data Processing at Scale with Flink and Pinot

Editor's Notes