SlideShare a Scribd company logo
Tame the Small Files Problem and Optimize
Data Layout for Streaming Ingestion to Iceberg
Steven Wu, Gang Ye, Haizhou Zhao | Apple
THIS IS NOT A CONTRIBUTION
Apache Iceberg is an open table format for huge analytic data
• Time travel

• Advanced
fi
ltering

• Serializable isolation
Where does Iceberg fit in the ecosystem
Table Format


(Metadata)
Compute


Engine
Storage


(Data) Cloud Blob
Storage
Ingest data to Iceberg data lake in streaming fashion
Flink Streaming

Ingestion
Iceberg

Data Lake
Kafka 

Msg Queue
Zoom into the Flink Iceberg sink
Iceberg

Data Lake
writer-1
writer-2
writer-n
…
Records
DFS
Data Files
committer
File Metadata
Case 1: event-time partitioned tables
hour=2022-08-03-00/
hour=2022-08-03-01/
…
Long tail problem with late arrival data
https://en.wikipedia.org/wiki/Long_tail
Hour
Percentage

of data
0 1 2 N
A data file can’t contain rows across partitions
hour=2022-08-03-00/
|- file-000.parquet
|- file-001.parquet
|- …
hour=2022-08-03-01/
|- …
…
How many data files are generated every hour?
writer-1
writer-2
writer-500
…
committer
720K
fi
les every hour (with 10 minute checkpoint interval)
Records for


24x10 partitions
Open 240
fi
les
Commit 120K
fi
les (240x500)
every checkpoint
Assuming table is partitioned
hourly and event time range
is capped at 10 days
Long-tail hours lead to small files
Percentile File Size
P50 55 KB
P75 77 KB
P90 13 MB
P99 18 MB
What are the implications of too many small files
• Poor read performance

• Request throttling

• Memory pressure

• Longer checkpoint duration and pipeline pause

• Stress the metadata system
Why not keyBy shuffle
writer-1
writer-2
writer-n
…
committer
operator-1
operator-2
operator-n
keyBy(hour)
Iceberg
There are two problems
• Tra
ffi
c are not evenly distributed across event hours

• keyBy for low cardinality column won’t be balanced [1]
[1] https://github.com/apache/iceberg/pull/4228
Need smarter shu
ffl
ing
Case 2: data clustering for non-partition columns
CREATE TABLE db.tbl (
ts timestamp,
data string,
event_type string)
USING iceberg
PARTITIONED BY (hours(ts))
Queries often filter on event_type
SELECT count(1) FROM db.tbl WHERE
ts >= '2022-01-01 08:00:00’ AND
ts < '2022-01-01 09:00:00' AND
event_type = ‘C’
Iceberg supports file pruning leveraging min-max stats
at column level
|- file-000.parquet (event_type: A-B)
|- file-001.parquet (event_type: C-C)
|- file-002.parquet (event_type: D-F)
…
event_type = ‘C’
Wide value range would make pruning ineffective
Wide value range
|- file-000.parquet (event_type: A-Z)
|- file-001.parquet (event_type: A-Z)
|- file-002.parquet (event_type: A-Z)
…
event_type = ‘C’
Making event_type a partition column can lead to
explosion of number of partitions
• Before: 8.8K partitions (365 days x 24 hours) [1]

• After: 4.4M partitions (365 days x 24 hours x 500 event_types) [2]

• Can stress metadata system and lead to small
fi
les
[1] Assuming 12 months retention

[2] Assuming 500 event types
Batch engines solve the clustering problem via shuffle
2. Shuffle to
cluster data
Stage Stage
…
1. Compute
data sketch
Event
Type
Weight
A 2%
B 7%
C 22%
…
Z 0.5%
…
A B A
C C C
Z Y X
A
B
A C
C
C
Z
Y
X
3. Sort data
before writing to
files
A A B
C C C
X Y Z
A-B
min-max
C-C
X-Z
Tight value
range
Shu
ffl
e for better data clustering
Why not compact small files or sort files via
background batch maintenance jobs
• Remediation is usually more expensive than prevention

• Doesn’t solve the throttling problem in the streaming path
Agenda
Motivation Evaluation
Design
Introduce a smart shuffling operator in Flink Iceberg sink
Iceberg
writer-1
writer-2
writer-n
…
committer
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Step 1: calculate traffic distribution
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-10
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Step 2a: shuffle data based on traffic distribution
Hour Assigned tasks
0 1, 2, 3, 4
1 4, 5
2 6
… …
238 10
239 10
240 10
writer-1
writer-2
writer-n
…
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Step 2b: range shuffle data for non-partition column
Event
type
Weight
A 2%
B 7%
C 28%
… …
Z 0.5%
Event
type
Assigned
task
A-B 1
C-C 2, 3, 4
… …
P-Z 10
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-10
Range shuffling improves data clustering
A B A
C C C
Z Y X
Z X A
A C Y
C C B
Unsorted
data files
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Tight value
range
Sorting within a file brings additional benefits of row
group and page level skipping
Parquet
fi
le
X


X


X


X


X


Y


Y


Z


Z


Z


Z


Z


Row 

group 1
Row 

group 2
Row 

group 3
SELECT * FROM db.tbl WHERE
ts >= … AND ts < … AND
event_type = 'Y'
What if sorting is needed
• Sorting in streaming is possible but expensive

• Use batch sorting jobs
How to calculate tra
ffi
c distribution
FLIP-27 source interface introduced operator
coordinator component
JobManager TaskManager-1
TaskManager-n
…
Source
Reader-1
Source
Reader-k
…
Source

Coordinator
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 1
Shuffle tasks calculate local stats and send them to
coordinator
writer-1
JobManager
shu
ffl
e

coordinator
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 1
Shuffle coordinator does global aggregation
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Global aggregation
addresses the
potential problem of
different local views
shu
ffl
e

coordinator
JobManager
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Shuffle coordinator broadcasts the globally aggregated
stats to tasks
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Shu
ffl
e

Coordinator
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
JobManager
All shuf
fl
e tasks make
the same decision based
on the same stats
How to shu
ffl
e data
Add a custom partitioner after the shuffle operator
dataStream
.transform("shuffleOperator", shuffleOperatorOutputType, operatorFactory)
.partitionCustom(binPackingPartitioner, keySelector)
public class BinPackingPartitioner<K> implements Partitioner<K> {
@Override
int partition(K key, int numPartitions);
}
There are two shuffling strategies
• Bin packing

• Range distribution
Bin packing can combine multiple small keys to a
single task or split a single large key to multiple tasks
Task Assigned keys
T0 K0, K2, K4, K6, K8
T1 K7
T2 K3
T3 K3
T4 K3
T5 K3
… …
T9 K1,K5
• Only focus on balanced
weight distribution

• Ignore ordering when
assigning keys

• Work well with shu
ffl
ing by
partition columns
Range shuffling split sort values into ranges and
assign them to tasks
• Balance weight distribution
with continuous ranges

• Work well with shu
ffl
ing by
non-partition columns
Value Assigned task
A
T1
B
C
…
D
T2
T3
T4
Optimizing balanced distribution in byte rate can lead to
file count skew where a task handles many long-tail hours
hours
0 1 2 N
https://en.wikipedia.org/wiki/Long_tail
Many long-tail hours can
be assigned to a single
task, which can become
bottleneck
There are two solutions
• Parallelize
fi
le
fl
ushing and upload

• Limit the
fi
le count skew via close-
fi
le-cost (like open-
fi
le-
cost)
Tune close-file-cost to balance btw file count skew
and byte rate skew
Skewness
Close-
fi
le-cost
Byte rate skew
File count skew
Agenda
Motivation Evaluation
Design
A: Simple Iceberg ingestion job without shuffling
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
Chained
…
• Job parallelism is 60

• Checkpoint interval is 10 min
B: Iceberg ingestion with smart shuffling
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
• Job parallelism is 60

• Checkpoint interval is 10 min
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Chained Shuffle
Test setup
• Sink Iceberg table is partitioned hourly by event time

• Benchmark tra
ffi
c volume is 250 MB/sec

• Event time range is 192 hours
What are we comparing
• Number of
fi
les written in one cycle

• File size distribution

• Checkpoint duration

• CPU utilization

• Shu
ffl
ing skew
• Job parallelism is 60

• Event time range is 192 hours

Shu
ffl
e reduced the number of
fi
les by 20x
Without shu
ffl
ing one cycle
fl
ushed 10K
fi
les
With shu
ffl
ing one cycle
fl
ushed 500
fi
les
~2.5x of minimal
number of
fi
les
Shuffling greatly improved file size distribution
Percentile
Without
shuffling
With
shuffling
Improvement
P50 55 KB 913 KB 17x
P75 77 KB 7 MB 90x
P95 13 MB 301 MB 23x
P99 18 MB 306 MB 17x
Shuffling tamed the small files problem
During checkpoint, writer tasks flush and upload data files
writer-1
writer-2
writer-n
…
committer
DFS
Data Files
Reduced checkpoint duration by 8x
Without shu
ffl
ing, checkpoint takes 64s on average
With shu
ffl
ing, checkpoint takes 8s on average
Seconds
10
20
30
40
50
60
70
Record handover btw chained operators are simple
method call
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
Chained
1. Kafka Source 2. Iceberg Sink
…
Shuffling involves significant CPU overhead on serdes
and network I/O
2. Shuffle
1. Kafka Source 3. Iceberg Sink
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Shuffle
Chained
Shuffling increased CPU usage by 62%
All about tradeo
ff
!
With shu
ffl
ing avg CPU util is 57%
Without shu
ffl
ing avg CPU util is 35%
Without shuffling, checkpoint pause is longer and
catch-up spike is bigger
With shu
ffl
ing
Without shu
ffl
ing
Catch-up spike
Trough caused
by pause
Bin packing shuffling won’t be perfect in weight distribution
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Shuffle
Chained
processes data for
partitions a, b, c
processes data for
partitions y, z
Min of writer
record rate
Max of writer
record rate
Skewness
(max-min)/min
No shuffling 4.36 K 4.44 K 1.8%
Bin packing
(greedy algo)
4.02 K 6.39 MB 59%
Our greedy algo implementation of bin packing
introduces higher skew than we hoped for
Future work
• Implement other algorithm

• Better bin packing with less skew

• Range partitioner

• Support sketch statistics for high-cardinality keys

• Contribute it to OSS
References
• Design doc: https://docs.google.com/document/d/13N8cMqPi-
ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo/
Q&A
Tame the small files problem and optimize data layout for streaming ingestion to Iceberg
Weight table should be
relatively stable
What about new hour as time moves forward?
Absolute hour Weight
2022-08-03-00 0.4
… …
2022-08-03-12 22
2022-08-03-13 27
2022-08-03-14 38
2022-08-03-15 ??
Weight table based on relative hour would be stable
Relative hour Weight
0 38
1 27
2 22
… …
14 0.4
… …
What about cold start problem?
• First-time run

• Restart with empty state

• New subtasks from scale-up
Cope with with cold start problems
• No shu
ffl
e while learning

• Bu
ff
er records until learned the
fi
rst stats

• New subtasks (scale-up) request stats from the coordinator

More Related Content

PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PDF
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
PDF
OPA: The Cloud Native Policy Engine
PDF
Apache Flink internals
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
HTTP2 and gRPC
PPTX
RedisConf17- Using Redis at scale @ Twitter
PDF
Apache Iceberg: An Architectural Look Under the Covers
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
OPA: The Cloud Native Policy Engine
Apache Flink internals
Introducing the Apache Flink Kubernetes Operator
HTTP2 and gRPC
RedisConf17- Using Redis at scale @ Twitter
Apache Iceberg: An Architectural Look Under the Covers

What's hot (20)

PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Apache Flink in the Cloud-Native Era
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Changelog Stream Processing with Apache Flink
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
Autoscaling Flink with Reactive Mode
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Parquet performance tuning: the missing guide
PDF
Batch Processing at Scale with Flink & Iceberg
PDF
Flink powered stream processing platform at Pinterest
PPTX
The Current State of Table API in 2022
PPTX
Using Queryable State for Fun and Profit
PDF
Introduction to Apache Flink
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building a fully managed stream processing platform on Flink at scale for Lin...
Apache Flink in the Cloud-Native Era
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
The Parquet Format and Performance Optimization Opportunities
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Changelog Stream Processing with Apache Flink
Evening out the uneven: dealing with skew in Flink
Autoscaling Flink with Reactive Mode
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Parquet performance tuning: the missing guide
Batch Processing at Scale with Flink & Iceberg
Flink powered stream processing platform at Pinterest
The Current State of Table API in 2022
Using Queryable State for Fun and Profit
Introduction to Apache Flink
Introduction to Apache Flink - Fast and reliable big data processing
Ad

Similar to Tame the small files problem and optimize data layout for streaming ingestion to Iceberg (20)

PPTX
Cloud Security Monitoring and Spark Analytics
PDF
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
PPTX
Streaming in Practice - Putting Apache Kafka in Production
PPTX
Building Stream Processing as a Service
PPTX
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
PPTX
TenMax Data Pipeline Experience Sharing
PDF
spark stream - kafka - the right way
PDF
Flink at netflix paypal speaker series
PDF
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
PDF
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
PDF
Introduction to apache kafka
PDF
SFBigAnalytics_20190724: Monitor kafka like a Pro
PPTX
Unified Batch & Stream Processing with Apache Samza
PDF
Extending Spark Streaming to Support Complex Event Processing
PDF
Spark cep
PPT
Google Cloud Computing on Google Developer 2008 Day
PDF
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
PDF
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
PDF
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Cloud Security Monitoring and Spark Analytics
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Streaming in Practice - Putting Apache Kafka in Production
Building Stream Processing as a Service
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
TenMax Data Pipeline Experience Sharing
spark stream - kafka - the right way
Flink at netflix paypal speaker series
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
Introduction to apache kafka
SFBigAnalytics_20190724: Monitor kafka like a Pro
Unified Batch & Stream Processing with Apache Samza
Extending Spark Streaming to Support Complex Event Processing
Spark cep
Google Cloud Computing on Google Developer 2008 Day
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Ad

More from Flink Forward (14)

PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Processing Semantically-Ordered Streams in Financial Services
PPTX
Welcome to the Flink Community!
PPTX
Practical learnings from running thousands of Flink jobs
PPTX
Extending Flink SQL for stream processing use cases
PPTX
The top 3 challenges running multi-tenant Flink at scale
PPTX
Large Scale Real Time Fraudulent Web Behavior Detection
PPTX
Near real-time statistical modeling and anomaly detection using Flink!
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Processing Semantically-Ordered Streams in Financial Services
Welcome to the Flink Community!
Practical learnings from running thousands of Flink jobs
Extending Flink SQL for stream processing use cases
The top 3 challenges running multi-tenant Flink at scale
Large Scale Real Time Fraudulent Web Behavior Detection
Near real-time statistical modeling and anomaly detection using Flink!

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Advanced IT Governance
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
Teaching material agriculture food technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Big Data Technologies - Introduction.pptx
NewMind AI Weekly Chronicles - August'25 Week I
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Advanced IT Governance
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Teaching material agriculture food technology
Understanding_Digital_Forensics_Presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The Rise and Fall of 3GPP – Time for a Sabbatical?
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Per capita expenditure prediction using model stacking based on satellite ima...
MYSQL Presentation for SQL database connectivity
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Big Data Technologies - Introduction.pptx

Tame the small files problem and optimize data layout for streaming ingestion to Iceberg

  • 1. Tame the Small Files Problem and Optimize Data Layout for Streaming Ingestion to Iceberg Steven Wu, Gang Ye, Haizhou Zhao | Apple THIS IS NOT A CONTRIBUTION
  • 2. Apache Iceberg is an open table format for huge analytic data • Time travel • Advanced fi ltering • Serializable isolation
  • 3. Where does Iceberg fit in the ecosystem Table Format (Metadata) Compute Engine Storage (Data) Cloud Blob Storage
  • 4. Ingest data to Iceberg data lake in streaming fashion Flink Streaming Ingestion Iceberg Data Lake Kafka Msg Queue
  • 5. Zoom into the Flink Iceberg sink Iceberg Data Lake writer-1 writer-2 writer-n … Records DFS Data Files committer File Metadata
  • 6. Case 1: event-time partitioned tables hour=2022-08-03-00/ hour=2022-08-03-01/ …
  • 7. Long tail problem with late arrival data https://en.wikipedia.org/wiki/Long_tail Hour Percentage of data 0 1 2 N
  • 8. A data file can’t contain rows across partitions hour=2022-08-03-00/ |- file-000.parquet |- file-001.parquet |- … hour=2022-08-03-01/ |- … …
  • 9. How many data files are generated every hour? writer-1 writer-2 writer-500 … committer 720K fi les every hour (with 10 minute checkpoint interval) Records for 24x10 partitions Open 240 fi les Commit 120K fi les (240x500) every checkpoint Assuming table is partitioned hourly and event time range is capped at 10 days
  • 10. Long-tail hours lead to small files Percentile File Size P50 55 KB P75 77 KB P90 13 MB P99 18 MB
  • 11. What are the implications of too many small files • Poor read performance • Request throttling • Memory pressure • Longer checkpoint duration and pipeline pause • Stress the metadata system
  • 12. Why not keyBy shuffle writer-1 writer-2 writer-n … committer operator-1 operator-2 operator-n keyBy(hour) Iceberg
  • 13. There are two problems • Tra ffi c are not evenly distributed across event hours • keyBy for low cardinality column won’t be balanced [1] [1] https://github.com/apache/iceberg/pull/4228
  • 15. Case 2: data clustering for non-partition columns CREATE TABLE db.tbl ( ts timestamp, data string, event_type string) USING iceberg PARTITIONED BY (hours(ts))
  • 16. Queries often filter on event_type SELECT count(1) FROM db.tbl WHERE ts >= '2022-01-01 08:00:00’ AND ts < '2022-01-01 09:00:00' AND event_type = ‘C’
  • 17. Iceberg supports file pruning leveraging min-max stats at column level |- file-000.parquet (event_type: A-B) |- file-001.parquet (event_type: C-C) |- file-002.parquet (event_type: D-F) … event_type = ‘C’
  • 18. Wide value range would make pruning ineffective Wide value range |- file-000.parquet (event_type: A-Z) |- file-001.parquet (event_type: A-Z) |- file-002.parquet (event_type: A-Z) … event_type = ‘C’
  • 19. Making event_type a partition column can lead to explosion of number of partitions • Before: 8.8K partitions (365 days x 24 hours) [1] • After: 4.4M partitions (365 days x 24 hours x 500 event_types) [2] • Can stress metadata system and lead to small fi les [1] Assuming 12 months retention [2] Assuming 500 event types
  • 20. Batch engines solve the clustering problem via shuffle 2. Shuffle to cluster data Stage Stage … 1. Compute data sketch Event Type Weight A 2% B 7% C 22% … Z 0.5% … A B A C C C Z Y X A B A C C C Z Y X 3. Sort data before writing to files A A B C C C X Y Z A-B min-max C-C X-Z Tight value range
  • 21. Shu ffl e for better data clustering
  • 22. Why not compact small files or sort files via background batch maintenance jobs • Remediation is usually more expensive than prevention • Doesn’t solve the throttling problem in the streaming path
  • 24. Introduce a smart shuffling operator in Flink Iceberg sink Iceberg writer-1 writer-2 writer-n … committer shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling
  • 25. Step 1: calculate traffic distribution writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-10 Hour Weight 0 33% 1 14% 2 5% … … 240 0.001%
  • 26. Step 2a: shuffle data based on traffic distribution Hour Assigned tasks 0 1, 2, 3, 4 1 4, 5 2 6 … … 238 10 239 10 240 10 writer-1 writer-2 writer-n … Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% shuf fl e-1 shuf fl e-2 shuf fl e-n
  • 27. Step 2b: range shuffle data for non-partition column Event type Weight A 2% B 7% C 28% … … Z 0.5% Event type Assigned task A-B 1 C-C 2, 3, 4 … … P-Z 10 writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-10
  • 28. Range shuffling improves data clustering A B A C C C Z Y X Z X A A C Y C C B Unsorted data files writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Tight value range
  • 29. Sorting within a file brings additional benefits of row group and page level skipping Parquet fi le X X X X X Y Y Z Z Z Z Z Row group 1 Row group 2 Row group 3 SELECT * FROM db.tbl WHERE ts >= … AND ts < … AND event_type = 'Y'
  • 30. What if sorting is needed • Sorting in streaming is possible but expensive • Use batch sorting jobs
  • 31. How to calculate tra ffi c distribution
  • 32. FLIP-27 source interface introduced operator coordinator component JobManager TaskManager-1 TaskManager-n … Source Reader-1 Source Reader-k … Source Coordinator
  • 33. writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 1 Shuffle tasks calculate local stats and send them to coordinator writer-1 JobManager shu ffl e coordinator
  • 34. writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 1 Shuffle coordinator does global aggregation Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Global aggregation addresses the potential problem of different local views shu ffl e coordinator JobManager
  • 35. writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling Shuffle coordinator broadcasts the globally aggregated stats to tasks Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Shu ffl e Coordinator Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% JobManager All shuf fl e tasks make the same decision based on the same stats
  • 37. Add a custom partitioner after the shuffle operator dataStream .transform("shuffleOperator", shuffleOperatorOutputType, operatorFactory) .partitionCustom(binPackingPartitioner, keySelector) public class BinPackingPartitioner<K> implements Partitioner<K> { @Override int partition(K key, int numPartitions); }
  • 38. There are two shuffling strategies • Bin packing • Range distribution
  • 39. Bin packing can combine multiple small keys to a single task or split a single large key to multiple tasks Task Assigned keys T0 K0, K2, K4, K6, K8 T1 K7 T2 K3 T3 K3 T4 K3 T5 K3 … … T9 K1,K5 • Only focus on balanced weight distribution • Ignore ordering when assigning keys • Work well with shu ffl ing by partition columns
  • 40. Range shuffling split sort values into ranges and assign them to tasks • Balance weight distribution with continuous ranges • Work well with shu ffl ing by non-partition columns Value Assigned task A T1 B C … D T2 T3 T4
  • 41. Optimizing balanced distribution in byte rate can lead to file count skew where a task handles many long-tail hours hours 0 1 2 N https://en.wikipedia.org/wiki/Long_tail Many long-tail hours can be assigned to a single task, which can become bottleneck
  • 42. There are two solutions • Parallelize fi le fl ushing and upload • Limit the fi le count skew via close- fi le-cost (like open- fi le- cost)
  • 43. Tune close-file-cost to balance btw file count skew and byte rate skew Skewness Close- fi le-cost Byte rate skew File count skew
  • 45. A: Simple Iceberg ingestion job without shuffling source-1 source-2 source-n writer-1 writer-2 writer-n committer Chained … • Job parallelism is 60 • Checkpoint interval is 10 min
  • 46. B: Iceberg ingestion with smart shuffling source-1 source-2 source-n writer-1 writer-2 writer-n committer • Job parallelism is 60 • Checkpoint interval is 10 min shuf fl e-1 shuf fl e-2 Shuf fl e-n Chained Shuffle
  • 47. Test setup • Sink Iceberg table is partitioned hourly by event time • Benchmark tra ffi c volume is 250 MB/sec • Event time range is 192 hours
  • 48. What are we comparing • Number of fi les written in one cycle • File size distribution • Checkpoint duration • CPU utilization • Shu ffl ing skew
  • 49. • Job parallelism is 60 • Event time range is 192 hours Shu ffl e reduced the number of fi les by 20x Without shu ffl ing one cycle fl ushed 10K fi les With shu ffl ing one cycle fl ushed 500 fi les ~2.5x of minimal number of fi les
  • 50. Shuffling greatly improved file size distribution Percentile Without shuffling With shuffling Improvement P50 55 KB 913 KB 17x P75 77 KB 7 MB 90x P95 13 MB 301 MB 23x P99 18 MB 306 MB 17x
  • 51. Shuffling tamed the small files problem
  • 52. During checkpoint, writer tasks flush and upload data files writer-1 writer-2 writer-n … committer DFS Data Files
  • 53. Reduced checkpoint duration by 8x Without shu ffl ing, checkpoint takes 64s on average With shu ffl ing, checkpoint takes 8s on average Seconds 10 20 30 40 50 60 70
  • 54. Record handover btw chained operators are simple method call source-1 source-2 source-n writer-1 writer-2 writer-n committer Chained 1. Kafka Source 2. Iceberg Sink …
  • 55. Shuffling involves significant CPU overhead on serdes and network I/O 2. Shuffle 1. Kafka Source 3. Iceberg Sink source-1 source-2 source-n writer-1 writer-2 writer-n committer shuf fl e-1 shuf fl e-2 Shuf fl e-n Shuffle Chained
  • 56. Shuffling increased CPU usage by 62% All about tradeo ff ! With shu ffl ing avg CPU util is 57% Without shu ffl ing avg CPU util is 35%
  • 57. Without shuffling, checkpoint pause is longer and catch-up spike is bigger With shu ffl ing Without shu ffl ing Catch-up spike Trough caused by pause
  • 58. Bin packing shuffling won’t be perfect in weight distribution source-1 source-2 source-n writer-1 writer-2 writer-n committer shuf fl e-1 shuf fl e-2 Shuf fl e-n Shuffle Chained processes data for partitions a, b, c processes data for partitions y, z
  • 59. Min of writer record rate Max of writer record rate Skewness (max-min)/min No shuffling 4.36 K 4.44 K 1.8% Bin packing (greedy algo) 4.02 K 6.39 MB 59% Our greedy algo implementation of bin packing introduces higher skew than we hoped for
  • 60. Future work • Implement other algorithm • Better bin packing with less skew • Range partitioner • Support sketch statistics for high-cardinality keys • Contribute it to OSS
  • 61. References • Design doc: https://docs.google.com/document/d/13N8cMqPi- ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo/
  • 62. Q&A
  • 64. Weight table should be relatively stable
  • 65. What about new hour as time moves forward? Absolute hour Weight 2022-08-03-00 0.4 … … 2022-08-03-12 22 2022-08-03-13 27 2022-08-03-14 38 2022-08-03-15 ??
  • 66. Weight table based on relative hour would be stable Relative hour Weight 0 38 1 27 2 22 … … 14 0.4 … …
  • 67. What about cold start problem? • First-time run • Restart with empty state • New subtasks from scale-up
  • 68. Cope with with cold start problems • No shu ffl e while learning • Bu ff er records until learned the fi rst stats • New subtasks (scale-up) request stats from the coordinator