Tame the small files problem and optimize data layout for streaming ingestion to Iceberg

Tame the Small Files Problem and Optimize
Data Layout for Streaming Ingestion to Iceberg
Steven Wu, Gang Ye, Haizhou Zhao | Apple
THIS IS NOT A CONTRIBUTION

Apache Iceberg is an open table format for huge analytic data
• Time travel

• Advanced
fi
ltering

• Serializable isolation

Where does Iceberg fit in the ecosystem
Table Format

(Metadata)
Compute

Engine
Storage

(Data) Cloud Blob
Storage

Ingest data to Iceberg data lake in streaming fashion
Flink Streaming

Ingestion
Iceberg

Data Lake
Kafka

Msg Queue

Zoom into the Flink Iceberg sink
Iceberg

Data Lake
writer-1
writer-2
writer-n
…
Records
DFS
Data Files
committer
File Metadata

Case 1: event-time partitioned tables
hour=2022-08-03-00/
hour=2022-08-03-01/
…

Long tail problem with late arrival data
https://en.wikipedia.org/wiki/Long_tail
Hour
Percentage

of data
0 1 2 N

A data file can’t contain rows across partitions
hour=2022-08-03-00/
|- file-000.parquet
|- file-001.parquet
|- …
hour=2022-08-03-01/
|- …
…

How many data files are generated every hour?
writer-1
writer-2
writer-500
…
committer
720K
fi
les every hour (with 10 minute checkpoint interval)
Records for

24x10 partitions
Open 240
fi
les
Commit 120K
fi
les (240x500)
every checkpoint
Assuming table is partitioned
hourly and event time range
is capped at 10 days

Long-tail hours lead to small files
Percentile File Size
P50 55 KB
P75 77 KB
P90 13 MB
P99 18 MB

What are the implications of too many small files
• Poor read performance

• Request throttling

• Memory pressure

• Longer checkpoint duration and pipeline pause

• Stress the metadata system

Why not keyBy shuffle
writer-1
writer-2
writer-n
…
committer
operator-1
operator-2
operator-n
keyBy(hour)
Iceberg

There are two problems
• Tra
ffi
c are not evenly distributed across event hours

• keyBy for low cardinality column won’t be balanced [1]
[1] https://github.com/apache/iceberg/pull/4228

Case 2: data clustering for non-partition columns
CREATE TABLE db.tbl (
ts timestamp,
data string,
event_type string)
USING iceberg
PARTITIONED BY (hours(ts))

Queries often filter on event_type
SELECT count(1) FROM db.tbl WHERE
ts >= '2022-01-01 08:00:00’ AND
ts < '2022-01-01 09:00:00' AND
event_type = ‘C’

Iceberg supports file pruning leveraging min-max stats
at column level
|- file-000.parquet (event_type: A-B)
|- file-001.parquet (event_type: C-C)
|- file-002.parquet (event_type: D-F)
…

Wide value range would make pruning ineffective
Wide value range
|- file-000.parquet (event_type: A-Z)
…

Making event_type a partition column can lead to
explosion of number of partitions
• Before: 8.8K partitions (365 days x 24 hours) [1]

• After: 4.4M partitions (365 days x 24 hours x 500 event_types) [2]

• Can stress metadata system and lead to small
fi
les
[1] Assuming 12 months retention

[2] Assuming 500 event types

Batch engines solve the clustering problem via shuffle
2. Shuffle to
cluster data
Stage Stage
…
1. Compute
data sketch
Event
Type
Weight
A 2%
B 7%
C 22%
…
Z 0.5%
…
A B A
C C C
Z Y X
A
B
A C
C
C
Z
Y
X
3. Sort data
before writing to
files
A A B
C C C
X Y Z
A-B
min-max
C-C
X-Z
Tight value
range

Shu
ffl
e for better data clustering

Why not compact small files or sort files via
background batch maintenance jobs
• Remediation is usually more expensive than prevention

• Doesn’t solve the throttling problem in the streaming path

Agenda
Motivation Evaluation
Design

Introduce a smart shuffling operator in Flink Iceberg sink
Iceberg
writer-1
writer-2
writer-n
…
committer
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling

Step 1: calculate traffic distribution
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-10
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%

Step 2a: shuffle data based on traffic distribution
Hour Assigned tasks
0 1, 2, 3, 4
1 4, 5
2 6
… …
238 10
239 10
240 10
writer-1
writer-2
writer-n
…
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n

Step 2b: range shuffle data for non-partition column
Event
type
Weight
A 2%
B 7%
C 28%
… …
Z 0.5%
Event
type
Assigned
task
A-B 1
C-C 2, 3, 4
… …
P-Z 10
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-10

Range shuffling improves data clustering
A B A
C C C
Z Y X
Z X A
A C Y
C C B
Unsorted
data files
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Tight value
range

Sorting within a file brings additional benefits of row
group and page level skipping
Parquet
fi
le
X

X

X

X

X

Y

Y

Z

Z

Z

Z

Z

Row

group 1
Row

group 2
Row

group 3
SELECT * FROM db.tbl WHERE
ts >= … AND ts < … AND
event_type = 'Y'

What if sorting is needed
• Sorting in streaming is possible but expensive

• Use batch sorting jobs

How to calculate tra
ffi
c distribution

FLIP-27 source interface introduced operator
coordinator component
JobManager TaskManager-1
TaskManager-n
…
Source
Reader-1
Source
Reader-k
…
Source

Coordinator

writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 1
Shuffle tasks calculate local stats and send them to
coordinator
writer-1
JobManager
shu
ffl
e

coordinator

writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 1
Shuffle coordinator does global aggregation
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Global aggregation
addresses the
potential problem of
different local views
shu
ffl
e

coordinator
JobManager

writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Shuffle coordinator broadcasts the globally aggregated
stats to tasks
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Shu
ffl
e

Coordinator
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
JobManager
All shuf
fl
e tasks make
the same decision based
on the same stats

Add a custom partitioner after the shuffle operator
dataStream
.transform("shuffleOperator", shuffleOperatorOutputType, operatorFactory)
.partitionCustom(binPackingPartitioner, keySelector)
public class BinPackingPartitioner<K> implements Partitioner<K> {
@Override
int partition(K key, int numPartitions);
}

There are two shuffling strategies
• Bin packing

• Range distribution

Bin packing can combine multiple small keys to a
single task or split a single large key to multiple tasks
Task Assigned keys
T0 K0, K2, K4, K6, K8
T1 K7
T2 K3
T3 K3
T4 K3
T5 K3
… …
T9 K1,K5
• Only focus on balanced
weight distribution

• Ignore ordering when
assigning keys

• Work well with shu
ffl
ing by
partition columns

Range shuffling split sort values into ranges and
assign them to tasks
• Balance weight distribution
with continuous ranges

• Work well with shu
ffl
ing by
non-partition columns
Value Assigned task
A
T1
B
C
…
D
T2
T3
T4

Optimizing balanced distribution in byte rate can lead to
file count skew where a task handles many long-tail hours
hours
0 1 2 N
https://en.wikipedia.org/wiki/Long_tail
Many long-tail hours can
be assigned to a single
task, which can become
bottleneck

There are two solutions
• Parallelize
fi
le
fl
ushing and upload

• Limit the
fi
le count skew via close-
fi
le-cost (like open-
fi
le-
cost)

Tune close-file-cost to balance btw file count skew
and byte rate skew
Skewness
Close-
fi
le-cost
Byte rate skew
File count skew

A: Simple Iceberg ingestion job without shuffling
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
Chained
…
• Job parallelism is 60

• Checkpoint interval is 10 min

B: Iceberg ingestion with smart shuffling
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer

• Checkpoint interval is 10 min
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Chained Shuffle

Test setup
• Sink Iceberg table is partitioned hourly by event time

• Benchmark tra
ffi
c volume is 250 MB/sec

• Event time range is 192 hours

What are we comparing
• Number of
fi
les written in one cycle

• File size distribution

• Checkpoint duration

• CPU utilization

• Shu
ffl
ing skew


• Event time range is 192 hours

Shu
ffl
e reduced the number of
fi
les by 20x
Without shu
ffl
ing one cycle
fl
ushed 10K
fi
les
With shu
ffl
ing one cycle
fl
ushed 500
fi
les
~2.5x of minimal
number of
fi
les

Shuffling greatly improved file size distribution
Percentile
Without
shuffling
With
shuffling
Improvement
P50 55 KB 913 KB 17x
P75 77 KB 7 MB 90x
P95 13 MB 301 MB 23x
P99 18 MB 306 MB 17x

Shuffling tamed the small files problem

During checkpoint, writer tasks flush and upload data files
writer-1
writer-2
writer-n
…
committer
DFS
Data Files

Reduced checkpoint duration by 8x
Without shu
ffl
ing, checkpoint takes 64s on average
With shu
ffl
ing, checkpoint takes 8s on average
Seconds
10
20
30
40
50
60
70

Record handover btw chained operators are simple
method call
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
Chained
1. Kafka Source 2. Iceberg Sink
…

Shuffling involves significant CPU overhead on serdes
and network I/O
2. Shuffle
1. Kafka Source 3. Iceberg Sink
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Shuffle
Chained

Shuffling increased CPU usage by 62%
All about tradeo
ff
!
With shu
ffl
ing avg CPU util is 57%
Without shu
ffl
ing avg CPU util is 35%

Without shuffling, checkpoint pause is longer and
catch-up spike is bigger
With shu
ffl
ing
Without shu
ffl
ing
Catch-up spike
Trough caused
by pause

Bin packing shuffling won’t be perfect in weight distribution
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Shuffle
Chained
processes data for
partitions a, b, c
processes data for
partitions y, z

Min of writer
record rate
Max of writer
record rate
Skewness
(max-min)/min
No shuffling 4.36 K 4.44 K 1.8%
Bin packing
(greedy algo)
4.02 K 6.39 MB 59%
Our greedy algo implementation of bin packing
introduces higher skew than we hoped for

Future work
• Implement other algorithm

• Better bin packing with less skew

• Range partitioner

• Support sketch statistics for high-cardinality keys

• Contribute it to OSS

References
• Design doc: https://docs.google.com/document/d/13N8cMqPi-
ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo/

Tame the small files problem and optimize data layout for streaming ingestion to Iceberg

Weight table should be
relatively stable

What about new hour as time moves forward?
Absolute hour Weight
2022-08-03-00 0.4
… …
2022-08-03-12 22
2022-08-03-13 27
2022-08-03-14 38
2022-08-03-15 ??

Weight table based on relative hour would be stable
Relative hour Weight
0 38
1 27
2 22
… …
14 0.4
… …

What about cold start problem?
• First-time run

• Restart with empty state

• New subtasks from scale-up

Cope with with cold start problems
• No shu
ffl
e while learning

• Bu
ff
er records until learned the
fi
rst stats

• New subtasks (scale-up) request stats from the coordinator

Tame the small files problem and optimize data layout for streaming ingestion to Iceberg

More Related Content

What's hot (20)

Similar to Tame the small files problem and optimize data layout for streaming ingestion to Iceberg (20)

More from Flink Forward (14)

Recently uploaded (20)

Tame the small files problem and optimize data layout for streaming ingestion to Iceberg