Apache Iceberg Table Optimization #3: Optimizing Compaction for Streaming Workloads in Apache Iceberg
Optimizing Compaction for Streaming Workloads in Apache Iceberg
In traditional batch pipelines, compaction jobs can run in large windows during idle periods. But in streaming workloads, data is written continuously—often in small increments—leading to rapid small file accumulation and tight freshness requirements.
So how do we compact Iceberg tables without interfering with ingestion and latency-sensitive reads? This post explores how to design efficient, incremental compaction jobs that preserve performance without disrupting your streaming pipelines.
The Challenge with Streaming + Compaction
Streaming ingestion into Apache Iceberg often uses micro-batches or event-driven triggers that:
A naive compaction job that rewrites entire partitions or the whole table risks:
The key is to optimize incrementally and intelligently.
Techniques for Streaming-Safe Compaction
1. Compact Only Cold Partitions
Don’t rewrite partitions actively being written to. Instead:
Example query using metadata table:
SELECT partition, COUNT(*) AS file_count
FROM my_table.files
WHERE last_modified < current_timestamp() - INTERVAL '1 hour'
GROUP BY partition
HAVING COUNT(*) > 10;
This can drive dynamic, safe compaction logic in orchestration tools.
2. Use Incremental Compaction Windows
Instead of full rewrites:
Spark's RewriteDataFiles and Dremio's OPTIMIZE features both support targeted rewrites.
3. Trigger Based on Metadata Metrics
Rather than scheduling compaction at fixed intervals, use metadata-driven triggers like:
You can track these via files and manifests metadata tables and use orchestration tools (e.g., Airflow, Dagster, dbt Cloud) to trigger compaction.
Example: Time-Based Compaction Script (Pseudo-code)
# For each partition older than 1 hour with many small files
for partition in get_partitions_older_than(hours=1):
if count_small_files(partition) > threshold:
run_compaction(partition)
This pattern allows incremental, scoped jobs that don’t touch fresh data.
Tuning for Performance
Parallelism: Use high parallelism for wide tables to speed up job runtime
Target file size: Stick to 128MB–256MB range unless your queries benefit from larger files
Retries and check-pointing: Make sure jobs are fault-tolerant in production
Summary
To maintain performance in streaming Iceberg pipelines:
With the right setup, you can keep query performance and data freshness high—without sacrificing one for the other.