Apache Iceberg Table Optimization #3: Optimizing Compaction for Streaming Workloads in Apache Iceberg

Alex Merced

Co-Author of “Apache Iceberg: The Definitive Guide” | Head of DevRel at Dremio | LinkedIn Learning Instructor | Tech Content Creator

Published Aug 12, 2025

Optimizing Compaction for Streaming Workloads in Apache Iceberg

In traditional batch pipelines, compaction jobs can run in large windows during idle periods. But in streaming workloads, data is written continuously—often in small increments—leading to rapid small file accumulation and tight freshness requirements.

So how do we compact Iceberg tables without interfering with ingestion and latency-sensitive reads? This post explores how to design efficient, incremental compaction jobs that preserve performance without disrupting your streaming pipelines.

The Challenge with Streaming + Compaction

Streaming ingestion into Apache Iceberg often uses micro-batches or event-driven triggers that:

Generate many small files per partition
Write new snapshots frequently
Introduce high metadata churn

A naive compaction job that rewrites entire partitions or the whole table risks:

Commit contention with streaming jobs
Stale data in read replicas or downstream queries
Latency spikes if compaction blocks snapshot availability

The key is to optimize incrementally and intelligently.

Techniques for Streaming-Safe Compaction

1. Compact Only Cold Partitions

Don’t rewrite partitions actively being written to. Instead:

Identify "cold" partitions (e.g., older than 1 hour if partioned by hour)
Compact only those to avoid conflicts with streaming writes

Example query using metadata table:

SELECT partition, COUNT(*) AS file_count
FROM my_table.files
WHERE last_modified < current_timestamp() - INTERVAL '1 hour'
GROUP BY partition
HAVING COUNT(*) > 10;

This can drive dynamic, safe compaction logic in orchestration tools.

2. Use Incremental Compaction Windows

Instead of full rewrites:

Compact only a subset of files at a time (e.g., oldest or smallest)
Avoid reprocessing already optimized files
Reduce job run time to minutes instead of hours

Spark's RewriteDataFiles and Dremio's OPTIMIZE features both support targeted rewrites.

3. Trigger Based on Metadata Metrics

Rather than scheduling compaction at fixed intervals, use metadata-driven triggers like:

Number of files per partition > threshold
Average file size < target
File age > threshold

You can track these via files and manifests metadata tables and use orchestration tools (e.g., Airflow, Dagster, dbt Cloud) to trigger compaction.

Example: Time-Based Compaction Script (Pseudo-code)

# For each partition older than 1 hour with many small files
for partition in get_partitions_older_than(hours=1):
    if count_small_files(partition) > threshold:
        run_compaction(partition)

This pattern allows incremental, scoped jobs that don’t touch fresh data.

Tuning for Performance

Parallelism: Use high parallelism for wide tables to speed up job runtime

Target file size: Stick to 128MB–256MB range unless your queries benefit from larger files

Retries and check-pointing: Make sure jobs are fault-tolerant in production

Summary

To maintain performance in streaming Iceberg pipelines:

Compact frequently, but narrowly
Use metadata to guide scope
Avoid active partitions and large rewrites
Leverage orchestration and branching when available

With the right setup, you can keep query performance and data freshness high—without sacrificing one for the other.

Apache Iceberg Table Optimization #3: Optimizing Compaction for Streaming Workloads in Apache Iceberg

Alex Merced

Co-Author of “Apache Iceberg: The Definitive Guide” | Head of DevRel at Dremio | LinkedIn Learning Instructor | Tech Content Creator

Optimizing Compaction for Streaming Workloads in Apache Iceberg

The Challenge with Streaming + Compaction

Techniques for Streaming-Safe Compaction

1. Compact Only Cold Partitions

2. Use Incremental Compaction Windows

3. Trigger Based on Metadata Metrics

Tuning for Performance

Summary

Data Lakehouse Bytes with Alex

6,533 followers

More articles by this author

Explore topics

Optimizing Compaction for Streaming Workloads in Apache Iceberg

The Challenge with Streaming + Compaction

Techniques for Streaming-Safe Compaction

1. Compact Only Cold Partitions

2. Use Incremental Compaction Windows

3. Trigger Based on Metadata Metrics

Tuning for Performance

Summary

Data Lakehouse Bytes with Alex

6,533 followers

Apache Iceberg Table Optimization #2: The Basics of Compaction — Bin Packing Your Data for Efficiency

Aug 7, 2025

Apache Iceberg Table Optimization #1: The Cost of Neglect — How Apache Iceberg Tables Degrade Without Optimization

Aug 5, 2025

Buy New Book "Architecting an Apache Iceberg Lakehouse" for 50% off

Jul 29, 2025

Introduction to Data Engineering Concepts |18| The Power of Dremio in the Modern Lakehouse

Jul 24, 2025

Introduction to Data Engineering Concepts |17| Apache Iceberg, Arrow, and Polaris

Jul 17, 2025

Introduction to Data Engineering Concepts |16| Data Lakehouse Architecture Explained

Jul 15, 2025

Materialization and Acceleration in the Iceberg Lakehouse Era: Comparing Dremio, Trino, Doris, StarRocks, ClickHouse, Snowflake & Databricks

Jul 14, 2025

Introduction to Data Engineering Concepts |15| Cloud Data Platforms and the Modern Stack

Jul 10, 2025

Introduction to Data Engineering Concepts |14| DevOps for Data Engineering

Jul 8, 2025

How to Discover or Organize Lakehouse & Apache Iceberg Meetups

Jul 3, 2025

Explore topics