Parallel Iceberg Table Compaction with AWS Step Functions and Athena

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

Published May 24, 2025

Optimizing data lakes for performance and cost is a critical task for any data engineering team. Apache Iceberg, an open table format for large analytic datasets, makes it easier to manage data at scale on Amazon S3. However, as data grows and changes, small file problems can degrade query performance. Compaction — merging small files into larger ones — is key to maintaining fast, efficient queries.

In this post, I’ll show you how to build a serverless, scalable solution for compacting multiple Iceberg tables in parallel using Amazon Athena and AWS Step Functions. This approach lets you automate and orchestrate compaction jobs without managing any servers, and it easily scales to handle many tables at once.

Why Use Athena and Step Functions for Iceberg Compaction?

Serverless: Athena is a fully managed, serverless query engine — no infrastructure to manage, and you pay only for what you use.
Scalable Parallelism: Step Functions can orchestrate and run multiple Athena queries in parallel, so you can compact many tables or partitions at once — no bottlenecks.
Automation: With Step Functions, you can schedule or trigger compaction jobs, integrate with other AWS services, and handle errors or retries automatically.
Simplicity: No need to spin up Spark or EMR clusters just to run compaction — Athena’s OPTIMIZE statement does the heavy lifting.

Step 1: Create Sample Iceberg Tables

Let’s create two mock Iceberg tables in Athena. These will be stored in S3 and use the Parquet format.

Step 2: Create a Generic Step Functions Workflow

Here’s a generic Step Functions state machine that can run multiple Athena queries in parallel. Each query can specify its own database, SQL, and S3 output location.

Map State: Orchestrates parallel execution of Athena queries.
Dynamic Parameters: Each job specifies its own SQL, database, and S3 output path.
Serverless: No EC2, no clusters — everything runs in managed AWS services.

Step 3: Fire Off Parallel Compaction Jobs

To compact both tables in parallel, use the following input payload when starting your Step Functions execution:

Each Athena query will run independently and in parallel, compacting small files in each Iceberg table using the efficient bin packing algorithm.

Why This Pattern Scales

Parallel Execution: Step Functions’ Map state can run dozens or hundreds of jobs in parallel, limited only by your concurrency settings and service quotas.
Serverless Compute: Athena automatically scales to handle query load — no need to manage resources or worry about overprovisioning.
Flexible Orchestration: You can easily add error handling, retries, notifications, or integrate with other AWS services.
No Infrastructure Management: All components (Athena, Step Functions, S3) are fully managed by AWS, letting you focus on data engineering, not DevOps.

Link https://medium.com/@shahsoumil519/parallel-iceberg-table-compaction-with-aws-step-functions-and-athena-704e0079037d

Wrapping Up

With this approach, you can automate and scale Iceberg table compaction across your entire data lake. By leveraging the power of AWS Step Functions and Athena, you get a fully serverless, parallel, and maintainable solution for keeping your analytic tables fast and efficient.

Ready to take your data lake to the next level? Try this pattern in your own environment and see how easy it is to keep your Iceberg tables optimized!

Parallel Iceberg Table Compaction with AWS Step Functions and Athena

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

Why Use Athena and Step Functions for Iceberg Compaction?

Step 1: Create Sample Iceberg Tables

Step 2: Create a Generic Step Functions Workflow

Step 3: Fire Off Parallel Compaction Jobs

Why This Pattern Scales

Wrapping Up

More articles by this author

Others also viewed

Building a Scalable Data Lake with AWS S3 and Open-Source Technologies for the BFSI Sector

Building a Scalable Data Lake with AWS S3 and Open-Source Technologies for the BFSI Sector

Beginner’s Guide to Azure Databricks for Big Data Processing

AI & AWS Newsletter (Vol. 2): Powering Data Science with AWS – ETL Pipelines, Compute, and Service Ecosystem

What is Databricks?

PySpark on AWS EMR: A Guide to Efficient ETL Processing

Solr search with kafka data ingestion using Apache Spark, Logstash and kubernetes CI/CD Pipeline

Fast and Cost-Effective Querying with DuckDB on AWS Lambda (Docker Container): Scaling Queries on Parquet and Table Formats (Hudi | Iceberg | Delta) |

Unlocking the Power of Azure Databricks: A Comprehensive Guide for Professionals

Building Transaction Datalake with Hudi and Glue PySpark (Insert| Read| Write| Update| Time Travel | Snapshots| Schema Evolution| Incremental Query)

Explore topics

Why Use Athena and Step Functions for Iceberg Compaction?

Step 1: Create Sample Iceberg Tables

Step 2: Create a Generic Step Functions Workflow

Step 3: Fire Off Parallel Compaction Jobs

Why This Pattern Scales

Wrapping Up

Building a Data Migration Bootstrapper: Migrating 5,000+ Tables (6TB) from Cloud Data Warehouse to S3 Tables (Iceberg)

Aug 18, 2025

I Learned from a Principal Engineer that EMR Adds Its Own Charge on Top of the Base EC2 Price — Which is 25%

Aug 2, 2025

Breaking Through Python’s GIL: Scaling Multi-Tenant S3 Archival from Hours to Minutes

Jul 19, 2025

Experiment: S3 Tables with Incremental Loads up to 520GB At Zeta Global

Jul 10, 2025

Learn How to Build a Datalake with DuckLake, DuckDB, and AWS S3 Express One Zone

May 27, 2025

Turning Vision into Reality: The Lakehouse Project at Zeta Global

May 23, 2025

Leveraging Spark Connect with S3 Tables (Managed Iceberg): A Comprehensive Guide

May 15, 2025

Multi-Tenant Data Ingestion with Apache Iceberg Views: A Spark-Powered Single Table Design

Apr 18, 2025

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing

Mar 29, 2025

Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

Mar 25, 2025

Others also viewed

Building a Scalable Data Lake with AWS S3 and Open-Source Technologies for the BFSI Sector

Building a Scalable Data Lake with AWS S3 and Open-Source Technologies for the BFSI Sector

Beginner’s Guide to Azure Databricks for Big Data Processing

AI & AWS Newsletter (Vol. 2): Powering Data Science with AWS – ETL Pipelines, Compute, and Service Ecosystem

What is Databricks?

PySpark on AWS EMR: A Guide to Efficient ETL Processing

Solr search with kafka data ingestion using Apache Spark, Logstash and kubernetes CI/CD Pipeline

Fast and Cost-Effective Querying with DuckDB on AWS Lambda (Docker Container): Scaling Queries on Parquet and Table Formats (Hudi | Iceberg | Delta) |

Unlocking the Power of Azure Databricks: A Comprehensive Guide for Professionals

Building Transaction Datalake with Hudi and Glue PySpark (Insert| Read| Write| Update| Time Travel | Snapshots| Schema Evolution| Incremental Query)

Explore topics