Parallel Iceberg Table Compaction with AWS Step Functions and Athena
Optimizing data lakes for performance and cost is a critical task for any data engineering team. Apache Iceberg, an open table format for large analytic datasets, makes it easier to manage data at scale on Amazon S3. However, as data grows and changes, small file problems can degrade query performance. Compaction — merging small files into larger ones — is key to maintaining fast, efficient queries.
In this post, I’ll show you how to build a serverless, scalable solution for compacting multiple Iceberg tables in parallel using Amazon Athena and AWS Step Functions. This approach lets you automate and orchestrate compaction jobs without managing any servers, and it easily scales to handle many tables at once.
Why Use Athena and Step Functions for Iceberg Compaction?
Serverless: Athena is a fully managed, serverless query engine — no infrastructure to manage, and you pay only for what you use.
Scalable Parallelism: Step Functions can orchestrate and run multiple Athena queries in parallel, so you can compact many tables or partitions at once — no bottlenecks.
Automation: With Step Functions, you can schedule or trigger compaction jobs, integrate with other AWS services, and handle errors or retries automatically.
Simplicity: No need to spin up Spark or EMR clusters just to run compaction — Athena’s OPTIMIZE statement does the heavy lifting.
Step 1: Create Sample Iceberg Tables
Let’s create two mock Iceberg tables in Athena. These will be stored in S3 and use the Parquet format.
Step 2: Create a Generic Step Functions Workflow
Here’s a generic Step Functions state machine that can run multiple Athena queries in parallel. Each query can specify its own database, SQL, and S3 output location.
Map State: Orchestrates parallel execution of Athena queries.
Dynamic Parameters: Each job specifies its own SQL, database, and S3 output path.
Serverless: No EC2, no clusters — everything runs in managed AWS services.
Step 3: Fire Off Parallel Compaction Jobs
To compact both tables in parallel, use the following input payload when starting your Step Functions execution:
Each Athena query will run independently and in parallel, compacting small files in each Iceberg table using the efficient bin packing algorithm.
Why This Pattern Scales
Parallel Execution: Step Functions’ Map state can run dozens or hundreds of jobs in parallel, limited only by your concurrency settings and service quotas.
Serverless Compute: Athena automatically scales to handle query load — no need to manage resources or worry about overprovisioning.
Flexible Orchestration: You can easily add error handling, retries, notifications, or integrate with other AWS services.
No Infrastructure Management: All components (Athena, Step Functions, S3) are fully managed by AWS, letting you focus on data engineering, not DevOps.
Wrapping Up
With this approach, you can automate and scale Iceberg table compaction across your entire data lake. By leveraging the power of AWS Step Functions and Athena, you get a fully serverless, parallel, and maintainable solution for keeping your analytic tables fast and efficient.
Ready to take your data lake to the next level? Try this pattern in your own environment and see how easy it is to keep your Iceberg tables optimized!
Data Engineer @ Novi Labs
2moGreat article once again Soumil! Congrats