Experiment: S3 Tables with Incremental Loads up to 520GB At Zeta Global

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

Published Jul 10, 2025

The Experiment

This experiment explores how merge operations perform on large volume of our test data using S3 tables. Below are our findings. Please note this is for education purpose only and not intended as a recommendation or suggestion.

We were curious to see how S3 tables would perform at scale, so we decided to run a test. We started with 850GB of raw data in S3, which we loaded into S3 tables, and then we began doing incremental merges with various volumes to see how the system would handle the load.

This article presents the results of a comprehensive stress test conducted on our EMR cluster. The objective was to evaluate Apache Iceberg’s merge performance on a baseline table of 850GB with incremental merge data sizes varying from 25GB to 528GB.

Key insights include:

Stable performance scaling with incremental data size
Detailed timings for read, merge, deduplication, and archival phases
Balanced data distribution across partitions

Test Environment

EMR Cluster Configuration

Spark Configuration

Deploy mode: client
Driver memory: 14GB
Executor memory: 18GB
Executor cores: 7
Number of executors: 20
Step concurrency: 50 concurrent steps

Iceberg Table Schema

Baseline Table

Size: 850GB Raw Data
Records: 550+ million
Bootstrap Time: (One time load)
S3 → Iceberg: 27m 08s

Processing Workflow

Lock Acquisition (Pessimistic-Based Control): Before any processing begins, the system attempts to acquire a pessimistic lock to ensure only one job operates on a given manifest path at a time. This prevents race conditions and avoids issues with stale or overlapping manifests. If the lock cannot be acquired (e.g., due to a stale lock, orphaned job, or in-progress run), the job will skip execution to maintain integrity.
Manifest Creation:Once the lock is successfully acquired, we generate a pending manifest file that lists the new incoming S3 files for the incremental batch. Each manifest is created based on a volume threshold — N = 50,000 MB(i.e., ~50GB of data). This batching strategy enables predictable processing time and controlled resource usage.If any failure occurs during downstream processing, the corresponding manifest is automatically moved to an error folder, allowing for easy inspection and retry without impacting new data ingestion.
Data Ingestion and Deduplication:The files listed in the manifest are read into Spark as a DataFrame. We apply business logic to deduplicate records using defined keys, ensuring only the most recent or relevant version of each record is retained.
Merge into Iceberg Table:Using Apache Iceberg’s MERGE INTO in merge-on-read mode, the deduplicated data is integrated into the existing Iceberg table. This mode allows us to update and insert data without rewriting the full dataset.
Archival of Processed Files:After a successful merge, the original input files are moved to an archival S3 path.

This step guarantees:

No duplicate processing in future runs
Full traceability of what was processed
Logical separation of pending vs. processed data

Performance Results

Total Time: Total duration spent on reading data, deduplicating records, merging data, and archiving processed files.
Read Time: Time taken to read the data into a Spark DataFrame and deduplicate the incoming batch.
Merge Time: Time spent merging incremental batches into the target table.
Archive Time: Duration taken to archive files into the archived directory so that new files can be processed.
Input Rows: Total number of records in the incoming incremental batch.

Throughput and Efficiency Metrics

Data Distribution and Partition Stats

Partition Details

Note: Data is evenly distributed across buckets based on the user_id column. post compaction

Processing Time VS Data Size

Description: This graph illustrates how the total processing time scales as the size of incremental data increases. It highlights the relationship between data volume and merge performance, helping to identify any non-linear slowdowns or bottlenecks.

Throughput (GB/min)

Description: This graph shows the throughput measured in gigabytes per minute during the merge operations at various incremental data sizes. It helps visualize the system’s efficiency and capacity to handle increasing workloads.

Archival Optimization

During our initial testing, we observed that archival times were consuming a significant portion of the overall processing time, especially as data volumes increased. To address this bottleneck, we implemented an asynchronous archival optimization strategy.

The Challenge

In our original synchronous approach, the archival process involved copying files from the source location to the archive directory and then deleting the original files. This approach, while reliable, created a processing bottleneck that grew proportionally with the number of files being processed.

The Solution

We decoupled the archival process from the main data processing pipeline by implementing an asynchronous tagging-based approach:

Immediate Tagging: Instead of physically moving files during processing, we simply tag files as “processed” using S3’s put_object_tagging API
Asynchronous Cleanup: A separate cleanup process runs independently to move tagged files to the archive directory
Non-blocking Pipeline: The main processing pipeline can continue immediately after tagging, significantly reducing end-to-end processing time

Key Benefits

Reduced Processing Time: The asynchronous approach reduced archival time by approximately 80%, leading to faster overall pipeline execution
Improved Scalability: The tagging approach scales better with increasing file counts
Pipeline Decoupling: Processing and archival concerns are separated, improving system reliability
Resource Optimization: The main processing cluster resources are freed up faster for subsequent jobs

This optimization demonstrates how small architectural changes can yield significant performance improvements in large-scale data processing workflows.

Conclusion

The stress test results demonstrate that Apache Iceberg’s merge-on-read capabilities scale robustly on our EMR cluster, effectively handling incremental merges from 25GB up to over 500GB on top of an 450GB (Size of Iceberg tables)baseline table.

Key takeaways include:

Consistent scaling: Processing time and throughput grow predictably with increasing data size, with no significant performance degradation observed even at large incremental loads.
Efficient resource utilization: The cluster configuration and Spark tuning support stable merge performance under high concurrency, leveraging optimized memory and CPU allocation.
Balanced data distribution: Partitioning by user_id bucket ensures evenly distributed workloads across files and partitions, contributing to efficient parallel processing.
Potential optimization areas: Archive times increase proportionally with data size, suggesting future focus on storage or archival tuning may further improve end-to-end merge times.

Overall, this analysis validates our current Iceberg implementation as scalable and performant for large-scale incremental merges. These findings provide a strong foundation to continue expanding data volumes while maintaining timely and reliable data processing.

About Zeta Global

Zeta Global (NYSE: ZETA) is the AI Marketing Cloud that leverages advanced artificial intelligence (AI) and trillions of consumer signals to make it easier for marketers to acquire, grow, and retain customers more efficiently. Through the Zeta Marketing Platform (ZMP), our vision is to make sophisticated marketing simple by unifying identity, intelligence, and omnichannel activation into a single platform — powered by one of the industry’s largest proprietary databases and AI. Our enterprise customers across multiple verticals are empowered to personalize experiences with consumers at an individual level across every channel, delivering better results for marketing programs.

About Authors

Sekhar Sahu Manoj Agarwal Vachaspathy Kuntamukkala Amazon Web Services (AWS) Zeta Global

Experiment: S3 Tables with Incremental Loads up to 520GB At Zeta Global

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

The Experiment

Test Environment

Spark Configuration

Iceberg Table Schema

Baseline Table

Processing Workflow

Performance Results

Throughput and Efficiency Metrics

Data Distribution and Partition Stats

Partition Details

Processing Time VS Data Size

Throughput (GB/min)

Archival Optimization

The Challenge

The Solution

Key Benefits

Conclusion

About Zeta Global

About Authors

More articles by this author

Others also viewed

Understanding the Delta Lake Transaction Log: The Heart of Reliable Data Lakes

Unlocking Performance Engineering outcomes with realistic data !!

Manage Your Data Platform – Macro Goals & Micro Thinking

star schema

An Introduction to NebulaGraph's Storage Engine

Understanding Data Structures in Node.js: A Developer’s Guide

November 30, 2021

Grafana

Priority Queue in Data Structure: Characteristics, Types & Implementation

🌊 Unleashing the Power of Data with a Medallion Architecture for Your Data Lakehouse!

Explore topics

The Experiment

Test Environment

Spark Configuration

Iceberg Table Schema

Baseline Table

Processing Workflow

Performance Results

Throughput and Efficiency Metrics

Data Distribution and Partition Stats

Partition Details

Processing Time VS Data Size

Throughput (GB/min)

Archival Optimization

The Challenge

The Solution

Key Benefits

Conclusion

About Zeta Global

About Authors

Building a Data Migration Bootstrapper: Migrating 5,000+ Tables (6TB) from Cloud Data Warehouse to S3 Tables (Iceberg)

Aug 18, 2025

I Learned from a Principal Engineer that EMR Adds Its Own Charge on Top of the Base EC2 Price — Which is 25%

Aug 2, 2025

Breaking Through Python’s GIL: Scaling Multi-Tenant S3 Archival from Hours to Minutes

Jul 19, 2025

Learn How to Build a Datalake with DuckLake, DuckDB, and AWS S3 Express One Zone

May 27, 2025

Parallel Iceberg Table Compaction with AWS Step Functions and Athena

May 24, 2025

Turning Vision into Reality: The Lakehouse Project at Zeta Global

May 23, 2025

Leveraging Spark Connect with S3 Tables (Managed Iceberg): A Comprehensive Guide

May 15, 2025

Multi-Tenant Data Ingestion with Apache Iceberg Views: A Spark-Powered Single Table Design

Apr 18, 2025

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing

Mar 29, 2025

Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

Mar 25, 2025

Others also viewed

Understanding the Delta Lake Transaction Log: The Heart of Reliable Data Lakes

Unlocking Performance Engineering outcomes with realistic data !!

Manage Your Data Platform – Macro Goals & Micro Thinking

star schema

An Introduction to NebulaGraph's Storage Engine

Understanding Data Structures in Node.js: A Developer’s Guide

November 30, 2021

Grafana

Priority Queue in Data Structure: Characteristics, Types & Implementation

🌊 Unleashing the Power of Data with a Medallion Architecture for Your Data Lakehouse!

Explore topics