Experiment: S3 Tables with Incremental Loads up to 520GB At Zeta Global
The Experiment
This experiment explores how merge operations perform on large volume of our test data using S3 tables. Below are our findings. Please note this is for education purpose only and not intended as a recommendation or suggestion.
We were curious to see how S3 tables would perform at scale, so we decided to run a test. We started with 850GB of raw data in S3, which we loaded into S3 tables, and then we began doing incremental merges with various volumes to see how the system would handle the load.
This article presents the results of a comprehensive stress test conducted on our EMR cluster. The objective was to evaluate Apache Iceberg’s merge performance on a baseline table of 850GB with incremental merge data sizes varying from 25GB to 528GB.
Key insights include:
Stable performance scaling with incremental data size
Detailed timings for read, merge, deduplication, and archival phases
Balanced data distribution across partitions
Test Environment
EMR Cluster Configuration
Spark Configuration
Deploy mode: client
Driver memory: 14GB
Executor memory: 18GB
Executor cores: 7
Number of executors: 20
Step concurrency: 50 concurrent steps
Iceberg Table Schema
Baseline Table
Size: 850GB Raw Data
Records: 550+ million
Bootstrap Time: (One time load)
S3 → Iceberg: 27m 08s
Processing Workflow
Lock Acquisition (Pessimistic-Based Control): Before any processing begins, the system attempts to acquire a pessimistic lock to ensure only one job operates on a given manifest path at a time. This prevents race conditions and avoids issues with stale or overlapping manifests. If the lock cannot be acquired (e.g., due to a stale lock, orphaned job, or in-progress run), the job will skip execution to maintain integrity.
Manifest Creation:Once the lock is successfully acquired, we generate a pending manifest file that lists the new incoming S3 files for the incremental batch. Each manifest is created based on a volume threshold — N = 50,000 MB(i.e., ~50GB of data). This batching strategy enables predictable processing time and controlled resource usage.If any failure occurs during downstream processing, the corresponding manifest is automatically moved to an error folder, allowing for easy inspection and retry without impacting new data ingestion.
Data Ingestion and Deduplication:The files listed in the manifest are read into Spark as a DataFrame. We apply business logic to deduplicate records using defined keys, ensuring only the most recent or relevant version of each record is retained.
Merge into Iceberg Table:Using Apache Iceberg’s MERGE INTO in merge-on-read mode, the deduplicated data is integrated into the existing Iceberg table. This mode allows us to update and insert data without rewriting the full dataset.
Archival of Processed Files:After a successful merge, the original input files are moved to an archival S3 path.
This step guarantees:
No duplicate processing in future runs
Full traceability of what was processed
Logical separation of pending vs. processed data
Performance Results
Total Time: Total duration spent on reading data, deduplicating records, merging data, and archiving processed files.
Read Time: Time taken to read the data into a Spark DataFrame and deduplicate the incoming batch.
Merge Time: Time spent merging incremental batches into the target table.
Archive Time: Duration taken to archive files into the archived directory so that new files can be processed.
Input Rows: Total number of records in the incoming incremental batch.
Throughput and Efficiency Metrics
Data Distribution and Partition Stats
Partition Details
Note: Data is evenly distributed across buckets based on the user_id column. post compaction
Processing Time VS Data Size
Description: This graph illustrates how the total processing time scales as the size of incremental data increases. It highlights the relationship between data volume and merge performance, helping to identify any non-linear slowdowns or bottlenecks.
Throughput (GB/min)
Description: This graph shows the throughput measured in gigabytes per minute during the merge operations at various incremental data sizes. It helps visualize the system’s efficiency and capacity to handle increasing workloads.
Archival Optimization
During our initial testing, we observed that archival times were consuming a significant portion of the overall processing time, especially as data volumes increased. To address this bottleneck, we implemented an asynchronous archival optimization strategy.
The Challenge
In our original synchronous approach, the archival process involved copying files from the source location to the archive directory and then deleting the original files. This approach, while reliable, created a processing bottleneck that grew proportionally with the number of files being processed.
The Solution
We decoupled the archival process from the main data processing pipeline by implementing an asynchronous tagging-based approach:
Immediate Tagging: Instead of physically moving files during processing, we simply tag files as “processed” using S3’s put_object_tagging API
Asynchronous Cleanup: A separate cleanup process runs independently to move tagged files to the archive directory
Non-blocking Pipeline: The main processing pipeline can continue immediately after tagging, significantly reducing end-to-end processing time
Key Benefits
Reduced Processing Time: The asynchronous approach reduced archival time by approximately 80%, leading to faster overall pipeline execution
Improved Scalability: The tagging approach scales better with increasing file counts
Pipeline Decoupling: Processing and archival concerns are separated, improving system reliability
Resource Optimization: The main processing cluster resources are freed up faster for subsequent jobs
This optimization demonstrates how small architectural changes can yield significant performance improvements in large-scale data processing workflows.
Conclusion
The stress test results demonstrate that Apache Iceberg’s merge-on-read capabilities scale robustly on our EMR cluster, effectively handling incremental merges from 25GB up to over 500GB on top of an 450GB (Size of Iceberg tables)baseline table.
Key takeaways include:
Consistent scaling: Processing time and throughput grow predictably with increasing data size, with no significant performance degradation observed even at large incremental loads.
Efficient resource utilization: The cluster configuration and Spark tuning support stable merge performance under high concurrency, leveraging optimized memory and CPU allocation.
Balanced data distribution: Partitioning by user_id bucket ensures evenly distributed workloads across files and partitions, contributing to efficient parallel processing.
Potential optimization areas: Archive times increase proportionally with data size, suggesting future focus on storage or archival tuning may further improve end-to-end merge times.
Overall, this analysis validates our current Iceberg implementation as scalable and performant for large-scale incremental merges. These findings provide a strong foundation to continue expanding data volumes while maintaining timely and reliable data processing.
About Zeta Global
Zeta Global (NYSE: ZETA) is the AI Marketing Cloud that leverages advanced artificial intelligence (AI) and trillions of consumer signals to make it easier for marketers to acquire, grow, and retain customers more efficiently. Through the Zeta Marketing Platform (ZMP), our vision is to make sophisticated marketing simple by unifying identity, intelligence, and omnichannel activation into a single platform — powered by one of the industry’s largest proprietary databases and AI. Our enterprise customers across multiple verticals are empowered to personalize experiences with consumers at an individual level across every channel, delivering better results for marketing programs.
About Authors
Sekhar Sahu Manoj Agarwal Vachaspathy Kuntamukkala Amazon Web Services (AWS) Zeta Global
Manager | Senior Data & AI Leader | Architect | GenAI | LLMs | MLOps | Cloud
1moThanks for sharing, Soumil S.