Experiment: S3 Tables with Incremental Loads up to 520GB At Zeta Global

Experiment: S3 Tables with Incremental Loads up to 520GB At Zeta Global

The Experiment

This experiment explores how merge operations perform on large volume of our test data using S3 tables. Below are our findings. Please note this is for education purpose only and not intended as a recommendation or suggestion.

We were curious to see how S3 tables would perform at scale, so we decided to run a test. We started with 850GB of raw data in S3, which we loaded into S3 tables, and then we began doing incremental merges with various volumes to see how the system would handle the load.

This article presents the results of a comprehensive stress test conducted on our EMR cluster. The objective was to evaluate Apache Iceberg’s merge performance on a baseline table of 850GB with incremental merge data sizes varying from 25GB to 528GB.


Key insights include:

  • Stable performance scaling with incremental data size

  • Detailed timings for read, merge, deduplication, and archival phases

  • Balanced data distribution across partitions

Test Environment

EMR Cluster Configuration

Spark Configuration

  • Deploy mode: client

  • Driver memory: 14GB

  • Executor memory: 18GB

  • Executor cores: 7

  • Number of executors: 20

  • Step concurrency: 50 concurrent steps

Iceberg Table Schema

Baseline Table

  • Size: 850GB Raw Data

  • Records: 550+ million

  • Bootstrap Time: (One time load)

  • S3 → Iceberg: 27m 08s


Processing Workflow

  1. Lock Acquisition (Pessimistic-Based Control): Before any processing begins, the system attempts to acquire a pessimistic lock to ensure only one job operates on a given manifest path at a time. This prevents race conditions and avoids issues with stale or overlapping manifests. If the lock cannot be acquired (e.g., due to a stale lock, orphaned job, or in-progress run), the job will skip execution to maintain integrity.

  2. Manifest Creation:Once the lock is successfully acquired, we generate a pending manifest file that lists the new incoming S3 files for the incremental batch. Each manifest is created based on a volume threshold — N = 50,000 MB(i.e., ~50GB of data). This batching strategy enables predictable processing time and controlled resource usage.If any failure occurs during downstream processing, the corresponding manifest is automatically moved to an error folder, allowing for easy inspection and retry without impacting new data ingestion.

  3. Data Ingestion and Deduplication:The files listed in the manifest are read into Spark as a DataFrame. We apply business logic to deduplicate records using defined keys, ensuring only the most recent or relevant version of each record is retained.

  4. Merge into Iceberg Table:Using Apache Iceberg’s MERGE INTO in merge-on-read mode, the deduplicated data is integrated into the existing Iceberg table. This mode allows us to update and insert data without rewriting the full dataset.

  5. Archival of Processed Files:After a successful merge, the original input files are moved to an archival S3 path.

This step guarantees:

  • No duplicate processing in future runs

  • Full traceability of what was processed

  • Logical separation of pending vs. processed data


Performance Results

  • Total Time: Total duration spent on reading data, deduplicating records, merging data, and archiving processed files.

  • Read Time: Time taken to read the data into a Spark DataFrame and deduplicate the incoming batch.

  • Merge Time: Time spent merging incremental batches into the target table.

  • Archive Time: Duration taken to archive files into the archived directory so that new files can be processed.

  • Input Rows: Total number of records in the incoming incremental batch.


Throughput and Efficiency Metrics

Data Distribution and Partition Stats


Partition Details

Note: Data is evenly distributed across buckets based on the user_id column. post compaction

Processing Time VS Data Size

Description: This graph illustrates how the total processing time scales as the size of incremental data increases. It highlights the relationship between data volume and merge performance, helping to identify any non-linear slowdowns or bottlenecks.


Throughput (GB/min)

Description: This graph shows the throughput measured in gigabytes per minute during the merge operations at various incremental data sizes. It helps visualize the system’s efficiency and capacity to handle increasing workloads.

Archival Optimization

During our initial testing, we observed that archival times were consuming a significant portion of the overall processing time, especially as data volumes increased. To address this bottleneck, we implemented an asynchronous archival optimization strategy.

The Challenge

In our original synchronous approach, the archival process involved copying files from the source location to the archive directory and then deleting the original files. This approach, while reliable, created a processing bottleneck that grew proportionally with the number of files being processed.

The Solution

We decoupled the archival process from the main data processing pipeline by implementing an asynchronous tagging-based approach:

  1. Immediate Tagging: Instead of physically moving files during processing, we simply tag files as “processed” using S3’s put_object_tagging API

  2. Asynchronous Cleanup: A separate cleanup process runs independently to move tagged files to the archive directory

  3. Non-blocking Pipeline: The main processing pipeline can continue immediately after tagging, significantly reducing end-to-end processing time

Key Benefits

  • Reduced Processing Time: The asynchronous approach reduced archival time by approximately 80%, leading to faster overall pipeline execution

  • Improved Scalability: The tagging approach scales better with increasing file counts

  • Pipeline Decoupling: Processing and archival concerns are separated, improving system reliability

  • Resource Optimization: The main processing cluster resources are freed up faster for subsequent jobs

This optimization demonstrates how small architectural changes can yield significant performance improvements in large-scale data processing workflows.

Conclusion

The stress test results demonstrate that Apache Iceberg’s merge-on-read capabilities scale robustly on our EMR cluster, effectively handling incremental merges from 25GB up to over 500GB on top of an 450GB (Size of Iceberg tables)baseline table.

Key takeaways include:

  • Consistent scaling: Processing time and throughput grow predictably with increasing data size, with no significant performance degradation observed even at large incremental loads.

  • Efficient resource utilization: The cluster configuration and Spark tuning support stable merge performance under high concurrency, leveraging optimized memory and CPU allocation.

  • Balanced data distribution: Partitioning by user_id bucket ensures evenly distributed workloads across files and partitions, contributing to efficient parallel processing.

  • Potential optimization areas: Archive times increase proportionally with data size, suggesting future focus on storage or archival tuning may further improve end-to-end merge times.

Overall, this analysis validates our current Iceberg implementation as scalable and performant for large-scale incremental merges. These findings provide a strong foundation to continue expanding data volumes while maintaining timely and reliable data processing.

About Zeta Global

Zeta Global (NYSE: ZETA) is the AI Marketing Cloud that leverages advanced artificial intelligence (AI) and trillions of consumer signals to make it easier for marketers to acquire, grow, and retain customers more efficiently. Through the Zeta Marketing Platform (ZMP), our vision is to make sophisticated marketing simple by unifying identity, intelligence, and omnichannel activation into a single platform — powered by one of the industry’s largest proprietary databases and AI. Our enterprise customers across multiple verticals are empowered to personalize experiences with consumers at an individual level across every channel, delivering better results for marketing programs.

About Authors

Sekhar Sahu Manoj Agarwal Vachaspathy Kuntamukkala Amazon Web Services (AWS) Zeta Global

Maniganda R K

Manager | Senior Data & AI Leader | Architect | GenAI | LLMs | MLOps | Cloud

1mo

Thanks for sharing, Soumil S.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore topics