Building a Data Migration Bootstrapper: Migrating 5,000+ Tables (6TB) from Cloud Data Warehouse to S3 Tables (Iceberg)

How we engineered a scalable solution to migrate massive datasets with optimal performance and cost efficiency

The Challenge

When tasked with migrating over 5000 tables from our Cloud data warehouse to S3 Tables, we knew this wasn’t going to be a simple copy-paste operation. We needed a solution that was:

Scalable: Handle thousands of tables without manual intervention
Cost-effective: Optimize storage and compute costs
Reliable: Ensure data integrity throughout the migration
Fast: Minimize downtime and migration duration

This is the story of how we built our data migration bootstrapper — a comprehensive solution that transformed our data infrastructure.

Phase 1: Finding the Optimal File Format

Before building any infrastructure, we needed to answer a fundamental question: Which file format and compression method would give us the best balance of storage efficiency, query performance, and export speed?

The Experiment

We conducted extensive testing using the SAMPLE_DATA.TPCH_SF100.CUSTOMERtable containing 15 million records (~1.08 GB) as our benchmark dataset. We tested various combinations of file formats and compression methods:

Key Insights

Parquet with Snappy compression emerged as our winner for several reasons:

Balanced Performance: Fast export times (4.8s) with reasonable compression
Columnar Benefits: Optimal for analytical workloads on S3 Tables
Ecosystem Compatibility: Excellent support across Spark, Athena, and other tools
File Count: Moderate number of files (91) — not too fragmented, not too large

While CSV GZIP offered better compression ratios, the longer export times and lack of columnar benefits made it less suitable for our analytical use cases.

Phase 2: Building the Migration Service

With our file format decided, we designed a robust migration service architecture that orchestrates the entire data movement process.

Our migration service follows a two-step orchestrated approach:

Step 1: Cloud Data Warehouse Export Service

Executes COPY INTO commands on the Cloud Data warehouse
Exports data in optimized Parquet format to S3 staging area
Handles authentication, error management, and logging

Step 2: Spark Processing Service

Reads the exported Parquet files from S3 staging
Performs INSERT OVERWRITE operations into S3 Tables
Manages data transformation and schema alignment

Phase 3: Step Functions Orchestration

The orchestration layer coordinates both the Cloud Data Warehouse export and Spark processing steps through AWS Step Functions.

Orchestrated Workflow

We used AWS Step Functions to orchestrate the entire pipeline:

Configuration Loading: Read table metadata and migration parameters
Cloud Data Warehouse Export: Execute the COPY INTO command
Validation: Verify export completion and data integrity
Spark Processing: Launch EMR job to process Parquet files
S3 Tables Creation: Execute INSERT OVERWRITE operations
Cleanup: Remove staging files and update metadata

The Step Functions state machine provided:

Parallel Processing: Multiple tables processed simultaneously
Error Recovery: Automatic retry with exponential backoff
Monitoring: Visual workflow tracking and alerting
Cost Optimization: Dynamic scaling based on workload

Challenges and Solutions

Challenge 1: Scale and Performance

Problem: Processing 5000+ tables sequentially would take weeks.

Solution: Implemented parallel processing with configurable concurrency limits. We could process up to 200 tables simultaneously while respecting Cloud Data warehouse limits.

Challenge 2: Cost Management

Problem: Large-scale data movement could result in significant costs.

Solution:

Optimized file formats and compression
Implemented intelligent staging cleanup
Used spot instances for Spark clusters
Scheduled migrations during off-peak hours

Challenge 4: Monitoring and Observability

Problem: Tracking progress across thousands of table migrations.

Solution: Integrated CloudWatch metrics, custom dashboards, and Slack notifications for real-time monitoring.

Results and Impact

Lakehouse Migration Highlights

Successfully migrated 6.2 TB of data from Cloud Data Warehouse
Migrated over 3,000 tables off Cloud Data Warehouse
Achieving throughput of 900–1,000 GB per hour via the Bootstrapper service

Key Takeaways

Test Early, Test Often: Our upfront investment in file format testing saved significant time and costs later.
Configuration Over Code: Making the system configuration-driven allowed us to handle diverse table structures without code changes.
Orchestration Matters: Step Functions provided the reliability and visibility needed for large-scale operations.
Monitor Everything: Comprehensive monitoring was crucial for managing a migration of this scale.
Plan for Failures: Building retry logic and error handling from the start prevented many potential issues

Conclusion

Building our data migration bootstrapper was more than just moving data — it was about creating a scalable, reliable foundation for our modern data architecture. By starting with thorough testing, implementing robust engineering practices, and leveraging cloud-native orchestration, we successfully migrated over 5000 tables while improving performance and reducing costs.

The journey taught us that successful large-scale data migrations require careful planning, the right technology choices, and a commitment to automation and monitoring. Our bootstrapper now serves as a blueprint for future data infrastructure initiatives.

Meet The Team

Special Thanks AWS for Support and Help

Varadaraj Ramachandraiah Manoj Agarwal Ananth P. Govardhan Yannam Nina Govan Sriram Kalagarla Sekhar Sahu Vachaspathy Kuntamukkala

Building a Data Migration Bootstrapper: Migrating 5,000+ Tables (6TB) from Cloud Data Warehouse to S3 Tables (Iceberg)

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

The Challenge

Phase 1: Finding the Optimal File Format

The Experiment

Key Insights

Phase 2: Building the Migration Service

Phase 3: Step Functions Orchestration

Orchestrated Workflow

Challenges and Solutions

Challenge 1: Scale and Performance

Challenge 2: Cost Management

Challenge 4: Monitoring and Observability

Results and Impact

Key Takeaways

Conclusion

Meet The Team

More articles by this author

Explore topics

The Challenge

Phase 1: Finding the Optimal File Format

The Experiment

Key Insights

Phase 2: Building the Migration Service

Phase 3: Step Functions Orchestration

Orchestrated Workflow

Challenges and Solutions

Challenge 1: Scale and Performance

Challenge 2: Cost Management

Challenge 4: Monitoring and Observability

Results and Impact

Key Takeaways

Conclusion

Meet The Team

I Learned from a Principal Engineer that EMR Adds Its Own Charge on Top of the Base EC2 Price — Which is 25%

Aug 2, 2025

Breaking Through Python’s GIL: Scaling Multi-Tenant S3 Archival from Hours to Minutes

Jul 19, 2025

Experiment: S3 Tables with Incremental Loads up to 520GB At Zeta Global

Jul 10, 2025

Learn How to Build a Datalake with DuckLake, DuckDB, and AWS S3 Express One Zone

May 27, 2025

Parallel Iceberg Table Compaction with AWS Step Functions and Athena

May 24, 2025

Turning Vision into Reality: The Lakehouse Project at Zeta Global

May 23, 2025

Leveraging Spark Connect with S3 Tables (Managed Iceberg): A Comprehensive Guide

May 15, 2025

Multi-Tenant Data Ingestion with Apache Iceberg Views: A Spark-Powered Single Table Design

Apr 18, 2025

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing

Mar 29, 2025

Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

Mar 25, 2025

Explore topics