Building a Data Migration Bootstrapper: Migrating 5,000+ Tables (6TB) from Cloud Data Warehouse to S3 Tables (Iceberg)
How we engineered a scalable solution to migrate massive datasets with optimal performance and cost efficiency
The Challenge
When tasked with migrating over 5000 tables from our Cloud data warehouse to S3 Tables, we knew this wasn’t going to be a simple copy-paste operation. We needed a solution that was:
This is the story of how we built our data migration bootstrapper — a comprehensive solution that transformed our data infrastructure.
Phase 1: Finding the Optimal File Format
Before building any infrastructure, we needed to answer a fundamental question: Which file format and compression method would give us the best balance of storage efficiency, query performance, and export speed?
The Experiment
We conducted extensive testing using the SAMPLE_DATA.TPCH_SF100.CUSTOMERtable containing 15 million records (~1.08 GB) as our benchmark dataset. We tested various combinations of file formats and compression methods:
Key Insights
Parquet with Snappy compression emerged as our winner for several reasons:
While CSV GZIP offered better compression ratios, the longer export times and lack of columnar benefits made it less suitable for our analytical use cases.
Phase 2: Building the Migration Service
With our file format decided, we designed a robust migration service architecture that orchestrates the entire data movement process.
Our migration service follows a two-step orchestrated approach:
Step 1: Cloud Data Warehouse Export Service
Step 2: Spark Processing Service
Phase 3: Step Functions Orchestration
The orchestration layer coordinates both the Cloud Data Warehouse export and Spark processing steps through AWS Step Functions.
Orchestrated Workflow
We used AWS Step Functions to orchestrate the entire pipeline:
The Step Functions state machine provided:
Challenges and Solutions
Challenge 1: Scale and Performance
Problem: Processing 5000+ tables sequentially would take weeks.
Solution: Implemented parallel processing with configurable concurrency limits. We could process up to 200 tables simultaneously while respecting Cloud Data warehouse limits.
Challenge 2: Cost Management
Problem: Large-scale data movement could result in significant costs.
Solution:
Challenge 4: Monitoring and Observability
Problem: Tracking progress across thousands of table migrations.
Solution: Integrated CloudWatch metrics, custom dashboards, and Slack notifications for real-time monitoring.
Results and Impact
Lakehouse Migration Highlights
Key Takeaways
Conclusion
Building our data migration bootstrapper was more than just moving data — it was about creating a scalable, reliable foundation for our modern data architecture. By starting with thorough testing, implementing robust engineering practices, and leveraging cloud-native orchestration, we successfully migrated over 5000 tables while improving performance and reducing costs.
The journey taught us that successful large-scale data migrations require careful planning, the right technology choices, and a commitment to automation and monitoring. Our bootstrapper now serves as a blueprint for future data infrastructure initiatives.
Meet The Team
Special Thanks AWS for Support and Help