Building a Data Migration Bootstrapper: Migrating 5,000+ Tables (6TB) from Cloud Data Warehouse to S3 Tables (Iceberg)

Building a Data Migration Bootstrapper: Migrating 5,000+ Tables (6TB) from Cloud Data Warehouse to S3 Tables (Iceberg)

How we engineered a scalable solution to migrate massive datasets with optimal performance and cost efficiency

The Challenge

When tasked with migrating over 5000 tables from our Cloud data warehouse to S3 Tables, we knew this wasn’t going to be a simple copy-paste operation. We needed a solution that was:

  • Scalable: Handle thousands of tables without manual intervention
  • Cost-effective: Optimize storage and compute costs
  • Reliable: Ensure data integrity throughout the migration
  • Fast: Minimize downtime and migration duration

This is the story of how we built our data migration bootstrapper — a comprehensive solution that transformed our data infrastructure.

Phase 1: Finding the Optimal File Format

Before building any infrastructure, we needed to answer a fundamental question: Which file format and compression method would give us the best balance of storage efficiency, query performance, and export speed?

The Experiment

We conducted extensive testing using the SAMPLE_DATA.TPCH_SF100.CUSTOMERtable containing 15 million records (~1.08 GB) as our benchmark dataset. We tested various combinations of file formats and compression methods:

Article content
Article content

Key Insights

Parquet with Snappy compression emerged as our winner for several reasons:

  1. Balanced Performance: Fast export times (4.8s) with reasonable compression
  2. Columnar Benefits: Optimal for analytical workloads on S3 Tables
  3. Ecosystem Compatibility: Excellent support across Spark, Athena, and other tools
  4. File Count: Moderate number of files (91) — not too fragmented, not too large

While CSV GZIP offered better compression ratios, the longer export times and lack of columnar benefits made it less suitable for our analytical use cases.

Phase 2: Building the Migration Service

Article content

With our file format decided, we designed a robust migration service architecture that orchestrates the entire data movement process.

Our migration service follows a two-step orchestrated approach:

Step 1: Cloud Data Warehouse Export Service

  • Executes COPY INTO commands on the Cloud Data warehouse
  • Exports data in optimized Parquet format to S3 staging area
  • Handles authentication, error management, and logging

Step 2: Spark Processing Service

  • Reads the exported Parquet files from S3 staging
  • Performs INSERT OVERWRITE operations into S3 Tables
  • Manages data transformation and schema alignment

Phase 3: Step Functions Orchestration

The orchestration layer coordinates both the Cloud Data Warehouse export and Spark processing steps through AWS Step Functions.

Orchestrated Workflow

We used AWS Step Functions to orchestrate the entire pipeline:

  1. Configuration Loading: Read table metadata and migration parameters
  2. Cloud Data Warehouse Export: Execute the COPY INTO command
  3. Validation: Verify export completion and data integrity
  4. Spark Processing: Launch EMR job to process Parquet files
  5. S3 Tables Creation: Execute INSERT OVERWRITE operations
  6. Cleanup: Remove staging files and update metadata

The Step Functions state machine provided:

  • Parallel Processing: Multiple tables processed simultaneously
  • Error Recovery: Automatic retry with exponential backoff
  • Monitoring: Visual workflow tracking and alerting
  • Cost Optimization: Dynamic scaling based on workload

Challenges and Solutions

Challenge 1: Scale and Performance

Problem: Processing 5000+ tables sequentially would take weeks.

Solution: Implemented parallel processing with configurable concurrency limits. We could process up to 200 tables simultaneously while respecting Cloud Data warehouse limits.

Challenge 2: Cost Management

Problem: Large-scale data movement could result in significant costs.

Solution:

  • Optimized file formats and compression
  • Implemented intelligent staging cleanup
  • Used spot instances for Spark clusters
  • Scheduled migrations during off-peak hours

Challenge 4: Monitoring and Observability

Problem: Tracking progress across thousands of table migrations.

Solution: Integrated CloudWatch metrics, custom dashboards, and Slack notifications for real-time monitoring.

Results and Impact

Lakehouse Migration Highlights

  • Successfully migrated 6.2 TB of data from Cloud Data Warehouse
  • Migrated over 3,000 tables off Cloud Data Warehouse
  • Achieving throughput of 900–1,000 GB per hour via the Bootstrapper service

Key Takeaways

  1. Test Early, Test Often: Our upfront investment in file format testing saved significant time and costs later.
  2. Configuration Over Code: Making the system configuration-driven allowed us to handle diverse table structures without code changes.
  3. Orchestration Matters: Step Functions provided the reliability and visibility needed for large-scale operations.
  4. Monitor Everything: Comprehensive monitoring was crucial for managing a migration of this scale.
  5. Plan for Failures: Building retry logic and error handling from the start prevented many potential issues

Conclusion

Building our data migration bootstrapper was more than just moving data — it was about creating a scalable, reliable foundation for our modern data architecture. By starting with thorough testing, implementing robust engineering practices, and leveraging cloud-native orchestration, we successfully migrated over 5000 tables while improving performance and reducing costs.

The journey taught us that successful large-scale data migrations require careful planning, the right technology choices, and a commitment to automation and monitoring. Our bootstrapper now serves as a blueprint for future data infrastructure initiatives.

Meet The Team

Article content


Article content


Article content


Article content
Article content
Article content
Article content
Article content
Article content

Special Thanks AWS for Support and Help

Article content

Varadaraj Ramachandraiah Manoj Agarwal Ananth P. Govardhan Yannam Nina Govan Sriram Kalagarla Sekhar Sahu Vachaspathy Kuntamukkala

To view or add a comment, sign in

Explore topics