SlideShare a Scribd company logo
Amazon’s
Exabyte-Scale
Migration from
Spark to Ray
Patrick Ames, Amazon
Overview 1. Introduction
2. The Compaction Efficiency Problem
3. The Spark to Ray Migration
4. Results
5. Future Work
6. Get Involved
Introduction
1. From Data Warehouse to Lakehouse
2. Business Intelligence?
3. A Brief History of Amazon BI
4.Ray?
From Data Warehouse to Lakehouse
2. Data Lake
Highly Scalable Raw Data Storage and
Retrieval
1. Data Warehouse
Coupled Storage and Compute over
Structured Data
4. Data Lakehouse
Decoupled Data Catalog + Compute over
Structured and Unstructured Data
Data Lake + Metadata (table names,
versions, schemas, descriptions, audit
logs, etc.)
3. Data Catalog
5
Decision Evolution
Business Intelligence
Business Intelligence Flywheel
Amazon is gradually transitioning large-scale BI pipelines into phases 2 and 3.
Amazon’s BI catalog provides a centralized hub to access exabytes of business data.
• 2016-2018: PB-Scale
Oracle Data Warehouse
Deprecation
• Migrated 50PB from Oracle Data
Warehouse to S3-Based Data
Catalog
• Decoupled storage with Amazon
Redshift & Apache Hive on
Amazon EMR Compute
A Brief History of
Amazon BI
• 2018-2019: EB-Scale Data
Catalog & Lakehouse
Formation
• Bring your own compute (EMR
Spark, AWS Glue, Amazon
Redshift Spectrum, etc.)
A Brief History of
Amazon BI
• 2018-2019: EB-Scale Data
Catalog & Lakehouse
Formation
• Bring your own compute (EMR
Spark, AWS Glue, Amazon
Redshift Spectrum, etc.)
• LSM-based CDC “Compaction”
using Apache Spark on Amazon
EMR
A Brief History of
Amazon BI
Append-only compaction: Append deltas arrive in a table’s CDC log stream, where each delta
contains pointers to one or more S3 files containing records to insert into the table. During a
compaction job, no records are updated or deleted so the delta merge is a simple concatenation,
but the compactor is still responsible for writing out files sized appropriately to optimize reads
(i.e. merge tiny files into larger files and split massive files into smaller files).
• 2018-2019: EB-Scale Data
Catalog & Lakehouse
Formation
• Bring your own compute (EMR
Spark, AWS Glue, Amazon
Redshift Spectrum, etc.)
• LSM-based CDC “Compaction”
using Apache Spark on Amazon
EMR
A Brief History of
Amazon BI
Upsert compaction: Append and Upsert deltas arrive in a table’s CDC log stream, where each
Upsert delta contains records to update or insert according to one or more merge keys. In this
case, column1 is used as the merge key, so only the latest column2 updates are kept per distinct
column1 value.
• 2019-2023: Ray
Integration
• EB-Scale Data Quality Analysis
• Spark-to-Ray Compaction
Migration
A Brief History of
Amazon BI
Ray Shadow Compaction Workflow: New deltas arriving in a data catalog table’s CDC log
stream are merged into two separate compacted tables maintained separately by Apache Spark
and Ray. The Data Reconciliation Service verifies that different data processing frameworks
produce equivalent results when querying datasets produced by Apache Spark and Ray, while the
Ray-based DQ Service compares key dataset statistics.
• 2024+: Ray Exclusivity
• Migrate all Table Queries to Ray
Compactor Output
• Turn off Spark Compactor
A Brief History of
Amazon BI
Ray Exclusive compaction: New deltas arriving in a data catalog table’s CDC log stream are
merged into only one compacted table maintained by Ray. Amazon BI tables are gradually being
migrated from Spark compaction to Ray-Exclusive compaction, starting with our largest tables.
Ray?
• Pythonic
• Provides distributed Python APIs for ML, data science, and general workloads.
• Intuitive
• Relatively simple to convert single-process Python to distributed.
• Scalable
• Can integrate PB-scale datasets with data processing and ML pipelines.
• Performant
• Reduces end-to-end latency of data processing and ML workflows.
• Efficient
• Reduces end-to-end cost of data processing and ML.
• Unified
• Can run all steps of mixed data processing, data science, and ML pipelines.
The Compaction Efficiency Problem
1. Efficiency Defined
2. Understanding Amazon BI Pricing
3.The Money Pit
Efficiency Defined
• Output/Input
• Physics: (Useful Energy Output) / (Energy Input)
• Computing: (Useful Compute Output) / (Resource Input)
• In Other Words… The More Useful Work You Complete With the Same Resources, the
More Efficient You Are
• Simple Example
• System A: Reads 2GB/min of input data with 1 CPU and 8GB RAM.
• System B: Reads 4GB/min of input data with 1 CPU and 8GB RAM.
• Conclusion: System B is 2X more efficient than System A.
Understanding Amazon BI Pricing
• Consumers Pay Per Byte Consumed from our Data Catalog
• Administrative costs divided evenly across 1000’s of distinct data consumers.
• Everyone is charged the same normalized amount per byte read.
• The Primary Administrative Cost is Compaction
• What does it do?
• Input is a list of structured S3 files with records to insert, update, and delete.
• Output is a set of S3 files with all inserts, updates, and deletes applied.
• Why is it so expensive?
• Runs after any write to active tables in our exabyte-scale data catalog.
• Incurs large compute cost to merge input, and storage cost to write output.
• Why do we do it?
• To cache a Read-Optimized View of the dataset.
• To enforces Security & Compliance Policies Like GDPR.
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
💵
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
💵💵
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
💵💵
💵
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
💵💵
💵
💵
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
💵💵
💵
💵
💵
The Spark to Ray Migration
1.Key Milestones
2.Challenges
3.Concessions Made
4.Lessons Learned
Key Milestones
1. 2019: Ray Investigation, Benchmarking, and Analysis
2. 2020: Ray Compactor (The Flash Compactor) Algorithm Design
3.2020: The Flash Compactor Proof-of-Concept Implementation
4.2021: Ray Job Management System Design & Implementation
5.2022: Exabyte-Scale Production Data Quality Analysis with Ray
6.2022: The Flash Compactor Shadows Spark in Production
7. 2023: Production Consumer Migration from Spark to The Flash Compactor on Ray
8.2024: Ray Compactor Exclusivity - Spark Compactor Shutdown Begins
26
Challenges
Budgeting for Deferred Results
• Costs increase before they decrease
• Operating Ray & Spark concurrently
• Largest 1% of tables are ~50% of cost
• Migrating largest 1% of tables first
• At scale, failure resilience takes time
Correctness
• Backwards compatibility with Spark
• Spark/Arrow type system differences
• Asserting correctness on a budget
• Automating data quality workflows
Regressive Datasets
• Merging millions of KB/MB-scale files
• Splitting TB-scale files
• Dealing with oversized merge keys
• Handling corrupt and missing data
• Poor table partitioning and schema
Operational Sustainability
• Cluster management (cost and scale)
• Reducing cluster start/stop times
• Eliminating out-of-memory errors
• When to use AWS Glue vs. EC2
• Proliferation of manual workloads
27
Challenges
Customer Experience & Paradoxical Goals
• Goal 1: Complete an EB-scale compute framework migration invisibly
• Goal 2: Customers just read a table and shouldn’t care if Spark or Ray produced it
• Goal 3: Customers need to know how complete the table they’re consuming is
So what do you tell them when the data produced by Spark and Ray are out of sync?
🤔
28
Challenges
Customer Experience & Paradoxical Goals
29
Challenges
Customer Experience & Paradoxical Goals
We also built dynamic routing per-table-consumer to get the right answer.
Q: So how complete IS this table?
A: Well, that kind of depends on who’s asking…
30
Backwards Compatibility
over Latest Tech
• Regressed from Parquet 2.X
to Spark-flavored Parquet 1.0
• Subsequent migrations
needed to move away from
backwards compatibility
Concessions Made
Multiple Implementations
over Unification
• No single implementation is
most optimal across all
dataset and hardware
variations
• Constantly balancing ROI of
maintaining vs. discarding
an implementation
Manual Memory
Management
over Automatic
• Difficult to predict object
store memory requirements
at cluster & task creation
time
• High efficiency self-
managed GC preferred over
fully managed GC
31
Lessons Learned
Budgeting for Deferred Results
• Don’t save the hardest problems for last
• Know time to recoup initial investment
• Show you can recoup initial investment
Correctness
• Your customers define correctness
• Type systems rarely agree on equality
• Don’t expect 100% result equality
• Build key dataset equivalence checks
Regressive Datasets
• Test against production data ASAP
• Code coverage != edge-case coverage
• Tests can’t replace production telemetry
• At scale, QA and telemetry converge
Operational Sustainability
• Murphy’s Law is always true at scale
• EC2 excels at larger, >10TB jobs
• Glue excels at smaller, <10TB jobs
• Automate manual ops early and often
• Building serverless Ray infra is hard
Results
1. Ray vs. Spark Production Metrics
2. Key Production Statistics
3. Fun Facts
Amazon's Exabyte-Scale Migration from Spark to Ray
Amazon's Exabyte-Scale Migration from Spark to Ray
Amazon's Exabyte-Scale Migration from Spark to Ray
Amazon's Exabyte-Scale Migration from Spark to Ray
Key Production Statistics
Ray DeltaCAT Compactor
October 2024
Job Runs
EC2 vCPUs
Provisioned
Data Merged Cost Efficiency Projected Annual Savings
>25K
jobs/day
>1.5MM
vCPUs/day
>40 PB/day
(pyarrow)
$2.74/TB
(on-demand r6g avg)
$0.59/TB
(spot r6g avg)
>220K EC2 vCPU Years
~$100MM/year
(on-demand r6g)
38
Fun Facts
Did you know…
1. All of our internal production Data Catalog API calls go from Python to Java?
2. We have a critical prod dependency on Py4j, and it hasn’t caused a single issue yet?
But we spent A LOT of time stabilizing this Py4j dependency before going to prod!
3. Our production compactor code is checked into the open source DeltaCAT project?
And we’ve started testing it on Iceberg!
So if you find any issues or would like to help, please let us know!
Conclusion
1.Spark to Ray?
2.Key Takeaways
40
Spark to Ray?
Should I start migrating my Spark jobs to Ray?
• Spark still provides more general and feature-rich data processing abstractions.
• There are no paved roads to translate all Spark applications to Ray-native equivalents.
• Don’t expect comparable improvements by just running your Spark jobs on RayDP.
Probably Not
(not yet, at least)
41
Key Takeaways
So What Do These Results Mean?
• The flexibility of Ray Core lets you craft optimal solutions to very specific problems.
• You may want to focus on rewriting your most expensive distributed compute jobs on Ray.
• Our compactor provides one specific example of how a targeted migration to Ray can pay off.
• The Exoshuffle paper shows a more general example of how Ray can improve data processing. (and
Ray on EC2 still holds the 100TB Cloud Sort Cost Record of $97!).
Ray has the potential to be a world-class big data processing
framework, but realizing that potential takes A LOT of work today!
Future Work?
+
Apache Iceberg Compaction on DeltaCAT
https://github.com/ray-project/deltacat/ https://github.com/apache/iceberg
Apache Iceberg Compaction on DeltaCAT
→ What Will it Do?
+ Improve Iceberg table copy-on-write / merge-on-read efficiency & scalability.
+ Improve reading Iceberg equality deletes written by Apache Flink.
→ When Can I Use It?
+ Targeting a stable open source release in early 2025.
+ Currently running internal tests to verify correctness, stability, efficiency, etc.
→ What Can I Do Today?
+ Run local and distributed Iceberg table reads and writes via on Ray.
+ https://www.getdaft.io/
Thank You
Ray Community Slack
@Patrick Ames
DeltaCAT Project Homepage
https://github.com/ray-project/deltacat
Read More on the AWS Open Source Blog
https://aws.amazon.com/blogs/opensource/

More Related Content

PPTX
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
PDF
How KeyBank Used Elastic to Build an Enterprise Monitoring Solution
PDF
The state of Spark in the cloud
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
Curriculum Associates Strata NYC 2017
PPTX
Curriculum Associates Strata NYC 2017
PPTX
Curriculum Associates Strata NYC 2017
PDF
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
How KeyBank Used Elastic to Build an Enterprise Monitoring Solution
The state of Spark in the cloud
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...

Similar to Amazon's Exabyte-Scale Migration from Spark to Ray (20)

PPTX
Stateful streaming and the challenge of state
PDF
Cost-Based Optimizer in Apache Spark 2.2
PPTX
Shikha fdp 62_14july2017
PDF
Suburface 2021 IBM Cloud Data Lake
PPTX
From Pipelines to Refineries: scaling big data applications with Tim Hunter
PPTX
Relational data modeling trends for transactional applications
PPTX
2014 09-12 lambda-architecture-at-indix
PDF
Apache CarbonData+Spark to realize data convergence and Unified high performa...
PDF
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
PDF
Presentation cmg2016 capacity management essentials-boston
PPTX
Case Study: Credit Card Core System with Exalogic, Exadata, Oracle Cloud Mach...
PDF
Chicago AWS user group - Raja Dheekonda: Replatforming ML
PDF
OpenPOWER Roadmap Toward CORAL
PDF
DoneDeal - AWS Data Analytics Platform
PDF
EM12c: Capacity Planning with OEM Metrics
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PDF
AWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
PDF
Budapest Spring MUG 2016 - MongoDB User Group
PPTX
Large Data Volume Salesforce experiences
PPTX
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Stateful streaming and the challenge of state
Cost-Based Optimizer in Apache Spark 2.2
Shikha fdp 62_14july2017
Suburface 2021 IBM Cloud Data Lake
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Relational data modeling trends for transactional applications
2014 09-12 lambda-architecture-at-indix
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Presentation cmg2016 capacity management essentials-boston
Case Study: Credit Card Core System with Exalogic, Exadata, Oracle Cloud Mach...
Chicago AWS user group - Raja Dheekonda: Replatforming ML
OpenPOWER Roadmap Toward CORAL
DoneDeal - AWS Data Analytics Platform
EM12c: Capacity Planning with OEM Metrics
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
AWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
Budapest Spring MUG 2016 - MongoDB User Group
Large Data Volume Salesforce experiences
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Ad

More from All Things Open (20)

PDF
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
PPTX
Big Data on a Small Budget: Scalable Data Visualization for the Rest of Us - ...
PDF
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
PDF
Let's Create a GitHub Copilot Extension! - Nick Taylor, Pomerium
PDF
Leveraging Pre-Trained Transformer Models for Protein Function Prediction - T...
PDF
Gen AI: AI Agents - Making LLMs work together in an organized way - Brent Las...
PDF
You Don't Need an AI Strategy, But You Do Need to Be Strategic About AI - Jes...
PPTX
DON’T PANIC: AI IS COMING – The Hitchhiker’s Guide to AI - Mark Hinkle, Perip...
PDF
Fine-Tuning Large Language Models with Declarative ML Orchestration - Shivay ...
PDF
Leveraging Knowledge Graphs for RAG: A Smarter Approach to Contextual AI Appl...
PPTX
Artificial Intelligence Needs Community Intelligence - Sriram Raghavan, IBM R...
PDF
Don't just talk to AI, do more with AI: how to improve productivity with AI a...
PPTX
Open-Source GenAI vs. Enterprise GenAI: Navigating the Future of AI Innovatio...
PDF
The Death of the Browser - Rachel-Lee Nabors, AgentQL
PDF
Making Operating System updates fast, easy, and safe
PDF
Reshaping the landscape of belonging to transform community
PDF
The Unseen, Underappreciated Security Work Your Maintainers May (or may not) ...
PDF
Integrating Diversity, Equity, and Inclusion into Product Design
PDF
The Open Source Ecosystem for eBPF in Kubernetes
PDF
Open Source Privacy-Preserving Metrics - Sarah Gran & Brandon Pitman
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
Big Data on a Small Budget: Scalable Data Visualization for the Rest of Us - ...
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
Let's Create a GitHub Copilot Extension! - Nick Taylor, Pomerium
Leveraging Pre-Trained Transformer Models for Protein Function Prediction - T...
Gen AI: AI Agents - Making LLMs work together in an organized way - Brent Las...
You Don't Need an AI Strategy, But You Do Need to Be Strategic About AI - Jes...
DON’T PANIC: AI IS COMING – The Hitchhiker’s Guide to AI - Mark Hinkle, Perip...
Fine-Tuning Large Language Models with Declarative ML Orchestration - Shivay ...
Leveraging Knowledge Graphs for RAG: A Smarter Approach to Contextual AI Appl...
Artificial Intelligence Needs Community Intelligence - Sriram Raghavan, IBM R...
Don't just talk to AI, do more with AI: how to improve productivity with AI a...
Open-Source GenAI vs. Enterprise GenAI: Navigating the Future of AI Innovatio...
The Death of the Browser - Rachel-Lee Nabors, AgentQL
Making Operating System updates fast, easy, and safe
Reshaping the landscape of belonging to transform community
The Unseen, Underappreciated Security Work Your Maintainers May (or may not) ...
Integrating Diversity, Equity, and Inclusion into Product Design
The Open Source Ecosystem for eBPF in Kubernetes
Open Source Privacy-Preserving Metrics - Sarah Gran & Brandon Pitman
Ad

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation theory and applications.pdf
PPTX
Cloud computing and distributed systems.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPT
Teaching material agriculture food technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Big Data Technologies - Introduction.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Machine learning based COVID-19 study performance prediction
A Presentation on Artificial Intelligence
Approach and Philosophy of On baking technology
Encapsulation theory and applications.pdf
Cloud computing and distributed systems.
Advanced methodologies resolving dimensionality complications for autism neur...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
“AI and Expert System Decision Support & Business Intelligence Systems”
Teaching material agriculture food technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Dropbox Q2 2025 Financial Results & Investor Presentation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Unlocking AI with Model Context Protocol (MCP)
Big Data Technologies - Introduction.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Machine learning based COVID-19 study performance prediction

Amazon's Exabyte-Scale Migration from Spark to Ray

  • 2. Overview 1. Introduction 2. The Compaction Efficiency Problem 3. The Spark to Ray Migration 4. Results 5. Future Work 6. Get Involved
  • 3. Introduction 1. From Data Warehouse to Lakehouse 2. Business Intelligence? 3. A Brief History of Amazon BI 4.Ray?
  • 4. From Data Warehouse to Lakehouse 2. Data Lake Highly Scalable Raw Data Storage and Retrieval 1. Data Warehouse Coupled Storage and Compute over Structured Data 4. Data Lakehouse Decoupled Data Catalog + Compute over Structured and Unstructured Data Data Lake + Metadata (table names, versions, schemas, descriptions, audit logs, etc.) 3. Data Catalog
  • 5. 5 Decision Evolution Business Intelligence Business Intelligence Flywheel Amazon is gradually transitioning large-scale BI pipelines into phases 2 and 3. Amazon’s BI catalog provides a centralized hub to access exabytes of business data.
  • 6. • 2016-2018: PB-Scale Oracle Data Warehouse Deprecation • Migrated 50PB from Oracle Data Warehouse to S3-Based Data Catalog • Decoupled storage with Amazon Redshift & Apache Hive on Amazon EMR Compute A Brief History of Amazon BI
  • 7. • 2018-2019: EB-Scale Data Catalog & Lakehouse Formation • Bring your own compute (EMR Spark, AWS Glue, Amazon Redshift Spectrum, etc.) A Brief History of Amazon BI
  • 8. • 2018-2019: EB-Scale Data Catalog & Lakehouse Formation • Bring your own compute (EMR Spark, AWS Glue, Amazon Redshift Spectrum, etc.) • LSM-based CDC “Compaction” using Apache Spark on Amazon EMR A Brief History of Amazon BI Append-only compaction: Append deltas arrive in a table’s CDC log stream, where each delta contains pointers to one or more S3 files containing records to insert into the table. During a compaction job, no records are updated or deleted so the delta merge is a simple concatenation, but the compactor is still responsible for writing out files sized appropriately to optimize reads (i.e. merge tiny files into larger files and split massive files into smaller files).
  • 9. • 2018-2019: EB-Scale Data Catalog & Lakehouse Formation • Bring your own compute (EMR Spark, AWS Glue, Amazon Redshift Spectrum, etc.) • LSM-based CDC “Compaction” using Apache Spark on Amazon EMR A Brief History of Amazon BI Upsert compaction: Append and Upsert deltas arrive in a table’s CDC log stream, where each Upsert delta contains records to update or insert according to one or more merge keys. In this case, column1 is used as the merge key, so only the latest column2 updates are kept per distinct column1 value.
  • 10. • 2019-2023: Ray Integration • EB-Scale Data Quality Analysis • Spark-to-Ray Compaction Migration A Brief History of Amazon BI Ray Shadow Compaction Workflow: New deltas arriving in a data catalog table’s CDC log stream are merged into two separate compacted tables maintained separately by Apache Spark and Ray. The Data Reconciliation Service verifies that different data processing frameworks produce equivalent results when querying datasets produced by Apache Spark and Ray, while the Ray-based DQ Service compares key dataset statistics.
  • 11. • 2024+: Ray Exclusivity • Migrate all Table Queries to Ray Compactor Output • Turn off Spark Compactor A Brief History of Amazon BI Ray Exclusive compaction: New deltas arriving in a data catalog table’s CDC log stream are merged into only one compacted table maintained by Ray. Amazon BI tables are gradually being migrated from Spark compaction to Ray-Exclusive compaction, starting with our largest tables.
  • 12. Ray? • Pythonic • Provides distributed Python APIs for ML, data science, and general workloads. • Intuitive • Relatively simple to convert single-process Python to distributed. • Scalable • Can integrate PB-scale datasets with data processing and ML pipelines. • Performant • Reduces end-to-end latency of data processing and ML workflows. • Efficient • Reduces end-to-end cost of data processing and ML. • Unified • Can run all steps of mixed data processing, data science, and ML pipelines.
  • 13. The Compaction Efficiency Problem 1. Efficiency Defined 2. Understanding Amazon BI Pricing 3.The Money Pit
  • 14. Efficiency Defined • Output/Input • Physics: (Useful Energy Output) / (Energy Input) • Computing: (Useful Compute Output) / (Resource Input) • In Other Words… The More Useful Work You Complete With the Same Resources, the More Efficient You Are • Simple Example • System A: Reads 2GB/min of input data with 1 CPU and 8GB RAM. • System B: Reads 4GB/min of input data with 1 CPU and 8GB RAM. • Conclusion: System B is 2X more efficient than System A.
  • 15. Understanding Amazon BI Pricing • Consumers Pay Per Byte Consumed from our Data Catalog • Administrative costs divided evenly across 1000’s of distinct data consumers. • Everyone is charged the same normalized amount per byte read. • The Primary Administrative Cost is Compaction • What does it do? • Input is a list of structured S3 files with records to insert, update, and delete. • Output is a set of S3 files with all inserts, updates, and deletes applied. • Why is it so expensive? • Runs after any write to active tables in our exabyte-scale data catalog. • Incurs large compute cost to merge input, and storage cost to write output. • Why do we do it? • To cache a Read-Optimized View of the dataset. • To enforces Security & Compliance Policies Like GDPR.
  • 16. The Money Pit • Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction
  • 17. The Money Pit • Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction 💵
  • 18. The Money Pit • Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction 💵 💵
  • 19. The Money Pit • Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction 💵 💵 💵
  • 20. The Money Pit • Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction 💵 💵 💵💵
  • 21. The Money Pit • Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction 💵 💵 💵💵 💵
  • 22. The Money Pit • Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction 💵 💵 💵💵 💵 💵
  • 23. The Money Pit • Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction 💵 💵 💵💵 💵 💵 💵
  • 24. The Spark to Ray Migration 1.Key Milestones 2.Challenges 3.Concessions Made 4.Lessons Learned
  • 25. Key Milestones 1. 2019: Ray Investigation, Benchmarking, and Analysis 2. 2020: Ray Compactor (The Flash Compactor) Algorithm Design 3.2020: The Flash Compactor Proof-of-Concept Implementation 4.2021: Ray Job Management System Design & Implementation 5.2022: Exabyte-Scale Production Data Quality Analysis with Ray 6.2022: The Flash Compactor Shadows Spark in Production 7. 2023: Production Consumer Migration from Spark to The Flash Compactor on Ray 8.2024: Ray Compactor Exclusivity - Spark Compactor Shutdown Begins
  • 26. 26 Challenges Budgeting for Deferred Results • Costs increase before they decrease • Operating Ray & Spark concurrently • Largest 1% of tables are ~50% of cost • Migrating largest 1% of tables first • At scale, failure resilience takes time Correctness • Backwards compatibility with Spark • Spark/Arrow type system differences • Asserting correctness on a budget • Automating data quality workflows Regressive Datasets • Merging millions of KB/MB-scale files • Splitting TB-scale files • Dealing with oversized merge keys • Handling corrupt and missing data • Poor table partitioning and schema Operational Sustainability • Cluster management (cost and scale) • Reducing cluster start/stop times • Eliminating out-of-memory errors • When to use AWS Glue vs. EC2 • Proliferation of manual workloads
  • 27. 27 Challenges Customer Experience & Paradoxical Goals • Goal 1: Complete an EB-scale compute framework migration invisibly • Goal 2: Customers just read a table and shouldn’t care if Spark or Ray produced it • Goal 3: Customers need to know how complete the table they’re consuming is So what do you tell them when the data produced by Spark and Ray are out of sync? 🤔
  • 29. 29 Challenges Customer Experience & Paradoxical Goals We also built dynamic routing per-table-consumer to get the right answer. Q: So how complete IS this table? A: Well, that kind of depends on who’s asking…
  • 30. 30 Backwards Compatibility over Latest Tech • Regressed from Parquet 2.X to Spark-flavored Parquet 1.0 • Subsequent migrations needed to move away from backwards compatibility Concessions Made Multiple Implementations over Unification • No single implementation is most optimal across all dataset and hardware variations • Constantly balancing ROI of maintaining vs. discarding an implementation Manual Memory Management over Automatic • Difficult to predict object store memory requirements at cluster & task creation time • High efficiency self- managed GC preferred over fully managed GC
  • 31. 31 Lessons Learned Budgeting for Deferred Results • Don’t save the hardest problems for last • Know time to recoup initial investment • Show you can recoup initial investment Correctness • Your customers define correctness • Type systems rarely agree on equality • Don’t expect 100% result equality • Build key dataset equivalence checks Regressive Datasets • Test against production data ASAP • Code coverage != edge-case coverage • Tests can’t replace production telemetry • At scale, QA and telemetry converge Operational Sustainability • Murphy’s Law is always true at scale • EC2 excels at larger, >10TB jobs • Glue excels at smaller, <10TB jobs • Automate manual ops early and often • Building serverless Ray infra is hard
  • 32. Results 1. Ray vs. Spark Production Metrics 2. Key Production Statistics 3. Fun Facts
  • 37. Key Production Statistics Ray DeltaCAT Compactor October 2024 Job Runs EC2 vCPUs Provisioned Data Merged Cost Efficiency Projected Annual Savings >25K jobs/day >1.5MM vCPUs/day >40 PB/day (pyarrow) $2.74/TB (on-demand r6g avg) $0.59/TB (spot r6g avg) >220K EC2 vCPU Years ~$100MM/year (on-demand r6g)
  • 38. 38 Fun Facts Did you know… 1. All of our internal production Data Catalog API calls go from Python to Java? 2. We have a critical prod dependency on Py4j, and it hasn’t caused a single issue yet? But we spent A LOT of time stabilizing this Py4j dependency before going to prod! 3. Our production compactor code is checked into the open source DeltaCAT project? And we’ve started testing it on Iceberg! So if you find any issues or would like to help, please let us know!
  • 40. 40 Spark to Ray? Should I start migrating my Spark jobs to Ray? • Spark still provides more general and feature-rich data processing abstractions. • There are no paved roads to translate all Spark applications to Ray-native equivalents. • Don’t expect comparable improvements by just running your Spark jobs on RayDP. Probably Not (not yet, at least)
  • 41. 41 Key Takeaways So What Do These Results Mean? • The flexibility of Ray Core lets you craft optimal solutions to very specific problems. • You may want to focus on rewriting your most expensive distributed compute jobs on Ray. • Our compactor provides one specific example of how a targeted migration to Ray can pay off. • The Exoshuffle paper shows a more general example of how Ray can improve data processing. (and Ray on EC2 still holds the 100TB Cloud Sort Cost Record of $97!). Ray has the potential to be a world-class big data processing framework, but realizing that potential takes A LOT of work today!
  • 43. + Apache Iceberg Compaction on DeltaCAT https://github.com/ray-project/deltacat/ https://github.com/apache/iceberg
  • 44. Apache Iceberg Compaction on DeltaCAT → What Will it Do? + Improve Iceberg table copy-on-write / merge-on-read efficiency & scalability. + Improve reading Iceberg equality deletes written by Apache Flink. → When Can I Use It? + Targeting a stable open source release in early 2025. + Currently running internal tests to verify correctness, stability, efficiency, etc. → What Can I Do Today? + Run local and distributed Iceberg table reads and writes via on Ray. + https://www.getdaft.io/
  • 45. Thank You Ray Community Slack @Patrick Ames DeltaCAT Project Homepage https://github.com/ray-project/deltacat Read More on the AWS Open Source Blog https://aws.amazon.com/blogs/opensource/