SlideShare a Scribd company logo
Edward Zhang,
Software Engineer Manager, Data Service & Solution (eBay)
ADBMS to Apache Spark
Auto Migration Framework
#SAISDD7
Who We Are
• Data Service & Solution team in eBay
• Responsible for big data processing and data
application development
• Focus on batch auto migration and Spark core
optimization
2#SAISDD7
Why Migrate to Spark
• More complex big data processing needs
• Streaming, Graph computation, Machine
Learning use cases
• Extreme performance optimization need
3#SAISDD7
What We Do
• ~90% batch workload auto migration
• Tool sets to enable manual migration
4#SAISDD7
Agenda
5#SAISDD7
ØAuto Migration Scope
ØAuto Migration Strategy
ØAuto Migration Components
ØKey Components
ØTool Sets
ØMajor Challenges
ØBe part of community
Auto Migration Scope
6#SAISDD7
• ~5K Target tables
• ~20K intermediate/working tables
• ~22PB target tables
• ~40PB relational data processing every day
• ~ 1 year timeline
Auto Migration Strategy
7#SAISDD7
Auto Migration Framework
8#SAISDD7
Migration Planner Metadata
Migration Engine
Controller
Process Manager
Task Invoker
Task Monitor
DDL Generator SQL Convertor Job Optimizer Pipeline Generator
Release Assistant Data Mover
Data Validator
Auto Migration Components
9#SAISDD7
Migration Planner
• Analyze and identify auto migration candidates
• Determine the order of table migration
Metadata
• Define and collect metadata to enable the auto migration
engine
• Include table profile, data linage, job linage, SQL file profile,
pipeline profile
Auto Migration Components
10#SAISDD7
Controller
• Manage the end to end migration process
• Include sub components like process manager, task invoker,
task monitor
DDL Generator
• A data modeler to generate DDL on Spark for target table,
working tables and views
• Also include setting the table format, bucket and partition
Auto Migration Components
11#SAISDD7
SQL Convertor
• Split original SQL files into table transform + merge
steps
• Parsing original ADBMS SQL into abstract syntax
tree and assemble into Spark SQL
• Special rules to deal with SQL dialect and UDFs
Auto Migration Components
12#SAISDD7
Job Optimizer
• Pre generate Spark job execution configurations
based on table size and Spark cluster scale (typically
spark.sql.shuffle.partions)
• Leverage Spark Adaptive Execution to optimize the
execution plan online
Auto Migration Components
13#SAISDD7
Pipeline Generator
– Generate workflow to set spark sql files execution steps and schedule
Release Assistant
- Push code to production environment and github repo, and table creation ..
Data Mover
- Move data across platforms, for snapshot data preparation on DEV and historical data
initialize on PROD
Data Validator
- Cross platform data checksum on both DEV and PROD
Key Components
14#SAISDD7
• Metadata
• SQL Converter
Metadata - Overview
15#SAISDD7
Neo4jMySQL
Table Profile SQL File Profile Pipeline Profile Data Linage Job Linage
Metadata – Data Linage
16#SAISDD7
SQL Converter - Overview
17#SAISDD7
SQL Converter – Conversion Rules
18#SAISDD7
• Split original SQL files into table transformation and final table merge
• Identify ACID steps (merge update/delete/insert into one insert-
overwrite step)
• Multiple update/delete cases – store middle step result into temp view
and do final single merge
• Special handling for cases like case sensitive, date/timestamp
calculations, column name alias …
• Adaptive for Spark known issues
• Internal function & UDF translation
SQL Convertor – Sample
19#SAISDD7
Tool Sets
20#SAISDD7
• DDL Generator
• SQL Converter
• SQL Optimizer
• Pipeline Generator
• Release Assistant
• Data Mover
• Data Validator
• + Dev Suite
Major Challenges
21#SAISDD7
• Metadata Definition & Collection
- You do not know what you do not know
• Data Validation
- Upstream data quality issues
- SQL behavior or data format difference on Spark
• No SQL Jobs
- Cannot cover logic in shell scripts or command lines in pipeline
Be part of community
22#SAISDD7
~ 50 issues reported to community during migration
Case-insensitive field resolution
• SPARK-25132 Case-insensitive field resolution when reading from Parquet
• SPARK-25175 Field resolution should fail if there's ambiguity for ORC native reader
• SPARK-25207 Case-insensitive field resolution for filter pushdown when reading Parquet
Parquet filter pushdown
• SPARK-23727 Support DATE predict push down in parquet
• SPARK-24716 Refactor ParquetFilters
• SPARK-24706 Support ByteType and ShortType pushdown to parquet
• SPARK-24549 Support DecimalType push down to the parquet data sources
• SPARK-24718 Timestamp support pushdown to parquet data source
• SPARK-24638 StringStartsWith support push down
• SPARK-17091 Convert IN predicate to equivalent Parquet filter
UDF Improvement
• SPARK-23900 format_number udf should take user specifed format as argument
• SPARK-23903Add support for date extract
• SPARK-23905 Add UDF weekday
Bugs
• SPARK-24076 very bad performance when shuffle.partition = 8192
• SPARK-24556 ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning
• SPARK-25084 "distribute by" on multiple columns may lead to codegen issue
• SPARK-25368 Incorrect constraint inference returns wrong result
Q & A
23#SAISDD7
Thank You!

More Related Content

PDF
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
PDF
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
PDF
Cassandra & Spark for IoT
PDF
Top 5 mistakes when writing Streaming applications
PDF
Change Data Feed in Delta
PPTX
Lambda architecture with Spark
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Next CERN Accelerator Logging Service with Jakub Wozniak
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Cassandra & Spark for IoT
Top 5 mistakes when writing Streaming applications
Change Data Feed in Delta
Lambda architecture with Spark

What's hot (20)

PDF
The delta architecture
PDF
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
PDF
Sa introduction to big data pipelining with cassandra & spark west mins...
PDF
Big Telco - Yousun Jeong
PDF
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
PDF
Making Apache Spark Better with Delta Lake
PDF
Reactive dashboard’s using apache spark
PDF
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
PPTX
How ReversingLabs Serves File Reputation Service for 10B Files
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PDF
Acid ORC, Iceberg and Delta Lake
PDF
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
PPTX
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
PPTX
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
PDF
Spark with Delta Lake
PDF
Capital One: Using Cassandra In Building A Reporting Platform
PDF
Managing Thousands of Spark Workers in Cloud Environment with Yuhao Zheng and...
PDF
Extracting Insights from Data at Twitter
PPTX
Spark - The Ultimate Scala Collections by Martin Odersky
PDF
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
The delta architecture
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Sa introduction to big data pipelining with cassandra & spark west mins...
Big Telco - Yousun Jeong
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Making Apache Spark Better with Delta Lake
Reactive dashboard’s using apache spark
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
How ReversingLabs Serves File Reputation Service for 10B Files
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Acid ORC, Iceberg and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Spark with Delta Lake
Capital One: Using Cassandra In Building A Reporting Platform
Managing Thousands of Spark Workers in Cloud Environment with Yuhao Zheng and...
Extracting Insights from Data at Twitter
Spark - The Ultimate Scala Collections by Martin Odersky
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Ad

Similar to Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang and Lipeng Zhu (20)

PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Automating Data Quality Processes at Reckitt
PDF
Big Data visualization with Apache Spark and Zeppelin
PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
PDF
Real Time Analytics with Dse
PDF
Understanding Query Plans and Spark UIs
PDF
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
PDF
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
PDF
DesignMind SQL Server 2008 Migration
PDF
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
PPTX
Database Migrations with Gradle and Liquibase
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PPTX
Spark SQL
PPTX
Big Data training
PDF
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
PDF
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive Analytics
PDF
Architecting a Next Gen Data Platform – Strata London 2018
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
PDF
Presto @ Zalando - Big Data Tech Warsaw 2020
PDF
Data processing with spark in r & python
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Automating Data Quality Processes at Reckitt
Big Data visualization with Apache Spark and Zeppelin
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Real Time Analytics with Dse
Understanding Query Plans and Spark UIs
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
DesignMind SQL Server 2008 Migration
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Database Migrations with Gradle and Liquibase
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Spark SQL
Big Data training
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive Analytics
Architecting a Next Gen Data Platform – Strata London 2018
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
Presto @ Zalando - Big Data Tech Warsaw 2020
Data processing with spark in r & python
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
annual-report-2024-2025 original latest.
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Quality review (1)_presentation of this 21
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Introduction to Data Science and Data Analysis
annual-report-2024-2025 original latest.
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Quality review (1)_presentation of this 21
.pdf is not working space design for the following data for the following dat...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
1_Introduction to advance data techniques.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Reliability_Chapter_ presentation 1221.5784
IBA_Chapter_11_Slides_Final_Accessible.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
SAP 2 completion done . PRESENTATION.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Galatica Smart Energy Infrastructure Startup Pitch Deck
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Data Science and Data Analysis

Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang and Lipeng Zhu

  • 1. Edward Zhang, Software Engineer Manager, Data Service & Solution (eBay) ADBMS to Apache Spark Auto Migration Framework #SAISDD7
  • 2. Who We Are • Data Service & Solution team in eBay • Responsible for big data processing and data application development • Focus on batch auto migration and Spark core optimization 2#SAISDD7
  • 3. Why Migrate to Spark • More complex big data processing needs • Streaming, Graph computation, Machine Learning use cases • Extreme performance optimization need 3#SAISDD7
  • 4. What We Do • ~90% batch workload auto migration • Tool sets to enable manual migration 4#SAISDD7
  • 5. Agenda 5#SAISDD7 ØAuto Migration Scope ØAuto Migration Strategy ØAuto Migration Components ØKey Components ØTool Sets ØMajor Challenges ØBe part of community
  • 6. Auto Migration Scope 6#SAISDD7 • ~5K Target tables • ~20K intermediate/working tables • ~22PB target tables • ~40PB relational data processing every day • ~ 1 year timeline
  • 8. Auto Migration Framework 8#SAISDD7 Migration Planner Metadata Migration Engine Controller Process Manager Task Invoker Task Monitor DDL Generator SQL Convertor Job Optimizer Pipeline Generator Release Assistant Data Mover Data Validator
  • 9. Auto Migration Components 9#SAISDD7 Migration Planner • Analyze and identify auto migration candidates • Determine the order of table migration Metadata • Define and collect metadata to enable the auto migration engine • Include table profile, data linage, job linage, SQL file profile, pipeline profile
  • 10. Auto Migration Components 10#SAISDD7 Controller • Manage the end to end migration process • Include sub components like process manager, task invoker, task monitor DDL Generator • A data modeler to generate DDL on Spark for target table, working tables and views • Also include setting the table format, bucket and partition
  • 11. Auto Migration Components 11#SAISDD7 SQL Convertor • Split original SQL files into table transform + merge steps • Parsing original ADBMS SQL into abstract syntax tree and assemble into Spark SQL • Special rules to deal with SQL dialect and UDFs
  • 12. Auto Migration Components 12#SAISDD7 Job Optimizer • Pre generate Spark job execution configurations based on table size and Spark cluster scale (typically spark.sql.shuffle.partions) • Leverage Spark Adaptive Execution to optimize the execution plan online
  • 13. Auto Migration Components 13#SAISDD7 Pipeline Generator – Generate workflow to set spark sql files execution steps and schedule Release Assistant - Push code to production environment and github repo, and table creation .. Data Mover - Move data across platforms, for snapshot data preparation on DEV and historical data initialize on PROD Data Validator - Cross platform data checksum on both DEV and PROD
  • 15. Metadata - Overview 15#SAISDD7 Neo4jMySQL Table Profile SQL File Profile Pipeline Profile Data Linage Job Linage
  • 16. Metadata – Data Linage 16#SAISDD7
  • 17. SQL Converter - Overview 17#SAISDD7
  • 18. SQL Converter – Conversion Rules 18#SAISDD7 • Split original SQL files into table transformation and final table merge • Identify ACID steps (merge update/delete/insert into one insert- overwrite step) • Multiple update/delete cases – store middle step result into temp view and do final single merge • Special handling for cases like case sensitive, date/timestamp calculations, column name alias … • Adaptive for Spark known issues • Internal function & UDF translation
  • 19. SQL Convertor – Sample 19#SAISDD7
  • 20. Tool Sets 20#SAISDD7 • DDL Generator • SQL Converter • SQL Optimizer • Pipeline Generator • Release Assistant • Data Mover • Data Validator • + Dev Suite
  • 21. Major Challenges 21#SAISDD7 • Metadata Definition & Collection - You do not know what you do not know • Data Validation - Upstream data quality issues - SQL behavior or data format difference on Spark • No SQL Jobs - Cannot cover logic in shell scripts or command lines in pipeline
  • 22. Be part of community 22#SAISDD7 ~ 50 issues reported to community during migration Case-insensitive field resolution • SPARK-25132 Case-insensitive field resolution when reading from Parquet • SPARK-25175 Field resolution should fail if there's ambiguity for ORC native reader • SPARK-25207 Case-insensitive field resolution for filter pushdown when reading Parquet Parquet filter pushdown • SPARK-23727 Support DATE predict push down in parquet • SPARK-24716 Refactor ParquetFilters • SPARK-24706 Support ByteType and ShortType pushdown to parquet • SPARK-24549 Support DecimalType push down to the parquet data sources • SPARK-24718 Timestamp support pushdown to parquet data source • SPARK-24638 StringStartsWith support push down • SPARK-17091 Convert IN predicate to equivalent Parquet filter UDF Improvement • SPARK-23900 format_number udf should take user specifed format as argument • SPARK-23903Add support for date extract • SPARK-23905 Add UDF weekday Bugs • SPARK-24076 very bad performance when shuffle.partition = 8192 • SPARK-24556 ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning • SPARK-25084 "distribute by" on multiple columns may lead to codegen issue • SPARK-25368 Incorrect constraint inference returns wrong result