Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang and Lipeng Zhu

Edward Zhang,
Software Engineer Manager, Data Service & Solution (eBay)
ADBMS to Apache Spark
Auto Migration Framework
#SAISDD7

Who We Are
• Data Service & Solution team in eBay
• Responsible for big data processing and data
application development
• Focus on batch auto migration and Spark core
optimization
2#SAISDD7

Why Migrate to Spark
• More complex big data processing needs
• Streaming, Graph computation, Machine
Learning use cases
• Extreme performance optimization need
3#SAISDD7

What We Do
• ~90% batch workload auto migration
• Tool sets to enable manual migration
4#SAISDD7

Agenda
5#SAISDD7
ØAuto Migration Scope
ØAuto Migration Strategy
ØAuto Migration Components
ØKey Components
ØTool Sets
ØMajor Challenges
ØBe part of community

Auto Migration Scope
6#SAISDD7
• ~5K Target tables
• ~20K intermediate/working tables
• ~22PB target tables
• ~40PB relational data processing every day
• ~ 1 year timeline

Auto Migration Strategy
7#SAISDD7

Auto Migration Framework
8#SAISDD7
Migration Planner Metadata
Migration Engine
Controller
Process Manager
Task Invoker
Task Monitor
DDL Generator SQL Convertor Job Optimizer Pipeline Generator
Release Assistant Data Mover
Data Validator

Auto Migration Components
9#SAISDD7
Migration Planner
• Analyze and identify auto migration candidates
• Determine the order of table migration
Metadata
• Define and collect metadata to enable the auto migration
engine
• Include table profile, data linage, job linage, SQL file profile,
pipeline profile

10#SAISDD7
Controller
• Manage the end to end migration process
• Include sub components like process manager, task invoker,
task monitor
DDL Generator
• A data modeler to generate DDL on Spark for target table,
working tables and views
• Also include setting the table format, bucket and partition

11#SAISDD7
SQL Convertor
• Split original SQL files into table transform + merge
steps
• Parsing original ADBMS SQL into abstract syntax
tree and assemble into Spark SQL
• Special rules to deal with SQL dialect and UDFs

12#SAISDD7
Job Optimizer
• Pre generate Spark job execution configurations
based on table size and Spark cluster scale (typically
spark.sql.shuffle.partions)
• Leverage Spark Adaptive Execution to optimize the
execution plan online

13#SAISDD7
Pipeline Generator
– Generate workflow to set spark sql files execution steps and schedule
Release Assistant
- Push code to production environment and github repo, and table creation ..
Data Mover
- Move data across platforms, for snapshot data preparation on DEV and historical data
initialize on PROD
Data Validator
- Cross platform data checksum on both DEV and PROD

Key Components
14#SAISDD7
• Metadata
• SQL Converter

Metadata - Overview
15#SAISDD7
Neo4jMySQL
Table Profile SQL File Profile Pipeline Profile Data Linage Job Linage

Metadata – Data Linage
16#SAISDD7

SQL Converter - Overview
17#SAISDD7

SQL Converter – Conversion Rules
18#SAISDD7
• Split original SQL files into table transformation and final table merge
• Identify ACID steps (merge update/delete/insert into one insert-
overwrite step)
• Multiple update/delete cases – store middle step result into temp view
and do final single merge
• Special handling for cases like case sensitive, date/timestamp
calculations, column name alias …
• Adaptive for Spark known issues
• Internal function & UDF translation

SQL Convertor – Sample
19#SAISDD7

Tool Sets
20#SAISDD7
• DDL Generator
• SQL Converter
• SQL Optimizer
• Pipeline Generator
• Release Assistant
• Data Mover
• Data Validator
• + Dev Suite

Major Challenges
21#SAISDD7
• Metadata Definition & Collection
- You do not know what you do not know
• Data Validation
- Upstream data quality issues
- SQL behavior or data format difference on Spark
• No SQL Jobs
- Cannot cover logic in shell scripts or command lines in pipeline

Be part of community
22#SAISDD7
~ 50 issues reported to community during migration
Case-insensitive field resolution
• SPARK-25132 Case-insensitive field resolution when reading from Parquet
• SPARK-25175 Field resolution should fail if there's ambiguity for ORC native reader
• SPARK-25207 Case-insensitive field resolution for filter pushdown when reading Parquet
Parquet filter pushdown
• SPARK-23727 Support DATE predict push down in parquet
• SPARK-24716 Refactor ParquetFilters
• SPARK-24706 Support ByteType and ShortType pushdown to parquet
• SPARK-24549 Support DecimalType push down to the parquet data sources
• SPARK-24718 Timestamp support pushdown to parquet data source
• SPARK-24638 StringStartsWith support push down
• SPARK-17091 Convert IN predicate to equivalent Parquet filter
UDF Improvement
• SPARK-23900 format_number udf should take user specifed format as argument
• SPARK-23903Add support for date extract
• SPARK-23905 Add UDF weekday
Bugs
• SPARK-24076 very bad performance when shuffle.partition = 8192
• SPARK-24556 ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning
• SPARK-25084 "distribute by" on multiple columns may lead to codegen issue
• SPARK-25368 Incorrect constraint inference returns wrong result

Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang and Lipeng Zhu

More Related Content

What's hot (20)

Similar to Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang and Lipeng Zhu (20)

More from Databricks (20)

Recently uploaded (20)

Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang and Lipeng Zhu