SlideShare a Scribd company logo
Fast and Reliable Apache
Spark SQL Releases
DataWorks Summit Barcelona
March 21st, 2019
1
2
NICOLAS POGGI
Databricks, Performance Engineer
• Spark benchmarking
Barcelona Supercomputing - Microsoft Research Centre
• Lead researcher ALOJA project
• New architectures for Big Data
BarcelonaTech (UPC), PhD in Computer Architecture
• Autonomic resource manager for the cloud
• Web customer modeling
About us
BOGDAN GHIT
Databricks, Software Engineer
• Spark performance
IBM T.J. Watson Research Center
• Research intern on big data
• Bid advisor for cloud spot markets
Delft University of Technology, PhD in Computer Science
• Resource management in datacenters
• Performance of Spark, Hadoop
Databricks ecosystem
3
ToolsDevelopers
DBR Cluster Manager
Infrastructure Customers
Beta
Full Support
Marked for deprecation
Deprecated
Databricks runtime (DBR) releases
Our goal is to make releases automatic and frequent
Feb’18 Aug’18 Nov’18 Apr’19 Jul’19 Oct’19 Feb’20
* dates and LTS-tag new releases are subject to change
Spark 3.0
Spark 2.3
Spark 2.4
Spark 2.4
DBR 6.0*
DBR 4.3
DBR 5.0
DBR 5.3-LTS*
Apache Spark contributions
5
Hundreds of commits monthly to the Apache Spark project
Numberofcommits
At this pace of development, mistakes are bound to happen
Where do these contributions go?
6
Scope of the testing
Developers put a significant engineering effort in testing
Query
Input data
Configuration
Over 200 built-in functions
Yet another brick in the wall
Unit testing is not enough to guarantee correctness and performance
Unit testing
Integration
E2E
Micro
Benchmarks
Plan
stability
Fuzz
testing
Macro
benchmarks
Stress
testing
Customer
workloads
Failure
testing
8
Continuous integration pipeline
New artifacts Metrics
- Correctness
- Performance
Test
Alerts
- Merge
- Build
Dev
- Rules
- Policies
Analyze
9
Classification and alerting
- Impact
- Scope
- Correlation
- Confirm?
Failure
Regression
- Minimize
- Drill-down
- Profile
- Compare
- Validate
Events Re-test Alert
Classify Root-cause
Correctness
Performance
10
Failure
Regression
Events
Re-test
Alert
Classify Root-cause
Correctness
Performance
Correctness
How SQLite is tested
Anomaly testing
Out-of-memory testing
IO-error testing
Crash testing
Compound failure tests
Fuzz testing
SQL Fuzz
Malformed database files
Boundary value tests
How SQLite Is Tested: https://www.sqlite.org/testing.html
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
Spark SQL behind the scenes
SQL operators can be represented as trees
Phases of transformation prepare the trees for execution
Rules can be applied once or to fix-point
Random query generation
13
Query profile
Model
translator
Spark
Query
Postgres
Query
vs
vs
...
...
DDL and datagen
14
...
...
BigIntBoolean
Timestamp
Decimal
FloatInteger
SmallInt
String
Choose a data type
Random number of rows
Random number of columns
Random number of tables
Random partition columns
Recursive query model
15
SQL Query
WITH
FROMUNION
SELECT
Functions
Constant
GROUP BY
ORDER BY
Table
Column
Alias
Query
Clause
Expression
JOIN
WHERE
Probabilistic query profile
Independent weights
• Optional query clauses
Inter-dependent weights
• Join types
• Select functions
ORDER BY
UNION
GROUP BY WHERE
10%
10%
50%
10%
Coalesce flattening (1/4)
SELECT COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3) AS int_col,
IF(NULL, VARIANCE(COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)),
COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)) AS int_col_1,
STDDEV(t2.double_col_2) AS float_col,
COALESCE(MIN((t1.smallint_col_3) - (COALESCE(t2.smallint_col_3, t1.smallint_col_3,
t2.smallint_col_3))), COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3),
COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)) AS int_col_2
FROM table_4 t1
INNER JOIN table_4 t2 ON (t2.timestamp_col_7) = (t1.timestamp_col_7)
WHERE (t1.smallint_col_3) IN (CAST('0.04' AS DECIMAL(10,10)), t1.smallint_col_3)
GROUP BY COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)
Small dataset with 2 tables of 5x5 size
Within 10 randomly generated queries
Error: Operation is in ERROR_STATE
Coalesce flattening (2/3)
Aggregate
Project
Join
FILTERSCAN foo
SCAN bar
foo.id IN
(CAST(‘0.04’ AS DECIMAL(10, 10)), foo.id)
foo.ts = bar.ts
COALESCE(COALESCE(foo.id, foo.val), 88)
GROUP BY COALESCE(foo.id, foo.val)
Coalesce flattening (3/4)
Aggregate
Project
Join
FILTERSCAN t1
SCAN t2
foo.id IN
(CAST(‘0.04’ AS DECIMAL(10, 10)), foo.id)
foo.ts = bar.ts
COALESCE(COALESCE(foo.id, foo.val), 88)
COALESCE(foo.id, foo.val)
Coalesce flattening (4/4)
Aggregate
Project
SCAN foo
Minimized query:
SELECT
COALESCE(COALESCE(foo.id, foo.val), 88)
FROM foo
GROUP BY
COALESCE(foo.id, foo.val)
Analyzing the error
● The optimizer flattens the nested coalesce calls
● The SELECT clause doesn’t contain the GROUP BY expression
● Possibly a problem with any GROUP BY expression that can be optimized
Lead function (1/3)
SELECT (t1.decimal0803_col_3) / (t1.decimal0803_col_3) AS decimal_col,
CAST(696 AS STRING) AS char_col, t1.decimal0803_col_3,
(COALESCE(CAST('0.02' AS DECIMAL(10,10)),
CAST('0.47' AS DECIMAL(10,10)),
CAST('-0.53' AS DECIMAL(10,10)))) +
(LEAD(-65, 4) OVER (ORDER BY (t1.decimal0803_col_3) / (t1.decimal0803_col_3),
CAST(696 AS STRING))) AS decimal_col_1,
CAST(-349 AS STRING) AS char_col_1
FROM table_16 t1
WHERE (943) > (889)
Error: Column 4 in row 10 does not match:
[1.0, 696, -871.81, <<-64.98>>, -349] SPARK row
[1.0, 696, -871.81, <<None>>, -349] POSTGRESQL row
Lead function (2/3)
Project
FILTER
SCAN foo
WHERE expr
COALESCE(expr) + LEAD(-65, 4) OVER ORDER BY expr
Lead function (3/3)
Project
FILTER WHERE expr
COALESCE(expr) + LEAD(-65, 4) OVER ORDER BY expr
Analyzing the error
● Using constant input values breaks the behaviour of the LEAD function
● SPARK-16633: https://github.com/apache/spark/pull/14284
SCAN foo
Query operator coverage analysis
In 15m (500 queries), we reach the max coverage of the framework
25
Performance
25
Failure
Regression
Events
Re-test
Alert
Classify Root-cause
Correctness
Performance
Benchmarking tools
•We use spark-sql-perf public library for
TPC workloads
• Provides datagen and import scripts
• local, cluster, S3
• Dashboards for analyzing results
•The Spark micro benchmarks
•And the async-profiler
• to produce flamegraphs
26
https://github.com/databricks/spark-sql-perf
Source:
http://www.brendangregg.com/flamegraphs.html
CPU Flame Graph
15%
------------- journey
Per query drill-down: q67
First, scope and validate
• in 2.4-master (dev) compared
• to 2.3 in DBR 4.3 (prod)
Query 67: 18% regression From 320s to 390s
Q67 executor profile for Spark 2.4-master
Side-by-side 2.3 vs 2.4: find the differences
Spark 2.3 Spark 2.4
Framegraph diff zoom Red slower White new
unsafe/Platform.copyMemory()
unsafe/BytesToBytesMap.safeLookup
New: hash/Murmur3_x86_32.hashUTF8String()
Murmur3_x86_32.hashUnsafeBytesBlock()
Look for hints:
- Mem mgmt
- Hashing
- unsafe
Root-causing
Results:
• Spark 2.3: hashUnsafeBytes() -> 40µs
• Spark 2.4 hashUnsafeBytesBlock() -> 140µs
• also slower UTF8String.getBytes()
Microbenchmark for UTF8String
GIT BISECT
1.)
2.)
3.)
It is a journey to get a release out
DBR and Spark testing and performance are a continuous effort
• Over a month effort to bring performance to improving
TPC-DS 2.4-master vs. 2.3 at SF 1000
15%
5%
< 0%
… a journey that pays off quickly
Query times have improved over 2X
in the Spark 2.x branch measured in the Databricks platform
Note: both 2.4.1 and 3.0.0 are not released yet
Conclusion
Spark in production is not just the framework
Unit and integration testing are not enough
We need Spark specific tools to automate the process
to ensure both correctness and performance
Thanks!
Fast and Reliable Apache Spark SQL Releases
March 2019
36
Test AnalyzeDev
Feedback: {Nico.Poggi, Bogdan.Ghit}@databricks.com

More Related Content

PDF
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
PDF
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
PDF
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
PDF
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
PDF
The Hidden Life of Spark Jobs
PPTX
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
PDF
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
The Hidden Life of Spark Jobs
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Apache Spark on K8S Best Practice and Performance in the Cloud

What's hot (20)

PDF
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
PPTX
Introduction to Apache Spark
PDF
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
PDF
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
PDF
Reactive app using actor model & apache spark
PDF
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
PDF
Spark Summit EU talk by Miklos Christine paddling up the stream
PDF
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
PDF
Building large scale applications in yarn with apache twill
PDF
Tachyon and Apache Spark
PDF
How To Connect Spark To Your Own Datasource
PDF
Spark Summit EU talk by Mike Percy
PDF
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
PPTX
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
PDF
Pedal to the Metal: Accelerating Spark with Silicon Innovation
PPTX
Speed it up and Spark it up at Intel
PDF
Integrating Existing C++ Libraries into PySpark with Esther Kundin
PPTX
Mutant Tests Too: The SQL
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
PDF
Data processing platforms with SMACK: Spark and Mesos internals
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Introduction to Apache Spark
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Reactive app using actor model & apache spark
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Spark Summit EU talk by Miklos Christine paddling up the stream
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Building large scale applications in yarn with apache twill
Tachyon and Apache Spark
How To Connect Spark To Your Own Datasource
Spark Summit EU talk by Mike Percy
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Speed it up and Spark it up at Intel
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Mutant Tests Too: The SQL
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Data processing platforms with SMACK: Spark and Mesos internals
Ad

Similar to Fast and Reliable Apache Spark SQL Releases (20)

PDF
Correctness and Performance of Apache Spark SQL
PDF
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
PDF
Fast and Reliable Apache Spark SQL Engine
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PDF
SparkSQL: A Compiler from Queries to RDDs
PPTX
Typesafe spark- Zalando meetup
PPTX
Big Data-Driven Applications with Cassandra and Spark
PPTX
Profiling & Testing with Spark
PDF
Spark DataFrames and ML Pipelines
PPTX
The Pushdown of Everything by Stephan Kessler and Santiago Mola
PDF
Spark streaming , Spark SQL
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
PPTX
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
PDF
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
PPTX
Running Presto and Spark on the Netflix Big Data Platform
PPTX
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
PDF
Writing Continuous Applications with Structured Streaming in PySpark
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PDF
Scalable Data Science in Python and R on Apache Spark
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Fast and Reliable Apache Spark SQL Engine
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
SparkSQL: A Compiler from Queries to RDDs
Typesafe spark- Zalando meetup
Big Data-Driven Applications with Cassandra and Spark
Profiling & Testing with Spark
Spark DataFrames and ML Pipelines
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Spark streaming , Spark SQL
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Running Presto and Spark on the Netflix Big Data Platform
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Writing Continuous Applications with Structured Streaming in PySpark
Apache Spark 2.0: Faster, Easier, and Smarter
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Scalable Data Science in Python and R on Apache Spark
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
Teaching material agriculture food technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
cuic standard and advanced reporting.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Big Data Technologies - Introduction.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Teaching material agriculture food technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Chapter 3 Spatial Domain Image Processing.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Empathic Computing: Creating Shared Understanding
cuic standard and advanced reporting.pdf
The AUB Centre for AI in Media Proposal.docx
sap open course for s4hana steps from ECC to s4
The Rise and Fall of 3GPP – Time for a Sabbatical?
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation_ Review paper, used for researhc scholars
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
MYSQL Presentation for SQL database connectivity
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton

Fast and Reliable Apache Spark SQL Releases

  • 1. Fast and Reliable Apache Spark SQL Releases DataWorks Summit Barcelona March 21st, 2019 1
  • 2. 2 NICOLAS POGGI Databricks, Performance Engineer • Spark benchmarking Barcelona Supercomputing - Microsoft Research Centre • Lead researcher ALOJA project • New architectures for Big Data BarcelonaTech (UPC), PhD in Computer Architecture • Autonomic resource manager for the cloud • Web customer modeling About us BOGDAN GHIT Databricks, Software Engineer • Spark performance IBM T.J. Watson Research Center • Research intern on big data • Bid advisor for cloud spot markets Delft University of Technology, PhD in Computer Science • Resource management in datacenters • Performance of Spark, Hadoop
  • 3. Databricks ecosystem 3 ToolsDevelopers DBR Cluster Manager Infrastructure Customers
  • 4. Beta Full Support Marked for deprecation Deprecated Databricks runtime (DBR) releases Our goal is to make releases automatic and frequent Feb’18 Aug’18 Nov’18 Apr’19 Jul’19 Oct’19 Feb’20 * dates and LTS-tag new releases are subject to change Spark 3.0 Spark 2.3 Spark 2.4 Spark 2.4 DBR 6.0* DBR 4.3 DBR 5.0 DBR 5.3-LTS*
  • 5. Apache Spark contributions 5 Hundreds of commits monthly to the Apache Spark project Numberofcommits At this pace of development, mistakes are bound to happen
  • 6. Where do these contributions go? 6 Scope of the testing Developers put a significant engineering effort in testing Query Input data Configuration Over 200 built-in functions
  • 7. Yet another brick in the wall Unit testing is not enough to guarantee correctness and performance Unit testing Integration E2E Micro Benchmarks Plan stability Fuzz testing Macro benchmarks Stress testing Customer workloads Failure testing
  • 8. 8 Continuous integration pipeline New artifacts Metrics - Correctness - Performance Test Alerts - Merge - Build Dev - Rules - Policies Analyze
  • 9. 9 Classification and alerting - Impact - Scope - Correlation - Confirm? Failure Regression - Minimize - Drill-down - Profile - Compare - Validate Events Re-test Alert Classify Root-cause Correctness Performance
  • 11. How SQLite is tested Anomaly testing Out-of-memory testing IO-error testing Crash testing Compound failure tests Fuzz testing SQL Fuzz Malformed database files Boundary value tests How SQLite Is Tested: https://www.sqlite.org/testing.html
  • 12. SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog Spark SQL behind the scenes SQL operators can be represented as trees Phases of transformation prepare the trees for execution Rules can be applied once or to fix-point
  • 13. Random query generation 13 Query profile Model translator Spark Query Postgres Query vs vs
  • 14. ... ... DDL and datagen 14 ... ... BigIntBoolean Timestamp Decimal FloatInteger SmallInt String Choose a data type Random number of rows Random number of columns Random number of tables Random partition columns
  • 15. Recursive query model 15 SQL Query WITH FROMUNION SELECT Functions Constant GROUP BY ORDER BY Table Column Alias Query Clause Expression JOIN WHERE
  • 16. Probabilistic query profile Independent weights • Optional query clauses Inter-dependent weights • Join types • Select functions ORDER BY UNION GROUP BY WHERE 10% 10% 50% 10%
  • 17. Coalesce flattening (1/4) SELECT COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3) AS int_col, IF(NULL, VARIANCE(COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)), COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)) AS int_col_1, STDDEV(t2.double_col_2) AS float_col, COALESCE(MIN((t1.smallint_col_3) - (COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3))), COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3), COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)) AS int_col_2 FROM table_4 t1 INNER JOIN table_4 t2 ON (t2.timestamp_col_7) = (t1.timestamp_col_7) WHERE (t1.smallint_col_3) IN (CAST('0.04' AS DECIMAL(10,10)), t1.smallint_col_3) GROUP BY COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3) Small dataset with 2 tables of 5x5 size Within 10 randomly generated queries Error: Operation is in ERROR_STATE
  • 18. Coalesce flattening (2/3) Aggregate Project Join FILTERSCAN foo SCAN bar foo.id IN (CAST(‘0.04’ AS DECIMAL(10, 10)), foo.id) foo.ts = bar.ts COALESCE(COALESCE(foo.id, foo.val), 88) GROUP BY COALESCE(foo.id, foo.val)
  • 19. Coalesce flattening (3/4) Aggregate Project Join FILTERSCAN t1 SCAN t2 foo.id IN (CAST(‘0.04’ AS DECIMAL(10, 10)), foo.id) foo.ts = bar.ts COALESCE(COALESCE(foo.id, foo.val), 88) COALESCE(foo.id, foo.val)
  • 20. Coalesce flattening (4/4) Aggregate Project SCAN foo Minimized query: SELECT COALESCE(COALESCE(foo.id, foo.val), 88) FROM foo GROUP BY COALESCE(foo.id, foo.val) Analyzing the error ● The optimizer flattens the nested coalesce calls ● The SELECT clause doesn’t contain the GROUP BY expression ● Possibly a problem with any GROUP BY expression that can be optimized
  • 21. Lead function (1/3) SELECT (t1.decimal0803_col_3) / (t1.decimal0803_col_3) AS decimal_col, CAST(696 AS STRING) AS char_col, t1.decimal0803_col_3, (COALESCE(CAST('0.02' AS DECIMAL(10,10)), CAST('0.47' AS DECIMAL(10,10)), CAST('-0.53' AS DECIMAL(10,10)))) + (LEAD(-65, 4) OVER (ORDER BY (t1.decimal0803_col_3) / (t1.decimal0803_col_3), CAST(696 AS STRING))) AS decimal_col_1, CAST(-349 AS STRING) AS char_col_1 FROM table_16 t1 WHERE (943) > (889) Error: Column 4 in row 10 does not match: [1.0, 696, -871.81, <<-64.98>>, -349] SPARK row [1.0, 696, -871.81, <<None>>, -349] POSTGRESQL row
  • 22. Lead function (2/3) Project FILTER SCAN foo WHERE expr COALESCE(expr) + LEAD(-65, 4) OVER ORDER BY expr
  • 23. Lead function (3/3) Project FILTER WHERE expr COALESCE(expr) + LEAD(-65, 4) OVER ORDER BY expr Analyzing the error ● Using constant input values breaks the behaviour of the LEAD function ● SPARK-16633: https://github.com/apache/spark/pull/14284 SCAN foo
  • 24. Query operator coverage analysis In 15m (500 queries), we reach the max coverage of the framework
  • 26. Benchmarking tools •We use spark-sql-perf public library for TPC workloads • Provides datagen and import scripts • local, cluster, S3 • Dashboards for analyzing results •The Spark micro benchmarks •And the async-profiler • to produce flamegraphs 26 https://github.com/databricks/spark-sql-perf Source: http://www.brendangregg.com/flamegraphs.html CPU Flame Graph
  • 28. Per query drill-down: q67 First, scope and validate • in 2.4-master (dev) compared • to 2.3 in DBR 4.3 (prod) Query 67: 18% regression From 320s to 390s
  • 29. Q67 executor profile for Spark 2.4-master
  • 30. Side-by-side 2.3 vs 2.4: find the differences Spark 2.3 Spark 2.4
  • 31. Framegraph diff zoom Red slower White new unsafe/Platform.copyMemory() unsafe/BytesToBytesMap.safeLookup New: hash/Murmur3_x86_32.hashUTF8String() Murmur3_x86_32.hashUnsafeBytesBlock() Look for hints: - Mem mgmt - Hashing - unsafe
  • 32. Root-causing Results: • Spark 2.3: hashUnsafeBytes() -> 40µs • Spark 2.4 hashUnsafeBytesBlock() -> 140µs • also slower UTF8String.getBytes() Microbenchmark for UTF8String GIT BISECT 1.) 2.) 3.)
  • 33. It is a journey to get a release out DBR and Spark testing and performance are a continuous effort • Over a month effort to bring performance to improving TPC-DS 2.4-master vs. 2.3 at SF 1000 15% 5% < 0%
  • 34. … a journey that pays off quickly Query times have improved over 2X in the Spark 2.x branch measured in the Databricks platform Note: both 2.4.1 and 3.0.0 are not released yet
  • 35. Conclusion Spark in production is not just the framework Unit and integration testing are not enough We need Spark specific tools to automate the process to ensure both correctness and performance
  • 36. Thanks! Fast and Reliable Apache Spark SQL Releases March 2019 36 Test AnalyzeDev Feedback: {Nico.Poggi, Bogdan.Ghit}@databricks.com