SlideShare a Scribd company logo
Correctness and Performance
of Apache Spark SQL
Spark + AI Summit, London
October 4, 2018
1
2
NICOLAS POGGI
Databricks, Performance Engineer
• Spark benchmarking
Barcelona Supercomputing - Microsoft Research Centre
• Lead researcher ALOJA project
• New architectures for Big Data
BarcelonaTech (UPC), PhD in Computer Architecture
• Autonomic resource manager for the cloud
• Web customer modeling
About us
BOGDAN GHIT
Databricks, Software Engineer
• SQL performance optimizations
IBM T.J. Watson, Research Intern
• Bid advisor for cloud spot markets
Delft University of Technology, PhD in Computer Science
• Resource management in datacenters
• Performance of Spark, Hadoop
Databricks ecosystem
3
ToolsDevelopers
DBR Cluster Manager
Infrastructure Customers
DBR 5.0
DBR 4.3-LTS
DBR 4.3
Feb’18
Beta
Full Support
Marked for deprecation
Deprecated
Databricks runtime (DBR) releases
Our goal is to make releases automatic and frequent
Jun’18 Oct’18 Feb’19 Jun’19 Oct’19 Feb’20
Spark 2.4
Spark 2.3
Spark 2.3
Apache Spark contributions
5
Hundreds of commits monthly to the Apache Spark project
Numberofcommits
At this pace of development, mistakes are bound to happen
Where do these contributions go?
6
Scope of the testing
Developers put a significant engineering effort in testing
Query
Input data
Configuration
Over 200 built-in functions
Yet another brick in the wall
Unit testing is not enough to guarantee correctness and performance
Unit testing
Integration
E2E
Micro
Benchmarks
Plan
stability
Fuzz
testing
Macro
benchmarks
Stress
testing
Customer
workloads
Failure
testing
8
Continuous Integration pipeline
New artifacts Metrics
- Correctness
- Performance
Test
Alerts
- Merge
- Build
Dev
- Rules
- Policies
Analyze
9
Classification and alerting
- Impact
- Scope
- Correlation
- Confirm?
Failure
Regression
- Minimize
- Drill-down
- Profile
- Compare
- Validate
Events Re-test Alert
Classify Root-cause
Correctness
Performance
10
Failure
Regression
Events
Re-test
Alert
Classify Root-cause
Correctness
Performance
Correctness
Random query generation
11
Query profile
Model
translator
Spark
Query
Postgres
Query
vs
vs
...
...
DDL and datagen
12
...
...
BigIntBoolean
Timestamp
Decimal
FloatInteger
SmallInt
String
Choose a data type
Random number of rows
Random number of columns
Random number of tables
Random partition columns
Recursive query model
13
SQL Query
WITH
FROMUNION
SELECT
Functions
Constant
GROUP BY
ORDER BY
Table
Column
Alias
Query
Clause
Expression
JOIN
WHERE
Probabilistic query profile
Independent weights
• Optional query clauses
Inter-dependent weights
• Join types
• Select functions
ORDER BY
UNION
GROUP BY WHERE
10%
10%
50%
10%
Coalesce flattening (1/4)
SELECT COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3) AS int_col,
IF(NULL, VARIANCE(COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)),
COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)) AS int_col_1,
STDDEV(t2.double_col_2) AS float_col,
COALESCE(MIN((t1.smallint_col_3) - (COALESCE(t2.smallint_col_3, t1.smallint_col_3,
t2.smallint_col_3))), COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3),
COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)) AS int_col_2
FROM table_4 t1
INNER JOIN table_4 t2 ON (t2.timestamp_col_7) = (t1.timestamp_col_7)
WHERE (t1.smallint_col_3) IN (CAST('0.04' AS DECIMAL(10,10)), t1.smallint_col_3)
GROUP BY COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)
Small dataset with 2 tables of 5x5 size
Within 10 randomly generated queries
Error: Operation is in ERROR_STATE
Coalesce flattening (2/3)
Aggregate
Project
Join
FILTERSCAN foo
SCAN bar
foo.id IN
(CAST(‘0.04’ AS DECIMAL(10, 10)), foo.id)
foo.ts = bar.ts
COALESCE(COALESCE(foo.id, foo.val), 88)
GROUP BY COALESCE(foo.id, foo.val)
Coalesce flattening (3/4)
Aggregate
Project
Join
FILTERSCAN t1
SCAN t2
foo.id IN
(CAST(‘0.04’ AS DECIMAL(10, 10)), foo.id)
foo.ts = bar.ts
COALESCE(COALESCE(foo.id, foo.val), 88)
COALESCE(foo.id, foo.val)
Coalesce flattening (4/4)
Aggregate
Project
SCAN foo
Minimized query:
SELECT
COALESCE(COALESCE(foo.id, foo.val), 88)
FROM foo
GROUP BY
COALESCE(foo.id, foo.val)
Analyzing the error
● The optimizer flattens the nested coalesce calls
● The SELECT clause doesn’t contain the GROUP BY expression
● Possibly a problem with any GROUP BY expression that can be optimized
Lead function (1/3)
SELECT (t1.decimal0803_col_3) / (t1.decimal0803_col_3) AS decimal_col,
CAST(696 AS STRING) AS char_col, t1.decimal0803_col_3,
(COALESCE(CAST('0.02' AS DECIMAL(10,10)),
CAST('0.47' AS DECIMAL(10,10)),
CAST('-0.53' AS DECIMAL(10,10)))) +
(LEAD(-65, 4) OVER (ORDER BY (t1.decimal0803_col_3) / (t1.decimal0803_col_3),
CAST(696 AS STRING))) AS decimal_col_1,
CAST(-349 AS STRING) AS char_col_1
FROM table_16 t1
WHERE (943) > (889)
Error: Column 4 in row 10 does not match:
[1.0, 696, -871.81, <<-64.98>>, -349] SPARK row
[1.0, 696, -871.81, <<None>>, -349] POSTGRESQL row
Lead function (2/3)
Project
FILTER
SCAN foo
WHERE expr
COALESCE(expr) + LEAD(-65, 4) OVER ORDER BY expr
Lead function (3/3)
Project
FILTER WHERE expr
COALESCE(expr) + LEAD(-65, 4) OVER ORDER BY expr
Analyzing the error
● Using constant input values breaks the behaviour of the LEAD function
● SC-16633: https://github.com/apache/spark/pull/14284
SCAN foo
22
Performance
22
Failure
Regression
Events
Re-test
Alert
Classify Root-cause
Correctness
Performance
Benchmarking tools
•We use spark-sql-perf public library for
TPC workloads
• Provides datagen and import scripts
• local, cluster, S3
• Dashboards for analyzing results
•The Spark micro benchmarks
•And the async-profiler
• to produce flamegraphs
23
https://github.com/databricks/spark-sql-perf
Source:
http://www.brendangregg.com/flamegraphs.html
CPU Flame Graph
15%
------------- journey
Per query drill-down: 67
First, scope and validate
• in 2.4-master (dev) compared
• to 2.3 in DBR 4.3 (prod)
Query 67: 18% regression From 320s to 390s
Q67 executor profile for Spark 2.4-master
Side-by-side 2.3 vs 2.4: find the differences
Spark 2.3 Spark 2.4
Framegraph diff zoom Red slower White new
unsafe/Platform.copyMemory()
unsafe/BytesToBytesMap.safeLookup
New: hash/Murmur3_x86_32.hashUTF8String()
Murmur3_x86_32.hashUnsafeBytesBlock()
Look for hints:
- Mem mgmt
- Hashing
- unsafe
Root-causing
Results:
• Spark 2.3: hashUnsafeBytes() -> 40µs
• Spark 2.4 hashUnsafeBytesBlock() -> 140µs
• also slower UTF8String.getBytes()
Microbenchmark for UTF8String
GIT BISECT
1.)
2.)
3.)
It is a journey to get a release out
DBR and Spark testing and performance are a continuous effort
• Over a month effort to bring performance to improving
TPC-DS 2.4-master vs. 2.3 at SF 1000
15%
5%
< 0%
Conclusion
Spark in production is not just the framework
Unit and integration testing are not enough
We need Spark specific tools to automate the process
to ensure both correctness and performance
Thanks!
Correctness and Performance of Apache Spark SQL
October 2018
32
Test AnalyzeDev

More Related Content

PDF
Photon Technical Deep Dive: How to Think Vectorized
PPTX
Apache Flink Training: System Overview
PDF
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PPTX
Apache Flink @ NYC Flink Meetup
PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
PDF
Mikio Braun – Data flow vs. procedural programming
PDF
Flink Gelly - Karlsruhe - June 2015
Photon Technical Deep Dive: How to Think Vectorized
Apache Flink Training: System Overview
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
A Deep Dive into Query Execution Engine of Spark SQL
Apache Flink @ NYC Flink Meetup
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Mikio Braun – Data flow vs. procedural programming
Flink Gelly - Karlsruhe - June 2015

What's hot (20)

PDF
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...
PDF
Apache Flink Deep Dive
ODP
Akka streams
PDF
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
PDF
Flink Streaming Berlin Meetup
PDF
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
PDF
Lambdas HOL
PPT
Stack linked list
PDF
Apache Apex as YARN Application
PPTX
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
PPTX
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
PDF
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
PDF
CNIT 127 Ch 5: Introduction to heap overflows
PDF
Reactive Extensions
PDF
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
PPTX
Apache Flink Berlin Meetup May 2016
PDF
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
PDF
Apex as yarn application
PPTX
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...
Apache Flink Deep Dive
Akka streams
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Streaming Berlin Meetup
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Lambdas HOL
Stack linked list
Apache Apex as YARN Application
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
CNIT 127 Ch 5: Introduction to heap overflows
Reactive Extensions
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Apache Flink Berlin Meetup May 2016
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Apex as yarn application
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Ad

Similar to Correctness and Performance of Apache Spark SQL (20)

PDF
Fast and Reliable Apache Spark SQL Releases
PDF
Fast and Reliable Apache Spark SQL Engine
PPTX
Big Data-Driven Applications with Cassandra and Spark
PDF
SparkSQL: A Compiler from Queries to RDDs
PDF
How the Postgres Query Optimizer Works
 
PPTX
Typesafe spark- Zalando meetup
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PDF
Explaining the Postgres Query Optimizer (Bruce Momjian)
PDF
Ehsan parallel accelerator-dec2015
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
PDF
Cost-Based Optimizer in Apache Spark 2.2
PDF
Sparkly Notebook: Interactive Analysis and Visualization with Spark
PDF
Reducing Redundancies in Multi-Revision Code Analysis
PPTX
Optimizing Tcl Bytecode
PDF
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
PPTX
Code instrumentation
PPTX
MySQL Optimizer Overview
PDF
Fast federated SQL with Apache Calcite
PPTX
Grow and Shrink - Dynamically Extending the Ruby VM Stack
PDF
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Engine
Big Data-Driven Applications with Cassandra and Spark
SparkSQL: A Compiler from Queries to RDDs
How the Postgres Query Optimizer Works
 
Typesafe spark- Zalando meetup
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Explaining the Postgres Query Optimizer (Bruce Momjian)
Ehsan parallel accelerator-dec2015
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Cost-Based Optimizer in Apache Spark 2.2
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Reducing Redundancies in Multi-Revision Code Analysis
Optimizing Tcl Bytecode
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Code instrumentation
MySQL Optimizer Overview
Fast federated SQL with Apache Calcite
Grow and Shrink - Dynamically Extending the Ruby VM Stack
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...
Ad

More from Nicolas Poggi (12)

PDF
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
PDF
State of Spark in the cloud (Spark Summit EU 2017)
PDF
The state of Hive and Spark in the Cloud (July 2017)
PDF
The state of Spark in the cloud
PDF
Using BigBench to compare Hive and Spark (Long version)
PDF
Using BigBench to compare Hive and Spark (short version)
PDF
Accelerating HBase with NVMe and Bucket Cache
PDF
The state of SQL-on-Hadoop in the Cloud
PDF
sudoers: Benchmarking Hadoop with ALOJA
PDF
Benchmarking Hadoop and Big Data
PDF
Vagrant + Docker provider [+Puppet]
PDF
The case for Hadoop performance
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
State of Spark in the cloud (Spark Summit EU 2017)
The state of Hive and Spark in the Cloud (July 2017)
The state of Spark in the cloud
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (short version)
Accelerating HBase with NVMe and Bucket Cache
The state of SQL-on-Hadoop in the Cloud
sudoers: Benchmarking Hadoop with ALOJA
Benchmarking Hadoop and Big Data
Vagrant + Docker provider [+Puppet]
The case for Hadoop performance

Recently uploaded (20)

PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
Launch Your Data Science Career in Kochi – 2025
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Supervised vs unsupervised machine learning algorithms
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Knowledge Engineering Part 1
Galatica Smart Energy Infrastructure Startup Pitch Deck
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Introduction-to-Cloud-ComputingFinal.pptx
Quality review (1)_presentation of this 21
Moving the Public Sector (Government) to a Digital Adoption
Major-Components-ofNKJNNKNKNKNKronment.pptx
Data_Analytics_and_PowerBI_Presentation.pptx

Correctness and Performance of Apache Spark SQL

  • 1. Correctness and Performance of Apache Spark SQL Spark + AI Summit, London October 4, 2018 1
  • 2. 2 NICOLAS POGGI Databricks, Performance Engineer • Spark benchmarking Barcelona Supercomputing - Microsoft Research Centre • Lead researcher ALOJA project • New architectures for Big Data BarcelonaTech (UPC), PhD in Computer Architecture • Autonomic resource manager for the cloud • Web customer modeling About us BOGDAN GHIT Databricks, Software Engineer • SQL performance optimizations IBM T.J. Watson, Research Intern • Bid advisor for cloud spot markets Delft University of Technology, PhD in Computer Science • Resource management in datacenters • Performance of Spark, Hadoop
  • 3. Databricks ecosystem 3 ToolsDevelopers DBR Cluster Manager Infrastructure Customers
  • 4. DBR 5.0 DBR 4.3-LTS DBR 4.3 Feb’18 Beta Full Support Marked for deprecation Deprecated Databricks runtime (DBR) releases Our goal is to make releases automatic and frequent Jun’18 Oct’18 Feb’19 Jun’19 Oct’19 Feb’20 Spark 2.4 Spark 2.3 Spark 2.3
  • 5. Apache Spark contributions 5 Hundreds of commits monthly to the Apache Spark project Numberofcommits At this pace of development, mistakes are bound to happen
  • 6. Where do these contributions go? 6 Scope of the testing Developers put a significant engineering effort in testing Query Input data Configuration Over 200 built-in functions
  • 7. Yet another brick in the wall Unit testing is not enough to guarantee correctness and performance Unit testing Integration E2E Micro Benchmarks Plan stability Fuzz testing Macro benchmarks Stress testing Customer workloads Failure testing
  • 8. 8 Continuous Integration pipeline New artifacts Metrics - Correctness - Performance Test Alerts - Merge - Build Dev - Rules - Policies Analyze
  • 9. 9 Classification and alerting - Impact - Scope - Correlation - Confirm? Failure Regression - Minimize - Drill-down - Profile - Compare - Validate Events Re-test Alert Classify Root-cause Correctness Performance
  • 11. Random query generation 11 Query profile Model translator Spark Query Postgres Query vs vs
  • 12. ... ... DDL and datagen 12 ... ... BigIntBoolean Timestamp Decimal FloatInteger SmallInt String Choose a data type Random number of rows Random number of columns Random number of tables Random partition columns
  • 13. Recursive query model 13 SQL Query WITH FROMUNION SELECT Functions Constant GROUP BY ORDER BY Table Column Alias Query Clause Expression JOIN WHERE
  • 14. Probabilistic query profile Independent weights • Optional query clauses Inter-dependent weights • Join types • Select functions ORDER BY UNION GROUP BY WHERE 10% 10% 50% 10%
  • 15. Coalesce flattening (1/4) SELECT COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3) AS int_col, IF(NULL, VARIANCE(COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)), COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)) AS int_col_1, STDDEV(t2.double_col_2) AS float_col, COALESCE(MIN((t1.smallint_col_3) - (COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3))), COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3), COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)) AS int_col_2 FROM table_4 t1 INNER JOIN table_4 t2 ON (t2.timestamp_col_7) = (t1.timestamp_col_7) WHERE (t1.smallint_col_3) IN (CAST('0.04' AS DECIMAL(10,10)), t1.smallint_col_3) GROUP BY COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3) Small dataset with 2 tables of 5x5 size Within 10 randomly generated queries Error: Operation is in ERROR_STATE
  • 16. Coalesce flattening (2/3) Aggregate Project Join FILTERSCAN foo SCAN bar foo.id IN (CAST(‘0.04’ AS DECIMAL(10, 10)), foo.id) foo.ts = bar.ts COALESCE(COALESCE(foo.id, foo.val), 88) GROUP BY COALESCE(foo.id, foo.val)
  • 17. Coalesce flattening (3/4) Aggregate Project Join FILTERSCAN t1 SCAN t2 foo.id IN (CAST(‘0.04’ AS DECIMAL(10, 10)), foo.id) foo.ts = bar.ts COALESCE(COALESCE(foo.id, foo.val), 88) COALESCE(foo.id, foo.val)
  • 18. Coalesce flattening (4/4) Aggregate Project SCAN foo Minimized query: SELECT COALESCE(COALESCE(foo.id, foo.val), 88) FROM foo GROUP BY COALESCE(foo.id, foo.val) Analyzing the error ● The optimizer flattens the nested coalesce calls ● The SELECT clause doesn’t contain the GROUP BY expression ● Possibly a problem with any GROUP BY expression that can be optimized
  • 19. Lead function (1/3) SELECT (t1.decimal0803_col_3) / (t1.decimal0803_col_3) AS decimal_col, CAST(696 AS STRING) AS char_col, t1.decimal0803_col_3, (COALESCE(CAST('0.02' AS DECIMAL(10,10)), CAST('0.47' AS DECIMAL(10,10)), CAST('-0.53' AS DECIMAL(10,10)))) + (LEAD(-65, 4) OVER (ORDER BY (t1.decimal0803_col_3) / (t1.decimal0803_col_3), CAST(696 AS STRING))) AS decimal_col_1, CAST(-349 AS STRING) AS char_col_1 FROM table_16 t1 WHERE (943) > (889) Error: Column 4 in row 10 does not match: [1.0, 696, -871.81, <<-64.98>>, -349] SPARK row [1.0, 696, -871.81, <<None>>, -349] POSTGRESQL row
  • 20. Lead function (2/3) Project FILTER SCAN foo WHERE expr COALESCE(expr) + LEAD(-65, 4) OVER ORDER BY expr
  • 21. Lead function (3/3) Project FILTER WHERE expr COALESCE(expr) + LEAD(-65, 4) OVER ORDER BY expr Analyzing the error ● Using constant input values breaks the behaviour of the LEAD function ● SC-16633: https://github.com/apache/spark/pull/14284 SCAN foo
  • 23. Benchmarking tools •We use spark-sql-perf public library for TPC workloads • Provides datagen and import scripts • local, cluster, S3 • Dashboards for analyzing results •The Spark micro benchmarks •And the async-profiler • to produce flamegraphs 23 https://github.com/databricks/spark-sql-perf Source: http://www.brendangregg.com/flamegraphs.html CPU Flame Graph
  • 25. Per query drill-down: 67 First, scope and validate • in 2.4-master (dev) compared • to 2.3 in DBR 4.3 (prod) Query 67: 18% regression From 320s to 390s
  • 26. Q67 executor profile for Spark 2.4-master
  • 27. Side-by-side 2.3 vs 2.4: find the differences Spark 2.3 Spark 2.4
  • 28. Framegraph diff zoom Red slower White new unsafe/Platform.copyMemory() unsafe/BytesToBytesMap.safeLookup New: hash/Murmur3_x86_32.hashUTF8String() Murmur3_x86_32.hashUnsafeBytesBlock() Look for hints: - Mem mgmt - Hashing - unsafe
  • 29. Root-causing Results: • Spark 2.3: hashUnsafeBytes() -> 40µs • Spark 2.4 hashUnsafeBytesBlock() -> 140µs • also slower UTF8String.getBytes() Microbenchmark for UTF8String GIT BISECT 1.) 2.) 3.)
  • 30. It is a journey to get a release out DBR and Spark testing and performance are a continuous effort • Over a month effort to bring performance to improving TPC-DS 2.4-master vs. 2.3 at SF 1000 15% 5% < 0%
  • 31. Conclusion Spark in production is not just the framework Unit and integration testing are not enough We need Spark specific tools to automate the process to ensure both correctness and performance
  • 32. Thanks! Correctness and Performance of Apache Spark SQL October 2018 32 Test AnalyzeDev