"Lessons learned using Apache Spark for self-service data prep in SaaS world"

Pavel Hardak (Product Manager, Workday)
Jianneng Li (Software Engineer, Workday)
Lessons Learned Using Apache Spark
for Self-Service Data Prep (and More)
in SaaS World
#UnifiedAnalytics #SparkAISummit

This presentation may contain forward-looking statements for which there are risks, uncertainties, and
assumptions. If the risks materialize or assumptions prove incorrect, Workday’s business results and directions
could differ materially from results implied by the forward-looking statements. Forward-looking statements
include any statements regarding strategies or plans for future operations; any statements concerning new
features, enhancements or upgrades to our existing applications or plans for future applications; and any
statements of belief. Further information on risks that could affect Workday’s results is included in our filings
with the Securities and Exchange Commission which are available on the Workday investor relations
webpage: www.workday.com/company/investor_relations.php
Workday assumes no obligation for and does not intend to update any forward-looking statements. Any
unreleased services, features, functionality or enhancements referenced in any Workday document, roadmap,
blog, our website, press release or public statement that are not currently available are subject to change at
Workday’s discretion and may not be delivered as planned or at all.
Customers who purchase Workday, Inc. services should make their purchase decisions upon services,
features, and functions that are currently available.
Safe Harbor Statement
#UnifiedAnalytics #SparkAISummit 2

Agenda
● Workday - Finance and HCM in the cloud
● Workday Platform - “Power of One”
● Prism Analytics - Powered by Apache Spark
● Production Stories & Lessons Learned
● Questions
3#UnifiedAnalytics #SparkAISummit 3

● “Pure” SaaS apps suite
○ Finance and HCM
● Customers: 2,500+
○ 200+ of Fortune 500
● Revenue: $2.82B
○ Growth: 32% YoY
Plan
Execute
Analyze
Planning
Financial Management
Human Capital
Management
Prism Analytics
and Reporting

Workday Confidential

6
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform

Durable
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Cloud
Machine
Learning
One Platform
Object Data Model
MetadataExtensible

Business Process
Framework
Object
Data Model
Reporting and
Analytics
Cloud
Machine
Learning
One Platform
Security
Encryption Privacy and
Compliance
Trust

Business Process
Framework
Object
Data Model
Reporting and
Analytics
Cloud
Machine
Learning
One Platform
Reporting and Analytics
Dashboards CollaborationDistribution

Plan
Execute
Analyze
Planning
Human Capital
Management
Prism Analytics
and Reporting

Workday Planning
Workday
Workday
Human Capital
Management
Workday Prism
Analytics and
Reporting
Prism Analytics
Integrate 3rd
Party Data
Data Management
Data Preparation
Data Discovery
Report Publishing
11#UnifiedAnalytics #SparkAISummit
Plan
Execute
Analyze
Planning
Human Capital
Management
Prism
Analytics and
Reporting

Workday Prism Analytics
The full spectrum of Finance and HCM insights, all within Workday.
Workday Data + Non-Workday Data

Finance, HCM
Operational
Industry systems
Legacy systems More…
CRM Service ticketing
Surveys Point of Sale
Stock grants
Map
Ingest
Preparation AnalysisAcquisition
Reporting
Worksheets
Data Discovery
Cleanse and Transform
Blend Datasets
Apply Security Permissions
Publish Data Source
Prism Analytics Workflow

Prism
Prism
Prism
HDFS / S3
Query Engine
Spark
Driver
Spark
Executor
Interactive
Data Prep
Spark
Driver
Spark
Executor
Spark
Driver
Data Prep
Publishing
YARN
Spark
Executor
Spark
Executor
Spark in Prism Analytics

Interactive Data Prep in Prism
Transform Stages
Number of samples
Examples and statistics

Powered by Spark
Edit Transform

Data Prep Publishing in Prism
Also powered by Spark

Interactive Publishing
Data size 100 - 100K rows Billions of rows
Sampling Yes No
Caching Yes No
Latency Seconds Minutes to hours
Result Returned in memory Written to disk
SLA Best effort Consistent performance
Data Prep: Interactive vs. Publishing

Data Prep: Interactive vs. Publishing
Same plan!

Prism Logical Model

Prism Logical Model
• Superset of SQL operators
• Compiles to Spark plans through Spark SQL
• Implements custom Catalyst rules and strategies
22#UnifiedAnalytics #SparkAISummit#UnifiedAnalytics #SparkAISummit

Example: Interactive Data Prep Operators
23#UnifiedAnalytics #SparkAISummit#UnifiedAnalytics #SparkAISummit
IngestSampler
LogicalIngestSampler
IngestSamplerExec
IngestSamplerRDD
Prism Logical Plan
RDD
Spark Physical Plan
Spark Logical Plan

Prism Data Types

Implementing Additional Data Types
• Prism has a richer type system than Catalyst
• Uses StructType and StructField to implement
additional data types

Example: Prism Currency Type
object CurrencyType extends StructType(
Array(
StructField(“amount”,DecimalType(26, 6)),
StructField(“code”, StringType)))
>> { “amount”: 1000.000000, “code”: “USD” }
>> { “amount”: -999.000000, “code”: “YEN” }

Lessons Learned

Lessons #1: Nested SQL

Lesson #1: Nested SQL
• SQL requires computed columns to be nested
– SELECT 1 as c1, c1 + 1 as c2; /* ✗ */
– SELECT c1 + 1 as c2 FROM (SELECT 1 as c1); /* ✓ */
• First version: one nesting per computed column
– Does not scale to 100s of columns
– Takes a long time to compile and optimize

Lesson #1: Example Dependency Graph
[first.name], [last.name], [income],
concat([first.name],”.”, [last.name]) as [full.name],
[income] * 0.28 as [federal.tax],
[income] *0.10 as [state.tax],
concat([full.name],”@workday.com”) as [email]
first.name last.name income
full.name federal.tax
email
state.tax
2nd level
1st level

select [income] * 0.10 as [state_tax], *
from (select [income] * 0.28 as [federal_tax], *
from (select concat([full.name],”@workday.com”) as [email], *
from (select concat([first.name],”.”, [last.name]) as [full.name], *
from (select [first.name], [last.name], [income] from base_table))))
Lesson #1: SQL Before Optimization
4 levels of nested SQL

Lesson #1: SQL After Optimization
2 levels of nested SQL
32
select concat([full.name],”@workday.com”) as [email], *
from (select concat([first.name],”.”, [last.name]) as [full.name],
[income] * 0.28 as [federal_tax],
[income] * 0.10 as [state_tax], *
from (select [first.name], [last.name], [income] from base_table)))

Lesson #2: Plan Blowup

Lesson #2: Plan Blowup
• Generated plans can have duplicate operators
• E.g. self joins and self unions
• Need to de-duplicate to improve performance

Lesson #2: Deduping Prism Logical Plan
Union(
Sample(k=100,
Parse(“Dataset A”)),
Join(
Sample(k=100,
Parse(“Dataset B”)),
Join(
Sample(k=100,
Parse(“Dataset B”))
)

Union(
Cache(ID=1,
Sample(k=100,
Parse(“Dataset A”))),
Join(
Cache(ID=1, ∅),
Join(
Sample(k=100,
)
Union(
Sample(k=100,
Join(
Sample(k=100,
Join(
Sample(k=100,
)

Union(
Cache(ID=1,
Sample(k=100,
Parse(“Dataset A”))),
Cache(ID=2,
Join(
Cache(ID=1, ∅),
Parse(“Dataset B”))),
Cache(ID=2, ∅)
)
Union(
Sample(k=100,
Join(
Sample(k=100,
Join(
Sample(k=100,
)

Lesson #2: Deduping Spark Tree String
(1) Project #2
+- (2) Join #2
:- (3) Project #1
: +- (4) Join #1
: :- (5) Scan #1
: +- (6) Scan #1
+- (7) Project #1
+- (8) Join #1
:- (9) Scan #1
+- (10) Scan #1

Lesson #2: Deduping Spark Tree String
(1) Project #2
+- (2) Join #2
:- (3) Project #1
: +- (4) Join #1
: :- (5) Scan #1
: +- (6) Scan #1
+- (7) Project #1
+- (8) Join #1
:- (9) Scan #1
+- (10) Scan #1
(1) Project #2
+- (2) Join #2
:- (3) Project #1
: +- (4) Join #1
: :- (5) Scan #1
: +- (6) Lines 5-5
+- (7) Lines 3-6

Lesson #3: Broadcast Join Tuning

Node 1
A 1
B 3
C 6
D 7
AA 2
BB 5
CC 9
Node 2
E 2
F 4
G 5
H 8
EE 3
FF 8
Node 1 Node 2
A 1
B 3
C 6
D 7
AA 2
BB 5
CC 9
EE 3
FF 8
E 2
F 4
G 5
H 8
AA 2
BB 5
CC 9
EE 3
FF 8
Broadcast
Join
Lesson #3: Broadcast Join Review

• Spark’s broadcasting mechanism is inefficient
– Broadcasted data goes through the driver
– No global limit on broadcasted data
– Complex jobs can make driver run out of memory
Lesson #3: Spark Broadcast
Driver
Executor 1
Executor 2
(1) Driver collects broadcasted data from executors
(2) Driver sends broadcasted data to executors

• Initially disabled broadcast joins for stability
• Expectation: small number of joins, all large joins
Lesson #3: Disabling Broadcast Joins
spark.sql.autoBroadcastJoinThreshold = -1

Lesson #3: Re-Enabling Broadcast Joins
44
• Reality: large number of joins, many are small
• Re-enabled broadcast join with a low threshold
• 2-10x runtime improvement
spark.sql.autoBroadcastJoinThreshold = 1000000

Lesson #4: Case-Insensitive Grouping

Prism
Prism
Prism
HDFS / S3
Query Engine
Spark
Driver
Spark
Executor
Interactive
Data Prep
Spark
Driver
Spark
Executor
Spark
Driver
Data Prep
Publishing
YARN
Spark
Executor
Spark
Executor
Lesson #4: Spark in Query Engine

Lesson #4: Spark in Query Engine

Sum of Billing Amount per Billing Location
BillingLocation BillingAmount
CALIFORNIA 100000
california 50000
california 40000
Illinois 25000
Texas 15000
TeXas 60000
texas 5000
BillingLocation TotalBillingAmount
CALIFORNIA 100000
california 90000
TeXas 60000
Illinois 25000
texas 15000
Texas 5000
SELECT BillingLocation,
SUM(BillingAmount) AS TotalBillingAmount
FROM InsuranceClaims
GROUP BY BillingLocation
ORDER BY TotalBillingAmount
Lesson #4: Grouping on String Columns

Sum of Billing Amount per Billing Location
SELECT MIN(BillingLocation) AS BillingLocation,
SUM(BillingAmount) AS TotalBillingAmount
FROM InsuranceClaims
GROUP BY UPPER(BillingLocation)
ORDER BY TotalBillingAmount
BillingLocation TotalBillingAmount
CALIFORNIA 190000
TeXas 80000
Illinois 25000
In Workday, grouping on strings columns is case insensitive
49
BillingLocation BillingAmount
CALIFORNIA 100000
california 50000
california 40000
Illinois 25000
Texas 15000
TeXas 60000
texas 5000
Lesson #4: Grouping on String Columns

GROUP BY stringField
GROUP BY UPPER(stringField)
+
MIN(stringField)
~7x regression
Lesson #4: Case-Insensitive Grouping is Costly

Aggregation on strings uses Spark uses SortAggregate operator
➔ Modified Spark’s HashAggregate to support strings
Regression reduced to
~3x
SortAggregate HashAggregate
Lesson #4: Aggregation on String Columns

In Spark’s HashAggregate operator, functions used in
GROUPING operator were getting evaluated twice
Regression reduced to
~2x
UPPER evaluated
twice
UPPER evaluated
only once
Lesson #4: Reducing Function Evaluations

Precompute uppercase for all characters
➔ replace toUpperCase() on each char by a simple array lookup
Regression reduced to ~1.5x
(and want to decrease more...)
UPPER Optimized UPPER
Lesson #4: Optimizing Spark’s UPPER Function

And one more thing...

HDFS / S3
Prism 1
Tenant 1
Prism 2
Tenant 2
Prism 3
Tenant 3
Prism 4
Tenant 4
Spark Cluster Spark Cluster Spark Cluster Spark Cluster
Current – Single-Tenanted Spark Clusters

HDFS / S3
Spark Cluster Spark Cluster
Tenant 2 Tenant 4Tenant 3 Tenant 6Tenant 5 Tenant 7Tenant 1 Tenant 8
Prism 1 Prism 2 Prism 3
Spark Cluster
Future – Multi-Tenanted Spark Clusters

Questions?
57
workday.com/careers

"Lessons learned using Apache Spark for self-service data prep in SaaS world"

More Related Content

What's hot (20)

Similar to "Lessons learned using Apache Spark for self-service data prep in SaaS world" (20)

Recently uploaded (20)

"Lessons learned using Apache Spark for self-service data prep in SaaS world"