SlideShare a Scribd company logo
Pavel Hardak (Product Manager, Workday)
Jianneng Li (Software Engineer, Workday)
Lessons Learned Using Apache Spark
for Self-Service Data Prep (and More)
in SaaS World
#UnifiedAnalytics #SparkAISummit
This presentation may contain forward-looking statements for which there are risks, uncertainties, and
assumptions. If the risks materialize or assumptions prove incorrect, Workday’s business results and directions
could differ materially from results implied by the forward-looking statements. Forward-looking statements
include any statements regarding strategies or plans for future operations; any statements concerning new
features, enhancements or upgrades to our existing applications or plans for future applications; and any
statements of belief. Further information on risks that could affect Workday’s results is included in our filings
with the Securities and Exchange Commission which are available on the Workday investor relations
webpage: www.workday.com/company/investor_relations.php
Workday assumes no obligation for and does not intend to update any forward-looking statements. Any
unreleased services, features, functionality or enhancements referenced in any Workday document, roadmap,
blog, our website, press release or public statement that are not currently available are subject to change at
Workday’s discretion and may not be delivered as planned or at all.
Customers who purchase Workday, Inc. services should make their purchase decisions upon services,
features, and functions that are currently available.
Safe Harbor Statement
#UnifiedAnalytics #SparkAISummit 2
Agenda
● Workday - Finance and HCM in the cloud
● Workday Platform - “Power of One”
● Prism Analytics - Powered by Apache Spark
● Production Stories & Lessons Learned
● Questions
3#UnifiedAnalytics #SparkAISummit 3
#UnifiedAnalytics #SparkAISummit 4
● “Pure” SaaS apps suite
○ Finance and HCM
● Customers: 2,500+
○ 200+ of Fortune 500
● Revenue: $2.82B
○ Growth: 32% YoY
Plan
Execute
Analyze
Planning
Financial Management
Human Capital
Management
Prism Analytics
and Reporting
Workday Confidential
#UnifiedAnalytics #SparkAISummit 5
6
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
#UnifiedAnalytics #SparkAISummit
#UnifiedAnalytics #SparkAISummit 7
Durable
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
Object Data Model
MetadataExtensible
#UnifiedAnalytics #SparkAISummit 8
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
Security
Encryption Privacy and
Compliance
Trust
#UnifiedAnalytics #SparkAISummit 9
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
Reporting and Analytics
Dashboards CollaborationDistribution
#UnifiedAnalytics #SparkAISummit 10
Plan
Execute
Analyze
Planning
Financial Management
Human Capital
Management
Prism Analytics
and Reporting
Workday Planning
Workday
Financial Management
Workday
Human Capital
Management
Workday Prism
Analytics and
Reporting
Prism Analytics
Integrate 3rd
Party Data
Data Management
Data Preparation
Data Discovery
Report Publishing
11#UnifiedAnalytics #SparkAISummit
Plan
Execute
Analyze
Planning
Financial Management
Human Capital
Management
Prism
Analytics and
Reporting
Workday Prism Analytics
The full spectrum of Finance and HCM insights, all within Workday.
Workday Data + Non-Workday Data
#UnifiedAnalytics #SparkAISummit 12
Finance, HCM
Operational
Industry systems
Legacy systems More…
CRM Service ticketing
Surveys Point of Sale
Stock grants
Map
Ingest
Preparation AnalysisAcquisition
Reporting
Worksheets
Data Discovery
Cleanse and Transform
Blend Datasets
Apply Security Permissions
Publish Data Source
Prism Analytics Workflow
13#UnifiedAnalytics #SparkAISummit
Prism
Prism
Prism
HDFS / S3
Query Engine
Spark
Driver
Spark
Executor
Interactive
Data Prep
Spark
Driver
Spark
Executor
Spark
Driver
Data Prep
Publishing
YARN
Spark
Executor
Spark
Executor
Spark in Prism Analytics
#UnifiedAnalytics #SparkAISummit 14
Interactive Data Prep in Prism
Transform Stages
Number of samples
Examples and statistics
15#UnifiedAnalytics #SparkAISummit
Interactive Data Prep in Prism
16#UnifiedAnalytics #SparkAISummit
Interactive Data Prep in Prism
Powered by Spark
Edit Transform
17#UnifiedAnalytics #SparkAISummit
Data Prep Publishing in Prism
Also powered by Spark
18#UnifiedAnalytics #SparkAISummit
19#UnifiedAnalytics #SparkAISummit
Interactive Publishing
Data size 100 - 100K rows Billions of rows
Sampling Yes No
Caching Yes No
Latency Seconds Minutes to hours
Result Returned in memory Written to disk
SLA Best effort Consistent performance
Data Prep: Interactive vs. Publishing
20#UnifiedAnalytics #SparkAISummit
Data Prep: Interactive vs. Publishing
Same plan!
Prism Logical Model
21#UnifiedAnalytics #SparkAISummit
Prism Logical Model
• Superset of SQL operators
• Compiles to Spark plans through Spark SQL
• Implements custom Catalyst rules and strategies
22#UnifiedAnalytics #SparkAISummit#UnifiedAnalytics #SparkAISummit
Example: Interactive Data Prep Operators
23#UnifiedAnalytics #SparkAISummit#UnifiedAnalytics #SparkAISummit
IngestSampler
LogicalIngestSampler
IngestSamplerExec
IngestSamplerRDD
Prism Logical Plan
RDD
Spark Physical Plan
Spark Logical Plan
Prism Data Types
24#UnifiedAnalytics #SparkAISummit
Implementing Additional Data Types
• Prism has a richer type system than Catalyst
• Uses StructType and StructField to implement
additional data types
25#UnifiedAnalytics #SparkAISummit
Example: Prism Currency Type
object CurrencyType extends StructType(
Array(
StructField(“amount”,DecimalType(26, 6)),
StructField(“code”, StringType)))
>> { “amount”: 1000.000000, “code”: “USD” }
>> { “amount”: -999.000000, “code”: “YEN” }
26#UnifiedAnalytics #SparkAISummit
Lessons Learned
27#UnifiedAnalytics #SparkAISummit
Lessons #1: Nested SQL
28#UnifiedAnalytics #SparkAISummit
Lesson #1: Nested SQL
29#UnifiedAnalytics #SparkAISummit
• SQL requires computed columns to be nested
– SELECT 1 as c1, c1 + 1 as c2; /* ✗ */
– SELECT c1 + 1 as c2 FROM (SELECT 1 as c1); /* ✓ */
• First version: one nesting per computed column
– Does not scale to 100s of columns
– Takes a long time to compile and optimize
Lesson #1: Example Dependency Graph
[first.name], [last.name], [income],
concat([first.name],”.”, [last.name]) as [full.name],
[income] * 0.28 as [federal.tax],
[income] *0.10 as [state.tax],
concat([full.name],”@workday.com”) as [email]
first.name last.name income
full.name federal.tax
email
state.tax
2nd level
1st level
30#UnifiedAnalytics #SparkAISummit
select [income] * 0.10 as [state_tax], *
from (select [income] * 0.28 as [federal_tax], *
from (select concat([full.name],”@workday.com”) as [email], *
from (select concat([first.name],”.”, [last.name]) as [full.name], *
from (select [first.name], [last.name], [income] from base_table))))
Lesson #1: SQL Before Optimization
4 levels of nested SQL
31#UnifiedAnalytics #SparkAISummit
Lesson #1: SQL After Optimization
2 levels of nested SQL
32
select concat([full.name],”@workday.com”) as [email], *
from (select concat([first.name],”.”, [last.name]) as [full.name],
[income] * 0.28 as [federal_tax],
[income] * 0.10 as [state_tax], *
from (select [first.name], [last.name], [income] from base_table)))
#UnifiedAnalytics #SparkAISummit
Lesson #2: Plan Blowup
33#UnifiedAnalytics #SparkAISummit
Lesson #2: Plan Blowup
34#UnifiedAnalytics #SparkAISummit
• Generated plans can have duplicate operators
• E.g. self joins and self unions
• Need to de-duplicate to improve performance
Lesson #2: Deduping Prism Logical Plan
35#UnifiedAnalytics #SparkAISummit
Union(
Sample(k=100,
Parse(“Dataset A”)),
Join(
Sample(k=100,
Parse(“Dataset A”)),
Parse(“Dataset B”)),
Join(
Sample(k=100,
Parse(“Dataset A”)),
Parse(“Dataset B”))
)
Union(
Cache(ID=1,
Sample(k=100,
Parse(“Dataset A”))),
Join(
Cache(ID=1, ∅),
Parse(“Dataset B”)),
Join(
Sample(k=100,
Parse(“Dataset A”)),
Parse(“Dataset B”))
)
Lesson #2: Deduping Prism Logical Plan
36#UnifiedAnalytics #SparkAISummit
Union(
Sample(k=100,
Parse(“Dataset A”)),
Join(
Sample(k=100,
Parse(“Dataset A”)),
Parse(“Dataset B”)),
Join(
Sample(k=100,
Parse(“Dataset A”)),
Parse(“Dataset B”))
)
Union(
Cache(ID=1,
Sample(k=100,
Parse(“Dataset A”))),
Cache(ID=2,
Join(
Cache(ID=1, ∅),
Parse(“Dataset B”))),
Cache(ID=2, ∅)
)
Lesson #2: Deduping Prism Logical Plan
37#UnifiedAnalytics #SparkAISummit
Union(
Sample(k=100,
Parse(“Dataset A”)),
Join(
Sample(k=100,
Parse(“Dataset A”)),
Parse(“Dataset B”)),
Join(
Sample(k=100,
Parse(“Dataset A”)),
Parse(“Dataset B”))
)
Lesson #2: Deduping Spark Tree String
38#UnifiedAnalytics #SparkAISummit
(1) Project #2
+- (2) Join #2
:- (3) Project #1
: +- (4) Join #1
: :- (5) Scan #1
: +- (6) Scan #1
+- (7) Project #1
+- (8) Join #1
:- (9) Scan #1
+- (10) Scan #1
Lesson #2: Deduping Spark Tree String
39#UnifiedAnalytics #SparkAISummit
(1) Project #2
+- (2) Join #2
:- (3) Project #1
: +- (4) Join #1
: :- (5) Scan #1
: +- (6) Scan #1
+- (7) Project #1
+- (8) Join #1
:- (9) Scan #1
+- (10) Scan #1
(1) Project #2
+- (2) Join #2
:- (3) Project #1
: +- (4) Join #1
: :- (5) Scan #1
: +- (6) Lines 5-5
+- (7) Lines 3-6
Lesson #3: Broadcast Join Tuning
40#UnifiedAnalytics #SparkAISummit
Node 1
A 1
B 3
C 6
D 7
AA 2
BB 5
CC 9
Node 2
E 2
F 4
G 5
H 8
EE 3
FF 8
Node 1 Node 2
A 1
B 3
C 6
D 7
AA 2
BB 5
CC 9
EE 3
FF 8
E 2
F 4
G 5
H 8
AA 2
BB 5
CC 9
EE 3
FF 8
Broadcast
Join
#UnifiedAnalytics #SparkAISummit 41
Lesson #3: Broadcast Join Review
• Spark’s broadcasting mechanism is inefficient
– Broadcasted data goes through the driver
– No global limit on broadcasted data
– Complex jobs can make driver run out of memory
Lesson #3: Spark Broadcast
42#UnifiedAnalytics #SparkAISummit
Driver
Executor 1
Executor 2
(1) Driver collects broadcasted data from executors
(2) Driver sends broadcasted data to executors
• Initially disabled broadcast joins for stability
• Expectation: small number of joins, all large joins
Lesson #3: Disabling Broadcast Joins
43#UnifiedAnalytics #SparkAISummit
spark.sql.autoBroadcastJoinThreshold = -1
Lesson #3: Re-Enabling Broadcast Joins
44
• Reality: large number of joins, many are small
• Re-enabled broadcast join with a low threshold
• 2-10x runtime improvement
#UnifiedAnalytics #SparkAISummit
spark.sql.autoBroadcastJoinThreshold = 1000000
Lesson #4: Case-Insensitive Grouping
45#UnifiedAnalytics #SparkAISummit
Prism
Prism
Prism
HDFS / S3
Query Engine
Spark
Driver
Spark
Executor
Interactive
Data Prep
Spark
Driver
Spark
Executor
Spark
Driver
Data Prep
Publishing
YARN
Spark
Executor
Spark
Executor
Lesson #4: Spark in Query Engine
#UnifiedAnalytics #SparkAISummit 46
47#UnifiedAnalytics #SparkAISummit
Lesson #4: Spark in Query Engine
Sum of Billing Amount per Billing Location
BillingLocation BillingAmount
CALIFORNIA 100000
california 50000
california 40000
Illinois 25000
Texas 15000
TeXas 60000
texas 5000
BillingLocation TotalBillingAmount
CALIFORNIA 100000
california 90000
TeXas 60000
Illinois 25000
texas 15000
Texas 5000
SELECT BillingLocation,
SUM(BillingAmount) AS TotalBillingAmount
FROM InsuranceClaims
GROUP BY BillingLocation
ORDER BY TotalBillingAmount
48#UnifiedAnalytics #SparkAISummit
Lesson #4: Grouping on String Columns
Sum of Billing Amount per Billing Location
SELECT MIN(BillingLocation) AS BillingLocation,
SUM(BillingAmount) AS TotalBillingAmount
FROM InsuranceClaims
GROUP BY UPPER(BillingLocation)
ORDER BY TotalBillingAmount
BillingLocation TotalBillingAmount
CALIFORNIA 190000
TeXas 80000
Illinois 25000
In Workday, grouping on strings columns is case insensitive
49
BillingLocation BillingAmount
CALIFORNIA 100000
california 50000
california 40000
Illinois 25000
Texas 15000
TeXas 60000
texas 5000
#UnifiedAnalytics #SparkAISummit
Lesson #4: Grouping on String Columns
GROUP BY stringField
GROUP BY UPPER(stringField)
+
MIN(stringField)
~7x regression
50#UnifiedAnalytics #SparkAISummit
Lesson #4: Case-Insensitive Grouping is Costly
Aggregation on strings uses Spark uses SortAggregate operator
➔ Modified Spark’s HashAggregate to support strings
Regression reduced to
~3x
SortAggregate HashAggregate
51#UnifiedAnalytics #SparkAISummit
Lesson #4: Aggregation on String Columns
In Spark’s HashAggregate operator, functions used in
GROUPING operator were getting evaluated twice
Regression reduced to
~2x
UPPER evaluated
twice
UPPER evaluated
only once
52#UnifiedAnalytics #SparkAISummit
Lesson #4: Reducing Function Evaluations
Precompute uppercase for all characters
➔ replace toUpperCase() on each char by a simple array lookup
Regression reduced to ~1.5x
(and want to decrease more...)
UPPER Optimized UPPER
53#UnifiedAnalytics #SparkAISummit
Lesson #4: Optimizing Spark’s UPPER Function
And one more thing...
54#UnifiedAnalytics #SparkAISummit
HDFS / S3
Prism 1
Tenant 1
Prism 2
Tenant 2
Prism 3
Tenant 3
Prism 4
Tenant 4
Spark Cluster Spark Cluster Spark Cluster Spark Cluster
Current – Single-Tenanted Spark Clusters
55#UnifiedAnalytics #SparkAISummit
HDFS / S3
Spark Cluster Spark Cluster
Tenant 2 Tenant 4Tenant 3 Tenant 6Tenant 5 Tenant 7Tenant 1 Tenant 8
Prism 1 Prism 2 Prism 3
Spark Cluster
Future – Multi-Tenanted Spark Clusters
56#UnifiedAnalytics #SparkAISummit
Questions?
57
workday.com/careers
#UnifiedAnalytics #SparkAISummit

More Related Content

PDF
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
PDF
Spark Development Lifecycle at Workday - ApacheCon 2020
PPTX
What’s new in Apache Spark 2.3
PDF
Unified Data Access with Gimel
PPTX
The structured streaming upgrade to Apache Spark and how enterprises can bene...
PDF
Migrating Your Data Platform At a High Growth Startup
PDF
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...
PPTX
Innovation in the Enterprise Rent-A-Car Data Warehouse
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Spark Development Lifecycle at Workday - ApacheCon 2020
What’s new in Apache Spark 2.3
Unified Data Access with Gimel
The structured streaming upgrade to Apache Spark and how enterprises can bene...
Migrating Your Data Platform At a High Growth Startup
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...
Innovation in the Enterprise Rent-A-Car Data Warehouse

What's hot (20)

PPTX
Spark ML Pipeline serving
PDF
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
PDF
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
PPTX
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
PDF
Visualizing Big Data in Realtime
PPTX
Spark in the Enterprise - 2 Years Later by Alan Saldich
PDF
ASPgems - kappa architecture
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
PPTX
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
PDF
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
PDF
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
PPTX
SAM—streaming analytics made easy
PDF
Apache Metron in the Real World
PPTX
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
PDF
Apache Eagle: Secure Hadoop in Real Time
PDF
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
PPTX
Spark sql meetup
Spark ML Pipeline serving
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Visualizing Big Data in Realtime
Spark in the Enterprise - 2 Years Later by Alan Saldich
ASPgems - kappa architecture
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
SAM—streaming analytics made easy
Apache Metron in the Real World
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
Apache Eagle: Secure Hadoop in Real Time
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Spark sql meetup
Ad

Similar to "Lessons learned using Apache Spark for self-service data prep in SaaS world" (20)

PDF
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
PDF
Physical Plans in Spark SQL
PDF
Apache Spark Data Validation
PDF
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
PDF
Life is but a Stream
PDF
Working with 1 Million Time Series a Day: How to Scale Up a Predictive Analyt...
PDF
Lightning-fast Analytics for Workday transactional data
PDF
Databricks: What We Have Learned by Eating Our Dog Food
PDF
Using Production Profiles to Guide Optimizations
PDF
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
PDF
Parallelizing with Apache Spark in Unexpected Ways
PDF
Self-Service Apache Spark Structured Streaming Applications and Analytics
PDF
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
PDF
Databricks + Snowflake: Catalyzing Data and AI Initiatives
PDF
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
PDF
Tactical Data Science Tips: Python and Spark Together
PDF
Build a Financial Data Hub
PPT
BI and Predictive analytics 2011 shyam desigan presentation
PPTX
SAP Explorer Visual Intelligence
PDF
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Physical Plans in Spark SQL
Apache Spark Data Validation
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Life is but a Stream
Working with 1 Million Time Series a Day: How to Scale Up a Predictive Analyt...
Lightning-fast Analytics for Workday transactional data
Databricks: What We Have Learned by Eating Our Dog Food
Using Production Profiles to Guide Optimizations
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Parallelizing with Apache Spark in Unexpected Ways
Self-Service Apache Spark Structured Streaming Applications and Analytics
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Tactical Data Science Tips: Python and Spark Together
Build a Financial Data Hub
BI and Predictive analytics 2011 shyam desigan presentation
SAP Explorer Visual Intelligence
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Ad

Recently uploaded (20)

PDF
Digital Strategies for Manufacturing Companies
PDF
System and Network Administraation Chapter 3
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
top salesforce developer skills in 2025.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
L1 - Introduction to python Backend.pptx
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
medical staffing services at VALiNTRY
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
ai tools demonstartion for schools and inter college
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
Digital Strategies for Manufacturing Companies
System and Network Administraation Chapter 3
wealthsignaloriginal-com-DS-text-... (1).pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PTS Company Brochure 2025 (1).pdf.......
top salesforce developer skills in 2025.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
L1 - Introduction to python Backend.pptx
Understanding Forklifts - TECH EHS Solution
VVF-Customer-Presentation2025-Ver1.9.pptx
medical staffing services at VALiNTRY
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Design an Analysis of Algorithms II-SECS-1021-03
ai tools demonstartion for schools and inter college
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
How Creative Agencies Leverage Project Management Software.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
How to Choose the Right IT Partner for Your Business in Malaysia

"Lessons learned using Apache Spark for self-service data prep in SaaS world"

  • 1. Pavel Hardak (Product Manager, Workday) Jianneng Li (Software Engineer, Workday) Lessons Learned Using Apache Spark for Self-Service Data Prep (and More) in SaaS World #UnifiedAnalytics #SparkAISummit
  • 2. This presentation may contain forward-looking statements for which there are risks, uncertainties, and assumptions. If the risks materialize or assumptions prove incorrect, Workday’s business results and directions could differ materially from results implied by the forward-looking statements. Forward-looking statements include any statements regarding strategies or plans for future operations; any statements concerning new features, enhancements or upgrades to our existing applications or plans for future applications; and any statements of belief. Further information on risks that could affect Workday’s results is included in our filings with the Securities and Exchange Commission which are available on the Workday investor relations webpage: www.workday.com/company/investor_relations.php Workday assumes no obligation for and does not intend to update any forward-looking statements. Any unreleased services, features, functionality or enhancements referenced in any Workday document, roadmap, blog, our website, press release or public statement that are not currently available are subject to change at Workday’s discretion and may not be delivered as planned or at all. Customers who purchase Workday, Inc. services should make their purchase decisions upon services, features, and functions that are currently available. Safe Harbor Statement #UnifiedAnalytics #SparkAISummit 2
  • 3. Agenda ● Workday - Finance and HCM in the cloud ● Workday Platform - “Power of One” ● Prism Analytics - Powered by Apache Spark ● Production Stories & Lessons Learned ● Questions 3#UnifiedAnalytics #SparkAISummit 3
  • 4. #UnifiedAnalytics #SparkAISummit 4 ● “Pure” SaaS apps suite ○ Finance and HCM ● Customers: 2,500+ ○ 200+ of Fortune 500 ● Revenue: $2.82B ○ Growth: 32% YoY Plan Execute Analyze Planning Financial Management Human Capital Management Prism Analytics and Reporting
  • 6. 6 Business Process Framework Object Data Model Reporting and Analytics Security Integration Cloud One Source for Data | One Security Model | One Experience | One Community Machine Learning One Platform #UnifiedAnalytics #SparkAISummit
  • 7. #UnifiedAnalytics #SparkAISummit 7 Durable Business Process Framework Object Data Model Reporting and Analytics Security Integration Cloud One Source for Data | One Security Model | One Experience | One Community Machine Learning One Platform Object Data Model MetadataExtensible
  • 8. #UnifiedAnalytics #SparkAISummit 8 Business Process Framework Object Data Model Reporting and Analytics Security Integration Cloud One Source for Data | One Security Model | One Experience | One Community Machine Learning One Platform Security Encryption Privacy and Compliance Trust
  • 9. #UnifiedAnalytics #SparkAISummit 9 Business Process Framework Object Data Model Reporting and Analytics Security Integration Cloud One Source for Data | One Security Model | One Experience | One Community Machine Learning One Platform Reporting and Analytics Dashboards CollaborationDistribution
  • 10. #UnifiedAnalytics #SparkAISummit 10 Plan Execute Analyze Planning Financial Management Human Capital Management Prism Analytics and Reporting
  • 11. Workday Planning Workday Financial Management Workday Human Capital Management Workday Prism Analytics and Reporting Prism Analytics Integrate 3rd Party Data Data Management Data Preparation Data Discovery Report Publishing 11#UnifiedAnalytics #SparkAISummit Plan Execute Analyze Planning Financial Management Human Capital Management Prism Analytics and Reporting
  • 12. Workday Prism Analytics The full spectrum of Finance and HCM insights, all within Workday. Workday Data + Non-Workday Data #UnifiedAnalytics #SparkAISummit 12
  • 13. Finance, HCM Operational Industry systems Legacy systems More… CRM Service ticketing Surveys Point of Sale Stock grants Map Ingest Preparation AnalysisAcquisition Reporting Worksheets Data Discovery Cleanse and Transform Blend Datasets Apply Security Permissions Publish Data Source Prism Analytics Workflow 13#UnifiedAnalytics #SparkAISummit
  • 14. Prism Prism Prism HDFS / S3 Query Engine Spark Driver Spark Executor Interactive Data Prep Spark Driver Spark Executor Spark Driver Data Prep Publishing YARN Spark Executor Spark Executor Spark in Prism Analytics #UnifiedAnalytics #SparkAISummit 14
  • 15. Interactive Data Prep in Prism Transform Stages Number of samples Examples and statistics 15#UnifiedAnalytics #SparkAISummit
  • 16. Interactive Data Prep in Prism 16#UnifiedAnalytics #SparkAISummit
  • 17. Interactive Data Prep in Prism Powered by Spark Edit Transform 17#UnifiedAnalytics #SparkAISummit
  • 18. Data Prep Publishing in Prism Also powered by Spark 18#UnifiedAnalytics #SparkAISummit
  • 19. 19#UnifiedAnalytics #SparkAISummit Interactive Publishing Data size 100 - 100K rows Billions of rows Sampling Yes No Caching Yes No Latency Seconds Minutes to hours Result Returned in memory Written to disk SLA Best effort Consistent performance Data Prep: Interactive vs. Publishing
  • 20. 20#UnifiedAnalytics #SparkAISummit Data Prep: Interactive vs. Publishing Same plan!
  • 22. Prism Logical Model • Superset of SQL operators • Compiles to Spark plans through Spark SQL • Implements custom Catalyst rules and strategies 22#UnifiedAnalytics #SparkAISummit#UnifiedAnalytics #SparkAISummit
  • 23. Example: Interactive Data Prep Operators 23#UnifiedAnalytics #SparkAISummit#UnifiedAnalytics #SparkAISummit IngestSampler LogicalIngestSampler IngestSamplerExec IngestSamplerRDD Prism Logical Plan RDD Spark Physical Plan Spark Logical Plan
  • 25. Implementing Additional Data Types • Prism has a richer type system than Catalyst • Uses StructType and StructField to implement additional data types 25#UnifiedAnalytics #SparkAISummit
  • 26. Example: Prism Currency Type object CurrencyType extends StructType( Array( StructField(“amount”,DecimalType(26, 6)), StructField(“code”, StringType))) >> { “amount”: 1000.000000, “code”: “USD” } >> { “amount”: -999.000000, “code”: “YEN” } 26#UnifiedAnalytics #SparkAISummit
  • 28. Lessons #1: Nested SQL 28#UnifiedAnalytics #SparkAISummit
  • 29. Lesson #1: Nested SQL 29#UnifiedAnalytics #SparkAISummit • SQL requires computed columns to be nested – SELECT 1 as c1, c1 + 1 as c2; /* ✗ */ – SELECT c1 + 1 as c2 FROM (SELECT 1 as c1); /* ✓ */ • First version: one nesting per computed column – Does not scale to 100s of columns – Takes a long time to compile and optimize
  • 30. Lesson #1: Example Dependency Graph [first.name], [last.name], [income], concat([first.name],”.”, [last.name]) as [full.name], [income] * 0.28 as [federal.tax], [income] *0.10 as [state.tax], concat([full.name],”@workday.com”) as [email] first.name last.name income full.name federal.tax email state.tax 2nd level 1st level 30#UnifiedAnalytics #SparkAISummit
  • 31. select [income] * 0.10 as [state_tax], * from (select [income] * 0.28 as [federal_tax], * from (select concat([full.name],”@workday.com”) as [email], * from (select concat([first.name],”.”, [last.name]) as [full.name], * from (select [first.name], [last.name], [income] from base_table)))) Lesson #1: SQL Before Optimization 4 levels of nested SQL 31#UnifiedAnalytics #SparkAISummit
  • 32. Lesson #1: SQL After Optimization 2 levels of nested SQL 32 select concat([full.name],”@workday.com”) as [email], * from (select concat([first.name],”.”, [last.name]) as [full.name], [income] * 0.28 as [federal_tax], [income] * 0.10 as [state_tax], * from (select [first.name], [last.name], [income] from base_table))) #UnifiedAnalytics #SparkAISummit
  • 33. Lesson #2: Plan Blowup 33#UnifiedAnalytics #SparkAISummit
  • 34. Lesson #2: Plan Blowup 34#UnifiedAnalytics #SparkAISummit • Generated plans can have duplicate operators • E.g. self joins and self unions • Need to de-duplicate to improve performance
  • 35. Lesson #2: Deduping Prism Logical Plan 35#UnifiedAnalytics #SparkAISummit Union( Sample(k=100, Parse(“Dataset A”)), Join( Sample(k=100, Parse(“Dataset A”)), Parse(“Dataset B”)), Join( Sample(k=100, Parse(“Dataset A”)), Parse(“Dataset B”)) )
  • 36. Union( Cache(ID=1, Sample(k=100, Parse(“Dataset A”))), Join( Cache(ID=1, ∅), Parse(“Dataset B”)), Join( Sample(k=100, Parse(“Dataset A”)), Parse(“Dataset B”)) ) Lesson #2: Deduping Prism Logical Plan 36#UnifiedAnalytics #SparkAISummit Union( Sample(k=100, Parse(“Dataset A”)), Join( Sample(k=100, Parse(“Dataset A”)), Parse(“Dataset B”)), Join( Sample(k=100, Parse(“Dataset A”)), Parse(“Dataset B”)) )
  • 37. Union( Cache(ID=1, Sample(k=100, Parse(“Dataset A”))), Cache(ID=2, Join( Cache(ID=1, ∅), Parse(“Dataset B”))), Cache(ID=2, ∅) ) Lesson #2: Deduping Prism Logical Plan 37#UnifiedAnalytics #SparkAISummit Union( Sample(k=100, Parse(“Dataset A”)), Join( Sample(k=100, Parse(“Dataset A”)), Parse(“Dataset B”)), Join( Sample(k=100, Parse(“Dataset A”)), Parse(“Dataset B”)) )
  • 38. Lesson #2: Deduping Spark Tree String 38#UnifiedAnalytics #SparkAISummit (1) Project #2 +- (2) Join #2 :- (3) Project #1 : +- (4) Join #1 : :- (5) Scan #1 : +- (6) Scan #1 +- (7) Project #1 +- (8) Join #1 :- (9) Scan #1 +- (10) Scan #1
  • 39. Lesson #2: Deduping Spark Tree String 39#UnifiedAnalytics #SparkAISummit (1) Project #2 +- (2) Join #2 :- (3) Project #1 : +- (4) Join #1 : :- (5) Scan #1 : +- (6) Scan #1 +- (7) Project #1 +- (8) Join #1 :- (9) Scan #1 +- (10) Scan #1 (1) Project #2 +- (2) Join #2 :- (3) Project #1 : +- (4) Join #1 : :- (5) Scan #1 : +- (6) Lines 5-5 +- (7) Lines 3-6
  • 40. Lesson #3: Broadcast Join Tuning 40#UnifiedAnalytics #SparkAISummit
  • 41. Node 1 A 1 B 3 C 6 D 7 AA 2 BB 5 CC 9 Node 2 E 2 F 4 G 5 H 8 EE 3 FF 8 Node 1 Node 2 A 1 B 3 C 6 D 7 AA 2 BB 5 CC 9 EE 3 FF 8 E 2 F 4 G 5 H 8 AA 2 BB 5 CC 9 EE 3 FF 8 Broadcast Join #UnifiedAnalytics #SparkAISummit 41 Lesson #3: Broadcast Join Review
  • 42. • Spark’s broadcasting mechanism is inefficient – Broadcasted data goes through the driver – No global limit on broadcasted data – Complex jobs can make driver run out of memory Lesson #3: Spark Broadcast 42#UnifiedAnalytics #SparkAISummit Driver Executor 1 Executor 2 (1) Driver collects broadcasted data from executors (2) Driver sends broadcasted data to executors
  • 43. • Initially disabled broadcast joins for stability • Expectation: small number of joins, all large joins Lesson #3: Disabling Broadcast Joins 43#UnifiedAnalytics #SparkAISummit spark.sql.autoBroadcastJoinThreshold = -1
  • 44. Lesson #3: Re-Enabling Broadcast Joins 44 • Reality: large number of joins, many are small • Re-enabled broadcast join with a low threshold • 2-10x runtime improvement #UnifiedAnalytics #SparkAISummit spark.sql.autoBroadcastJoinThreshold = 1000000
  • 45. Lesson #4: Case-Insensitive Grouping 45#UnifiedAnalytics #SparkAISummit
  • 46. Prism Prism Prism HDFS / S3 Query Engine Spark Driver Spark Executor Interactive Data Prep Spark Driver Spark Executor Spark Driver Data Prep Publishing YARN Spark Executor Spark Executor Lesson #4: Spark in Query Engine #UnifiedAnalytics #SparkAISummit 46
  • 48. Sum of Billing Amount per Billing Location BillingLocation BillingAmount CALIFORNIA 100000 california 50000 california 40000 Illinois 25000 Texas 15000 TeXas 60000 texas 5000 BillingLocation TotalBillingAmount CALIFORNIA 100000 california 90000 TeXas 60000 Illinois 25000 texas 15000 Texas 5000 SELECT BillingLocation, SUM(BillingAmount) AS TotalBillingAmount FROM InsuranceClaims GROUP BY BillingLocation ORDER BY TotalBillingAmount 48#UnifiedAnalytics #SparkAISummit Lesson #4: Grouping on String Columns
  • 49. Sum of Billing Amount per Billing Location SELECT MIN(BillingLocation) AS BillingLocation, SUM(BillingAmount) AS TotalBillingAmount FROM InsuranceClaims GROUP BY UPPER(BillingLocation) ORDER BY TotalBillingAmount BillingLocation TotalBillingAmount CALIFORNIA 190000 TeXas 80000 Illinois 25000 In Workday, grouping on strings columns is case insensitive 49 BillingLocation BillingAmount CALIFORNIA 100000 california 50000 california 40000 Illinois 25000 Texas 15000 TeXas 60000 texas 5000 #UnifiedAnalytics #SparkAISummit Lesson #4: Grouping on String Columns
  • 50. GROUP BY stringField GROUP BY UPPER(stringField) + MIN(stringField) ~7x regression 50#UnifiedAnalytics #SparkAISummit Lesson #4: Case-Insensitive Grouping is Costly
  • 51. Aggregation on strings uses Spark uses SortAggregate operator ➔ Modified Spark’s HashAggregate to support strings Regression reduced to ~3x SortAggregate HashAggregate 51#UnifiedAnalytics #SparkAISummit Lesson #4: Aggregation on String Columns
  • 52. In Spark’s HashAggregate operator, functions used in GROUPING operator were getting evaluated twice Regression reduced to ~2x UPPER evaluated twice UPPER evaluated only once 52#UnifiedAnalytics #SparkAISummit Lesson #4: Reducing Function Evaluations
  • 53. Precompute uppercase for all characters ➔ replace toUpperCase() on each char by a simple array lookup Regression reduced to ~1.5x (and want to decrease more...) UPPER Optimized UPPER 53#UnifiedAnalytics #SparkAISummit Lesson #4: Optimizing Spark’s UPPER Function
  • 54. And one more thing... 54#UnifiedAnalytics #SparkAISummit
  • 55. HDFS / S3 Prism 1 Tenant 1 Prism 2 Tenant 2 Prism 3 Tenant 3 Prism 4 Tenant 4 Spark Cluster Spark Cluster Spark Cluster Spark Cluster Current – Single-Tenanted Spark Clusters 55#UnifiedAnalytics #SparkAISummit
  • 56. HDFS / S3 Spark Cluster Spark Cluster Tenant 2 Tenant 4Tenant 3 Tenant 6Tenant 5 Tenant 7Tenant 1 Tenant 8 Prism 1 Prism 2 Prism 3 Spark Cluster Future – Multi-Tenanted Spark Clusters 56#UnifiedAnalytics #SparkAISummit