SlideShare a Scribd company logo
Accelerating Data Processing in Spark SQL with Pandas UDFs
Accelerating Data
Processing in Spark
SQL with Pandas UDFs
Michael Tong, Quantcast
Machine Learning Engineer
Agenda
Review of Pandas UDFs
Review what they are and go over some
development tips
Modeling at Quantcast
How we use Spark SQL in production
Example Problem
Introduce a real problem from our model training
pipeline that will be the main focus of our
optimization efforts for this talk.
Optimization tips and tricks.
Iteratively and aggressively optimize this problem
with pandas UDFs
Optimization Tricks
Do more things in memory
Loops > spark SQL intermediate rows. Look for
ways to do as many things in memory as possible
Aggregate Keys
Try to reduce the number of unique keys in your
data and/or process multiple keys in a single UDF
call.
Use inverted indices
Works especially well with sparse data.
Use python libraries
Pandas is easy to work with but slow, use other
python libraries for better performance
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Review of Pandas UDFs
What are Pandas UDFs?
▪ UDF = User Defined Function
▪ Pandas UDFs are part of Spark SQL,
which is Apache spark’s module for
working with structured data.
▪ Pandas UDFs are a great way of
writing custom data-processing
logic in a developer-friendly
environment.
Summary
What are Pandas UDFs?
▪ Scalar UDFs. One-to-one mapping
function that supports simple return
types (no Map/Struct types)
▪ Grouped Map UDFs. Requires a
groupby operation but can return a
variable number of output rows with
complicated return types.
▪ Grouped Agg UDFs. I recommend you
use Grouped Map UDFs instead.
Types of Pandas UDFs
Development tips and tricks
▪ Use an interactive development framework.
▪ At Quantcast we use jupyter notebooks.
▪ Develop with mock data.
▪ Pandas UDFs call python functions. Develop in your local environment using mock data in you interactive environment to quickly
iterate quickly on ideas when developing code.
▪ Use magic commands (if you are using jupyter)
▪ Useful commands like timeit, time and prun allow for easy profiling and performance tuning to squeeze every bit of performance out
of your pandas UDFs.
▪ Use python’s debugging tool
▪ The module is pdb (python debugger).
Modeling at Quantcast
Modeling at Quantcast
▪ Train tens of thousands of models
that refresh daily-weekly
▪ Models trained off first party data
from global internet traffic.
▪ We have a lot of data
▪ Around 400TB raw logs written/day
▪ Data is cleaned and compressed to
about 4-5 TB/day
▪ Typically train off of several days-
months of data for each model.
Scale, scale, and even more scale
Example Problem
Example Problem
▪ We have about 10k models that we want to train.
▪ Each of them cover different geo regions
▪ Some of them are over large regions (i.e. everybody in the US)
▪ Some of them are over specific regions (i.e. everybody in San Francisco)
▪ Some of them are over large regions but exclude specific regions (i.e. everybody in the US except people in San Francisco)
▪ For each model, we want to know how many unique ids (users) were
found in each region over a certain period of time.
A high level overview
Example Problem
▪ Each model will have a set of inclusion regions (i.e. US, San Francisco)
where each id must be in one of these regions to be considered part
of the model.
▪ Each model will have a set of exclusion regions (i.e. San Francisco)
where each id must be in none of these regions to be considered part
of the model.
▪ Each id only needs to satisfy the geo constraints once to be part of
the model (i.e., an id that moves from the US to Canada during the
training timeframe is considered valid for a US model)
More details
Example Problem
With some example tables
Feature Feature Id
US 0 or 100
San Francisco 1 or 101
Feature Map
Model Data and Result
Model Geo Incl. Geo Excl. # unique
Ids
ids
Model-0 (US) [0, 100] [] 4 A, B, C, D
Model-1 (US, not SF) [0, 100] [1, 101] 2 A, B
Id Timestamp Feature ids Model ids
A ts-1 [0] [0, 1]
B ts-2 [0, 1] [0, 1]
C ts-3 [100, 101] [0]
D ts-4 [0, 1, 2, 3, 4] [0]
D ts-5 [999] []
Feature Store
Optimization:
Tips and Tricks
Naive approach: Use Spark SQL
▪ Spark has built in functions to do everything we need
▪ Get all (row, model) pairs using a cross join
▪ Use functions.array_intersect for the inclusion/exclusion logic.
▪ Use groupby and aggregate to get the counts..
▪ Code is really simple (<10 lines of code)
▪ Will test this on a sample of 100k rows.
Naive approach: Use Spark SQL
Source code
Naive approach: Use Spark SQL
▪ Only processes about 25
rows/CPU/second
▪ To see why, look at the graph.
▪ We generate about 700x the number
of intermediate rows as our input
data to process this.
▪ This is because every row on average
belongs to several models.
▪ There has to be a better way.
This solution is terrible
10,067
(# models)
100,000
(# input rows)
69,697,819
(# rows)
Optimization: Use Pandas UDFs for Looping
▪ One reason why spark is really slow is
because of the large number of
intermediate rows.
▪ What if we wrote a simple UDF that
would iterate over all of the rows in
memory instead?
▪ For this example problem, it speeds
things up by ~1.8x
Optimization: Use Pandas UDFs for Looping
▪ Store the model data
(model_data_df) in a pandas
dataframe.
▪ Use a pandas GROUPED_MAP UDF to
process the data for each id.
▪ Figure out which models belong to an
id in a nested for loop
▪ This is faster because we do not
have to generate intermediate rows.
The code in a nutshell
Optimization: Aggregate keys
▪ In model training, there are some
commonly used filters.
▪ Instead of counting the number of
unique models, count the number of
unique filters.
▪ In this data set, there are 473 unique
model filters, which is much less
than 10k models.
▪ ~9.82x faster than the previous
solution.
Most common geo
inclusions/exclusions
Idea
Count Inclusion Exclusion
2035 [US] []
409 [GB] []
389 [CA] []
358 [AU] []
274 [DE] []
Optimization: Aggregate Keys
▪ Create a UDF that iterates over the
unique filters (by filterId) instead of
model ids.
▪ In order to get the model ids, create
a table that contains the mapping
from model ids to filter ids
(filter_model_id_pairs) and use a
broadcast hash join.
The code in a nutshell
Optimization: Aggregate Keys in Batches
▪ What if we grouped things by
something bigger than an id?
▪ Generate less intermediate rows.
▪ Take advantage of python
vectorization.
▪ We can rewrite a UDF that takes in
batches of ~10k ids per UDF call.
▪ ~2.9x faster than the previous
solution.
▪ ~51.3x faster than the naive one.
Idea
Optimization: Aggregate Keys in Batches
▪ Group things into batches based off
the hash of the id.
▪ Have the UDF group each batch by id
and count the number of ids that
satisfy each filter, returning a partial
count for each filter id.
▪ The group by operation becomes a
sum instead of a count because we
do partial counts in the batches.
The code in a nutshell
Optimization: Inverted Indexes
▪ Each feature store row has relatively
few unique features.
▪ Feature store rows have 10-20
features/row.
▪ There are ~500 unique filters.
▪ Use an inverted index to iterate over
the unique features instead of filters.
▪ Use set operations for
inclusion/exclusion logic
▪ ~6.44x faster than previous solution.
Idea
Optimization: Inverted Indexes
▪ Create maps for filter id to
inclusion/exclusion filters.
▪ Use those maps to get the set of
inclusion/exclusion filters each row
belongs to.
▪ Use set operations to perform the
inclusion/exclusion logic.
▪ Have each UDF call process batches
of ids.
The code in a nutshell
Optimization: Use python libraries
▪ Pandas is optimized for ease of use,
not speed.
▪ Use python libraries (itertools) to
make python code run faster.
▪ reduce, and numpy are also good
candidates to consider for other
UDFs.
▪ ~2.6x faster than previous solution.
▪ ~860x faster than naive solution!
Idea
Optimization: Use python libraries
▪ Use .values to extract the columns
from a pandas dataframe.
▪ Use itertools to iterate through for
loops faster than default for loops.
▪ itertools.group_by is used to sort and
group the data.
▪ itertools.chain.from_iterable is to iterate
through a nested for loop.
The code in a nutshell
Optimization: Summary
▪ Pandas UDFs are extremely flexible
and can be used to speed up spark
SQL.
▪ We discussed a problem where we
could apply optimization tricks for
almost 1000x speedup.
▪ Apply these tricks to your own
problems and watch things
accelerate.
Key takeaways
Questions?
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Thank you

More Related Content

PDF
Optimizing Apache Spark UDFs
PPTX
Apache Arrow Flight Overview
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
PPTX
Apache Spark Fundamentals
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
PDF
Building Robust ETL Pipelines with Apache Spark
PDF
Alfresco DevCon 2019: Encryption at-rest and in-transit
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Optimizing Apache Spark UDFs
Apache Arrow Flight Overview
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Apache Spark Fundamentals
Designing Structured Streaming Pipelines—How to Architect Things Right
Building Robust ETL Pipelines with Apache Spark
Alfresco DevCon 2019: Encryption at-rest and in-transit
Fine Tuning and Enhancing Performance of Apache Spark Jobs

What's hot (20)

PDF
Physical Plans in Spark SQL
PDF
Reporting with Oracle Application Express (APEX)
PDF
Relational vs Non Relational Databases
PDF
Big Data Processing with Spark and Scala
PDF
NiFi Developer Guide
PPTX
Apache Arrow: In Theory, In Practice
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
PDF
State schema evolution for Apache Flink Applications
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PDF
Apache Spark Notes
PPTX
Apache Spark
PDF
Oracle 12c PDB insights
PPTX
Apache Spark.
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PPTX
Top 10 tips for Oracle performance (Updated April 2015)
PPTX
Introduction to apache zoo keeper
PDF
Introduction to column oriented databases
PDF
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
PPTX
SQL Tuning 101
Physical Plans in Spark SQL
Reporting with Oracle Application Express (APEX)
Relational vs Non Relational Databases
Big Data Processing with Spark and Scala
NiFi Developer Guide
Apache Arrow: In Theory, In Practice
Best Practices for Enabling Speculative Execution on Large Scale Platforms
State schema evolution for Apache Flink Applications
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Cosco: An Efficient Facebook-Scale Shuffle Service
Apache Spark Notes
Apache Spark
Oracle 12c PDB insights
Apache Spark.
Spark Summit EU 2015: Lessons from 300+ production users
Top 10 tips for Oracle performance (Updated April 2015)
Introduction to apache zoo keeper
Introduction to column oriented databases
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
SQL Tuning 101
Ad

Similar to Accelerating Data Processing in Spark SQL with Pandas UDFs (20)

PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
PDF
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
PDF
PandasUDFs: One Weird Trick to Scaled Ensembles
PDF
Introduction to Spark Datasets - Functional and relational together at last
PDF
Beyond SQL: Speeding up Spark with DataFrames
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PDF
Pandas UDF and Python Type Hint in Apache Spark 3.0
PDF
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
PDF
Real-Time Spark: From Interactive Queries to Streaming
PPTX
DataFusion and Arrow_ Supercharge Your Data Analytical Tool with a Rusty Quer...
PDF
Tactical Data Science Tips: Python and Spark Together
PDF
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
PDF
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
PDF
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
PDF
Nose Dive into Apache Spark ML
PDF
Spark SQL In Depth www.syedacademy.com
PDF
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Pandas UDF: Scalable Analysis with Python and PySpark
PandasUDFs: One Weird Trick to Scaled Ensembles
Introduction to Spark Datasets - Functional and relational together at last
Beyond SQL: Speeding up Spark with DataFrames
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Pandas UDF and Python Type Hint in Apache Spark 3.0
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Real-Time Spark: From Interactive Queries to Streaming
DataFusion and Arrow_ Supercharge Your Data Analytical Tool with a Rusty Quer...
Tactical Data Science Tips: Python and Spark Together
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
SparkR - Play Spark Using R (20160909 HadoopCon)
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Nose Dive into Apache Spark ML
Spark SQL In Depth www.syedacademy.com
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Global journeys: estimating international migration
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Quality review (1)_presentation of this 21
PPTX
Computer network topology notes for revision
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Lecture1 pattern recognition............
Major-Components-ofNKJNNKNKNKNKronment.pptx
Foundation of Data Science unit number two notes
IB Computer Science - Internal Assessment.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Global journeys: estimating international migration
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Quality review (1)_presentation of this 21
Computer network topology notes for revision
Miokarditis (Inflamasi pada Otot Jantung)
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush

Accelerating Data Processing in Spark SQL with Pandas UDFs

  • 2. Accelerating Data Processing in Spark SQL with Pandas UDFs Michael Tong, Quantcast Machine Learning Engineer
  • 3. Agenda Review of Pandas UDFs Review what they are and go over some development tips Modeling at Quantcast How we use Spark SQL in production Example Problem Introduce a real problem from our model training pipeline that will be the main focus of our optimization efforts for this talk. Optimization tips and tricks. Iteratively and aggressively optimize this problem with pandas UDFs
  • 4. Optimization Tricks Do more things in memory Loops > spark SQL intermediate rows. Look for ways to do as many things in memory as possible Aggregate Keys Try to reduce the number of unique keys in your data and/or process multiple keys in a single UDF call. Use inverted indices Works especially well with sparse data. Use python libraries Pandas is easy to work with but slow, use other python libraries for better performance
  • 5. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 7. What are Pandas UDFs? ▪ UDF = User Defined Function ▪ Pandas UDFs are part of Spark SQL, which is Apache spark’s module for working with structured data. ▪ Pandas UDFs are a great way of writing custom data-processing logic in a developer-friendly environment. Summary
  • 8. What are Pandas UDFs? ▪ Scalar UDFs. One-to-one mapping function that supports simple return types (no Map/Struct types) ▪ Grouped Map UDFs. Requires a groupby operation but can return a variable number of output rows with complicated return types. ▪ Grouped Agg UDFs. I recommend you use Grouped Map UDFs instead. Types of Pandas UDFs
  • 9. Development tips and tricks ▪ Use an interactive development framework. ▪ At Quantcast we use jupyter notebooks. ▪ Develop with mock data. ▪ Pandas UDFs call python functions. Develop in your local environment using mock data in you interactive environment to quickly iterate quickly on ideas when developing code. ▪ Use magic commands (if you are using jupyter) ▪ Useful commands like timeit, time and prun allow for easy profiling and performance tuning to squeeze every bit of performance out of your pandas UDFs. ▪ Use python’s debugging tool ▪ The module is pdb (python debugger).
  • 11. Modeling at Quantcast ▪ Train tens of thousands of models that refresh daily-weekly ▪ Models trained off first party data from global internet traffic. ▪ We have a lot of data ▪ Around 400TB raw logs written/day ▪ Data is cleaned and compressed to about 4-5 TB/day ▪ Typically train off of several days- months of data for each model. Scale, scale, and even more scale
  • 13. Example Problem ▪ We have about 10k models that we want to train. ▪ Each of them cover different geo regions ▪ Some of them are over large regions (i.e. everybody in the US) ▪ Some of them are over specific regions (i.e. everybody in San Francisco) ▪ Some of them are over large regions but exclude specific regions (i.e. everybody in the US except people in San Francisco) ▪ For each model, we want to know how many unique ids (users) were found in each region over a certain period of time. A high level overview
  • 14. Example Problem ▪ Each model will have a set of inclusion regions (i.e. US, San Francisco) where each id must be in one of these regions to be considered part of the model. ▪ Each model will have a set of exclusion regions (i.e. San Francisco) where each id must be in none of these regions to be considered part of the model. ▪ Each id only needs to satisfy the geo constraints once to be part of the model (i.e., an id that moves from the US to Canada during the training timeframe is considered valid for a US model) More details
  • 15. Example Problem With some example tables Feature Feature Id US 0 or 100 San Francisco 1 or 101 Feature Map Model Data and Result Model Geo Incl. Geo Excl. # unique Ids ids Model-0 (US) [0, 100] [] 4 A, B, C, D Model-1 (US, not SF) [0, 100] [1, 101] 2 A, B Id Timestamp Feature ids Model ids A ts-1 [0] [0, 1] B ts-2 [0, 1] [0, 1] C ts-3 [100, 101] [0] D ts-4 [0, 1, 2, 3, 4] [0] D ts-5 [999] [] Feature Store
  • 17. Naive approach: Use Spark SQL ▪ Spark has built in functions to do everything we need ▪ Get all (row, model) pairs using a cross join ▪ Use functions.array_intersect for the inclusion/exclusion logic. ▪ Use groupby and aggregate to get the counts.. ▪ Code is really simple (<10 lines of code) ▪ Will test this on a sample of 100k rows.
  • 18. Naive approach: Use Spark SQL Source code
  • 19. Naive approach: Use Spark SQL ▪ Only processes about 25 rows/CPU/second ▪ To see why, look at the graph. ▪ We generate about 700x the number of intermediate rows as our input data to process this. ▪ This is because every row on average belongs to several models. ▪ There has to be a better way. This solution is terrible 10,067 (# models) 100,000 (# input rows) 69,697,819 (# rows)
  • 20. Optimization: Use Pandas UDFs for Looping ▪ One reason why spark is really slow is because of the large number of intermediate rows. ▪ What if we wrote a simple UDF that would iterate over all of the rows in memory instead? ▪ For this example problem, it speeds things up by ~1.8x
  • 21. Optimization: Use Pandas UDFs for Looping ▪ Store the model data (model_data_df) in a pandas dataframe. ▪ Use a pandas GROUPED_MAP UDF to process the data for each id. ▪ Figure out which models belong to an id in a nested for loop ▪ This is faster because we do not have to generate intermediate rows. The code in a nutshell
  • 22. Optimization: Aggregate keys ▪ In model training, there are some commonly used filters. ▪ Instead of counting the number of unique models, count the number of unique filters. ▪ In this data set, there are 473 unique model filters, which is much less than 10k models. ▪ ~9.82x faster than the previous solution. Most common geo inclusions/exclusions Idea Count Inclusion Exclusion 2035 [US] [] 409 [GB] [] 389 [CA] [] 358 [AU] [] 274 [DE] []
  • 23. Optimization: Aggregate Keys ▪ Create a UDF that iterates over the unique filters (by filterId) instead of model ids. ▪ In order to get the model ids, create a table that contains the mapping from model ids to filter ids (filter_model_id_pairs) and use a broadcast hash join. The code in a nutshell
  • 24. Optimization: Aggregate Keys in Batches ▪ What if we grouped things by something bigger than an id? ▪ Generate less intermediate rows. ▪ Take advantage of python vectorization. ▪ We can rewrite a UDF that takes in batches of ~10k ids per UDF call. ▪ ~2.9x faster than the previous solution. ▪ ~51.3x faster than the naive one. Idea
  • 25. Optimization: Aggregate Keys in Batches ▪ Group things into batches based off the hash of the id. ▪ Have the UDF group each batch by id and count the number of ids that satisfy each filter, returning a partial count for each filter id. ▪ The group by operation becomes a sum instead of a count because we do partial counts in the batches. The code in a nutshell
  • 26. Optimization: Inverted Indexes ▪ Each feature store row has relatively few unique features. ▪ Feature store rows have 10-20 features/row. ▪ There are ~500 unique filters. ▪ Use an inverted index to iterate over the unique features instead of filters. ▪ Use set operations for inclusion/exclusion logic ▪ ~6.44x faster than previous solution. Idea
  • 27. Optimization: Inverted Indexes ▪ Create maps for filter id to inclusion/exclusion filters. ▪ Use those maps to get the set of inclusion/exclusion filters each row belongs to. ▪ Use set operations to perform the inclusion/exclusion logic. ▪ Have each UDF call process batches of ids. The code in a nutshell
  • 28. Optimization: Use python libraries ▪ Pandas is optimized for ease of use, not speed. ▪ Use python libraries (itertools) to make python code run faster. ▪ reduce, and numpy are also good candidates to consider for other UDFs. ▪ ~2.6x faster than previous solution. ▪ ~860x faster than naive solution! Idea
  • 29. Optimization: Use python libraries ▪ Use .values to extract the columns from a pandas dataframe. ▪ Use itertools to iterate through for loops faster than default for loops. ▪ itertools.group_by is used to sort and group the data. ▪ itertools.chain.from_iterable is to iterate through a nested for loop. The code in a nutshell
  • 30. Optimization: Summary ▪ Pandas UDFs are extremely flexible and can be used to speed up spark SQL. ▪ We discussed a problem where we could apply optimization tricks for almost 1000x speedup. ▪ Apply these tricks to your own problems and watch things accelerate. Key takeaways
  • 32. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.