SlideShare a Scribd company logo
Bootstrapping of
(Py)Spark models for
factorial A/B tests
Ondrej Havlicek
Data Scientist
Ondřej Havlíček
• Senior data scientist
• Background
• Computer science, psychology,
neuroscience
• Focus
• Inferential statistics, machine learning, ETL
• Spark, Python, R
• A/B testing, recommendation, search, ...
• e-Commerce, social media, ...
Making data science and
machine learning have a real
impact on organizations.
We are
DataSentics PX
Personalization for
Banking and
Insurance
DS Innovate
AI/ML driven
innovation &
startups
DS TechScale
Platforms for AI-
intensive
applications
DS InRetail
Improving the
customer
experience in
Retail/FMCG
Gold partner &
Partner of the Year 2020 Professional partner
4th fastest growing in CE
Rising stars award
Partners &
Awards:
Selected
Customers:
Data science
Machine learning
specialists
Data engineering
Cloud
specialists
10+
product
owners
50+ 30+
Optimize and automate the
thousands/millions of small
decisions you do everyday
Analyse positioning, out-of-
stock, pricing and more
from a photo.
AI choice assistant for e-
commerce
AI extension for your
adform
Agenda
1. Factorial A/B testing
2. Analysis of results
3. Bootstrapping
4. Performance tuning
A/B testing
• What
• A: Control version
• B: Experimental version
• Why
• The only way to improve KPIs consistently
• Evidence > HIPPO
• Most of tested ideas actually incorrect
• How
• Usually isolated tests, in parallel or one after another
Wikipedia: a user experience research methodology ... consist of a randomized
experiment with two variants, A and B. It includes application of statistical hypothesis
testing ... and determining which of the two variants is more effective.
Why factorial A/B testing?
• Isolated tests are limiting
• Few concurrent experiments or very long
durations
• Solution: Factorial design
• Cross multiple tests orthogonally
• Each visitor assigned into a variant in all tests
• Allows running dozens of simultaneous tests
• Each test runs at all traffic
• Faster results
https://hbr.org/2017/09/the-surprising-power-of-online-experiments
Analysis of results
• What you often get
• Version B has a statistically significant effect on CR, p = 0.04
• What we ideally want
• Version B increases CR with 92.5% probability
• most likely by 1.8 %, 95% CI: [-0.3; 3.9]
Results of Test 1
Analysis of results
• How: effect size
• Big data: Spark GLM, e.g.:
• is_conversion ~ T1 + T2 + T1 * T2
• family = "binomial"
• link = "logit"
• How: uncertainty
• Std. errors generally not provided by Spark GLMs
• Bootstrapping
• A way to estimate distribution of some statistic
• “Poor man’s Bayes”, noninformative prior
Results of Test 1
Bootstrapping
• Iterate many times (hundreds..):
• Randomly resample data with replacement
• Compute statistics of interest: GLM coefficients
df_resample = df.sample(withReplacement=True, fraction=1.0)
fitted_model = model.fit(df_resample)
stats = extract_stats(fitted_model)
• How in Spark?
• Bootstrapping: Embarrassingly parallel
• Spark parallelizes tasks of model fitting = within 1 iteration
• How to scale?
• Need to run many instances of model fitting in parallel
Bootstrapping of GLM in Spark in a parallel fashion
• Multithreading
• Prepare bootstrap iterations into batches:
• Each batch contains sequential iterations
• Each iteration performs a spark action
• Stages have fewer tasks than cores
Worker 1 Worker 2
Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4
Iteration 1 Iteration 2 Iteration 3 Iteration 4
Iteration 5 Iteration 6 Iteration 7 Iteration 8
Iteration 9 Iteration 10 Iteration 11 Iteration 12
... ... ... ...
Batch 1 Batch 2 Batch 3 Batch 4
• Submit the batches in parallel using
multithreading
• Tasks get scheduled in FIFO / FAIR
fashion to the executors
Iteration 1
Stage 1
Task 1
Task 3
Task 2
Task 4
Core 1 Core 2
Bootstrapping of GLM in Spark
• Multithreading
Worker 1 Worker 2
Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4
Iteration 1 Iteration 2 Iteration 3 Iteration 4
Iteration 5 Iteration 6 Iteration 7 Iteration 8
Iteration 9 Iteration 10 Iteration 11 Iteration 12
... ... ... ...
Batch 1 Batch 2 Batch 3 Batch 4
ret_vals = []
batch_size = math.floor(n_iterations / n_threads)
batches = [{'batchnum': i + 1, 'reps': batch_size} for i in range(n_threads)]
with concurrent.futures.ThreadPoolExecutor(max_workers=n_threads) as executor:
future_run = {
executor.submit(run_batch, df, model, batch['reps']): batch for batch in batches
}
for future in concurrent.futures.as_completed(future_run):
try:
batch_result = future.result()
ret_vals.append(batch_result)
...
Performance: don’t waste resources
• How many parallel batches (threads)?
• n_threads = n_cores / n_tasks * n_tasks_per_core
• n_tasks: repartition to ~100 – 200 MB
• n_tasks_per_core: empirical question, ca. 2 – 4
• Check Ganglia UI
Worker 1 Worker 2
Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4
Iteration 1 Iteration 2 Iteration 3 Iteration 4
Iteration 5 Iteration 6 Iteration 7 Iteration 8
Iteration 9 Iteration 10 Iteration 11 Iteration 12
... ... ... ...
Batch 1 Batch 2 Batch 3 Batch 4
Performance test
Lessons learned
• Spark better suited for ML than inferential stats
• Bootstrapping helps
• You can do parallelization^2 in Spark
• Business users understand & like the outputs
• Core of factorial AB testing is simple
• Many interesting challenges in reality J
• Overlaps, interactions, funnels, outliers, zero-inflated metrics, variance
reduction, ...
Thank you!
Want to know more? Drop me a line
ondrej.havlicek@datasentics.com

More Related Content

PDF
Data Migration Plan PowerPoint Presentation Slides
PDF
Best Practices for Your CMP RFP or RFI
PPTX
Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​
PDF
Managed Feature Store for Machine Learning
PDF
Predicting Flights with Azure Databricks
PPTX
Introduction to the world of Cloud Computing & Microsoft Azure.pptx
PDF
Data Quality With or Without Apache Spark and Its Ecosystem
PDF
uReplicator: Uber Engineering’s Scalable, Robust Kafka Replicator
Data Migration Plan PowerPoint Presentation Slides
Best Practices for Your CMP RFP or RFI
Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​
Managed Feature Store for Machine Learning
Predicting Flights with Azure Databricks
Introduction to the world of Cloud Computing & Microsoft Azure.pptx
Data Quality With or Without Apache Spark and Its Ecosystem
uReplicator: Uber Engineering’s Scalable, Robust Kafka Replicator

What's hot (20)

PDF
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
PPTX
Cloudera Customer Success Story
PDF
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PDF
Scaling and Modernizing Data Platform with Databricks
PDF
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
PPTX
Why do the majority of Data Science projects never make it to production?
PDF
Data Governance by stealth v0.0.2
PDF
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
PDF
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
PDF
Modernizing to a Cloud Data Architecture
PDF
AWS Data Analytics on AWS
PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
PPTX
Introduction to Azure Machine Learning
PPTX
Redshift overview
PDF
Time to Talk about Data Mesh
PPTX
Azure data platform overview
PPTX
Challenges in Building a Data Pipeline
PDF
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Cloudera Customer Success Story
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Data Lakehouse Symposium | Day 1 | Part 1
Scaling and Modernizing Data Platform with Databricks
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
Introducing the Snowflake Computing Cloud Data Warehouse
Why do the majority of Data Science projects never make it to production?
Data Governance by stealth v0.0.2
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Modernizing to a Cloud Data Architecture
AWS Data Analytics on AWS
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Introduction to Azure Machine Learning
Redshift overview
Time to Talk about Data Mesh
Azure data platform overview
Challenges in Building a Data Pipeline
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Ad

Similar to Bootstrapping of PySpark Models for Factorial A/B Tests (20)

PDF
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
PDF
Production-Ready BIG ML Workflows - from zero to hero
PDF
Advanced Model Comparison and Automated Deployment Using ML
PDF
Experimental Design for Distributed Machine Learning with Myles Baker
PPTX
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
PPTX
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
PPTX
Scaling out logistic regression with Spark
PPTX
230208 MLOps Getting from Good to Great.pptx
PDF
Integrate SparkR with existing R packages to accelerate data science workflows
PDF
Machine learning systems for engineers
PPTX
Ml2 production
PDF
Spark Based Distributed Deep Learning Framework For Big Data Applications
PDF
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
PDF
Spark ml streaming
PDF
Machine Learning Crash Course by Sebastian Raschka
PDF
Deeplearning in production
PPTX
My Master's Thesis
PPTX
Notes on Deploying Machine-learning Models at Scale
PPTX
Machine Learning with Spark
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Production-Ready BIG ML Workflows - from zero to hero
Advanced Model Comparison and Automated Deployment Using ML
Experimental Design for Distributed Machine Learning with Myles Baker
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
Scaling out logistic regression with Spark
230208 MLOps Getting from Good to Great.pptx
Integrate SparkR with existing R packages to accelerate data science workflows
Machine learning systems for engineers
Ml2 production
Spark Based Distributed Deep Learning Framework For Big Data Applications
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Spark ml streaming
Machine Learning Crash Course by Sebastian Raschka
Deeplearning in production
My Master's Thesis
Notes on Deploying Machine-learning Models at Scale
Machine Learning with Spark
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Computer network topology notes for revision
PPTX
Database Infoormation System (DBIS).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Lecture1 pattern recognition............
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Quality review (1)_presentation of this 21
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Business Acumen Training GuidePresentation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction-to-Cloud-ComputingFinal.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
climate analysis of Dhaka ,Banglades.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Computer network topology notes for revision
Database Infoormation System (DBIS).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Lecture1 pattern recognition............
ISS -ESG Data flows What is ESG and HowHow
.pdf is not working space design for the following data for the following dat...
Qualitative Qantitative and Mixed Methods.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Clinical guidelines as a resource for EBP(1).pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Quality review (1)_presentation of this 21

Bootstrapping of PySpark Models for Factorial A/B Tests

  • 1. Bootstrapping of (Py)Spark models for factorial A/B tests Ondrej Havlicek Data Scientist
  • 2. Ondřej Havlíček • Senior data scientist • Background • Computer science, psychology, neuroscience • Focus • Inferential statistics, machine learning, ETL • Spark, Python, R • A/B testing, recommendation, search, ... • e-Commerce, social media, ...
  • 3. Making data science and machine learning have a real impact on organizations. We are DataSentics PX Personalization for Banking and Insurance DS Innovate AI/ML driven innovation & startups DS TechScale Platforms for AI- intensive applications DS InRetail Improving the customer experience in Retail/FMCG Gold partner & Partner of the Year 2020 Professional partner 4th fastest growing in CE Rising stars award Partners & Awards: Selected Customers: Data science Machine learning specialists Data engineering Cloud specialists 10+ product owners 50+ 30+ Optimize and automate the thousands/millions of small decisions you do everyday Analyse positioning, out-of- stock, pricing and more from a photo. AI choice assistant for e- commerce AI extension for your adform
  • 4. Agenda 1. Factorial A/B testing 2. Analysis of results 3. Bootstrapping 4. Performance tuning
  • 5. A/B testing • What • A: Control version • B: Experimental version • Why • The only way to improve KPIs consistently • Evidence > HIPPO • Most of tested ideas actually incorrect • How • Usually isolated tests, in parallel or one after another Wikipedia: a user experience research methodology ... consist of a randomized experiment with two variants, A and B. It includes application of statistical hypothesis testing ... and determining which of the two variants is more effective.
  • 6. Why factorial A/B testing? • Isolated tests are limiting • Few concurrent experiments or very long durations • Solution: Factorial design • Cross multiple tests orthogonally • Each visitor assigned into a variant in all tests • Allows running dozens of simultaneous tests • Each test runs at all traffic • Faster results https://hbr.org/2017/09/the-surprising-power-of-online-experiments
  • 7. Analysis of results • What you often get • Version B has a statistically significant effect on CR, p = 0.04 • What we ideally want • Version B increases CR with 92.5% probability • most likely by 1.8 %, 95% CI: [-0.3; 3.9] Results of Test 1
  • 8. Analysis of results • How: effect size • Big data: Spark GLM, e.g.: • is_conversion ~ T1 + T2 + T1 * T2 • family = "binomial" • link = "logit" • How: uncertainty • Std. errors generally not provided by Spark GLMs • Bootstrapping • A way to estimate distribution of some statistic • “Poor man’s Bayes”, noninformative prior Results of Test 1
  • 9. Bootstrapping • Iterate many times (hundreds..): • Randomly resample data with replacement • Compute statistics of interest: GLM coefficients df_resample = df.sample(withReplacement=True, fraction=1.0) fitted_model = model.fit(df_resample) stats = extract_stats(fitted_model) • How in Spark? • Bootstrapping: Embarrassingly parallel • Spark parallelizes tasks of model fitting = within 1 iteration • How to scale? • Need to run many instances of model fitting in parallel
  • 10. Bootstrapping of GLM in Spark in a parallel fashion • Multithreading • Prepare bootstrap iterations into batches: • Each batch contains sequential iterations • Each iteration performs a spark action • Stages have fewer tasks than cores Worker 1 Worker 2 Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7 Iteration 8 Iteration 9 Iteration 10 Iteration 11 Iteration 12 ... ... ... ... Batch 1 Batch 2 Batch 3 Batch 4 • Submit the batches in parallel using multithreading • Tasks get scheduled in FIFO / FAIR fashion to the executors Iteration 1 Stage 1 Task 1 Task 3 Task 2 Task 4 Core 1 Core 2
  • 11. Bootstrapping of GLM in Spark • Multithreading Worker 1 Worker 2 Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7 Iteration 8 Iteration 9 Iteration 10 Iteration 11 Iteration 12 ... ... ... ... Batch 1 Batch 2 Batch 3 Batch 4 ret_vals = [] batch_size = math.floor(n_iterations / n_threads) batches = [{'batchnum': i + 1, 'reps': batch_size} for i in range(n_threads)] with concurrent.futures.ThreadPoolExecutor(max_workers=n_threads) as executor: future_run = { executor.submit(run_batch, df, model, batch['reps']): batch for batch in batches } for future in concurrent.futures.as_completed(future_run): try: batch_result = future.result() ret_vals.append(batch_result) ...
  • 12. Performance: don’t waste resources • How many parallel batches (threads)? • n_threads = n_cores / n_tasks * n_tasks_per_core • n_tasks: repartition to ~100 – 200 MB • n_tasks_per_core: empirical question, ca. 2 – 4 • Check Ganglia UI Worker 1 Worker 2 Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7 Iteration 8 Iteration 9 Iteration 10 Iteration 11 Iteration 12 ... ... ... ... Batch 1 Batch 2 Batch 3 Batch 4
  • 14. Lessons learned • Spark better suited for ML than inferential stats • Bootstrapping helps • You can do parallelization^2 in Spark • Business users understand & like the outputs • Core of factorial AB testing is simple • Many interesting challenges in reality J • Overlaps, interactions, funnels, outliers, zero-inflated metrics, variance reduction, ...
  • 15. Thank you! Want to know more? Drop me a line ondrej.havlicek@datasentics.com