SlideShare a Scribd company logo
November 19th, 2020 l Data+AI Summit l Michael Winer & Daniel Hen
Offer Wall Revenue Uplift
with Spark, XGBoost and Statistics
About us
Daniel Hen
Data Scientist
Michael Winer
Data Science & BI lead
2
Agenda 01
02
03
04
05
06
Fyber Overview
Business Use Case
Solution Exploration
Solution
A/B Testing
Main Insights
3
4
Fyber Overview
SAN FRANCISCO
NEW YORK
LONDON
BERLIN
TEL AVIV SEOUL
BEIJING
This is Fyber
We’re builders
40% of 300+ employees focused
on technology and product
We’re app people
Building solutions that app
developers love
We’re publicly traded
FRA: FBEN
We’re global
7 offices
5
How big is our
Big Data?
25B Auctions
Per Day
200M DAU
800B Bid
Requests
Per Day
15K+ Apps
300TB
Generated Monthly
300 Users
Level
Dimensions
80+ Reported
Dimensions
(on real-time reporting)
60+ Reported
Metrics
7
October
2019
Marshal 100%
Marshal increased the Offer Wall revenues by 11%
Why are we here
8
Fyber
Overview
Business
Use Case
Solution
Explorati
on
Solution
A/B
Testing
Conclusion
s +
Summary
Business Use Case
● Integrated in a user’s application
● Contains offers in which users can
execute, in order to proceed within
a game
● Gives the user an option to win an “in-
app” claim / reward
Offer Wall
Increase in our
user engagement
Maximize
revenues for our
clients (publishers)
Motivation
10
Challenges
● Our data is too big for ordinary frameworks
(~hundreds of millions of events)
● Delayed Feedback Conversions
○ Conversion with a long delay presents a
challenge to models, however they can have
a high monetary value
11
User A
Click: January 1st, 2020
Conversion: January 3rd, 2020
Conversion Value: $2
(had 2 days of delay)
User C
Click: January 1st, 2020
Conversion: February 1st, 2020
Conversion Value: $40
(had 30 days of delay)
User B
Click: January 1st, 2020
Conversion: January 10th, 2020
Conversion Value: $3.5
(had 9 days of delay)
Nature Of Delayed Conversions
12
Multi Arm Bandit - Vanilla Setting
13
Multi Arm Bandit - Delayed Feedback Setting
14
15
Fyber
Overview
Business
Use Case
Solution
Explorati
on
Solution
A/B
Testing
Conclusion
s +
Summary
Solution Exploration
Existing Environment
Delayed Feedback
Big Data - mostly Tabular
Time Series (event based)
16
Where we looked
Literature
Review
(Arxiv,
Papers with Code)
Kaggle
17
What we found
This paper formulates two main aspects of feedback
in Display Advertising.
Instead of directly calculating:
● P(Conversion	|	Impression)
We can calculate:
● P(click) =	P(Click	|	Impression)
● P(conversion) =	P(Conversion	|	Click)
● P(Conversion	|	Impression)	=	P(click)*P(conversion)
18
Solution Principles
● Don’t try to predict all at once - Use different tools for different problems
● We need a framework that is able to deal with our needs:
○ Big Data Aggregation
○ ML Modeling
○ Testing and Visualization
○ Debugging and Troubleshooting
19
20
Fyber
Overview
Business
Use Case
Solution
Explorati
on
Solution
A/B
Testing
Conclusion
s +
Summary
The Chosen Solution
Architecture HL Overview
21
CTR Prediction Model
Spark Support
Handling Missing Data
Great performance
with (Big) Tabular Data
22
● XGBoost4J is a project which is being constantly updated and stabilized
○ The latest stable release - Sept. 2020
● We use it in order to perform distributed training on our big data
● Can be added directly from Maven repository
● Easily integrates with Spark ML framework (MLlib)
● Databricks allows us to use it pretty easily, and that was one of the main reasons for choosing it
XGBoost with Spark in Databricks #1
23
XGBoost with Spark in Databricks #2
Databricks XGBoost4J Documentation
Relevant Imports
Data preprocessing
Vector Assembler
24
XGBoost with Spark in Databricks #3
XGBoost4J “X, Y” definition
Model Train
Model Transform
XGBoost4J Model
Instantiation (with Map)
Distributed Training
25
● XGBoost knows how to handle missing values within a dataset
● In tree-based algorithms, branch directions for missing values are learned during training
● You can tell XGBoost to treat a value (-999) as if it was a missing value. Example below:
Missing Value Flag
XGBoost - Handling Missing Data
26
● One of our technical challenges was how to save the pipeline / models, which were
trained in Spark (Databricks)
● We looked for a solution which is able to provide us a model export / import
for both online & offline prediction modes
● MLeap provides all of the above
● Databricks contains great documentation about it, which made this even easier
● We also wrote a short blog post on how to create synergy between Spark and MLeap
A word about MLeap |
27
Conversion Prediction Model #1
● Some conversions will arrive with a delay (E.g 14 days delay)
● By predicting the num. of conversions before they all arrive, we
make our model faster and better
● For this purpose we look at this flow as a poisson process
● A poisson process is mostly used where we count the occurrences
of events that happen at a certain rate, but at random
0 1 2 K
28
Conversion Prediction Model #2
● A single event within a poisson process can be
modeled using the Exponential Distribution
● Probability estimation using Exponential Distribution
is straightforward to calculate:
1 / ( 1+ e^(-x*λ) )
● λ = 1 / (Avg. time to convert from click)
x = Elapsed time from click
● Using only these 2 parameters, we can calculate a
probability for each user’s click to become a
conversion
0.00.51.01.5
0 1 2 3 4 5
= 0.5
= 0.1
= 1.5
ProbabilityDensity
29
● Airflow is a platform to programmatically
author, schedule and monitor workflows
● Our data pipeline is complex, as there are
several dependencies affecting each other
● Databricks Airflow Operator to the rescue!
● Databricks have great documentation about it
Airflow & Databricks Scheduling |
Airflow Databricks Operator
30
31
Fyber
Overview
Business
Use Case
Solution
Explorati
on
Solution
A/B
Testing
Conclusion
s +
Summary
A/B Testing
A/B Testing #1
Our Best Practices
● Decide on one dominant KPI, and 2-3 supporting ones
● Build proper analysis tools for analyzing the tests
● Run an A/B test with a (small) portion of traffic
● Analyzing results using Databricks Dashboards &
Scheduling capabilities
3232
A/B Testing #2
From Events to A/B testing with Databricks
● Read raw events using spark
● Aggregate raw data to results, and save
periodically using Databricks jobs scheduler
● Use SQL, built-in widgets and visual libraries (E.g bokeh)
to build a dashboard
● Again, Use Databricks Jobs to run the report every
couple of hours and share the link with colleagues
3333
34
Notebook
Scheduling
Notifications
A/B Testing #3
From Events to A/B testing with Databricks
A/B Testing - Summary Statistics
Variant Main_KPI KPI_2 KPI_3 KPI_4 KPI_5
C 1.001 29.839 8.289 0.673 0.047
B 1.02 31.585 10.285 0.606 0.061
A 0.975 32.261 25.819 0 0.14
35
Model Analysis
Model A CTR
Predictions Distribution
Model B CTR
Predictions Distribution
Model C CTR
Predictions Distribution
36
37
Fyber
Overview
Business
Use Case
Solution
Explorati
on
Solution
A/B
Testing
Conclusion
s +
Summary
Main Insights
Main Insights
● Exploratory Data Analysis is crucial
● There’s a high chance that the first experiment will go wrong. It’s OK, Keep on
● Late conversions = Late results
● Work is not done once deployment is done
● Post-deployment tools are crucial, especially if other teams are supporting
your models
38
Post-Deployment
Tools Using
Databricks
39
Summary
■ Fyber Overview
■ Offer Wall Overview
■ Our Use-Case Motivation
■ Our solution - how we explored it, what we wanted to achieve
■ A/B testing in a nutshell
■ Main Insights
40
Feel free to reach out!
Daniel Hen
Data Scientist
Michael Winer
Data Science & BI lead
Email | Linkedin |
Medium | GitHub
Email | Linkedin
41
Q&A
42
43
THANK YOU!
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

PDF
DevOps for DataScience
PDF
Vertex AI: Pipelines for your MLOps workflows
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
PPTX
Feature Store as a Data Foundation for Machine Learning
PDF
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
PDF
Saving Energy in Homes with a Unified Approach to Data and AI
PDF
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
PDF
KFServing, Model Monitoring with Apache Spark and a Feature Store
DevOps for DataScience
Vertex AI: Pipelines for your MLOps workflows
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Feature Store as a Data Foundation for Machine Learning
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Saving Energy in Homes with a Unified Approach to Data and AI
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
KFServing, Model Monitoring with Apache Spark and a Feature Store

What's hot (20)

PPTX
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
PDF
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
PDF
Life is but a Stream
PDF
Unified Data Access with Gimel
PPTX
Data Science Crash Course
PDF
Spark Development Lifecycle at Workday - ApacheCon 2020
PDF
Machine Learning at Scale with MLflow and Apache Spark
PDF
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
PDF
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
PDF
PDF
Airbyte @ Airflow Summit - The new modern data stack
PPTX
Real time machine learning
PDF
Simplifying AI integration on Apache Spark
PPTX
Production Grade Data Science for Hadoop
PDF
Horizon: Deep Reinforcement Learning at Scale
PDF
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
PPTX
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
PDF
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
PPTX
Next.ml Boston: Data Science Dev Ops
PDF
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Life is but a Stream
Unified Data Access with Gimel
Data Science Crash Course
Spark Development Lifecycle at Workday - ApacheCon 2020
Machine Learning at Scale with MLflow and Apache Spark
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Airbyte @ Airflow Summit - The new modern data stack
Real time machine learning
Simplifying AI integration on Apache Spark
Production Grade Data Science for Hadoop
Horizon: Deep Reinforcement Learning at Scale
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
Next.ml Boston: Data Science Dev Ops
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Ad

Similar to ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed Feedback Environment (20)

PDF
Building the BI system and analytics capabilities at the company based on Rea...
PPTX
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
PDF
Applying BigQuery ML on e-commerce data analytics
PDF
SOP Planning and Optimization Solution-as-a-Service.pdf
PDF
Application Migration: How to Start, Scale and Succeed
PPTX
Production ready big ml workflows from zero to hero daniel marcous @ waze
PPTX
Visualizations that make an impact - see what s new in minitab statistical s...
PDF
Architecting for analytics
PDF
Customer Success Story: Interact Everywhere with IBM Active Reports
 
PDF
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
PPT
Edgewater Spreadsheet Planning with Microsoft PPS
PDF
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
PDF
Data Architecture at Vente-Exclusive.com - TOTM Exellys
PPTX
Building TaxBrain: Numba-enabled Financial Computing on the Web
PPTX
Democratizing data science Using spark, hive and druid
PDF
The Digital Twin For Production Optimization
PDF
Big Data, Bigger Analytics
PPTX
Cracking web development
PPTX
Data Ops at TripActions
PPTX
How to successfully implement change in your organization (REX Dashlane) (EN)
Building the BI system and analytics capabilities at the company based on Rea...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Applying BigQuery ML on e-commerce data analytics
SOP Planning and Optimization Solution-as-a-Service.pdf
Application Migration: How to Start, Scale and Succeed
Production ready big ml workflows from zero to hero daniel marcous @ waze
Visualizations that make an impact - see what s new in minitab statistical s...
Architecting for analytics
Customer Success Story: Interact Everywhere with IBM Active Reports
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Edgewater Spreadsheet Planning with Microsoft PPS
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Data Architecture at Vente-Exclusive.com - TOTM Exellys
Building TaxBrain: Numba-enabled Financial Computing on the Web
Democratizing data science Using spark, hive and druid
The Digital Twin For Production Optimization
Big Data, Bigger Analytics
Cracking web development
Data Ops at TripActions
How to successfully implement change in your organization (REX Dashlane) (EN)
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPT
Quality review (1)_presentation of this 21
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
.pdf is not working space design for the following data for the following dat...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Computer network topology notes for revision
Clinical guidelines as a resource for EBP(1).pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
oil_refinery_comprehensive_20250804084928 (1).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Supervised vs unsupervised machine learning algorithms
Business Acumen Training GuidePresentation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Database Infoormation System (DBIS).pptx
Quality review (1)_presentation of this 21
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
.pdf is not working space design for the following data for the following dat...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Knowledge Engineering Part 1
Moving the Public Sector (Government) to a Digital Adoption
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Computer network topology notes for revision

ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed Feedback Environment

  • 1. November 19th, 2020 l Data+AI Summit l Michael Winer & Daniel Hen Offer Wall Revenue Uplift with Spark, XGBoost and Statistics
  • 2. About us Daniel Hen Data Scientist Michael Winer Data Science & BI lead 2
  • 3. Agenda 01 02 03 04 05 06 Fyber Overview Business Use Case Solution Exploration Solution A/B Testing Main Insights 3
  • 5. SAN FRANCISCO NEW YORK LONDON BERLIN TEL AVIV SEOUL BEIJING This is Fyber We’re builders 40% of 300+ employees focused on technology and product We’re app people Building solutions that app developers love We’re publicly traded FRA: FBEN We’re global 7 offices 5
  • 6. How big is our Big Data? 25B Auctions Per Day 200M DAU 800B Bid Requests Per Day 15K+ Apps 300TB Generated Monthly 300 Users Level Dimensions 80+ Reported Dimensions (on real-time reporting) 60+ Reported Metrics
  • 7. 7 October 2019 Marshal 100% Marshal increased the Offer Wall revenues by 11% Why are we here
  • 9. ● Integrated in a user’s application ● Contains offers in which users can execute, in order to proceed within a game ● Gives the user an option to win an “in- app” claim / reward Offer Wall
  • 10. Increase in our user engagement Maximize revenues for our clients (publishers) Motivation 10
  • 11. Challenges ● Our data is too big for ordinary frameworks (~hundreds of millions of events) ● Delayed Feedback Conversions ○ Conversion with a long delay presents a challenge to models, however they can have a high monetary value 11
  • 12. User A Click: January 1st, 2020 Conversion: January 3rd, 2020 Conversion Value: $2 (had 2 days of delay) User C Click: January 1st, 2020 Conversion: February 1st, 2020 Conversion Value: $40 (had 30 days of delay) User B Click: January 1st, 2020 Conversion: January 10th, 2020 Conversion Value: $3.5 (had 9 days of delay) Nature Of Delayed Conversions 12
  • 13. Multi Arm Bandit - Vanilla Setting 13
  • 14. Multi Arm Bandit - Delayed Feedback Setting 14
  • 16. Existing Environment Delayed Feedback Big Data - mostly Tabular Time Series (event based) 16
  • 18. What we found This paper formulates two main aspects of feedback in Display Advertising. Instead of directly calculating: ● P(Conversion | Impression) We can calculate: ● P(click) = P(Click | Impression) ● P(conversion) = P(Conversion | Click) ● P(Conversion | Impression) = P(click)*P(conversion) 18
  • 19. Solution Principles ● Don’t try to predict all at once - Use different tools for different problems ● We need a framework that is able to deal with our needs: ○ Big Data Aggregation ○ ML Modeling ○ Testing and Visualization ○ Debugging and Troubleshooting 19
  • 22. CTR Prediction Model Spark Support Handling Missing Data Great performance with (Big) Tabular Data 22
  • 23. ● XGBoost4J is a project which is being constantly updated and stabilized ○ The latest stable release - Sept. 2020 ● We use it in order to perform distributed training on our big data ● Can be added directly from Maven repository ● Easily integrates with Spark ML framework (MLlib) ● Databricks allows us to use it pretty easily, and that was one of the main reasons for choosing it XGBoost with Spark in Databricks #1 23
  • 24. XGBoost with Spark in Databricks #2 Databricks XGBoost4J Documentation Relevant Imports Data preprocessing Vector Assembler 24
  • 25. XGBoost with Spark in Databricks #3 XGBoost4J “X, Y” definition Model Train Model Transform XGBoost4J Model Instantiation (with Map) Distributed Training 25
  • 26. ● XGBoost knows how to handle missing values within a dataset ● In tree-based algorithms, branch directions for missing values are learned during training ● You can tell XGBoost to treat a value (-999) as if it was a missing value. Example below: Missing Value Flag XGBoost - Handling Missing Data 26
  • 27. ● One of our technical challenges was how to save the pipeline / models, which were trained in Spark (Databricks) ● We looked for a solution which is able to provide us a model export / import for both online & offline prediction modes ● MLeap provides all of the above ● Databricks contains great documentation about it, which made this even easier ● We also wrote a short blog post on how to create synergy between Spark and MLeap A word about MLeap | 27
  • 28. Conversion Prediction Model #1 ● Some conversions will arrive with a delay (E.g 14 days delay) ● By predicting the num. of conversions before they all arrive, we make our model faster and better ● For this purpose we look at this flow as a poisson process ● A poisson process is mostly used where we count the occurrences of events that happen at a certain rate, but at random 0 1 2 K 28
  • 29. Conversion Prediction Model #2 ● A single event within a poisson process can be modeled using the Exponential Distribution ● Probability estimation using Exponential Distribution is straightforward to calculate: 1 / ( 1+ e^(-x*λ) ) ● λ = 1 / (Avg. time to convert from click) x = Elapsed time from click ● Using only these 2 parameters, we can calculate a probability for each user’s click to become a conversion 0.00.51.01.5 0 1 2 3 4 5 = 0.5 = 0.1 = 1.5 ProbabilityDensity 29
  • 30. ● Airflow is a platform to programmatically author, schedule and monitor workflows ● Our data pipeline is complex, as there are several dependencies affecting each other ● Databricks Airflow Operator to the rescue! ● Databricks have great documentation about it Airflow & Databricks Scheduling | Airflow Databricks Operator 30
  • 32. A/B Testing #1 Our Best Practices ● Decide on one dominant KPI, and 2-3 supporting ones ● Build proper analysis tools for analyzing the tests ● Run an A/B test with a (small) portion of traffic ● Analyzing results using Databricks Dashboards & Scheduling capabilities 3232
  • 33. A/B Testing #2 From Events to A/B testing with Databricks ● Read raw events using spark ● Aggregate raw data to results, and save periodically using Databricks jobs scheduler ● Use SQL, built-in widgets and visual libraries (E.g bokeh) to build a dashboard ● Again, Use Databricks Jobs to run the report every couple of hours and share the link with colleagues 3333
  • 34. 34 Notebook Scheduling Notifications A/B Testing #3 From Events to A/B testing with Databricks
  • 35. A/B Testing - Summary Statistics Variant Main_KPI KPI_2 KPI_3 KPI_4 KPI_5 C 1.001 29.839 8.289 0.673 0.047 B 1.02 31.585 10.285 0.606 0.061 A 0.975 32.261 25.819 0 0.14 35
  • 36. Model Analysis Model A CTR Predictions Distribution Model B CTR Predictions Distribution Model C CTR Predictions Distribution 36
  • 38. Main Insights ● Exploratory Data Analysis is crucial ● There’s a high chance that the first experiment will go wrong. It’s OK, Keep on ● Late conversions = Late results ● Work is not done once deployment is done ● Post-deployment tools are crucial, especially if other teams are supporting your models 38
  • 40. Summary ■ Fyber Overview ■ Offer Wall Overview ■ Our Use-Case Motivation ■ Our solution - how we explored it, what we wanted to achieve ■ A/B testing in a nutshell ■ Main Insights 40
  • 41. Feel free to reach out! Daniel Hen Data Scientist Michael Winer Data Science & BI lead Email | Linkedin | Medium | GitHub Email | Linkedin 41
  • 44. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.