SlideShare a Scribd company logo
Setting up an A/B-testing
framework
Agnes van Belle
04-05-2021
Contents
1. Intro
2. Deciding what to measure
3. Data transformation
4. Testing: the t-test versus alternatives
5. When to stop the experiment, or: minimum sample size calculation
6. User adoption
7. Summary
Intro
Intro to A/B testing
1. Assign users to variants
2. Collect user data
3. Analyse data, yield conclusions
Why A/B testing
Optimize task for "human satisfaction" (clicks, purchases)
● Recommender systems, personalized ranking
● Any UI change
● Online advertising
○ selection effect vs. advertising effect1
Related
● Labeled sample size estimation and testing
○ e.g. image classification, named entity recognition
○ user surveys
1. A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook / Gordon, Zettelmeyer, Bhargava, Chapsky / 2019
About me
● Studied Artificial Intelligence at University of Amsterdam
● Most experience in recommendation, search/ranking and NLP
● Currently work at HeyJobs (Berlin)
● Formerly at OLX Group (Berlin)
○ Developed their internal A/B testing framework
○ Was working in Search & Recommendation team
● Formerly Search R&D lead at Textkernel; Data Scientist at Xomnia (Amsterdam)
Agnes van Belle
A/B testing @ OLX
OLX
● global online marketplace
● selling & buying
● thousands of users every hour
Example test topics
● new ranking model
● new recommendation model
● UI changes
A/B testing setup
Between-subjects design
● Variant A / baseline
○ Current platform
○ “Control group” users
● Variant B
○ Platform with specific feature
○ “Treatment group” users
● ...
A/B testing setup
Assigning users to variants - requirements
● Consistency
○ users that see a specific variant of the site
won't be presented with a different variant of
the site during the experiment
● Countability
○ need to keep track of the number of users in
each variant for statistical purposes
A/B testing: user assignment
Assigning users to variants
● Few users: store assignment per users
○ control over the assignment
○ easy to add cohorts
● Many users: store assignment logic per experiment per variant
● Example:
○ have finite list of codes per experiment per variant
○ example codes: all three-character combinations of hexadecimal chars (4096)
○ assign user to variant if first three hexadecimal chars of its user ID match
What to measure
What to measure: conversion rate
All our metrics are variations of the so-called “conversion rate”.
● base value / visit: page visit to the site necessary for conversion (denominator)
● conversion: action, like purchase or download (numerator)
● conversion rate: conversion, divided by the base value.
Example: we want to measure the number of responses to seen ads
● base value: “number of visits”
● conversion: “number of purchases after visit”
● conversion rate:
source: https://www.werbe-agentur-graz.at/
“number of purchases after visit”
“number of visits”
What to measure: conversion rate
Note: conversion rate is a final single number
● not calculated per user
● conversion rate = (∑users
numeratoruser
) / (∑users
denominatoruser
)
Business goals: total success, not success per user.
What to measure: conversion rate
Variant User 1 User 2 User 2 User 4 User 5
A Conversions 0 0 1 2 0
Base values 0 3 1 2 0
User 6 User 7 User 8 User 9 User 10
B Conversions 3 1 1 0 0
Base values 3 1 1 10 0
Conversion rate = (∑users
numeratoruser
) / (∑users
denominatoruser
)
● Conversion rate: (1+2)/(3+1+2) = 0.5 vs. (3+1+1)/(3+1+1+10) = 1/3 (A winner)
● Not conversion rate: (0/3)+(1/1)+(2/2) = 2 vs. (3/3)+(1/1)+(1/1)+(0/10) = 3 (B winner)
How to apply tests to conversion rate?
● for t-test, variance is needed
● for MWU-test, a list of rankable samples
→ covered later
What to measure: display uplift
Final metric to display: uplift
Uplift = 100 * (conversion_ratevariant
- conversion_ratecontrol
) / conversion_ratecontrol
control Variant B Variant C
What to measure: multiple metrics
Just including a single "primary" metric is a bad idea.
● “Conversion funnel” drop examination
○ e.g. landing page -> browse items -> add an item to cart -> checkout
○ at each successive step, fewer users who complete them.
● Accidental deterioration on ignored metrics
○ different teams might care about different metrics
■ your change increases the number of ads that people browse ✅
■ your change makes it harder for people to create an ad ❌
What to measure: standard “health” metrics
Include standard “health metrics” that
● should never decrease
● include in every experiment
● cover funnel
● are aligned with company goals
Secures against deterioration w.r.t. global
company goals
What to measure: standard “health” metrics
Include standard “health metrics” that
● cover funnel
● are aligned with company goals
● should never decrease
● include in every experiment
Secures against deterioration w.r.t. global
company goals
Data transformation
Data transformation
Recall that conversion rate = (∑users
numeratoruser
) / (∑users
denominatoruser
)
For each user in a variant, we have a numerator and denominator value
● The numerator represents the conversion or action
● The denominator represents the base value or visit
Data transformation:
1. Remove data points where the denominator is 0
2. Remove outliers
→ Feed data to a test
Data transformation
Recall that conversion rate = (∑users
numeratoruser
) / (∑users
denominatoruser
)
For each user in a variant, we have a numerator and denominator value
● The numerator represents the conversion or action
● The denominator represents the base value or
Data transformation:
1. Remove data points where the denominator is 0
2. Remove outliers
→ Feed data to a test
How to apply test to single ratio value?
Need mean and variance (t-test), or ranks (MWU-test)
Data transformation: linearization
Data transformation: linearization
We use the first-degree Taylor approximation of the function
● f(x, y) = x / y
Taylor series of a function f(x, y) is
● a sum of terms, expressed in terms of the function's
derivatives
● at a single point, (a, b)
● Taylor series F(x, y, a, b) ≅ f(x, y)
● F(x,y,a,b) will be equal to f(x,y) at the point (a,b)
Basic example of Taylor series (varying
degrees) of f(x) = cos(x) through the point a = 0
Data transformation: linearization
We use the first-degree Taylor approximation of the function
● f(x, y) = x / y
Conversion rate: f(sum(X), sum(Y)), where X=numerators & Y=denominators
If we apply its Taylor expansion F(X, Y, sum(X), sum(Y))→Z then
● mean(F(X, Y, sum(X), sum(Y))) ≅ mean(f(sum(X), sum(Y))) 1
● variance(F(X, Y, sum(X), sum(Y))) ≅ variance(f(sum(X), sum(Y))) 2
Now we can treat the data as a single list, Z
1. http://www.stat.cmu.edu/~hseltman/files/ratio.pdf
2. "Linearization method for a nonlinear estimator". Chapter 5.3 from Lehtonen, R. & Pahkinen, E. (2004). Practical Methods for Design and Analysis of Complex
Surveys, 2nd Edition. John Wiley & Sons. https://wiki.helsinki.fi/download/attachments/50784462/Diat_2b.pdf
Data transformation: outlier removal
Pre-filtering
● Remove data from site crawlers etc.
Distribution check
● Don’t remove outliers in case of binary data
Outlier removal: combination of
● Deviance from standard deviation
● Min/max percentage: set a clear maximum
min(max
(removing the top/bottom minimum_percent,
removing any data point more than x standard deviations from the mean),
maximum_percent)
Testing
Tests: t-test (between-subjects)
Student's t-test
● Most common test
● Assumptions:
○ Means of data are normally distributed
○ Approximately equal variance in control and variant groups
Welch's t-test
● Student’s t-test without equal-variances assumption
Tests: t-test (between-subjects)
What if the assumptions of the t-test are not met?
● Most of our data is heavily right-skewed
● Since most users do few/no action
Fairly normal distribution Typical distribution our experiments
Tests: t-test (between-subjects)
Central Limit Theorem
● the distribution of the mean will eventually be a normal distribution
Original: log-normal distribution Take many samples of size 30, record the mean
Tests: t-test (between-subjects)
the distribution of the mean will eventually be a normal distribution
image source: https://thestatsgeek.com/2013/09/28/the-t-test-and-robustness-to-non-normality/
Tests: t-test (between-subjects)
the distribution of the mean will eventually be a normal distribution
image source: https://thestatsgeek.com/2013/09/28/the-t-test-and-robustness-to-non-normality/
Tests: t-test (between-subjects)
What if the assumptions of the t-test are not met?
● For moderately large samples the t-test is robust to light violations of normality1
● The t-test may be under-powered (less likely to reject a false null-hypothesis) if the
data is asymmetric and the sample size is not large enough 2, 3, 4, 5, 6
The MWU-test is suggested in this latter case
1. "An Introduction to Medical Statistics." Martin Bland, 1995, Oxford University Press
2. https://thestatsgeek.com/2013/09/28/the-t-test-and-robustness-to-non-normality/
3. https://www.johndcook.com/blog/2018/05/11/two-sample-t-test/
4. "A More Realistic Look at the Robustness and Type II Error Properties of the t Test to Departures From Population Normality". Sawilowsky, Blair. 1992
5. "A Comparison of the Power of Wilcoxon's Rank-Sum Statistic to That of Student's t Statistic Under Various Nonnormal Distributions". Blar, Higgins. 1980.
6. "Wilcoxon–Mann–Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules". Fay, Proschan. 2010.
Tests: t-test vs. MWU-test
Simulation study: comparing the t-test with MWU-test
MWU-test:
● non-parametric, no normality assumption
● checks if two samples have come from different distributions
● checks how likely it is that a randomly picked value from variant A is greater or lesser
than a randomly picked value from variant B
Examine the power of the two tests, for data with varying skewness.
● Power = probability that the test rejects the null hypothesis in the case that the
alternative hypothesis is true
Tests: t-test vs. MWU-test: simulation
● samplei
= a function generating the control and variant samples from the distribution
● p = a function getting the p-value for the test on that sample
● I = the identity function
● N = the amount of trials
For our experiments, we chose N=100.
● Distributions: 36 Gamma distributions with varying skewness & standard deviations
● Sampling: use minimum sample size calculation for t-test (for power of 80%)
● Variant distribution was generated such that the uplift of the experiment would be 2%.
Tests: t-test vs. MWU-test: simulation
We confirmed that the MWU-test is a better choice than the t-test, the more the distributions of control and
variant group are skewed. MWU-test yields a higher power in this case, using same sample size.
Choosing a test
We also have binary data, in which case the Chi-square test is more applicable
Flow:
● if binary data:
○ if enough samples
■ Chi-square test
○ else
■ Fisher's exact test
● else (continuous data):
○ if data is normal enough:
■ if variances of control and treatment group equal:
● Student's t-test
■ else
● Welch's t-test
○ else:
■ MWU-test
Example Chi-square
test contingency table
0 1
control 23 31
variant 28 22
Test for continuous data with very small sample size
Sometimes minimum sample size cannot be reached (few samples per day)
We apply bootstrap test:
● Take sample with replacement, set means between control & variant equal (null-hypothesis to true)
● Run Welch's t-test, record result
● Repeat above 1000 times
● P-value = fraction of times result was more extreme than for observed data
● Can be better than MWU-test
● Depending on underlying distribution power improves 0%, 2% or 4%
● Worked best when first transforming data with symmetrization method (Yeo-Johnson)
Minimum sample size calculation
Minimum sample size calculation
Our setup:
● Get new data continuously
● Run “statistical engine” every night
How to know when to stop the experiment?
When does one have enough data to say the test is valid?
Minimum sample size calculation: t-test
Factor 1: minimum detectable effect (MDE) (settable)
● Recall uplift = 100 * (conversion_ratevariant
- conversion_ratecontrol
) / conversion_ratecontrol
● MDE says the minimum uplift you would want to detect
● The lower the MDE, the higher the sample size needs to be
● Usually set to 1%
Minimum sample size calculation: t-test
Factor 2: significance level (settable)
● Chance of rejecting the null-hypothesis when it’s actually true (lower is better)
● The lower the significance level, the higher the sample size needs to be
● Usually set to 5% - means confidence interval is 95%
Factor 3: Power (settable)
● Chance of rejecting the null-hypothesis when it’s actually false (higher is better)
● The higher the desired power, the higher the sample size needs to be
● Usually set to 80%
Minimum sample size calculation: t-test
● Factor 4: population variance (not settable)
○ Can be estimated from the sample variance
○ The higher the variance, the higher the sample size needs to be
Sample size and power calculations. In Data Analysis Using Regression and Multilevel/Hierarchical Models (Analytical Methods for Social
Research, pp. 437-456). Gelman & Hill (2006).
Minimum sample size calculation: t-test
Final calculation1
1. Gelman & Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models (Analytical Methods for Social Research). 2006
Where:
● n = the minimum sample size
● zɑ/2
= 1.96, corresponds to significance level of 5%
● zβ
= 0.84, corresponds to power of 80%
● σ = the (estimated) population standard deviation
● μa
= the mean under alternative hypothesis
● μx
= the mean under null-hypothesis
● μa
- μx
= set to be 1% of μx
,the "minimum detectable effect"
Minimum sample size calculation: t-test
zɑ/2
= 1.96, corresponds to significance level of 5% or confidence interval of 95%
zβ
= .084, corresponds to power of 80%
standard errors
Standard normal distribution
The mean of the alternative hypothesis
to cover 95% confidence interval
The mean of the alternative
hypothesis to make 80% of the
95%-interval positive
source: Gelman & Hill. Data Analysis
Using Regression and Multilevel /
Hierarchical Models. 2006
Other minimum sample size calculations
Chi-Squared test (binary)
● For binary data, chi-square test should be equal to binomial test
● Reduces to minimum sample size for binomial tests (proportions test)
● Normal distribution can be used to approximate Binomial distribution
● Reduces to minimum sample size for the t-test
Example Chi-square
test contingency table
0 1
control 23 31
variant 28 22
MWU-test
● Source: Happ, Bathke and Brunner. Optimal Sample Size Planning for the
Wilcoxon-Mann-Whitney-Test, 2018 (https://github.com/happma/PyNonpar)
● Doesn't use MDE
● Doesn't use significance level or power
User adoption
User adoption
User adoption can be a big hurdle
Teams already have "their own ways"
Possible facilitations:
● Standard way of expressing metrics + allow for custom metrics
○ numerator/denominator framework
● Give clear conclusions
○ positive/negative/waiting/inconclusive
● "Experimentation ambassadors" & internal courses
● Routine: embed examining experiments in demo's/stand-ups
● Documentation and support channels
Summary
● Try to have framework for expressing metrics
○ Allows for metric-"scalability"
● Always include primary metric + "health metrics"
● Standardize preprocessing and data transformation
○ Linearization
○ Outlier removal: set maximum
● Think about when to apply which test
● Formalize stopping criterion
● Give clear reportable results (uplift) and conclusion
End
Thanks for attending!

More Related Content

PPTX
A/B Testing with React
PDF
Artificial Intelligence in Action
PDF
Value-Focused Prioritization & Decision-Making
PPTX
Shift left
PDF
A/B Mythbusters: Common Optimization Objections Debunked
PDF
8 Blind Spots Often Overlooked When Testing on Mobile
PPTX
Practical Introduction to A/B Testing
PDF
Translating Tester-Speak Into Plain English: Simple Explanations for 8 Testin...
A/B Testing with React
Artificial Intelligence in Action
Value-Focused Prioritization & Decision-Making
Shift left
A/B Mythbusters: Common Optimization Objections Debunked
8 Blind Spots Often Overlooked When Testing on Mobile
Practical Introduction to A/B Testing
Translating Tester-Speak Into Plain English: Simple Explanations for 8 Testin...

Similar to Setting up an A/B-testing framework (20)

PDF
Andrii Belas: A/B testing overview: use-cases, theory and tools
PPTX
A/B testing problems
PDF
Data Insights Talk
PDF
A/B Testing - Design, Analysis and Pitfals
PPTX
A/B Testing at Scale
PPTX
Testing
PDF
ISSTA'16 Summer School: Intro to Statistics
PDF
Data Science Toolkit for Product Managers
PDF
Data science toolkit for product managers
PDF
Predictive Analytics with UX Research Data: Yes We Can!
PPTX
What is A/B-testing? An Introduction
PDF
T-test in statistics for data science
PPTX
A/B testing
PDF
Columbus Data & Analytics Wednesdays - June 2024
PDF
A/B testing from basic concepts to advanced techniques
PDF
A/B Testing Data-Driven Algorithms in the Cloud - Webinar
PPT
MLlectureMethod.ppt
PPT
MLlectureMethod.ppt
PDF
A/B Testing and Experimentation in Data Science
PPTX
Ab testing 101
Andrii Belas: A/B testing overview: use-cases, theory and tools
A/B testing problems
Data Insights Talk
A/B Testing - Design, Analysis and Pitfals
A/B Testing at Scale
Testing
ISSTA'16 Summer School: Intro to Statistics
Data Science Toolkit for Product Managers
Data science toolkit for product managers
Predictive Analytics with UX Research Data: Yes We Can!
What is A/B-testing? An Introduction
T-test in statistics for data science
A/B testing
Columbus Data & Analytics Wednesdays - June 2024
A/B testing from basic concepts to advanced techniques
A/B Testing Data-Driven Algorithms in the Cloud - Webinar
MLlectureMethod.ppt
MLlectureMethod.ppt
A/B Testing and Experimentation in Data Science
Ab testing 101
Ad

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Spectroscopy.pptx food analysis technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
MYSQL Presentation for SQL database connectivity
Big Data Technologies - Introduction.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Digital-Transformation-Roadmap-for-Companies.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Encapsulation_ Review paper, used for researhc scholars
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectroscopy.pptx food analysis technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
cuic standard and advanced reporting.pdf
sap open course for s4hana steps from ECC to s4
Empathic Computing: Creating Shared Understanding
Programs and apps: productivity, graphics, security and other tools
A comparative analysis of optical character recognition models for extracting...
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
MYSQL Presentation for SQL database connectivity
Ad

Setting up an A/B-testing framework

  • 1. Setting up an A/B-testing framework Agnes van Belle 04-05-2021
  • 2. Contents 1. Intro 2. Deciding what to measure 3. Data transformation 4. Testing: the t-test versus alternatives 5. When to stop the experiment, or: minimum sample size calculation 6. User adoption 7. Summary
  • 4. Intro to A/B testing 1. Assign users to variants 2. Collect user data 3. Analyse data, yield conclusions
  • 5. Why A/B testing Optimize task for "human satisfaction" (clicks, purchases) ● Recommender systems, personalized ranking ● Any UI change ● Online advertising ○ selection effect vs. advertising effect1 Related ● Labeled sample size estimation and testing ○ e.g. image classification, named entity recognition ○ user surveys 1. A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook / Gordon, Zettelmeyer, Bhargava, Chapsky / 2019
  • 6. About me ● Studied Artificial Intelligence at University of Amsterdam ● Most experience in recommendation, search/ranking and NLP ● Currently work at HeyJobs (Berlin) ● Formerly at OLX Group (Berlin) ○ Developed their internal A/B testing framework ○ Was working in Search & Recommendation team ● Formerly Search R&D lead at Textkernel; Data Scientist at Xomnia (Amsterdam) Agnes van Belle
  • 7. A/B testing @ OLX OLX ● global online marketplace ● selling & buying ● thousands of users every hour Example test topics ● new ranking model ● new recommendation model ● UI changes
  • 8. A/B testing setup Between-subjects design ● Variant A / baseline ○ Current platform ○ “Control group” users ● Variant B ○ Platform with specific feature ○ “Treatment group” users ● ...
  • 9. A/B testing setup Assigning users to variants - requirements ● Consistency ○ users that see a specific variant of the site won't be presented with a different variant of the site during the experiment ● Countability ○ need to keep track of the number of users in each variant for statistical purposes
  • 10. A/B testing: user assignment Assigning users to variants ● Few users: store assignment per users ○ control over the assignment ○ easy to add cohorts ● Many users: store assignment logic per experiment per variant ● Example: ○ have finite list of codes per experiment per variant ○ example codes: all three-character combinations of hexadecimal chars (4096) ○ assign user to variant if first three hexadecimal chars of its user ID match
  • 12. What to measure: conversion rate All our metrics are variations of the so-called “conversion rate”. ● base value / visit: page visit to the site necessary for conversion (denominator) ● conversion: action, like purchase or download (numerator) ● conversion rate: conversion, divided by the base value. Example: we want to measure the number of responses to seen ads ● base value: “number of visits” ● conversion: “number of purchases after visit” ● conversion rate: source: https://www.werbe-agentur-graz.at/ “number of purchases after visit” “number of visits”
  • 13. What to measure: conversion rate Note: conversion rate is a final single number ● not calculated per user ● conversion rate = (∑users numeratoruser ) / (∑users denominatoruser ) Business goals: total success, not success per user.
  • 14. What to measure: conversion rate Variant User 1 User 2 User 2 User 4 User 5 A Conversions 0 0 1 2 0 Base values 0 3 1 2 0 User 6 User 7 User 8 User 9 User 10 B Conversions 3 1 1 0 0 Base values 3 1 1 10 0 Conversion rate = (∑users numeratoruser ) / (∑users denominatoruser ) ● Conversion rate: (1+2)/(3+1+2) = 0.5 vs. (3+1+1)/(3+1+1+10) = 1/3 (A winner) ● Not conversion rate: (0/3)+(1/1)+(2/2) = 2 vs. (3/3)+(1/1)+(1/1)+(0/10) = 3 (B winner) How to apply tests to conversion rate? ● for t-test, variance is needed ● for MWU-test, a list of rankable samples → covered later
  • 15. What to measure: display uplift Final metric to display: uplift Uplift = 100 * (conversion_ratevariant - conversion_ratecontrol ) / conversion_ratecontrol control Variant B Variant C
  • 16. What to measure: multiple metrics Just including a single "primary" metric is a bad idea. ● “Conversion funnel” drop examination ○ e.g. landing page -> browse items -> add an item to cart -> checkout ○ at each successive step, fewer users who complete them. ● Accidental deterioration on ignored metrics ○ different teams might care about different metrics ■ your change increases the number of ads that people browse ✅ ■ your change makes it harder for people to create an ad ❌
  • 17. What to measure: standard “health” metrics Include standard “health metrics” that ● should never decrease ● include in every experiment ● cover funnel ● are aligned with company goals Secures against deterioration w.r.t. global company goals
  • 18. What to measure: standard “health” metrics Include standard “health metrics” that ● cover funnel ● are aligned with company goals ● should never decrease ● include in every experiment Secures against deterioration w.r.t. global company goals
  • 20. Data transformation Recall that conversion rate = (∑users numeratoruser ) / (∑users denominatoruser ) For each user in a variant, we have a numerator and denominator value ● The numerator represents the conversion or action ● The denominator represents the base value or visit Data transformation: 1. Remove data points where the denominator is 0 2. Remove outliers → Feed data to a test
  • 21. Data transformation Recall that conversion rate = (∑users numeratoruser ) / (∑users denominatoruser ) For each user in a variant, we have a numerator and denominator value ● The numerator represents the conversion or action ● The denominator represents the base value or Data transformation: 1. Remove data points where the denominator is 0 2. Remove outliers → Feed data to a test How to apply test to single ratio value? Need mean and variance (t-test), or ranks (MWU-test)
  • 23. Data transformation: linearization We use the first-degree Taylor approximation of the function ● f(x, y) = x / y Taylor series of a function f(x, y) is ● a sum of terms, expressed in terms of the function's derivatives ● at a single point, (a, b) ● Taylor series F(x, y, a, b) ≅ f(x, y) ● F(x,y,a,b) will be equal to f(x,y) at the point (a,b) Basic example of Taylor series (varying degrees) of f(x) = cos(x) through the point a = 0
  • 24. Data transformation: linearization We use the first-degree Taylor approximation of the function ● f(x, y) = x / y Conversion rate: f(sum(X), sum(Y)), where X=numerators & Y=denominators If we apply its Taylor expansion F(X, Y, sum(X), sum(Y))→Z then ● mean(F(X, Y, sum(X), sum(Y))) ≅ mean(f(sum(X), sum(Y))) 1 ● variance(F(X, Y, sum(X), sum(Y))) ≅ variance(f(sum(X), sum(Y))) 2 Now we can treat the data as a single list, Z 1. http://www.stat.cmu.edu/~hseltman/files/ratio.pdf 2. "Linearization method for a nonlinear estimator". Chapter 5.3 from Lehtonen, R. & Pahkinen, E. (2004). Practical Methods for Design and Analysis of Complex Surveys, 2nd Edition. John Wiley & Sons. https://wiki.helsinki.fi/download/attachments/50784462/Diat_2b.pdf
  • 25. Data transformation: outlier removal Pre-filtering ● Remove data from site crawlers etc. Distribution check ● Don’t remove outliers in case of binary data Outlier removal: combination of ● Deviance from standard deviation ● Min/max percentage: set a clear maximum min(max (removing the top/bottom minimum_percent, removing any data point more than x standard deviations from the mean), maximum_percent)
  • 27. Tests: t-test (between-subjects) Student's t-test ● Most common test ● Assumptions: ○ Means of data are normally distributed ○ Approximately equal variance in control and variant groups Welch's t-test ● Student’s t-test without equal-variances assumption
  • 28. Tests: t-test (between-subjects) What if the assumptions of the t-test are not met? ● Most of our data is heavily right-skewed ● Since most users do few/no action Fairly normal distribution Typical distribution our experiments
  • 29. Tests: t-test (between-subjects) Central Limit Theorem ● the distribution of the mean will eventually be a normal distribution Original: log-normal distribution Take many samples of size 30, record the mean
  • 30. Tests: t-test (between-subjects) the distribution of the mean will eventually be a normal distribution image source: https://thestatsgeek.com/2013/09/28/the-t-test-and-robustness-to-non-normality/
  • 31. Tests: t-test (between-subjects) the distribution of the mean will eventually be a normal distribution image source: https://thestatsgeek.com/2013/09/28/the-t-test-and-robustness-to-non-normality/
  • 32. Tests: t-test (between-subjects) What if the assumptions of the t-test are not met? ● For moderately large samples the t-test is robust to light violations of normality1 ● The t-test may be under-powered (less likely to reject a false null-hypothesis) if the data is asymmetric and the sample size is not large enough 2, 3, 4, 5, 6 The MWU-test is suggested in this latter case 1. "An Introduction to Medical Statistics." Martin Bland, 1995, Oxford University Press 2. https://thestatsgeek.com/2013/09/28/the-t-test-and-robustness-to-non-normality/ 3. https://www.johndcook.com/blog/2018/05/11/two-sample-t-test/ 4. "A More Realistic Look at the Robustness and Type II Error Properties of the t Test to Departures From Population Normality". Sawilowsky, Blair. 1992 5. "A Comparison of the Power of Wilcoxon's Rank-Sum Statistic to That of Student's t Statistic Under Various Nonnormal Distributions". Blar, Higgins. 1980. 6. "Wilcoxon–Mann–Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules". Fay, Proschan. 2010.
  • 33. Tests: t-test vs. MWU-test Simulation study: comparing the t-test with MWU-test MWU-test: ● non-parametric, no normality assumption ● checks if two samples have come from different distributions ● checks how likely it is that a randomly picked value from variant A is greater or lesser than a randomly picked value from variant B Examine the power of the two tests, for data with varying skewness. ● Power = probability that the test rejects the null hypothesis in the case that the alternative hypothesis is true
  • 34. Tests: t-test vs. MWU-test: simulation ● samplei = a function generating the control and variant samples from the distribution ● p = a function getting the p-value for the test on that sample ● I = the identity function ● N = the amount of trials For our experiments, we chose N=100. ● Distributions: 36 Gamma distributions with varying skewness & standard deviations ● Sampling: use minimum sample size calculation for t-test (for power of 80%) ● Variant distribution was generated such that the uplift of the experiment would be 2%.
  • 35. Tests: t-test vs. MWU-test: simulation We confirmed that the MWU-test is a better choice than the t-test, the more the distributions of control and variant group are skewed. MWU-test yields a higher power in this case, using same sample size.
  • 36. Choosing a test We also have binary data, in which case the Chi-square test is more applicable Flow: ● if binary data: ○ if enough samples ■ Chi-square test ○ else ■ Fisher's exact test ● else (continuous data): ○ if data is normal enough: ■ if variances of control and treatment group equal: ● Student's t-test ■ else ● Welch's t-test ○ else: ■ MWU-test Example Chi-square test contingency table 0 1 control 23 31 variant 28 22
  • 37. Test for continuous data with very small sample size Sometimes minimum sample size cannot be reached (few samples per day) We apply bootstrap test: ● Take sample with replacement, set means between control & variant equal (null-hypothesis to true) ● Run Welch's t-test, record result ● Repeat above 1000 times ● P-value = fraction of times result was more extreme than for observed data ● Can be better than MWU-test ● Depending on underlying distribution power improves 0%, 2% or 4% ● Worked best when first transforming data with symmetrization method (Yeo-Johnson)
  • 38. Minimum sample size calculation
  • 39. Minimum sample size calculation Our setup: ● Get new data continuously ● Run “statistical engine” every night How to know when to stop the experiment? When does one have enough data to say the test is valid?
  • 40. Minimum sample size calculation: t-test Factor 1: minimum detectable effect (MDE) (settable) ● Recall uplift = 100 * (conversion_ratevariant - conversion_ratecontrol ) / conversion_ratecontrol ● MDE says the minimum uplift you would want to detect ● The lower the MDE, the higher the sample size needs to be ● Usually set to 1%
  • 41. Minimum sample size calculation: t-test Factor 2: significance level (settable) ● Chance of rejecting the null-hypothesis when it’s actually true (lower is better) ● The lower the significance level, the higher the sample size needs to be ● Usually set to 5% - means confidence interval is 95% Factor 3: Power (settable) ● Chance of rejecting the null-hypothesis when it’s actually false (higher is better) ● The higher the desired power, the higher the sample size needs to be ● Usually set to 80%
  • 42. Minimum sample size calculation: t-test ● Factor 4: population variance (not settable) ○ Can be estimated from the sample variance ○ The higher the variance, the higher the sample size needs to be Sample size and power calculations. In Data Analysis Using Regression and Multilevel/Hierarchical Models (Analytical Methods for Social Research, pp. 437-456). Gelman & Hill (2006).
  • 43. Minimum sample size calculation: t-test Final calculation1 1. Gelman & Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models (Analytical Methods for Social Research). 2006 Where: ● n = the minimum sample size ● zɑ/2 = 1.96, corresponds to significance level of 5% ● zβ = 0.84, corresponds to power of 80% ● σ = the (estimated) population standard deviation ● μa = the mean under alternative hypothesis ● μx = the mean under null-hypothesis ● μa - μx = set to be 1% of μx ,the "minimum detectable effect"
  • 44. Minimum sample size calculation: t-test zɑ/2 = 1.96, corresponds to significance level of 5% or confidence interval of 95% zβ = .084, corresponds to power of 80% standard errors Standard normal distribution The mean of the alternative hypothesis to cover 95% confidence interval The mean of the alternative hypothesis to make 80% of the 95%-interval positive source: Gelman & Hill. Data Analysis Using Regression and Multilevel / Hierarchical Models. 2006
  • 45. Other minimum sample size calculations Chi-Squared test (binary) ● For binary data, chi-square test should be equal to binomial test ● Reduces to minimum sample size for binomial tests (proportions test) ● Normal distribution can be used to approximate Binomial distribution ● Reduces to minimum sample size for the t-test Example Chi-square test contingency table 0 1 control 23 31 variant 28 22 MWU-test ● Source: Happ, Bathke and Brunner. Optimal Sample Size Planning for the Wilcoxon-Mann-Whitney-Test, 2018 (https://github.com/happma/PyNonpar) ● Doesn't use MDE ● Doesn't use significance level or power
  • 47. User adoption User adoption can be a big hurdle Teams already have "their own ways" Possible facilitations: ● Standard way of expressing metrics + allow for custom metrics ○ numerator/denominator framework ● Give clear conclusions ○ positive/negative/waiting/inconclusive ● "Experimentation ambassadors" & internal courses ● Routine: embed examining experiments in demo's/stand-ups ● Documentation and support channels
  • 48. Summary ● Try to have framework for expressing metrics ○ Allows for metric-"scalability" ● Always include primary metric + "health metrics" ● Standardize preprocessing and data transformation ○ Linearization ○ Outlier removal: set maximum ● Think about when to apply which test ● Formalize stopping criterion ● Give clear reportable results (uplift) and conclusion