Setting up an A/B-testing framework

Setting up an A/B-testing
framework
Agnes van Belle
04-05-2021

Contents
1. Intro
2. Deciding what to measure
3. Data transformation
4. Testing: the t-test versus alternatives
5. When to stop the experiment, or: minimum sample size calculation
6. User adoption
7. Summary

Intro to A/B testing
1. Assign users to variants
2. Collect user data
3. Analyse data, yield conclusions

Why A/B testing
Optimize task for "human satisfaction" (clicks, purchases)
● Recommender systems, personalized ranking
● Any UI change
● Online advertising
○ selection effect vs. advertising effect1
Related
● Labeled sample size estimation and testing
○ e.g. image classification, named entity recognition
○ user surveys
1. A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook / Gordon, Zettelmeyer, Bhargava, Chapsky / 2019

About me
● Studied Artificial Intelligence at University of Amsterdam
● Most experience in recommendation, search/ranking and NLP
● Currently work at HeyJobs (Berlin)
● Formerly at OLX Group (Berlin)
○ Developed their internal A/B testing framework
○ Was working in Search & Recommendation team
● Formerly Search R&D lead at Textkernel; Data Scientist at Xomnia (Amsterdam)
Agnes van Belle

A/B testing @ OLX
OLX
● global online marketplace
● selling & buying
● thousands of users every hour
Example test topics
● new ranking model
● new recommendation model
● UI changes

A/B testing setup
Between-subjects design
● Variant A / baseline
○ Current platform
○ “Control group” users
● Variant B
○ Platform with specific feature
○ “Treatment group” users
● ...

A/B testing setup
Assigning users to variants - requirements
● Consistency
○ users that see a specific variant of the site
won't be presented with a different variant of
the site during the experiment
● Countability
○ need to keep track of the number of users in
each variant for statistical purposes

A/B testing: user assignment
Assigning users to variants
● Few users: store assignment per users
○ control over the assignment
○ easy to add cohorts
● Many users: store assignment logic per experiment per variant
● Example:
○ have finite list of codes per experiment per variant
○ example codes: all three-character combinations of hexadecimal chars (4096)
○ assign user to variant if first three hexadecimal chars of its user ID match

What to measure: conversion rate
All our metrics are variations of the so-called “conversion rate”.
● base value / visit: page visit to the site necessary for conversion (denominator)
● conversion: action, like purchase or download (numerator)
● conversion rate: conversion, divided by the base value.
Example: we want to measure the number of responses to seen ads
● base value: “number of visits”
● conversion: “number of purchases after visit”
● conversion rate:
source: https://www.werbe-agentur-graz.at/
“number of purchases after visit”
“number of visits”

Note: conversion rate is a final single number
● not calculated per user
● conversion rate = (∑users
numeratoruser
) / (∑users
denominatoruser
)
Business goals: total success, not success per user.

Variant User 1 User 2 User 2 User 4 User 5
A Conversions 0 0 1 2 0
Base values 0 3 1 2 0
User 6 User 7 User 8 User 9 User 10
B Conversions 3 1 1 0 0
Base values 3 1 1 10 0
Conversion rate = (∑users
numeratoruser
) / (∑users
denominatoruser
)
● Conversion rate: (1+2)/(3+1+2) = 0.5 vs. (3+1+1)/(3+1+1+10) = 1/3 (A winner)
● Not conversion rate: (0/3)+(1/1)+(2/2) = 2 vs. (3/3)+(1/1)+(1/1)+(0/10) = 3 (B winner)
How to apply tests to conversion rate?
● for t-test, variance is needed
● for MWU-test, a list of rankable samples
→ covered later

What to measure: display uplift
Final metric to display: uplift
Uplift = 100 * (conversion_ratevariant
- conversion_ratecontrol
) / conversion_ratecontrol
control Variant B Variant C

What to measure: multiple metrics
Just including a single "primary" metric is a bad idea.
● “Conversion funnel” drop examination
○ e.g. landing page -> browse items -> add an item to cart -> checkout
○ at each successive step, fewer users who complete them.
● Accidental deterioration on ignored metrics
○ different teams might care about different metrics
■ your change increases the number of ads that people browse ✅
■ your change makes it harder for people to create an ad ❌

What to measure: standard “health” metrics
Include standard “health metrics” that
● should never decrease
● include in every experiment
● cover funnel
● are aligned with company goals
Secures against deterioration w.r.t. global
company goals

What to measure: standard “health” metrics
Include standard “health metrics” that
● cover funnel
● are aligned with company goals
● should never decrease
● include in every experiment
Secures against deterioration w.r.t. global
company goals

Data transformation
Recall that conversion rate = (∑users
numeratoruser
) / (∑users
denominatoruser
)
For each user in a variant, we have a numerator and denominator value
● The numerator represents the conversion or action
● The denominator represents the base value or visit
Data transformation:
1. Remove data points where the denominator is 0
2. Remove outliers
→ Feed data to a test

Data transformation
Recall that conversion rate = (∑users
numeratoruser
) / (∑users
denominatoruser
)
For each user in a variant, we have a numerator and denominator value
● The numerator represents the conversion or action
● The denominator represents the base value or
Data transformation:
1. Remove data points where the denominator is 0
2. Remove outliers
→ Feed data to a test
How to apply test to single ratio value?
Need mean and variance (t-test), or ranks (MWU-test)

Data transformation: linearization

We use the first-degree Taylor approximation of the function
● f(x, y) = x / y
Taylor series of a function f(x, y) is
● a sum of terms, expressed in terms of the function's
derivatives
● at a single point, (a, b)
● Taylor series F(x, y, a, b) ≅ f(x, y)
● F(x,y,a,b) will be equal to f(x,y) at the point (a,b)
Basic example of Taylor series (varying
degrees) of f(x) = cos(x) through the point a = 0

We use the first-degree Taylor approximation of the function
● f(x, y) = x / y
Conversion rate: f(sum(X), sum(Y)), where X=numerators & Y=denominators
If we apply its Taylor expansion F(X, Y, sum(X), sum(Y))→Z then
● mean(F(X, Y, sum(X), sum(Y))) ≅ mean(f(sum(X), sum(Y))) 1
● variance(F(X, Y, sum(X), sum(Y))) ≅ variance(f(sum(X), sum(Y))) 2
Now we can treat the data as a single list, Z
1. http://www.stat.cmu.edu/~hseltman/files/ratio.pdf
2. "Linearization method for a nonlinear estimator". Chapter 5.3 from Lehtonen, R. & Pahkinen, E. (2004). Practical Methods for Design and Analysis of Complex
Surveys, 2nd Edition. John Wiley & Sons. https://wiki.helsinki.fi/download/attachments/50784462/Diat_2b.pdf

Data transformation: outlier removal
Pre-filtering
● Remove data from site crawlers etc.
Distribution check
● Don’t remove outliers in case of binary data
Outlier removal: combination of
● Deviance from standard deviation
● Min/max percentage: set a clear maximum
min(max
(removing the top/bottom minimum_percent,
removing any data point more than x standard deviations from the mean),
maximum_percent)

Tests: t-test (between-subjects)
Student's t-test
● Most common test
● Assumptions:
○ Means of data are normally distributed
○ Approximately equal variance in control and variant groups
Welch's t-test
● Student’s t-test without equal-variances assumption

What if the assumptions of the t-test are not met?
● Most of our data is heavily right-skewed
● Since most users do few/no action
Fairly normal distribution Typical distribution our experiments

Central Limit Theorem
● the distribution of the mean will eventually be a normal distribution
Original: log-normal distribution Take many samples of size 30, record the mean

the distribution of the mean will eventually be a normal distribution
image source: https://thestatsgeek.com/2013/09/28/the-t-test-and-robustness-to-non-normality/

What if the assumptions of the t-test are not met?
● For moderately large samples the t-test is robust to light violations of normality1
● The t-test may be under-powered (less likely to reject a false null-hypothesis) if the
data is asymmetric and the sample size is not large enough 2, 3, 4, 5, 6
The MWU-test is suggested in this latter case
1. "An Introduction to Medical Statistics." Martin Bland, 1995, Oxford University Press
2. https://thestatsgeek.com/2013/09/28/the-t-test-and-robustness-to-non-normality/
3. https://www.johndcook.com/blog/2018/05/11/two-sample-t-test/
4. "A More Realistic Look at the Robustness and Type II Error Properties of the t Test to Departures From Population Normality". Sawilowsky, Blair. 1992
5. "A Comparison of the Power of Wilcoxon's Rank-Sum Statistic to That of Student's t Statistic Under Various Nonnormal Distributions". Blar, Higgins. 1980.
6. "Wilcoxon–Mann–Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules". Fay, Proschan. 2010.

Tests: t-test vs. MWU-test
Simulation study: comparing the t-test with MWU-test
MWU-test:
● non-parametric, no normality assumption
● checks if two samples have come from different distributions
● checks how likely it is that a randomly picked value from variant A is greater or lesser
than a randomly picked value from variant B
Examine the power of the two tests, for data with varying skewness.
● Power = probability that the test rejects the null hypothesis in the case that the
alternative hypothesis is true

Tests: t-test vs. MWU-test: simulation
● samplei
= a function generating the control and variant samples from the distribution
● p = a function getting the p-value for the test on that sample
● I = the identity function
● N = the amount of trials
For our experiments, we chose N=100.
● Distributions: 36 Gamma distributions with varying skewness & standard deviations
● Sampling: use minimum sample size calculation for t-test (for power of 80%)
● Variant distribution was generated such that the uplift of the experiment would be 2%.

Tests: t-test vs. MWU-test: simulation
We confirmed that the MWU-test is a better choice than the t-test, the more the distributions of control and
variant group are skewed. MWU-test yields a higher power in this case, using same sample size.

Choosing a test
We also have binary data, in which case the Chi-square test is more applicable
Flow:
● if binary data:
○ if enough samples
■ Chi-square test
○ else
■ Fisher's exact test
● else (continuous data):
○ if data is normal enough:
■ if variances of control and treatment group equal:
● Student's t-test
■ else
● Welch's t-test
○ else:
■ MWU-test
Example Chi-square
test contingency table
0 1
control 23 31
variant 28 22

Test for continuous data with very small sample size
Sometimes minimum sample size cannot be reached (few samples per day)
We apply bootstrap test:
● Take sample with replacement, set means between control & variant equal (null-hypothesis to true)
● Run Welch's t-test, record result
● Repeat above 1000 times
● P-value = fraction of times result was more extreme than for observed data
● Can be better than MWU-test
● Depending on underlying distribution power improves 0%, 2% or 4%
● Worked best when first transforming data with symmetrization method (Yeo-Johnson)

Minimum sample size calculation

Minimum sample size calculation
Our setup:
● Get new data continuously
● Run “statistical engine” every night
How to know when to stop the experiment?
When does one have enough data to say the test is valid?

Minimum sample size calculation: t-test
Factor 1: minimum detectable effect (MDE) (settable)
● Recall uplift = 100 * (conversion_ratevariant
- conversion_ratecontrol
) / conversion_ratecontrol
● MDE says the minimum uplift you would want to detect
● The lower the MDE, the higher the sample size needs to be
● Usually set to 1%

Factor 2: significance level (settable)
● Chance of rejecting the null-hypothesis when it’s actually true (lower is better)
● The lower the significance level, the higher the sample size needs to be
● Usually set to 5% - means confidence interval is 95%
Factor 3: Power (settable)
● Chance of rejecting the null-hypothesis when it’s actually false (higher is better)
● The higher the desired power, the higher the sample size needs to be
● Usually set to 80%

● Factor 4: population variance (not settable)
○ Can be estimated from the sample variance
○ The higher the variance, the higher the sample size needs to be
Sample size and power calculations. In Data Analysis Using Regression and Multilevel/Hierarchical Models (Analytical Methods for Social
Research, pp. 437-456). Gelman & Hill (2006).

Final calculation1
1. Gelman & Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models (Analytical Methods for Social Research). 2006
Where:
● n = the minimum sample size
● zɑ/2
= 1.96, corresponds to significance level of 5%
● zβ
= 0.84, corresponds to power of 80%
● σ = the (estimated) population standard deviation
● μa
= the mean under alternative hypothesis
● μx
= the mean under null-hypothesis
● μa
- μx
= set to be 1% of μx
,the "minimum detectable effect"

zɑ/2
= 1.96, corresponds to significance level of 5% or confidence interval of 95%
zβ
= .084, corresponds to power of 80%
standard errors
Standard normal distribution
The mean of the alternative hypothesis
to cover 95% confidence interval
The mean of the alternative
hypothesis to make 80% of the
95%-interval positive
source: Gelman & Hill. Data Analysis
Using Regression and Multilevel /
Hierarchical Models. 2006

Other minimum sample size calculations
Chi-Squared test (binary)
● For binary data, chi-square test should be equal to binomial test
● Reduces to minimum sample size for binomial tests (proportions test)
● Normal distribution can be used to approximate Binomial distribution
● Reduces to minimum sample size for the t-test
Example Chi-square
test contingency table
0 1
control 23 31
variant 28 22
MWU-test
● Source: Happ, Bathke and Brunner. Optimal Sample Size Planning for the
Wilcoxon-Mann-Whitney-Test, 2018 (https://github.com/happma/PyNonpar)
● Doesn't use MDE
● Doesn't use significance level or power

User adoption
User adoption can be a big hurdle
Teams already have "their own ways"
Possible facilitations:
● Standard way of expressing metrics + allow for custom metrics
○ numerator/denominator framework
● Give clear conclusions
○ positive/negative/waiting/inconclusive
● "Experimentation ambassadors" & internal courses
● Routine: embed examining experiments in demo's/stand-ups
● Documentation and support channels

Summary
● Try to have framework for expressing metrics
○ Allows for metric-"scalability"
● Always include primary metric + "health metrics"
● Standardize preprocessing and data transformation
○ Linearization
○ Outlier removal: set maximum
● Think about when to apply which test
● Formalize stopping criterion
● Give clear reportable results (uplift) and conclusion

Setting up an A/B-testing framework

More Related Content

Similar to Setting up an A/B-testing framework (20)

Recently uploaded (20)

Setting up an A/B-testing framework