Columbus Data & Analytics Wednesdays - June 2024

Running experiments - from design to
analysis
A/B tests under the hood
Maria Copot - CbusDAW

Who am I?
Researcher at Ohio State (Dept of Linguistics)
Main interest: characterising speakers’
knowledge of the words of their language
Broader interest in quantitative methodologies
and statistical analysis

…Why am I here?
We both collect data and run experiments!

The pitch
Digital analysts run experiments, software drives design and implementation
Privacy laws, Cookie Ban™ - consequences for experimental data and design
A peek under the hood of common software, and proposals for branching out

The next half hour of your life
1. The gears of A/B testing software
2. The bene
fi
ts of Bayesian methods
3. Paid participants - Proli
fi
c.co, Mechanical Turk

The goal vs the tools
All statistical analysis and experimentation is in the service of a goal.
Never lose track of the bigger picture!
Is A/B testing what you need?

When to use A/B testing
Comparing pairs of options differing along one dimension
Compare multiple dimensions between pairs of options? Multivariate tests!
Compare holistically di
ff
erent page designs? Redirect tests!

A/B testing
Goal
Hypothesis
Choose variations
Results!
Mysterious A/B testing black box
(…goblins?)

The importance of having a hypothesis
Testing for random features might sometimes increase clicks or engagement
but…

Important to start from what you think is missing and why it will improve things.
“Users prefer X to Y because Z”
- Identifying areas for improvement helps come up with tests (X vs Y)
- Both positive and negative outcomes teach us about user motivations (is Z
true?)

Results do not speak on their own
The narrative behind them is crucial for
decisions

Statistical signi
fi
cance
A/B test - is CR higher with a blue or a red button?
Users CR P-value
Blue (old) 1000 15%
0.006
Red (new) 1000 18%

15% 18%
p = 0.006
Statistical signi
fi
cance
Assume no di
ff
erence between the blue and red button
How likely is the red button observation?
P value = probability of observation assuming
no difference

Statistical signi
fi
cance
Threshold for signi
fi
cance usually set at p < 0.05 (arbitrary value)

Statistical signi
fi
cance
Threshold for signi
fi
cance usually set at p < 0.05 (arbitrary value)
Can lower threshold for more precision, depending on application and variance
in user behaviour.
✨Nothing magical about 0.05 ✨

Statistical signi
fi
cance
Always look at both effect size and p-value!
Larger expected effect sizes, easier to get low
p-values
Effect size: the difference between the
means of the two hypotheses

The dangers of working with means

P-values and sample size
Low p-values are facilitated by large sample sizes
Users Conversion rate P-value
Blue (old) 10 70%
0.69
Red (new) 10 80%
Blue (old) 100 70%
0.05
Red (new) 100 80%

P-values and sample size
Blue (old) 100000000 80.000
0.03
Red (new) 100000000 80.001
Power analysis
how many data points are needed to have an 80% chance
to discover an effect of the anticipated size?

P-values, sample sizes and time
Running assumption: users split 50/50 between variations.
But for certain features, not possible.
Example: purchase totals for people using apple pay vs manually entering
card details.

P-value
Effect size
(bigger difference = less overlap
between hypotheses)
Sample size
(higher = more certainty in the
difference)

A/B testing - sneaking in frequentist statistics
Frequentists ask: how likely is the observation if the null hypothesis is true?
Probability of the data given the (null) hypothesis

Bayesian A/B testing
Bayesians ask: how likely is my hypothesis, given the
data?

Bayesians vs frequentists
Frequentist estimation
a single unknown number with uncertainty around it, tested against the null hypothesis
Bayesian estimation
the entire distribution of the parameters of interest

Bayesian updating
posterior likelihood prior
marginal
p(hypothesis|evidence) =
p(evidence|hypothesis) * p(hypothesis)
p(evidence)
Out of all the times people press the button, how many were blue vs red?

Bayesian updating
p(hypothesis|evidence) =
p(evidence|hypothesis) * p(hypothesis)
p(evidence)
posterior likelihood prior
marginal

Advantages of Bayesian methods
• Intuitive results (p-values and con
fi
dence intervals are misunderstood)
• Reliable even for small sample sizes (no need to pre-de
fi
ne sample sizes)
• No need to estimate the e
ff
ect size in advance
• Early stopping is allowed (continuous updating)
• Faster pipeline to decisions
• Can incorporate domain knowledge through priors
• Estimates entire distribution

Testing features in the age of privacy laws
Desired tests are often more complex and nuanced than “red vs blue button”
• Longitudinal tracking
• Multiple outcomes
Need to know what behaviour comes from the same user

Paid participant pools
Platforms like Proli
fi
c.co, MTurk, Qualtrics Panel (and others!)
Consent forms allow you to track participant behaviour in depth
Participants can be recontacted for follow-up qualitative assessments
Large participant pool
fi
ltrable by detailed demographic information
Cons: participants must be paid and know they are taking part in an
experiment

Thank you!
Any further questions:
maria.copot.s@gmail.com

Columbus Data & Analytics Wednesdays - June 2024

More Related Content

Similar to Columbus Data & Analytics Wednesdays - June 2024 (20)

More from Jason Packer (20)

Recently uploaded (20)

Columbus Data & Analytics Wednesdays - June 2024