SlideShare a Scribd company logo
Running experiments - from design to
analysis
A/B tests under the hood
Maria Copot - CbusDAW
Who am I?
Researcher at Ohio State (Dept of Linguistics)
Main interest: characterising speakers’
knowledge of the words of their language
Broader interest in quantitative methodologies
and statistical analysis
…Why am I here?
We both collect data and run experiments!
The pitch
Digital analysts run experiments, software drives design and implementation
Privacy laws, Cookie Ban™ - consequences for experimental data and design
A peek under the hood of common software, and proposals for branching out
The next half hour of your life
1. The gears of A/B testing software
2. The bene
fi
ts of Bayesian methods
3. Paid participants - Proli
fi
c.co, Mechanical Turk
The goal vs the tools
All statistical analysis and experimentation is in the service of a goal.
Never lose track of the bigger picture!
Is A/B testing what you need?
When to use A/B testing
Comparing pairs of options differing along one dimension
Compare multiple dimensions between pairs of options? Multivariate tests!
Compare holistically di
ff
erent page designs? Redirect tests!
A/B testing
Goal
Hypothesis
Choose variations
Results!
Mysterious A/B testing black box
(…goblins?)
The importance of having a hypothesis
Testing for random features might sometimes increase clicks or engagement
but…
The importance of having a hypothesis
Important to start from what you think is missing and why it will improve things.
“Users prefer X to Y because Z”
- Identifying areas for improvement helps come up with tests (X vs Y)
- Both positive and negative outcomes teach us about user motivations (is Z
true?)
The importance of having a hypothesis
Results do not speak on their own
The narrative behind them is crucial for
decisions
Statistical signi
fi
cance
A/B test - is CR higher with a blue or a red button?
Users CR P-value
Blue (old) 1000 15%
0.006
Red (new) 1000 18%
15% 18%
p = 0.006
Statistical signi
fi
cance
Assume no di
ff
erence between the blue and red button
How likely is the red button observation?
P value = probability of observation assuming
no difference
Statistical signi
fi
cance
Threshold for signi
fi
cance usually set at p < 0.05 (arbitrary value)
Statistical signi
fi
cance
Threshold for signi
fi
cance usually set at p < 0.05 (arbitrary value)
Can lower threshold for more precision, depending on application and variance
in user behaviour.
✨Nothing magical about 0.05 ✨
Statistical signi
fi
cance
Always look at both effect size and p-value!
Larger expected effect sizes, easier to get low
p-values
Effect size: the difference between the
means of the two hypotheses
The dangers of working with means
P-values and sample size
Low p-values are facilitated by large sample sizes
Users Conversion rate P-value
Blue (old) 10 70%
0.69
Red (new) 10 80%
Users Conversion rate P-value
Blue (old) 100 70%
0.05
Red (new) 100 80%
P-values and sample size
Users Conversion rate P-value
Blue (old) 100000000 80.000
0.03
Red (new) 100000000 80.001
Power analysis
how many data points are needed to have an 80% chance
to discover an effect of the anticipated size?
P-values, sample sizes and time
Running assumption: users split 50/50 between variations.
But for certain features, not possible.
Example: purchase totals for people using apple pay vs manually entering
card details.
P-value
Effect size
(bigger difference = less overlap
between hypotheses)
Sample size
(higher = more certainty in the
difference)
A/B testing - sneaking in frequentist statistics
Frequentists ask: how likely is the observation if the null hypothesis is true?
Probability of the data given the (null) hypothesis
Bayesian A/B testing
Bayesians ask: how likely is my hypothesis, given the
data?
Bayesians vs frequentists
Frequentist estimation
a single unknown number with uncertainty around it, tested against the null hypothesis
Bayesian estimation
the entire distribution of the parameters of interest
Bayesian updating
posterior likelihood prior
marginal
p(hypothesis|evidence) =
p(evidence|hypothesis) * p(hypothesis)
p(evidence)
Out of all the times people press the button, how many were blue vs red?
Bayesian updating
p(hypothesis|evidence) =
p(evidence|hypothesis) * p(hypothesis)
p(evidence)
posterior likelihood prior
marginal
Advantages of Bayesian methods
• Intuitive results (p-values and con
fi
dence intervals are misunderstood)
• Reliable even for small sample sizes (no need to pre-de
fi
ne sample sizes)
• No need to estimate the e
ff
ect size in advance
• Early stopping is allowed (continuous updating)
• Faster pipeline to decisions
• Can incorporate domain knowledge through priors
• Estimates entire distribution
Testing features in the age of privacy laws
Desired tests are often more complex and nuanced than “red vs blue button”
• Longitudinal tracking
• Multiple outcomes
Need to know what behaviour comes from the same user
Paid participant pools
Platforms like Proli
fi
c.co, MTurk, Qualtrics Panel (and others!)
Consent forms allow you to track participant behaviour in depth
Participants can be recontacted for follow-up qualitative assessments
Large participant pool
fi
ltrable by detailed demographic information
Cons: participants must be paid and know they are taking part in an
experiment
Thank you!
Any further questions:
maria.copot.s@gmail.com

More Related Content

PDF
A/B testing from basic concepts to advanced techniques
PDF
Andrii Belas: A/B testing overview: use-cases, theory and tools
PPTX
Crash Course in A/B testing
PDF
Data Insights Talk
PDF
When in doubt, go live
PPTX
What is A/B-testing? An Introduction
PPTX
Ab test
PPTX
How can A/B testing go wrong?
A/B testing from basic concepts to advanced techniques
Andrii Belas: A/B testing overview: use-cases, theory and tools
Crash Course in A/B testing
Data Insights Talk
When in doubt, go live
What is A/B-testing? An Introduction
Ab test
How can A/B testing go wrong?

Similar to Columbus Data & Analytics Wednesdays - June 2024 (20)

PDF
How to know the impact of changes on audience reach - User and partner confer...
PPTX
Ab testing 101
PPTX
A/B testing problems
PPTX
10 Guidelines for A/B Testing
PPTX
Basics of AB testing in online products
PPTX
ABTest-20231020.pptx
PDF
Chris Stuccio - Data science - Conversion Hotel 2015
PDF
Web & Social Media Analytics Module 3.pdf
PDF
A/B Testing and Experimentation in Data Science
PDF
PDF
Que es un test A/B, conceptos básicos. Estadistica
PDF
Practical A/B Testing Statistics: 5 Tips To Help You Get Reliable Data
PDF
Predictive Analytics with UX Research Data: Yes We Can!
PPTX
A/B Testing Presentation for Comm4Dev
PDF
Optimizely Workshop: Take Action on Results with Statistics
PDF
Ab testing explained
PDF
Principles Before Practices: Transform Your Testing by Understanding Key Conc...
PDF
A/B Testing at SweetIM
PDF
How Significant is Statistically Significant? The case of Audio Music Similar...
PDF
Software testing
How to know the impact of changes on audience reach - User and partner confer...
Ab testing 101
A/B testing problems
10 Guidelines for A/B Testing
Basics of AB testing in online products
ABTest-20231020.pptx
Chris Stuccio - Data science - Conversion Hotel 2015
Web & Social Media Analytics Module 3.pdf
A/B Testing and Experimentation in Data Science
Que es un test A/B, conceptos básicos. Estadistica
Practical A/B Testing Statistics: 5 Tips To Help You Get Reliable Data
Predictive Analytics with UX Research Data: Yes We Can!
A/B Testing Presentation for Comm4Dev
Optimizely Workshop: Take Action on Results with Statistics
Ab testing explained
Principles Before Practices: Transform Your Testing by Understanding Key Conc...
A/B Testing at SweetIM
How Significant is Statistically Significant? The case of Audio Music Similar...
Software testing
Ad

More from Jason Packer (20)

PDF
CBUSDAW - Ash Lewis - Reducing LLM Hallucinations
PDF
CBUSDAW April 2025 - Predicting and Preventing Homelessness
PDF
Landing Page A/B Testing with Melanie Bowles
PDF
CBUSDAW Oct 2024 - Geo Testing with Sanjay Tamrakar
PDF
Third Party Cookies: Columbus DAW March 2024
PDF
Cbuswaw October '23, Marketing Mix Modeling
PDF
Generative AI and SEO
PDF
DataOps , cbuswaw April '23
PDF
Google Analytics Alternatives
PDF
Google Analytics Alternatives
PDF
Web Analytics Wednesday April 2020 - Customer Journey Mapping
PPTX
Introduction to Factor Analysis
PDF
Product Analytics at Web Analytics Wednesday
PPTX
Columbus Web Analytics Wednesday September 2019
PDF
How to Present Test Results to Inspire Action
PPTX
Sentiment analysis
PDF
CBUSWAW - October 2017 Alain Stephan
PDF
Attribution 101
PDF
CBUSWAW presentation July 2016
PPTX
CBUSWAW presentation May 2016
CBUSDAW - Ash Lewis - Reducing LLM Hallucinations
CBUSDAW April 2025 - Predicting and Preventing Homelessness
Landing Page A/B Testing with Melanie Bowles
CBUSDAW Oct 2024 - Geo Testing with Sanjay Tamrakar
Third Party Cookies: Columbus DAW March 2024
Cbuswaw October '23, Marketing Mix Modeling
Generative AI and SEO
DataOps , cbuswaw April '23
Google Analytics Alternatives
Google Analytics Alternatives
Web Analytics Wednesday April 2020 - Customer Journey Mapping
Introduction to Factor Analysis
Product Analytics at Web Analytics Wednesday
Columbus Web Analytics Wednesday September 2019
How to Present Test Results to Inspire Action
Sentiment analysis
CBUSWAW - October 2017 Alain Stephan
Attribution 101
CBUSWAW presentation July 2016
CBUSWAW presentation May 2016
Ad

Recently uploaded (20)

PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
project resource management chapter-09.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPTX
Tartificialntelligence_presentation.pptx
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Zenith AI: Advanced Artificial Intelligence
Programs and apps: productivity, graphics, security and other tools
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Final SEM Unit 1 for mit wpu at pune .pptx
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
project resource management chapter-09.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Module 1.ppt Iot fundamentals and Architecture
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Web App vs Mobile App What Should You Build First.pdf
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Tartificialntelligence_presentation.pptx
NewMind AI Weekly Chronicles – August ’25 Week III
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
NewMind AI Weekly Chronicles - August'25-Week II
observCloud-Native Containerability and monitoring.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf

Columbus Data & Analytics Wednesdays - June 2024

  • 1. Running experiments - from design to analysis A/B tests under the hood Maria Copot - CbusDAW
  • 2. Who am I? Researcher at Ohio State (Dept of Linguistics) Main interest: characterising speakers’ knowledge of the words of their language Broader interest in quantitative methodologies and statistical analysis
  • 3. …Why am I here? We both collect data and run experiments!
  • 4. The pitch Digital analysts run experiments, software drives design and implementation Privacy laws, Cookie Ban™ - consequences for experimental data and design A peek under the hood of common software, and proposals for branching out
  • 5. The next half hour of your life 1. The gears of A/B testing software 2. The bene fi ts of Bayesian methods 3. Paid participants - Proli fi c.co, Mechanical Turk
  • 6. The goal vs the tools All statistical analysis and experimentation is in the service of a goal. Never lose track of the bigger picture! Is A/B testing what you need?
  • 7. When to use A/B testing Comparing pairs of options differing along one dimension Compare multiple dimensions between pairs of options? Multivariate tests! Compare holistically di ff erent page designs? Redirect tests!
  • 9. The importance of having a hypothesis Testing for random features might sometimes increase clicks or engagement but…
  • 10. The importance of having a hypothesis Important to start from what you think is missing and why it will improve things. “Users prefer X to Y because Z” - Identifying areas for improvement helps come up with tests (X vs Y) - Both positive and negative outcomes teach us about user motivations (is Z true?)
  • 11. The importance of having a hypothesis
  • 12. Results do not speak on their own The narrative behind them is crucial for decisions
  • 13. Statistical signi fi cance A/B test - is CR higher with a blue or a red button? Users CR P-value Blue (old) 1000 15% 0.006 Red (new) 1000 18%
  • 14. 15% 18% p = 0.006 Statistical signi fi cance Assume no di ff erence between the blue and red button How likely is the red button observation? P value = probability of observation assuming no difference
  • 15. Statistical signi fi cance Threshold for signi fi cance usually set at p < 0.05 (arbitrary value)
  • 16. Statistical signi fi cance Threshold for signi fi cance usually set at p < 0.05 (arbitrary value) Can lower threshold for more precision, depending on application and variance in user behaviour. ✨Nothing magical about 0.05 ✨
  • 17. Statistical signi fi cance Always look at both effect size and p-value! Larger expected effect sizes, easier to get low p-values Effect size: the difference between the means of the two hypotheses
  • 18. The dangers of working with means
  • 19. P-values and sample size Low p-values are facilitated by large sample sizes Users Conversion rate P-value Blue (old) 10 70% 0.69 Red (new) 10 80% Users Conversion rate P-value Blue (old) 100 70% 0.05 Red (new) 100 80%
  • 20. P-values and sample size Users Conversion rate P-value Blue (old) 100000000 80.000 0.03 Red (new) 100000000 80.001 Power analysis how many data points are needed to have an 80% chance to discover an effect of the anticipated size?
  • 21. P-values, sample sizes and time Running assumption: users split 50/50 between variations. But for certain features, not possible. Example: purchase totals for people using apple pay vs manually entering card details.
  • 22. P-value Effect size (bigger difference = less overlap between hypotheses) Sample size (higher = more certainty in the difference)
  • 23. A/B testing - sneaking in frequentist statistics Frequentists ask: how likely is the observation if the null hypothesis is true? Probability of the data given the (null) hypothesis
  • 24. Bayesian A/B testing Bayesians ask: how likely is my hypothesis, given the data?
  • 25. Bayesians vs frequentists Frequentist estimation a single unknown number with uncertainty around it, tested against the null hypothesis Bayesian estimation the entire distribution of the parameters of interest
  • 26. Bayesian updating posterior likelihood prior marginal p(hypothesis|evidence) = p(evidence|hypothesis) * p(hypothesis) p(evidence) Out of all the times people press the button, how many were blue vs red?
  • 27. Bayesian updating p(hypothesis|evidence) = p(evidence|hypothesis) * p(hypothesis) p(evidence) posterior likelihood prior marginal
  • 28. Advantages of Bayesian methods • Intuitive results (p-values and con fi dence intervals are misunderstood) • Reliable even for small sample sizes (no need to pre-de fi ne sample sizes) • No need to estimate the e ff ect size in advance • Early stopping is allowed (continuous updating) • Faster pipeline to decisions • Can incorporate domain knowledge through priors • Estimates entire distribution
  • 29. Testing features in the age of privacy laws Desired tests are often more complex and nuanced than “red vs blue button” • Longitudinal tracking • Multiple outcomes Need to know what behaviour comes from the same user
  • 30. Paid participant pools Platforms like Proli fi c.co, MTurk, Qualtrics Panel (and others!) Consent forms allow you to track participant behaviour in depth Participants can be recontacted for follow-up qualitative assessments Large participant pool fi ltrable by detailed demographic information Cons: participants must be paid and know they are taking part in an experiment
  • 31. Thank you! Any further questions: maria.copot.s@gmail.com