SlideShare a Scribd company logo
A/B Testing at SweetIM – the
Importance of Proper Statistical
Analysis
Slava Borodovsky
SweetIM
slavabo@gmail.com
Saharon Rosset
Tel Aviv University
saharon@post.tau.ac.il
About SweetIM
• Provides interactive content and search services for IMs and social
networks.
• More than 1,000,000 monthly new users.
• More than 100,000,000 monthly search queries.
• Every new feature and change in product pass A/B testing.
• Data driven decision making process is based on A/B testing results.
A/B testing – a buzz word?
• Standard A/B Flow
Previous works:
• KDD 2009, Microsoft experimentation platform (http://exp-platform.com/hippo.aspx)
• KDD 2010, Google team (“Overlapping Experiment Infrastructure: More, Better, Faster
Experimentation”)
LP Visit
Control
(A) – 50%
No
Change
Feature
(B) – 50%
New
Product
A/B testing at SweetIM
New Users Existing Users
LP Visit
(Cookie)
Confirmation
Page A
(Control)
Cookie A
(No Change)
Confirmation
Page B (New
Feature)
Cookie B
(New Feature)
Activity domain
(Unique ID%N)
Unique ID%2 =
0
No Change
Unique ID%2 =
1
New Feature
A/B Test on Search
Group A Group B
Does the change increase the usage of search by users (average searches per day)?
Results
Using the Poisson distribution the difference is significant, but is this
assumption appropriate in our case?
Some well known facts of internet traffic:
• Different users can have very different usage patterns
• Existence of non-human users like automatic bots and crawlers.
A B
# Users 17,287 17,295
#Activity 297,485 300,843
Average (μ) 17.21 17.39
StDev 45.83 44.03
% Difference
Poisson p-value
NB p-value
1.08%
0.00003
0.47300
Poisson assumption
• Inhomogeneous behavior of online users is not appropriate for this
distribution.
• Variance is much greater than mean.
• Not well suited for the data.
Negative binomial assumption
• Can be used as a generalization of Poisson in over dispersed cases (Var >>
Mean).
• Has been used before in other domains to analyze the count data (genetics,
traffic modeling).
• Fits well the real distribution.
Poisson
X ~ Pois(), if:
P(X=k) = e- k / k! , k=0,1,2,…
X has mean and variance both equal to the Poisson parameter .
Hypothesis:
H0: A=B
HA: A<B
Distribution of difference between means:
𝑋 − 𝑌~𝑁(𝜇 𝑎 − 𝜇 𝑏,
𝜎 𝑎
2
𝑛
+
𝜎 𝑏
2
𝑚
)
Probability of obtaining a test statistic at least as extreme as the one that was
actually observed, assuming that the null hypothesis is true:
𝑃𝑝𝑜𝑖𝑠 = Φ
𝑋 −𝑌
𝜇𝑡 𝑛+𝑚 / 𝑛𝑚
𝑋 , 𝑌 - means of 2 groups, 𝜇 – Poisson parameter, 𝑡 – test duration and 𝑛, 𝑚 – size of test groups.
Negative binomial
X ~ NB(𝑎, 𝑝), if:
𝑃(𝑋=𝑘) =
𝑎 + 𝑘 − 1
𝑘
1 − 𝑝 𝑎
𝑝 𝑘
In case of over dispersed Poisson,  ~ Γ(α,β), X| ~ Pois().
X ~ NB(α, 1/(1+β))
X has mean μ= α/β and variance σ2 = μ(μ+α)/α.
Hypothesis:
H0: A=B
HA: A<B
Distribution of difference between means:
𝑋 − 𝑌~𝑁(𝜇 𝑎 − 𝜇 𝑏,
𝜇 𝑎
𝑛
+
𝜇 𝑏
𝑚
)
Probability of obtaining a test statistic at least as extreme as the one that was
actually observed, assuming that the null hypothesis is true:
𝑃 𝑁𝐵,𝑎 = Φ
𝑋 −𝑌
𝜇𝑡 𝜇𝑡+𝑎
𝑎
𝑛+𝑚 / 𝑛𝑚
1/2
𝑋 , 𝑌 - means of 2 groups, 𝜇, 𝑎 – NB parameters, 𝑡 – test duration and 𝑛, 𝑚 – size of test groups.
Results (Test 1):
The difference is not significant by NB.
Results of using the proper methodology:
• Saved production from unnecessary change
• Allow to benefit from additional features in A group
A B
# Users 17,287 17,295
#Activity 297,485 300,843
Average (μ) 17.21 17.39
StDev 45.83 44.03
% Difference
Poisson p-value
NB p-value
1.08%
0.00003
0.47300
A/B Tests on Content
Example – two sided content test.
The new feature allows users that don’t have SweetIM installed on
their computer to receive funny content from SweetIM users.
Results:
• Implementation of new feature
• Increase in application usage and user experience
A B
# Users 17,568 17,843
#Activity 770,436 930,066
Average (μ) 43.85 52.12
StDev 74.80 96.00
% Difference
Poisson p-value
NB p-value 0.00000
8.68%
0.00000
A/A Tests
Required for checking:
• The randomization algorithm and distribution system.
• The technical aspects of the A/B testing system.
• The properness of methodology, hypothesis and analysis.
• The existence of robots and crawlers.
LP Visit
Control
(A) – 50%
No
Change
Control
(A) – 50%
No
Change
A/A Tests
• A/A Test on Search
• A/A Test on Content
A B
# Users 61,719 61,608
#Activity 1,274,279 1,288,333
Average (μ) 20.65 20.91
StDev 52.77 49.69
% Difference
Poisson p-value
NB p-value
1.27%
0.00000
0.11000
A B
# Users 61,719 61,608
#Activity 3,142,766 3,165,208
Average (μ) 50.92 51.38
StDev 97.50 97.00
% Difference
Poisson p-value
NB p-value
0.98%
0.00000
0.18500
“Fraud” Detection
• It is almost impossible to filter all non-human activity on the web.
• Automatic bots and crawlers can bias the results and drive to wrong
conclusions.
• Need to be checked in every test.
Conclusions
• Overview of SweetIM A/B Testing system.
• Some insights on statistical aspects of A/B Testing methodology as related
to count data analysis.
• Suggestion to use negative binomial approach instead of incorrect Poisson
in case of over dispersed count data.
• Real-world examples of A/B and A/A tests from SweetIM.
• A word about fraud
A/B Test on Search Logo
Group A
Group B
Which one was better?
A/B Test on Search Logo
Group A
Group B
A B
# Users 19,249 19,725
#Activity 343,136 355,929
Average (μ) 17.80 18.50
StDev 45.35 47.72
% Difference
Poisson p-value
NB p-value
3.93%
0.00000
0.00780
Thank You!

More Related Content

PDF
Efficient nearest neighbors search for large scale
PDF
A/B Testing - Design, Analysis and Pitfals
PDF
How to conclude online experiments in python
PDF
BasicStatistics.pdf
PDF
Andrii Belas: A/B testing overview: use-cases, theory and tools
PDF
Columbus Data & Analytics Wednesdays - June 2024
PPTX
Crash Course in A/B testing
PDF
Petri for kyiv.pptx
Efficient nearest neighbors search for large scale
A/B Testing - Design, Analysis and Pitfals
How to conclude online experiments in python
BasicStatistics.pdf
Andrii Belas: A/B testing overview: use-cases, theory and tools
Columbus Data & Analytics Wednesdays - June 2024
Crash Course in A/B testing
Petri for kyiv.pptx

Similar to A/B Testing at SweetIM (20)

PDF
A/B testing from basic concepts to advanced techniques
PDF
Optimizely Workshop: Take Action on Results with Statistics
PDF
ISSTA'16 Summer School: Intro to Statistics
PPTX
What is A/B-testing? An Introduction
PPTX
Waws january 2015-nikolay-novozhilov
PDF
Setting up an A/B-testing framework
PDF
Data Science Toolkit for Product Managers
PDF
Data science toolkit for product managers
PPTX
10 Guidelines for A/B Testing
PDF
Principles Before Practices: Transform Your Testing by Understanding Key Conc...
PPT
Advanced statistics for librarians
PDF
Data Insights Talk
PPTX
Bio-Statistics in Bio-Medical research
PPT
Research_Methodology_Lecture_literature_review.ppt
PDF
How to do A/B in scale - Nature Intelligence style
PDF
Controlled Experiments for Decision-Making in e-Commerce Search
PPTX
Group 3_Chapter 9_Subtopics 9-1, 9-2, 9-3.pptx
PPTX
2016 Symposium Poster - statistics - Final
PDF
You have no idea what your users want - WordCamp PDX
PPTX
A/B testing from basic concepts to advanced techniques
Optimizely Workshop: Take Action on Results with Statistics
ISSTA'16 Summer School: Intro to Statistics
What is A/B-testing? An Introduction
Waws january 2015-nikolay-novozhilov
Setting up an A/B-testing framework
Data Science Toolkit for Product Managers
Data science toolkit for product managers
10 Guidelines for A/B Testing
Principles Before Practices: Transform Your Testing by Understanding Key Conc...
Advanced statistics for librarians
Data Insights Talk
Bio-Statistics in Bio-Medical research
Research_Methodology_Lecture_literature_review.ppt
How to do A/B in scale - Nature Intelligence style
Controlled Experiments for Decision-Making in e-Commerce Search
Group 3_Chapter 9_Subtopics 9-1, 9-2, 9-3.pptx
2016 Symposium Poster - statistics - Final
You have no idea what your users want - WordCamp PDX
Ad

Recently uploaded (20)

PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
Business Analytics and business intelligence.pdf
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
Introduction to Inferential Statistics.pptx
PDF
Microsoft 365 products and services descrption
PPT
DU, AIS, Big Data and Data Analytics.ppt
PDF
Microsoft Core Cloud Services powerpoint
PPT
Predictive modeling basics in data cleaning process
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
Leprosy and NLEP programme community medicine
PPTX
Managing Community Partner Relationships
PDF
Introduction to the R Programming Language
PDF
annual-report-2024-2025 original latest.
PDF
Global Data and Analytics Market Outlook Report
Topic 5 Presentation 5 Lesson 5 Corporate Fin
STERILIZATION AND DISINFECTION-1.ppthhhbx
Pilar Kemerdekaan dan Identi Bangsa.pptx
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Business Analytics and business intelligence.pdf
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Introduction to Inferential Statistics.pptx
Microsoft 365 products and services descrption
DU, AIS, Big Data and Data Analytics.ppt
Microsoft Core Cloud Services powerpoint
Predictive modeling basics in data cleaning process
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Leprosy and NLEP programme community medicine
Managing Community Partner Relationships
Introduction to the R Programming Language
annual-report-2024-2025 original latest.
Global Data and Analytics Market Outlook Report
Ad

A/B Testing at SweetIM

  • 1. A/B Testing at SweetIM – the Importance of Proper Statistical Analysis Slava Borodovsky SweetIM slavabo@gmail.com Saharon Rosset Tel Aviv University saharon@post.tau.ac.il
  • 2. About SweetIM • Provides interactive content and search services for IMs and social networks. • More than 1,000,000 monthly new users. • More than 100,000,000 monthly search queries. • Every new feature and change in product pass A/B testing. • Data driven decision making process is based on A/B testing results.
  • 3. A/B testing – a buzz word? • Standard A/B Flow Previous works: • KDD 2009, Microsoft experimentation platform (http://exp-platform.com/hippo.aspx) • KDD 2010, Google team (“Overlapping Experiment Infrastructure: More, Better, Faster Experimentation”) LP Visit Control (A) – 50% No Change Feature (B) – 50% New Product
  • 4. A/B testing at SweetIM New Users Existing Users LP Visit (Cookie) Confirmation Page A (Control) Cookie A (No Change) Confirmation Page B (New Feature) Cookie B (New Feature) Activity domain (Unique ID%N) Unique ID%2 = 0 No Change Unique ID%2 = 1 New Feature
  • 5. A/B Test on Search Group A Group B Does the change increase the usage of search by users (average searches per day)?
  • 6. Results Using the Poisson distribution the difference is significant, but is this assumption appropriate in our case? Some well known facts of internet traffic: • Different users can have very different usage patterns • Existence of non-human users like automatic bots and crawlers. A B # Users 17,287 17,295 #Activity 297,485 300,843 Average (μ) 17.21 17.39 StDev 45.83 44.03 % Difference Poisson p-value NB p-value 1.08% 0.00003 0.47300
  • 7. Poisson assumption • Inhomogeneous behavior of online users is not appropriate for this distribution. • Variance is much greater than mean. • Not well suited for the data.
  • 8. Negative binomial assumption • Can be used as a generalization of Poisson in over dispersed cases (Var >> Mean). • Has been used before in other domains to analyze the count data (genetics, traffic modeling). • Fits well the real distribution.
  • 9. Poisson X ~ Pois(), if: P(X=k) = e- k / k! , k=0,1,2,… X has mean and variance both equal to the Poisson parameter . Hypothesis: H0: A=B HA: A<B Distribution of difference between means: 𝑋 − 𝑌~𝑁(𝜇 𝑎 − 𝜇 𝑏, 𝜎 𝑎 2 𝑛 + 𝜎 𝑏 2 𝑚 ) Probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true: 𝑃𝑝𝑜𝑖𝑠 = Φ 𝑋 −𝑌 𝜇𝑡 𝑛+𝑚 / 𝑛𝑚 𝑋 , 𝑌 - means of 2 groups, 𝜇 – Poisson parameter, 𝑡 – test duration and 𝑛, 𝑚 – size of test groups.
  • 10. Negative binomial X ~ NB(𝑎, 𝑝), if: 𝑃(𝑋=𝑘) = 𝑎 + 𝑘 − 1 𝑘 1 − 𝑝 𝑎 𝑝 𝑘 In case of over dispersed Poisson,  ~ Γ(α,β), X| ~ Pois(). X ~ NB(α, 1/(1+β)) X has mean μ= α/β and variance σ2 = μ(μ+α)/α. Hypothesis: H0: A=B HA: A<B Distribution of difference between means: 𝑋 − 𝑌~𝑁(𝜇 𝑎 − 𝜇 𝑏, 𝜇 𝑎 𝑛 + 𝜇 𝑏 𝑚 ) Probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true: 𝑃 𝑁𝐵,𝑎 = Φ 𝑋 −𝑌 𝜇𝑡 𝜇𝑡+𝑎 𝑎 𝑛+𝑚 / 𝑛𝑚 1/2 𝑋 , 𝑌 - means of 2 groups, 𝜇, 𝑎 – NB parameters, 𝑡 – test duration and 𝑛, 𝑚 – size of test groups.
  • 11. Results (Test 1): The difference is not significant by NB. Results of using the proper methodology: • Saved production from unnecessary change • Allow to benefit from additional features in A group A B # Users 17,287 17,295 #Activity 297,485 300,843 Average (μ) 17.21 17.39 StDev 45.83 44.03 % Difference Poisson p-value NB p-value 1.08% 0.00003 0.47300
  • 12. A/B Tests on Content Example – two sided content test. The new feature allows users that don’t have SweetIM installed on their computer to receive funny content from SweetIM users. Results: • Implementation of new feature • Increase in application usage and user experience A B # Users 17,568 17,843 #Activity 770,436 930,066 Average (μ) 43.85 52.12 StDev 74.80 96.00 % Difference Poisson p-value NB p-value 0.00000 8.68% 0.00000
  • 13. A/A Tests Required for checking: • The randomization algorithm and distribution system. • The technical aspects of the A/B testing system. • The properness of methodology, hypothesis and analysis. • The existence of robots and crawlers. LP Visit Control (A) – 50% No Change Control (A) – 50% No Change
  • 14. A/A Tests • A/A Test on Search • A/A Test on Content A B # Users 61,719 61,608 #Activity 1,274,279 1,288,333 Average (μ) 20.65 20.91 StDev 52.77 49.69 % Difference Poisson p-value NB p-value 1.27% 0.00000 0.11000 A B # Users 61,719 61,608 #Activity 3,142,766 3,165,208 Average (μ) 50.92 51.38 StDev 97.50 97.00 % Difference Poisson p-value NB p-value 0.98% 0.00000 0.18500
  • 15. “Fraud” Detection • It is almost impossible to filter all non-human activity on the web. • Automatic bots and crawlers can bias the results and drive to wrong conclusions. • Need to be checked in every test.
  • 16. Conclusions • Overview of SweetIM A/B Testing system. • Some insights on statistical aspects of A/B Testing methodology as related to count data analysis. • Suggestion to use negative binomial approach instead of incorrect Poisson in case of over dispersed count data. • Real-world examples of A/B and A/A tests from SweetIM. • A word about fraud
  • 17. A/B Test on Search Logo Group A Group B Which one was better?
  • 18. A/B Test on Search Logo Group A Group B A B # Users 19,249 19,725 #Activity 343,136 355,929 Average (μ) 17.80 18.50 StDev 45.35 47.72 % Difference Poisson p-value NB p-value 3.93% 0.00000 0.00780