SlideShare a Scribd company logo
Statistics in


Data Analysis


Additional methods in data analysis or
something useful you never knew exists
Maxim Neronov – Nexters Data Analyst
1
Work Experience
2 years as Game-Designer


4 years as Data Analyst
2
Statistics in Data Analysis
About Speaker
B.S. in Applied Mathematics


M.A. in Computer Science
Education
About Speaker
3
Statistics in Data Analysis
Main Themes
What are we actually talking about?
Stochastic dominance – can there be an order?


Empirical Distribution Functions and their estimations


An example of consequent tests with function of
random variables
Stochastic Dominance
Is there any way to order randomness?
5
Total Order is an instrument to


compare things
Random Variables & Stochastic processes
One random variable is stochastically dominant over another if probability of an event
ξ > x is larger or equal than v > x for all x
What are we actually talking about?
What is an order?
6
Simple Example
Which means that both mean and median
of
fi
rst distribution are less than those from
second distribution
Let’s see how it works with two Normal
Distributions with same variance, but
different expected values
7
Statistical Significance
1. Data is random → there may be variance in what we look at
2. We need a stable algorithm to prove that there is signi
fi
cance in dominance
3. Somehow we all familiar with this statistical test
«Let x and y be two random variables having continuous С.D.F. f and g
respectively. The variable x will be called stochastically smaller than y if
f(a) > g(a) for every a. We wish to test the hypothesis f = g against the
alternative that x is stochastically smaller than y»
H. B. Mann and D. R. Whitney – 1947
8
Not so Simple Example
Mean Median
Exponential 1.9795 1.3863
Normal 1.7816 1.78
Mean is larger
Median is larger
Test H1 p-value Result
Mann-Whitney Less 10-6 Less
Student t-test Less 0.999 Greater
KS-test
Less 10-6
Both
Greater 10-6
9
Not so Simple Example
Comparing Exponential sample to Normal


We got:


– Sample mean is greater or equal


– Mann Whitney test called exponential
sample stochastically less


– Kolmogorov-Smirnov test says it both
greater and less
1. We can detect difference in mean


2. More elements are «less»


3. Both data samples are dominant on
the different part of axis
10
Not so Simple Example
Although you can describe the null hypothesis in terms of dominance or mean
difference, it most often will be dif
fi
cult to say how it affect your product. In general,
Mann-Whitney detect dominance in terms of «most elements» being larger/smaller
Do not compare things that differ in distribution shape
11
How to use
* Python has a continuity correction for discrete values
When the metrics are continuous and «Stable»*


If the experiment has little effect on variance


Always look at the ECDF curve before conducting any kind of
analysis


Mann-Whitney U-test tells you which sample has more
element on the right/left side


2 sample KS-test tells you if there is difference at all
Estimating ECDF
13
CDF & ECDF Estimation
It’s nice to have ECDF


We know how to work with pointwise estimation (mean, variance)


Can we do something more?
14
Kolmogorov Statistic
The maximum difference between CDF and
ECDF is de
fi
ned as Kolmogorov-Smirnov statistic
15
Kolmogorov Statistic
The maximum difference between CDF and
ECDF is de
fi
ned as Kolmogorov-Smirnov statistic
16
DKW – Inequality
KS test also may be used for testing hypothesis like
F(x) ≤ G(x). But more importantly it opens new ideas to
estimate the borders of ECDF
DKW Inequality allows such borders
When to use


Sometimes we change not only the mean/median/variance
but the nature of some events
17
Case with in-game mechanics
Let’s see an example where we somehow changed the arena


Any user can take up to 5 battles per day
How do we check the difference and know where it happened?
Mean Median
Sample 1 2.262 2
Sample 2 2.331 2
18
Case with in-game mechanics
Let’s see an example where we somehow changed the arena


Any user can take up to 5 battles per day
How do we check the difference and know where it happened?
Mean Median
Sample 1 2.262 2
Sample 2 2.331 2
19
Case with in-game mechanics
Let’s use a DKW-inequality to build bandwidth Con
fi
dence Intervals
We can see difference on the
fi
rst step as well as on the second one due to
the fact of ECDF growth
20
General Advice
Can be used to any kind of variables


Doesn’t give any specific answers about mean/mediaEasy to
calculate


Easy to visualise/explain


Would not replace statistical testing


Works good with mechanics and economy metrics
Consequent Testing
What to do with a functions of R.V.?
22
Problem
Test 1 Test 2 Test 3 Test N
A / B A / B A / B A / B
We want to conduct a series of consequent tests
For every test group B is much worse from the start and we don’t need to
test. But we are making small differences and trying to converge the
group B to group A
The questions we want to seek answer for


1. Does the metrics closes in gap between iterations


2. How much do we need to improve


3. How many more iterations we need to conduct
23
Problem
Test 1
Test 2
A – Tutorial: 55.8%


B – Tutorial: 47.4%
A – Tutorial: 54.2%


B – Tutorial: 48.4%
In both tests the difference in tutorial is signi
fi
cant


But is 0.892 signi
fi
cantly larger than 0.849 and we are going in the right direction?
Ratio B/A
Test 1 0.849
Test 2 0.892
24
Problem
Why is this even hard?
How do we
fi
nd the unknown
distribution F which is the function of
Random Variables?
What we have researched


1. Fieller’s Theorem


2. Bootstrapping


3. Delta Method


4. Analytical research of Ratio
Distributions
25
Solution
Farrington-Manning Test
How to apply


1. Let the B conversion a p1
2. Let the A conversion a p2
3. De
fi
ne r as the ratio from previous
test


4. Conduct one-sided test with greater
alternative hypothesis
Pros


- No additional assumptions


- Directly solves the problem


- Described by an article


- Has an answer for sample size
Cons


- s & r are constants


- No existing implementation in
Python


- The article itself was hard to
fi
nd
26
Back to the case
Test 1
Test 2
A – Tutorial: 55.8%


B – Tutorial: 47.4%
A – Tutorial: 54.2%


B – Tutorial: 48.4%
27
General Advice
Bootstrap is a good instrument, but sometimes you can
solve the problem directly


Look for the science articles or popular library packages


In case of binomial ratio – use Farrington-Manning Test
Source & Stack
29
Source
- Wikipedia


- М.Б. Лагутин: «Наглядная Математическая Статистика» (глава 14)


- H.B. Mann, D.R. Whitney: «On a Test of Whether one of Two Random
Variables is Stochastically Larger than the Other» (1947, DOI: 10.1214/aoms/
1177730491)


- Fieller’s Theorem (wikipedia)


- Ratio Distribution (wikipedia)


- Delta Method (wikipedia)


- Con
fi
dence Intervals for a Ratio of Binomial Proportions Based on Direct and
Inverse Sampling Schemes (2016, DOI: 10.1134/S1995080216040132)


- Farrington-Manning: «Test Statistics and Sample Size Formulae for
Comparative Binomial Trials» (1990, DOI:10.1002/sim.1242)
30
Stack
Thank you for
Listening!
31

More Related Content

PPTX
Hypothesis testing , T test , chi square test, z test
PDF
202003241550010409rajeev_pandey_Non-Parametric.pdf
PPT
Biostatistics
PDF
Big_DM_24_MS_Topic_02_Understanding Data.pdf
PDF
DAVLectuer3 Exploratory data analysis .pdf
PDF
Lecturenotesstatistics
PPTX
scope and need of biostatics
PPTX
Presentation 7.pptx
Hypothesis testing , T test , chi square test, z test
202003241550010409rajeev_pandey_Non-Parametric.pdf
Biostatistics
Big_DM_24_MS_Topic_02_Understanding Data.pdf
DAVLectuer3 Exploratory data analysis .pdf
Lecturenotesstatistics
scope and need of biostatics
Presentation 7.pptx

Similar to Additional Descriptive Statistics methods for Data Analysis / Maxim Neronov (Nexters) (20)

PPTX
PDF
UG_B.Sc._Psycology_11933 –PSYCHOLOGICAL STATISTICS.pdf
PDF
Data Science_Chapter -2_Statical Data Analysis.pdf
PPTX
PARAMETRIC TESTS.pptx
PPT
Stat 4 the normal distribution & steps of testing hypothesis
PPTX
Basic Concepts of Non-Parametric Methods ( Statistics )
PPT
Introduction_to_Statistics_as_used_in_th.ppt
PPTX
Seminar 10 BIOSTATISTICS
PPT
Statistical ppt
PDF
An Introduction To Probability And Statistical Inference 1st Edition George G...
PPT
Stats-Review-Maie-St-John-5-20-2009.ppt
PDF
Machine Learning Machine Learning Interview
PDF
elementary statistic
PPTX
Statistics for Librarians, Session 3: Inferential statistics
PDF
Bmgt 311 chapter_12
DOCX
Pampers CaseIn an increasingly competitive diaper market, P&G’.docx
PDF
03-Data-Analysis-Final.pdf
PDF
Nonparametric Statistics
PPTX
Statistical techniques used in measurement
DOCX
BUS 308 Week 2 Lecture 2 Statistical Testing for Differenc.docx
UG_B.Sc._Psycology_11933 –PSYCHOLOGICAL STATISTICS.pdf
Data Science_Chapter -2_Statical Data Analysis.pdf
PARAMETRIC TESTS.pptx
Stat 4 the normal distribution & steps of testing hypothesis
Basic Concepts of Non-Parametric Methods ( Statistics )
Introduction_to_Statistics_as_used_in_th.ppt
Seminar 10 BIOSTATISTICS
Statistical ppt
An Introduction To Probability And Statistical Inference 1st Edition George G...
Stats-Review-Maie-St-John-5-20-2009.ppt
Machine Learning Machine Learning Interview
elementary statistic
Statistics for Librarians, Session 3: Inferential statistics
Bmgt 311 chapter_12
Pampers CaseIn an increasingly competitive diaper market, P&G’.docx
03-Data-Analysis-Final.pdf
Nonparametric Statistics
Statistical techniques used in measurement
BUS 308 Week 2 Lecture 2 Statistical Testing for Differenc.docx
Ad

More from DevGAMM Conference (20)

PPTX
The art of small steps, or how to make sound for games in conditions of war /...
PPTX
Breaking up with FMOD - Why we ended things and embraced Metasounds / Daniel ...
PPTX
How Audio Objects Improve Spatial Accuracy / Mads Maretty Sønderup (Audiokine...
PPTX
Why indie developers should consider hyper-casual right now / Igor Gurenyov (...
PPTX
AI / ML for Indies / Tyler Coleman (Retora Games)
PDF
Agility is the Key: Power Up Your GameDev Project Management with Agile Pract...
PPTX
New PR Tech and AI Tools for 2023: A Game Changer for Outreach / Kirill Perev...
PDF
Playable Ads - Revolutionizing mobile games advertising / Jakub Kukuryk (Popc...
PDF
Creative Collaboration: Managing an Art Team / Nastassia Radzivonava (Glera G...
PDF
From Local to Global: Unleashing the Power of Payments / Jan Kuhlmannn (Xsolla)
PDF
Strategies and case studies to grow LTV in 2023 / Julia Iljuk (Balancy)
PDF
Why is ASO not working in 2023 and how to change it? / Olena Vedmedenko (Keya...
PDF
How to increase wishlists & game sales from China? Growth marketing tactics &...
PDF
Turkish Gaming Industry and HR Insights / Mustafa Mert EFE (Zindhu)
PDF
Building an Awesome Creative Team from Scratch, Capable of Scaling Up / Sasha...
PPTX
Seven Reasons Why Your LiveOps Is Not Performing / Alexander Devyaterikov (Be...
PDF
The Power of Game and Music Collaborations: Reaching and Engaging the Masses ...
PPTX
Branded Content: How to overcome players' immunity to advertising / Alex Brod...
PPTX
Resurrecting Chasm: The Rift - A Source-less Remastering Journey / Gennadii P...
PPTX
How NOT to do showcase events: Behind the scenes of Midnight Show / Andrew Ko...
The art of small steps, or how to make sound for games in conditions of war /...
Breaking up with FMOD - Why we ended things and embraced Metasounds / Daniel ...
How Audio Objects Improve Spatial Accuracy / Mads Maretty Sønderup (Audiokine...
Why indie developers should consider hyper-casual right now / Igor Gurenyov (...
AI / ML for Indies / Tyler Coleman (Retora Games)
Agility is the Key: Power Up Your GameDev Project Management with Agile Pract...
New PR Tech and AI Tools for 2023: A Game Changer for Outreach / Kirill Perev...
Playable Ads - Revolutionizing mobile games advertising / Jakub Kukuryk (Popc...
Creative Collaboration: Managing an Art Team / Nastassia Radzivonava (Glera G...
From Local to Global: Unleashing the Power of Payments / Jan Kuhlmannn (Xsolla)
Strategies and case studies to grow LTV in 2023 / Julia Iljuk (Balancy)
Why is ASO not working in 2023 and how to change it? / Olena Vedmedenko (Keya...
How to increase wishlists & game sales from China? Growth marketing tactics &...
Turkish Gaming Industry and HR Insights / Mustafa Mert EFE (Zindhu)
Building an Awesome Creative Team from Scratch, Capable of Scaling Up / Sasha...
Seven Reasons Why Your LiveOps Is Not Performing / Alexander Devyaterikov (Be...
The Power of Game and Music Collaborations: Reaching and Engaging the Masses ...
Branded Content: How to overcome players' immunity to advertising / Alex Brod...
Resurrecting Chasm: The Rift - A Source-less Remastering Journey / Gennadii P...
How NOT to do showcase events: Behind the scenes of Midnight Show / Andrew Ko...
Ad

Recently uploaded (20)

PDF
Foundation of Data Science unit number two notes
PPTX
Introduction to machine learning and Linear Models
PPTX
Database Infoormation System (DBIS).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Business Analytics and business intelligence.pdf
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Lecture1 pattern recognition............
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Mega Projects Data Mega Projects Data
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Foundation of Data Science unit number two notes
Introduction to machine learning and Linear Models
Database Infoormation System (DBIS).pptx
Fluorescence-microscope_Botany_detailed content
Business Ppt On Nestle.pptx huunnnhhgfvu
Business Analytics and business intelligence.pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
Qualitative Qantitative and Mixed Methods.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction-to-Cloud-ComputingFinal.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Lecture1 pattern recognition............
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Mega Projects Data Mega Projects Data
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

Additional Descriptive Statistics methods for Data Analysis / Maxim Neronov (Nexters)

  • 1. Statistics in Data Analysis Additional methods in data analysis or something useful you never knew exists Maxim Neronov – Nexters Data Analyst 1
  • 2. Work Experience 2 years as Game-Designer 4 years as Data Analyst 2 Statistics in Data Analysis About Speaker B.S. in Applied Mathematics M.A. in Computer Science Education About Speaker
  • 3. 3 Statistics in Data Analysis Main Themes What are we actually talking about? Stochastic dominance – can there be an order? Empirical Distribution Functions and their estimations An example of consequent tests with function of random variables
  • 4. Stochastic Dominance Is there any way to order randomness?
  • 5. 5 Total Order is an instrument to compare things Random Variables & Stochastic processes One random variable is stochastically dominant over another if probability of an event ξ > x is larger or equal than v > x for all x What are we actually talking about? What is an order?
  • 6. 6 Simple Example Which means that both mean and median of fi rst distribution are less than those from second distribution Let’s see how it works with two Normal Distributions with same variance, but different expected values
  • 7. 7 Statistical Significance 1. Data is random → there may be variance in what we look at 2. We need a stable algorithm to prove that there is signi fi cance in dominance 3. Somehow we all familiar with this statistical test «Let x and y be two random variables having continuous С.D.F. f and g respectively. The variable x will be called stochastically smaller than y if f(a) > g(a) for every a. We wish to test the hypothesis f = g against the alternative that x is stochastically smaller than y» H. B. Mann and D. R. Whitney – 1947
  • 8. 8 Not so Simple Example Mean Median Exponential 1.9795 1.3863 Normal 1.7816 1.78 Mean is larger Median is larger
  • 9. Test H1 p-value Result Mann-Whitney Less 10-6 Less Student t-test Less 0.999 Greater KS-test Less 10-6 Both Greater 10-6 9 Not so Simple Example Comparing Exponential sample to Normal We got: – Sample mean is greater or equal – Mann Whitney test called exponential sample stochastically less – Kolmogorov-Smirnov test says it both greater and less 1. We can detect difference in mean 2. More elements are «less» 3. Both data samples are dominant on the different part of axis
  • 10. 10 Not so Simple Example Although you can describe the null hypothesis in terms of dominance or mean difference, it most often will be dif fi cult to say how it affect your product. In general, Mann-Whitney detect dominance in terms of «most elements» being larger/smaller Do not compare things that differ in distribution shape
  • 11. 11 How to use * Python has a continuity correction for discrete values When the metrics are continuous and «Stable»* If the experiment has little effect on variance Always look at the ECDF curve before conducting any kind of analysis Mann-Whitney U-test tells you which sample has more element on the right/left side 2 sample KS-test tells you if there is difference at all
  • 13. 13 CDF & ECDF Estimation It’s nice to have ECDF We know how to work with pointwise estimation (mean, variance) Can we do something more?
  • 14. 14 Kolmogorov Statistic The maximum difference between CDF and ECDF is de fi ned as Kolmogorov-Smirnov statistic
  • 15. 15 Kolmogorov Statistic The maximum difference between CDF and ECDF is de fi ned as Kolmogorov-Smirnov statistic
  • 16. 16 DKW – Inequality KS test also may be used for testing hypothesis like F(x) ≤ G(x). But more importantly it opens new ideas to estimate the borders of ECDF DKW Inequality allows such borders When to use Sometimes we change not only the mean/median/variance but the nature of some events
  • 17. 17 Case with in-game mechanics Let’s see an example where we somehow changed the arena Any user can take up to 5 battles per day How do we check the difference and know where it happened? Mean Median Sample 1 2.262 2 Sample 2 2.331 2
  • 18. 18 Case with in-game mechanics Let’s see an example where we somehow changed the arena Any user can take up to 5 battles per day How do we check the difference and know where it happened? Mean Median Sample 1 2.262 2 Sample 2 2.331 2
  • 19. 19 Case with in-game mechanics Let’s use a DKW-inequality to build bandwidth Con fi dence Intervals We can see difference on the fi rst step as well as on the second one due to the fact of ECDF growth
  • 20. 20 General Advice Can be used to any kind of variables Doesn’t give any specific answers about mean/mediaEasy to calculate Easy to visualise/explain Would not replace statistical testing Works good with mechanics and economy metrics
  • 21. Consequent Testing What to do with a functions of R.V.?
  • 22. 22 Problem Test 1 Test 2 Test 3 Test N A / B A / B A / B A / B We want to conduct a series of consequent tests For every test group B is much worse from the start and we don’t need to test. But we are making small differences and trying to converge the group B to group A The questions we want to seek answer for 1. Does the metrics closes in gap between iterations 2. How much do we need to improve 3. How many more iterations we need to conduct
  • 23. 23 Problem Test 1 Test 2 A – Tutorial: 55.8% B – Tutorial: 47.4% A – Tutorial: 54.2% B – Tutorial: 48.4% In both tests the difference in tutorial is signi fi cant But is 0.892 signi fi cantly larger than 0.849 and we are going in the right direction? Ratio B/A Test 1 0.849 Test 2 0.892
  • 24. 24 Problem Why is this even hard? How do we fi nd the unknown distribution F which is the function of Random Variables? What we have researched 1. Fieller’s Theorem 2. Bootstrapping 3. Delta Method 4. Analytical research of Ratio Distributions
  • 25. 25 Solution Farrington-Manning Test How to apply 1. Let the B conversion a p1 2. Let the A conversion a p2 3. De fi ne r as the ratio from previous test 4. Conduct one-sided test with greater alternative hypothesis Pros - No additional assumptions - Directly solves the problem - Described by an article - Has an answer for sample size Cons - s & r are constants - No existing implementation in Python - The article itself was hard to fi nd
  • 26. 26 Back to the case Test 1 Test 2 A – Tutorial: 55.8% B – Tutorial: 47.4% A – Tutorial: 54.2% B – Tutorial: 48.4%
  • 27. 27 General Advice Bootstrap is a good instrument, but sometimes you can solve the problem directly Look for the science articles or popular library packages In case of binomial ratio – use Farrington-Manning Test
  • 29. 29 Source - Wikipedia - М.Б. Лагутин: «Наглядная Математическая Статистика» (глава 14) - H.B. Mann, D.R. Whitney: «On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other» (1947, DOI: 10.1214/aoms/ 1177730491) - Fieller’s Theorem (wikipedia) - Ratio Distribution (wikipedia) - Delta Method (wikipedia) - Con fi dence Intervals for a Ratio of Binomial Proportions Based on Direct and Inverse Sampling Schemes (2016, DOI: 10.1134/S1995080216040132) - Farrington-Manning: «Test Statistics and Sample Size Formulae for Comparative Binomial Trials» (1990, DOI:10.1002/sim.1242)