SlideShare a Scribd company logo
‹#› Het begint met een idee
Data Analysis
Descriptive Statistics and data exploration
Ivano Malavolta
Vrije Universiteit Amsterdam
2
Quick Recap
Experiment
scoping
Experiment
planning
Idea
Experiment
operation
Analysis &
interpretation
Presentation &
package
Vrije Universiteit Amsterdam
3
Analysis and Interpretation
● Understanding the data
▪ descriptive statistics
▪ exploratory data analysis (EDA, e.g. boxplots, scatter plots)
● (Optional) data reduction
● Hypothesis testing
● Results interpretation
Vrije Universiteit Amsterdam
4
Descriptive Statistics
● Goal: get a ‘feeling’ about how data is distributed
● Properties:
▪ Central tendency (e.g. mean, median)
▪ Dispersion (e.g. frequency, standard deviation)
▪ Dependency (e.g., correlation)
Vrije Universiteit Amsterdam
5
Parameter vs. statistic
● Parameter: feature of the population
▪ μ: mean
▪ σ: standard deviation
● Statistic: feature of the sample
▪ : mean
▪ s: standard deviation
● Statistics are an estimation of parameters
Vrije Universiteit Amsterdam
6
Central Tendency
● Arithmetic mean:
● Geometric Mean:
• It is like the arithmetic mean, but with multiplication
à used when collected data is not ”additive”, but “multiplicative”
• Less sensible to outliers
• Try it when the range of the considered values is very large
Vrije Universiteit Amsterdam
7
Central Tendency: example
● Average of scores:
6 - 7 - 8 - 9 - 10
● Arithmetic mean: 8
● Geometric mean: ~7.87
Vrije Universiteit Amsterdam
8
Central tendency: example
● Average of returns of investments:
90% ; 10% ; 20% ; 30% ; -90%
● Arithmetic mean:
(90+10+20+30-90)/5= 12%
● Geometric mean:
[(1.9 x 1.1 x 1.2 x 1.3 x 0.1) ^ 1/5] - 1 =0.2008= -20.08%
Vrije Universiteit Amsterdam
9
Central tendency
● Median (or 50% percentile): middle value separating the
greater and lesser halves of a data set
X = [13, 18, 13, 14, 13, 16, 14, 21, 13]
Xsort = [13, 13, 13, 13, 14, 14, 16, 18, 21]
Vrije Universiteit Amsterdam
10
Central tendency
● Mode: most frequent value in data set
X = [13, 18, 13, 14, 13, 16, 14, 21, 13]
Mox = 13
Vrije Universiteit Amsterdam
11
Central tendency - Skewness
Vrije Universiteit Amsterdam
12
Dispersion
● Sample variance:
● Standard Deviation:
● Standard Deviation is dimensionally equivalent to the data
Informally: everything which is within 1 SD from
the mean is “normal”
Informally: it gives an idea about how ”sparse” is
data
Vrije Universiteit Amsterdam
13
Dispersion - three-sigma-rule
"Empirical Rule" by Dan Kernler - Own work. Licensed under CC BY-SA 4.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Empirical_Rule.PNG#/media/File:Empirical_Rule.PNG
Vrije Universiteit Amsterdam
14
Dispersion - three-sigma-rule
● Range:
● Coefficient of variation:
(in percentage of mean)
● Coefficient of variation only has meaning if all values are
positive (ratio scale, not interval scale e.g. temperatures)
It is useful if you want to compare the dispersion
of variables with different units of measure
Vrije Universiteit Amsterdam
15
Dispersion - example
● Dataset: [100, 100, 100]
Mean: 100
● Variance: 0
● Standard Deviation: 0
● Coeff. Variation: 0
● Range: 0
Vrije Universiteit Amsterdam
16
Dispersion - example
● Dataset: [90, 100, 110]
Mean: 100
● Sample Variance: 100
● Standard Deviation: 10
● Coeff. Variation: 10%
● Range: 20
Vrije Universiteit Amsterdam
17
Dispersion - example
● Dataset: [1, 5, 6, 8, 10, 40, 65, 88]
Mean: 27.875
● Sample Variance: 1082.69
● Standard Deviation: 32.9
● Coeff. Variation: 1.18%
● Range: 87
Vrije Universiteit Amsterdam
18
Basic visualizations
Box Plot
Median
3rd quartile
1st quartile
Vrije Universiteit Amsterdam
19
Basic visualizations
Box Plot
Minimum/maximum values THAT ARE NOT OUTLIERS
Vrije Universiteit Amsterdam
20
Basic visualizations
Box Plot
By Gbdivers (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-SA 3.0
(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
outliers positive
skewness
Vrije Universiteit Amsterdam
21
Dependency: correlation
● Sample correlation coefficient (Pearson):
● Meaningful when comparing paired values/datasets
Vrije Universiteit Amsterdam
22
Dependency: correlation
● Spearman’s rank correlation coefficient:
● also good for ordinal data
● Kendall’s rank correlation coefficient:
▪ smaller values
▪ more accurate on small samples
● Pearson correlation coefficient assumes normally distributed
data
Vrije Universiteit Amsterdam
23
Dependency: example
Age vs. body fat %
● Pearson: r = 0.7921
● Spearman: 𝜌 = 0.7539
● Kendall: 𝜏 = 0.5762
Vrije Universiteit Amsterdam
24
Basic Visualizations
Scatter Plot
Vrije Universiteit Amsterdam
25
Positive VS negative correlation
https://statistics.laerd.com/statistical-guides/pearson-correlation-
coefficient-statistical-guide.php
Vrije Universiteit Amsterdam
26
Scatter plots per different values of r
r = Pearson
rs = Spearman
https://www.researchgate.net/publication/224915794_Improving_standa
rds_in_brain-behavior_correlation_analyses
Vrije Universiteit Amsterdam
27
Correlation does NOT imply causation!
● Spurious Correlations: http://tylervigen.com/
Vrije Universiteit Amsterdam
● Now you know how to explore trends within your data
● but you cannot reject null hypotheses yet
● You can have a “feeling” about
● how disperse-correlated is your data
● what is “standard” in your data
● You can quickly visualize interesting trends
● box plots
● scatterplots28
What this lecture means to you?
Vrije Universiteit Amsterdam
29 Ivano Malavolta / S2 group / Experiment design
Readings
Chapter 10

More Related Content

PDF
[03-B] Measurement theory basics
PDF
[03-A] Experiment planning
PDF
[05-A] Experiment design (basics)
PDF
[02-B] Experiment scoping
PDF
The Green Lab - [05 A] Experiment design (basics)
PDF
The Green Lab - [01 C] Empirical software engineering
PDF
[02-A] The experimental process
PDF
[05-B] Experiment design (advanced)
[03-B] Measurement theory basics
[03-A] Experiment planning
[05-A] Experiment design (basics)
[02-B] Experiment scoping
The Green Lab - [05 A] Experiment design (basics)
The Green Lab - [01 C] Empirical software engineering
[02-A] The experimental process
[05-B] Experiment design (advanced)

What's hot (20)

PDF
The Green Lab - [09 A] Statistical tests and effect size
PDF
The Green Lab - [02 B] Experiment scoping
PDF
[01-B] Empirical software engineering
PDF
The Green Lab - [03 B] Measurement theory basics
PDF
The Green Lab - [11-A] Data Visualization
PDF
The Green Lab - [07-A] Data Analysis
PDF
[09-A] Statistical tests and effect size
PDF
[13 - B] Experiment reporting
PDF
[13 - A] Experiment validity
PDF
The Green Lab - [05 B] Experiment design (advanced)
PDF
The Green Lab - [09 B] Experiment validity
PPTX
Analysis results-of-multiple-choice-tests
PDF
Resume_xuezhi
PDF
The Green Lab - [01-B] Case study presentation
PDF
Resume-Luan Sitao
PDF
Matrai_resume_2015
PPT
Spss beginners
PPT
Minimax Poster FINAL
PDF
Machine Learning Goes Production
PDF
Parsimonious and Adaptive Contextual Information Acquisition in Recommender S...
The Green Lab - [09 A] Statistical tests and effect size
The Green Lab - [02 B] Experiment scoping
[01-B] Empirical software engineering
The Green Lab - [03 B] Measurement theory basics
The Green Lab - [11-A] Data Visualization
The Green Lab - [07-A] Data Analysis
[09-A] Statistical tests and effect size
[13 - B] Experiment reporting
[13 - A] Experiment validity
The Green Lab - [05 B] Experiment design (advanced)
The Green Lab - [09 B] Experiment validity
Analysis results-of-multiple-choice-tests
Resume_xuezhi
The Green Lab - [01-B] Case study presentation
Resume-Luan Sitao
Matrai_resume_2015
Spss beginners
Minimax Poster FINAL
Machine Learning Goes Production
Parsimonious and Adaptive Contextual Information Acquisition in Recommender S...
Ad

Similar to [07-A] Descriptive Statistics and data exploration (20)

PPTX
Data in science
PDF
Datascience Introduction WebSci Summer School 2014
PDF
More about data science post.pdf
PPTX
Basic Statistical Descriptions of Data.pptx
PPTX
CPSC 531: System Modeling and Simulation.pptx
PDF
Descriptive Analytics: Data Reduction
PDF
Statistics.pdf
PDF
Res701 research methodology lecture 7 8-devaprakasam
PPTX
Quant Data Analysis
PPT
Poster template
PDF
1.0 Descriptive statistics.pdf
PDF
Measure of central tendency
PDF
Data analysis01 singlevariable
PDF
data science lecture for data engineering and data analysis.pdf
PPTX
Dscriptive statistics
PPTX
Data science
PPTX
Statistics for machine learning shifa noorulain
PPT
BA 3 Statistics.ppt
PDF
76a15ed521b7679e372aab35412ab78ab583436a-1602816156135.pdf
PPTX
STATISTICS.pptx for the scholars and students
Data in science
Datascience Introduction WebSci Summer School 2014
More about data science post.pdf
Basic Statistical Descriptions of Data.pptx
CPSC 531: System Modeling and Simulation.pptx
Descriptive Analytics: Data Reduction
Statistics.pdf
Res701 research methodology lecture 7 8-devaprakasam
Quant Data Analysis
Poster template
1.0 Descriptive statistics.pdf
Measure of central tendency
Data analysis01 singlevariable
data science lecture for data engineering and data analysis.pdf
Dscriptive statistics
Data science
Statistics for machine learning shifa noorulain
BA 3 Statistics.ppt
76a15ed521b7679e372aab35412ab78ab583436a-1602816156135.pdf
STATISTICS.pptx for the scholars and students
Ad

More from Ivano Malavolta (20)

PDF
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
PDF
Conducting Experiments on the Software Architecture of Robotic Systems (QRARS...
PDF
The H2020 experience
PDF
The Green Lab - Research cocktail @Vrije Universiteit Amsterdam (October 2020)
PDF
Software sustainability and Green IT
PDF
Navigation-aware and Personalized Prefetching of Network Requests in Android ...
PDF
How Maintainability Issues of Android Apps Evolve [ICSME 2018]
PDF
Collaborative Model-Driven Software Engineering: a Classification Framework a...
PDF
Experimenting on Mobile Apps Quality - a tale about Energy, Performance, and ...
PDF
Modeling objects interaction via UML sequence diagrams [Software Design] [Com...
PDF
Modeling behaviour via UML state machines [Software Design] [Computer Science...
PDF
Object-oriented design patterns in UML [Software Design] [Computer Science] [...
PDF
Structure modeling with UML [Software Design] [Computer Science] [Vrije Unive...
PDF
Requirements engineering with UML [Software Design] [Computer Science] [Vrije...
PDF
Modeling and abstraction, software development process [Software Design] [Com...
PDF
[2017/2018] Agile development
PDF
Reconstructing microservice-based architectures
PDF
[2017/2018] AADL - Architecture Analysis and Design Language
PDF
[2017/2018] Architectural languages
PDF
[2017/2018] Introduction to Software Architecture
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
Conducting Experiments on the Software Architecture of Robotic Systems (QRARS...
The H2020 experience
The Green Lab - Research cocktail @Vrije Universiteit Amsterdam (October 2020)
Software sustainability and Green IT
Navigation-aware and Personalized Prefetching of Network Requests in Android ...
How Maintainability Issues of Android Apps Evolve [ICSME 2018]
Collaborative Model-Driven Software Engineering: a Classification Framework a...
Experimenting on Mobile Apps Quality - a tale about Energy, Performance, and ...
Modeling objects interaction via UML sequence diagrams [Software Design] [Com...
Modeling behaviour via UML state machines [Software Design] [Computer Science...
Object-oriented design patterns in UML [Software Design] [Computer Science] [...
Structure modeling with UML [Software Design] [Computer Science] [Vrije Unive...
Requirements engineering with UML [Software Design] [Computer Science] [Vrije...
Modeling and abstraction, software development process [Software Design] [Com...
[2017/2018] Agile development
Reconstructing microservice-based architectures
[2017/2018] AADL - Architecture Analysis and Design Language
[2017/2018] Architectural languages
[2017/2018] Introduction to Software Architecture

Recently uploaded (20)

PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
history of c programming in notes for students .pptx
PDF
Cost to Outsource Software Development in 2025
PDF
Nekopoi APK 2025 free lastest update
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PPTX
Introduction to Artificial Intelligence
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
System and Network Administration Chapter 2
PDF
top salesforce developer skills in 2025.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
PTS Company Brochure 2025 (1).pdf.......
Navsoft: AI-Powered Business Solutions & Custom Software Development
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
history of c programming in notes for students .pptx
Cost to Outsource Software Development in 2025
Nekopoi APK 2025 free lastest update
Why Generative AI is the Future of Content, Code & Creativity?
Introduction to Artificial Intelligence
Design an Analysis of Algorithms I-SECS-1021-03
System and Network Administration Chapter 2
top salesforce developer skills in 2025.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Odoo Companies in India – Driving Business Transformation.pdf
Designing Intelligence for the Shop Floor.pdf
Understanding Forklifts - TECH EHS Solution
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Odoo POS Development Services by CandidRoot Solutions
PTS Company Brochure 2025 (1).pdf.......

[07-A] Descriptive Statistics and data exploration

  • 1. ‹#› Het begint met een idee Data Analysis Descriptive Statistics and data exploration Ivano Malavolta
  • 2. Vrije Universiteit Amsterdam 2 Quick Recap Experiment scoping Experiment planning Idea Experiment operation Analysis & interpretation Presentation & package
  • 3. Vrije Universiteit Amsterdam 3 Analysis and Interpretation ● Understanding the data ▪ descriptive statistics ▪ exploratory data analysis (EDA, e.g. boxplots, scatter plots) ● (Optional) data reduction ● Hypothesis testing ● Results interpretation
  • 4. Vrije Universiteit Amsterdam 4 Descriptive Statistics ● Goal: get a ‘feeling’ about how data is distributed ● Properties: ▪ Central tendency (e.g. mean, median) ▪ Dispersion (e.g. frequency, standard deviation) ▪ Dependency (e.g., correlation)
  • 5. Vrije Universiteit Amsterdam 5 Parameter vs. statistic ● Parameter: feature of the population ▪ μ: mean ▪ σ: standard deviation ● Statistic: feature of the sample ▪ : mean ▪ s: standard deviation ● Statistics are an estimation of parameters
  • 6. Vrije Universiteit Amsterdam 6 Central Tendency ● Arithmetic mean: ● Geometric Mean: • It is like the arithmetic mean, but with multiplication à used when collected data is not ”additive”, but “multiplicative” • Less sensible to outliers • Try it when the range of the considered values is very large
  • 7. Vrije Universiteit Amsterdam 7 Central Tendency: example ● Average of scores: 6 - 7 - 8 - 9 - 10 ● Arithmetic mean: 8 ● Geometric mean: ~7.87
  • 8. Vrije Universiteit Amsterdam 8 Central tendency: example ● Average of returns of investments: 90% ; 10% ; 20% ; 30% ; -90% ● Arithmetic mean: (90+10+20+30-90)/5= 12% ● Geometric mean: [(1.9 x 1.1 x 1.2 x 1.3 x 0.1) ^ 1/5] - 1 =0.2008= -20.08%
  • 9. Vrije Universiteit Amsterdam 9 Central tendency ● Median (or 50% percentile): middle value separating the greater and lesser halves of a data set X = [13, 18, 13, 14, 13, 16, 14, 21, 13] Xsort = [13, 13, 13, 13, 14, 14, 16, 18, 21]
  • 10. Vrije Universiteit Amsterdam 10 Central tendency ● Mode: most frequent value in data set X = [13, 18, 13, 14, 13, 16, 14, 21, 13] Mox = 13
  • 12. Vrije Universiteit Amsterdam 12 Dispersion ● Sample variance: ● Standard Deviation: ● Standard Deviation is dimensionally equivalent to the data Informally: everything which is within 1 SD from the mean is “normal” Informally: it gives an idea about how ”sparse” is data
  • 13. Vrije Universiteit Amsterdam 13 Dispersion - three-sigma-rule "Empirical Rule" by Dan Kernler - Own work. Licensed under CC BY-SA 4.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Empirical_Rule.PNG#/media/File:Empirical_Rule.PNG
  • 14. Vrije Universiteit Amsterdam 14 Dispersion - three-sigma-rule ● Range: ● Coefficient of variation: (in percentage of mean) ● Coefficient of variation only has meaning if all values are positive (ratio scale, not interval scale e.g. temperatures) It is useful if you want to compare the dispersion of variables with different units of measure
  • 15. Vrije Universiteit Amsterdam 15 Dispersion - example ● Dataset: [100, 100, 100] Mean: 100 ● Variance: 0 ● Standard Deviation: 0 ● Coeff. Variation: 0 ● Range: 0
  • 16. Vrije Universiteit Amsterdam 16 Dispersion - example ● Dataset: [90, 100, 110] Mean: 100 ● Sample Variance: 100 ● Standard Deviation: 10 ● Coeff. Variation: 10% ● Range: 20
  • 17. Vrije Universiteit Amsterdam 17 Dispersion - example ● Dataset: [1, 5, 6, 8, 10, 40, 65, 88] Mean: 27.875 ● Sample Variance: 1082.69 ● Standard Deviation: 32.9 ● Coeff. Variation: 1.18% ● Range: 87
  • 18. Vrije Universiteit Amsterdam 18 Basic visualizations Box Plot Median 3rd quartile 1st quartile
  • 19. Vrije Universiteit Amsterdam 19 Basic visualizations Box Plot Minimum/maximum values THAT ARE NOT OUTLIERS
  • 20. Vrije Universiteit Amsterdam 20 Basic visualizations Box Plot By Gbdivers (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons outliers positive skewness
  • 21. Vrije Universiteit Amsterdam 21 Dependency: correlation ● Sample correlation coefficient (Pearson): ● Meaningful when comparing paired values/datasets
  • 22. Vrije Universiteit Amsterdam 22 Dependency: correlation ● Spearman’s rank correlation coefficient: ● also good for ordinal data ● Kendall’s rank correlation coefficient: ▪ smaller values ▪ more accurate on small samples ● Pearson correlation coefficient assumes normally distributed data
  • 23. Vrije Universiteit Amsterdam 23 Dependency: example Age vs. body fat % ● Pearson: r = 0.7921 ● Spearman: 𝜌 = 0.7539 ● Kendall: 𝜏 = 0.5762
  • 24. Vrije Universiteit Amsterdam 24 Basic Visualizations Scatter Plot
  • 25. Vrije Universiteit Amsterdam 25 Positive VS negative correlation https://statistics.laerd.com/statistical-guides/pearson-correlation- coefficient-statistical-guide.php
  • 26. Vrije Universiteit Amsterdam 26 Scatter plots per different values of r r = Pearson rs = Spearman https://www.researchgate.net/publication/224915794_Improving_standa rds_in_brain-behavior_correlation_analyses
  • 27. Vrije Universiteit Amsterdam 27 Correlation does NOT imply causation! ● Spurious Correlations: http://tylervigen.com/
  • 28. Vrije Universiteit Amsterdam ● Now you know how to explore trends within your data ● but you cannot reject null hypotheses yet ● You can have a “feeling” about ● how disperse-correlated is your data ● what is “standard” in your data ● You can quickly visualize interesting trends ● box plots ● scatterplots28 What this lecture means to you?
  • 29. Vrije Universiteit Amsterdam 29 Ivano Malavolta / S2 group / Experiment design Readings Chapter 10