Basic Statistical Concepts in Machine Learning.pptx

Introduction to
Basic Statistical Concepts
Statistics is a branch of mathematics that deals with the collection,
organization, analysis, interpretation, and presentation of data. It is used in
various fields such as business, economics, sociology, and more.
Understanding statistical concepts is essential for making informed decisions
and drawing meaningful conclusions.

Descriptive Statistics
Mean, Median, Mode
Descriptive statistics involve methods used
to summarize and describe data. It includes
measures of central tendency such as
mean, median, and mode.
Variability Measures
Descriptive statistics also include measures
of variability, which provide insights into the
spread and dispersion of data.

Measures of Central Tendency
1 Mean
The mean is the average of a set of numbers and is calculated by summing all the
numbers and then dividing by the count of numbers.
2 Median
The median is the middle value when the numbers are arranged in ascending order. It
represents the central tendency of the data.
3 Mode
The mode is the value that appears most frequently in a set of data. It indicates the
most common observation.

Measures of Variability
1 Range
The range is the difference between
the highest and lowest values in a
dataset. It provides a simple measure
of variability.
2 Variance
Variance measures the average
degree to which each point in a
dataset differs from the mean. It
shows how much the data points are
spread out.
3 Standard Deviation
Standard deviation is a measure of the amount of variation or dispersion of a set of
values. It is the square root of the variance.

Inferential Statistics
Definition
Inferential statistics involves using data from a
sample to make predictions or inferences about
a population.
Applications
It is used to determine the probability of
something happening or how accurate a
prediction can be made from a sample.

Hypothesis Testing
Formulate Hypothesis
The first step involves
stating a clear hypothesis
that you want to test based
on existing knowledge or
observations.
Collect Data
After formulating the
hypothesis, data is collected
to test and analyze the
validity of the hypothesis.
Analyze Results
The results are statistically
analyzed to determine
whether to accept or reject
the hypothesis.

Confidence Intervals
1 Upper Bound
The upper bound of a confidence
interval represents the high end of
the interval and provides the
maximum potential value of the
parameter.
2
Lower Bound
The lower bound of a confidence
interval represents the low end of the
interval and provides the minimum
potential value of the parameter.

Types of Data
1 Nominal Data
Nominal data represents categories
without any order or sequence.
Examples include gender, colors, and
names.
2 Ordinal Data
Ordinal data represents categories
with a specific order or rank.
Examples include education levels
and survey ratings.

Sampling Methods
Simple Random Sampling
Every member of the population has an
equal chance of being selected.
Stratified Sampling
The population is divided into subgroups,
and samples are then randomly selected
from each subgroup.

Common Statistical Distributions
Normal Distribution
The bell-shaped curve
represents a symmetrical
distribution with most values
clustered around the mean.
Binomial Distribution
It represents the number of
successes in a fixed number
of independent trials with the
same probability of success in
each trial.
Poisson Distribution
It estimates the number of
events that can happen in a
fixed interval of time or space.

Introduction to
Descriptive Statistics
Using Python
Descriptive statistics is a branch of statistics that involves the collection,
analysis, interpretation, and presentation of data. Its primary focus is on
summarizing and describing the main features of a dataset, providing a
comprehensive and meaningful overview.

Purpose and Goals
Summarization:
Condensing large amounts of data
into key insights.
Exploration:
Identifying patterns, trends, and
outliers within the data.
Communication:
Presenting findings in a clear and understandable manner to
facilitate decision-making.

Types of Descriptive Statistics
Provide a central or typical value
in a dataset.
Common measures include:
• Mean: Average of all values.
• Median: Middle value in a
sorted dataset.
• Mode: Most frequently
occurring value.
Indicate the spread or variability of
the data.
• Range: Difference between the
maximum and minimum values.
• Variance: Average of the
squared differences from the
mean.
• Standard Deviation: Square
root of the variance.
Describe the distribution or shape of
the data.
• Skewness: Indicates the
asymmetry of the data distribution.
• Kurtosis: Measures the
"tailedness" or sharpness of the
data distribution.
Measures of
Central Tendency
Measures of
Dispersion
Measures of
Shape

Mean
The mean is the average of a set of numbers, calculated by adding all the numbers together and then dividing by the count of numbers.
Consider the following dataset:
[10, 15, 20, 25, 30]
• (10 + 15 + 20 + 25 + 30) / 5 = 20

Median
The median is the middle value of a data set when it is ordered from least to greatest. It represents the 50th
percentile of the data.
[10, 15, 20, 25, 30]
• The Middle value, which is also 20.

Mode
The mode is the value that appears most frequently in a given data set. It's the most common observation in the
data.
[10, 15, 20, 25, 30]
• No Mode in this case.

Measures of Dispersion
Range
The range is the difference
between the largest and the
smallest values within a
dataset. It provides a simple
measure of variability.
Variance
Variance measures the average
degree to which each point in a
dataset differs from the mean. It
indicates the spread of the
data.
Standard Deviation
The standard deviation is a
measure of the amount of
variation or dispersion of a set
of values. It is the square root
of the variance.

Measures of Dispersion
Example:
Indicate the spread or variability of the data.
Consider two datasets:
• Dataset A: [5,5,5,5,5]
• Dataset B: [0,10,0,10,0]
• Both datasets have the same mean (5), but Dataset B has higher dispersion.

Measures of Shape
Example:
Describe the distribution or shape of the data.
Consider two datasets:
• Dataset C: [10,15,20,25,30]
• Dataset D: [10,10,20,30,30]
• Both datasets have the same mean and median, but Dataset C is symmetric, while Dataset D is
skewed.

Interquartile Range (IQR)
1 Definition
The interquartile range (IQR) is a measure of statistical dispersion, or how
scattered spread out, the values in a dataset are.
2 Calculation
It is calculated as the difference between the third quartile (Q3) and the first
quartile (Q1) in a dataset.

Interquartile Range (IQR)
IQR = Q3 – Q1
• Interquartile range is the amount of spread in the middle 50% of a dataset.
• In other words, it is the distance between the first quartile (Q1) and the third quartile (Q3).

How to Find IQR?
Here's how to find the IQR:
Step 1: Put the data in order from least to greatest.
Step 2: Find the median. If the number of data points is odd, the median is the middle data point. If the number of data points is
even, the median is the average of the middle two data points.
Step 3: Find the first quartile (Q1). The first quartile is the median of the data points to the left of the median in the ordered list.
Step 4: Find the third quartile (Q3). The third quartile is the median of the data points to the right of the median in the ordered
list.
Step 5: Calculate IQR by subtracting Q3 – Q1.

Find the IQR of these scores:
1,3,3,3,4,4,4,6,6
Step 1: The data is already in order.
Step 2: Find the median. There are 9 scores, so the median is the middle score.
The median is 4.
Step 3: Find Q1, which is the median of the data to the left of the median.
There is an even number of data points to the left of the median, so we need the average of
the middle two data points.
1,3,3,3
Q1 = (3+3)/2 = 3
The first Quartile (Q1) is 3.

Step 4: Find Q3, which is the median of the data to the right of the median.
There is an even number of data points to the right of the median, so we need the average of
the middle two data points.
4,4,6,6
Q3 = (4+6)/2 = 5
The Third Quartile (Q3) is 5.
Step 5: Calculate the IQR.
IQR = Q3 - Q1
= 5 – 3
= 2
The IQR is 2 points.

Basic Statistical Concepts in Machine Learning.pptx

More Related Content

What's hot (20)

Similar to Basic Statistical Concepts in Machine Learning.pptx (20)

Recently uploaded (20)

Basic Statistical Concepts in Machine Learning.pptx