Introduction to statistics.pptx

Probability and Statistics
Unit -1

Introduction to Statistics
Statistics:
• The word statistics has two meanings:
• In the most common usage – statistics refers to numerical facts
• The number that represents –
a) annul income
b) age
c) the percentage of students who scored grade A
d) the starting salary of a typical college graduate
• What will be other examples of statistics? ……………..

The following examples present some statistics:
• Approximately 30% of Google’s employees were female in July 2014
(USA TODAY, July 24, 2014).
• In 2013, author James Patterson earned $90 million from the sale of
his books (Forbes, September 29, 2014).
• As per the CBS report, the hotel and restaurant, manufacturing and
transportation sectors of Nepal will witness negative growth of 16.3
percent, 1.1 percent and 2.3 percent, respectively, in the current
fiscal year (The Himalayan Times, April 30, 2020).

• The second meaning of statistics refers to the field or
discipline of study.
• Statistics is the science of collecting, analyzing, presenting,
and interpreting data, as well as of making decisions based
on such analyses.
• A comprehensive definition given by Croxton and Cowden
is:
“Statistics may be defined as the collection, presentation,
analysis and interpretation of numerical data”

• Statistical methods help us make scientific and intelligent
decisions.
• Decisions made by using statistical methods are called
educated guesses.
• Decisions made without using statistical (or scientific)
methods are called pure guesses and, hence, may prove to
be unreliable.
• For example: …….

Applications:
Accounting: Generally the number of individual accounts
receivable is large and time taking to check its validity. Based on
sample data auditors make conclusions as to whether the
accounts receivable amount shown on the client’s balance is
acceptable or not.

Finance: Financial analysis, uses variety of statistical information
and methods to guide their investment recommendations.
Economics: Economists use a variety of statistical information and
methods in making forecasting, planning and formulations
economic policies price index numbers, unemployment rates,
manufacturing capacity utilization, human development indicator
indices, and quality control charts etc.

Fig1. the relation between population and sample
Basic Terms
Population or target population: The collection of all
elements/members whose characteristics are being studied.
For example:………………..
Sample: A portion/fraction of the population of interest.
For example: ……………

Goal of Sample:
Usually populations are so large that a researcher cannot examine
the entire group. Therefore, a sample is selected to represent the
population in a research study. The goal is to use the results
obtained from the sample to help answer questions about the
population.

Introduction to statistics.pptx

Basic terms continued…..
Survey:
A survey is a research method used for collecting data from a
predefined group of respondents to gain information and insights
into various topics of interest.
Census:
procedure of systematically calculating, acquiring and
recording information about the members of a given population.
Sample Survey:
procedure of systematically calculating, acquiring and
recording information from only a portion of a population of
interest.

• Variable
- A variable is a characteristic under study that assumes
different values for different elements.
- A variable is often denoted by letters x, y, or z
- The value of a variable for an element is called an
observation or measurement.
• Data
- collection of information/observations
- The goal of statistics is to help researchers organize and
interpret the data.

Typesof Variables
• Some variables (such as the height of person, price of
groceries) can be measured numerically, whereas others (such
as occupation, income sources) cannot.
• Variables are classified into two types:
a) Quantitative Variable
b) Qualitative Variable

i) Quantitative Variable
• A variable that can be measured numerically is called a quantitative
variable.
• The data collected on a quantitative variable are called quantitative
data.
• Example: Number of workers: 23, 24, 25, 15, 19, 18
• Other examples:
- Annual Gross sale
- No. of accidents
- Weight of a laptop
- Temperature
- No. of gadgets owned

• As you can see from the above examples that certain quantitative
variable can assume may be countable or noncountable
• Quantitative variables may be classified into two categories
a)Discrete Variable
b)Continuous Variable

A) Discrete Variable
• Variable whose values are countable.
• In other words, a discrete variable can assume only certain values
with no intermediate values.
• For example:
- No. of accidents
- The no. of daily admissions in a general hospitals
- The no. of people visit bank in on any day
- The no. of books in a library

B) Continuous Variable
• A variable that can assume any numerical value over a certain
interval or intervals is called a continuous variable.
• Example:
- Price of book: USD105.6
- Annual salary
- Body temperature
- Expenditure on food on any day
- The time it takes to complete a certain task

ii) Qualitative or Categorical Variable
• A variable that cannot assume a numerical value but can be
classified into two or more nonnumeric categories is called a
qualitative or categorical variable.
• The data collected on such a variable are called qualitative data.
• Examples:
- Gender of a person
- A person’s blood type
- Occupation
- Modes of transportation

Measuring Variables
• To establish relationships between variables, researchers must
observe the variables and record their observations. This
requires that the variables be measured.
• The process of measuring a variable requires a set of categories
called a scale of measurement and a process that classifies each
individual into one category.

Four Types of Measurement Scales
Interval Data
Ordinal Data
Nominal Data
Highest Level
(Strongest forms of
measurement)
Higher Levels
Lowest Level
(Weakest form of
measurement)
Categories (no
ordering or direction)
Ordered Categories
(rankings, order, or
scaling)
Differences between
measurements but no
true zero
Ratio Data
Differences between
measurements, true
zero exists

Nominal data:
Categorical data and numbers that are simply used as identifiers or
names represent a nominal scale of measurement.
Examples: Gender: a) male b) female
Ordinal data:
An ordinal scale of measurement represents an ordered series of
relationships or rank order.
Individuals competing in a contest may be fortunate to achieve first,
second, or third place.
First, second, and third place represent ordinal data
Examples: organizational chart, post, educational qualification,

Interval data:
• A scale which represents quantity and has equal units but for which
zero represents simply an additional point of measurement is an
interval scale
• Example: Temperature, Ph, SAT Score, IQ Test
Ratio data:
• The ratio scale of measurement is similar to the interval scale in that it
also represents quantity and has equality of units.
• However, this scale also has an absolute zero (no numbers exist below
the zero).
• Very often, physical measures will represent ratio data (for example,
height and weight).

Branches of Statistics
• Descriptive statistics are methods for organizing and
summarizing data.
• For example, tables or graphs are used to organize data, and
descriptive values such as the average score are used to
summarize data.
• A descriptive value for a population is called a parameter and a
descriptive value for a sample is called a statistic.

• Inferential statistics are methods for using sample data to make
general conclusions (inferences) about populations.
• Because a sample is typically only a part of the whole population,
sample data provide only limited information about the
population. As a result, sample statistics are generally imperfect
representatives of the corresponding population parameters.

• A descriptive study may be performed either on a sample or on a
population. Only when an inference is made about the population,
based on information obtained from the sample, does the study
become inferential.
• Descriptive statistics and inferential statistics are interrelated. You
must almost always use techniques of descriptive statistics to
organize and summarize the information obtained from a sample
before carrying out an inferential analysis.
• Furthermore, as you will see, the preliminary descriptive analysis of a
sample often reveals features that lead you to the choice of the
appropriate inferential method.
Things to remember….

Describing Data with Numerical Measures
a) Measure of central tendency and location
b) Measure of Variability

Topics:
• Compute and interpret the mean, median, and mode for a set of data
• Compute the range, variance, and standard deviation and know what
these values mean
• Construct and interpret a box and whiskers plot
• Compute and explain the coefficient of variation
• Use numerical measures along with graphs, charts, and tables to
interpret data

Summary Measures
Center and Location
Mean
Median
Mode
Other Measures
of Location
Weighted Mean
Describing Data Numerically
Variation
Variance
Standard Deviation
Coefficient of
Variation
Range
Percentiles
Interquartile Range
Quartiles

Measures of Center and Location
Center and Location
Mean Median Mode Weighted Mean
N
x
n
x
x
N
i
i
n
i
i







1
1







i
i
i
W
i
i
i
W
w
x
w
w
x
w
X
Overview

Measures of Center for Ungrouped and Grouped Data
a) Mean

b) Median
• In an ordered array, the median is the “middle” number
• If n or N is odd, the median is the middle number
• If n or N is even, the median is the average of the two middle
numbers
• The advantage of using the median as a measure of central tendency is
that it is not influenced by outliers.
• When outliers exist, use median instead of mean as a measure of
central tendency.

The median is the value of the middle term in a data set
that has been ranked in increasing order.
173,175 49,723 20,352 10,824 40,911 18,038 61,848
Find the median for these data.
th
1
Median = value
2
n 
 
 
 

28.0 28.2 56.2
Median 28.1 $28.1 million
2 2

   

Calculating median for grouped data
Median= h
f
cf
n
l


2
/
Where l= lower limit of median class
n/2= median position
cf = cumulative frequency preceding to median class
f = median class frequency
h = class width of median class

c) Mode
Mode for ungrouped data
• The mode is the value that occurs with the highest frequency in a data
set.
• Example: …….
• Advantage:
- Can be used for both Qualitative and Quantitative data, whereas
the mean and median can be calculated for only quantitative data
- Not affected by outliers
• Disadvantage: (dependent on the nature of data set)
- There may be no mode
- There may be several modes

Calculating mode for grouped data

• Weighted Mean is an average computed by giving different weights to some of
the individual values. If all the weights are equal, then the weighted mean is
the same as the arithmetic mean.
• It represents the average of a given data. The Weighted mean is similar to
arithmetic mean or simple mean. The Weighted mean is calculated when data
is given in a different way compared to an arithmetic mean or simple mean.
• The Weighted mean for given set of non-negative data x1, x2, x3,… xn with non-
negative weighted w1, w2, w3,… wn. Then the weighted mean is given by;
where, w = given weight
𝑿𝒘 =
w1x1+w2x2+w3x3+…+w𝑛xn
w1+w2+w3+…+w𝑛
=
𝑤𝑥
𝑤
d) Weighted Mean

Days to
Complete
Frequency
5 4
6 12
7 8
8 2
Example: Sample of
26 Repair Projects
Weighted Mean Days
to Complete:
days
6.31
26
164
2
8
12
4
8)
(2
7)
(8
6)
(12
5)
(4
w
x
w
X
i
i
i
W

















Which measure of location is the “best”?
• Mean is generally used, unless extreme values (outliers)
exist
• Then median is often used, since the median is not
sensitive to extreme values.

Relationships Among the Mean, Median, and Mode

Partition values
• The variate values dividing into the total number of observation in equal number of parts are
known as partition values.
• If the values of the variate are arranged in ascending or descending order of magnitudes,
then we have seen that median is that value of the variate which divides the total frequencies
in two equal parts.
• Similarly the given series can be divided into four, ten and hundred equal parts.
• Quartile:
The values of the variate which divide the total frequency into four equal parts, are
called quartiles. there are three types of quartiles:- first quartile (Q1), second quartile
(Q2), and third quartile (Q3 ).
• Decile:
Deciles are those values that divide any set of a given observation into a total of ten
equal parts. Therefore, there are a total of nine deciles. These representation of these
deciles are as follows D1, D2, D3, D4, ……… D9.
• Percentile:
Percentile basically divide any given observation into a total of 100 equal parts. The
representation of these percentiles or centiles is given as P1, P2, P3, P4, ……… P99.

Percentiles
• The pth percentile in an ordered array of n values is
the value in ith position, where
 Example: The 60th percentile in an ordered array of 19
values is the value in 12th position:
1)
(n
100
p
i 

12
1)
(19
100
60
1)
(n
100
p
i 





Calculation of Partition value:
• Quartile:
where, i= 1,2,3
• Decile:
where, i= 1,2,3,…,9
• Percentile:
where, i= 1,2,3,4,……,99
Note : Median = 𝑸𝟐= 𝑫𝟓= 𝑷𝟓𝟎
𝐐𝐢 = 𝐋 +
(
𝐢𝐧
𝟒
− 𝐜. 𝐟. )
𝐟
× 𝐡
𝐃𝐢 = 𝐋 +
(
𝐢𝐧
𝟏𝟎
− 𝐜. 𝐟. )
𝐟
× 𝐡
𝐏𝐢 = 𝐋 +
(
𝐢𝐧
𝟏𝟎𝟎 − 𝐜. 𝐟. )
𝐟
× 𝐡

Interquartile Range
• Can eliminate some outlier problems by using the
interquartile range
• Eliminate some high-and low-valued observations and
calculate the range from the remaining values.
• Interquartile range = 3rd quartile – 1st quartile

Interquartile Range
Median
(Q2)
X
maximum
X
minimum Q1 Q3
Example:
25% 25% 25% 25%
12 30 45 57 70
Interquartile range
= 57 – 30 = 27

Box and Whisker Plot
• A Graphical display of data using 5-number summary:
Minimum -- Q1 -- Median -- Q3 -- Maximum
Example:
Minimum 1st Median 3rd Maximum
Quartile Quartile
25% 25% 25% 25%

Features of Box and Whisker plot:
- Gives a graphic presentation of data using five measures: the median, the
first quartile, the third quartile, and the smallest and the largest values in
the data set between the lower and the upper inner fences.
- Can help visualize the center, the spread, and the skewness of a data set.
- It also helps detect outliers.
- Always located at actual data points, are quickly computable (originally
by hand), and have no tuning parameters. They are particularly useful for
comparing distributions across groups.

Shape of Box and Whisker Plot
• Symmetric
• Right Skewed
• Left Skewed

Why Use a Boxplot?
• A boxplot provides an alternative to a histogram, a dot plot, and a stem-and-
leaf plot. Among the advantages of a boxplot over a histogram are ease of
construction and convenient handling of outliers. In addition, the
construction of a boxplot does not involve subjective judgements, as does a
histogram. That is, two individuals will construct the same boxplot for a
given set of data - which is not necessarily true of a histogram, because the
number of classes and the class endpoints must be chosen. On the other
hand, the boxplot lacks the details the histogram provides.
• Dot plots and stem plots retain the identity of the individual observations; a
boxplot does not. Many sets of data are more suitable for display as
boxplots than as a stem plot. A boxplot as well as a stem plot are useful for
making side-by-side comparisons.

Measures of Variation
Variation
Variance Standard Deviation Coefficient of
Variation
Population
Variance
Sample
Variance
Population
Standard
Deviation
Sample
Standard
Deviation
Range
Interquartile
Range

Variation
• Measures of variation give information on the
spread or variability of the data values.
Same center,
different variation

Measures of Dispersion for Grouped and Ungrouped Data
Range
• Range = Largest value – smallest value
Range = Largest value – smallest value
= 267,277 – 49,651
= 217,626 square miles

Disadvantages of the Range
• Ignores the way in which data are distributed
• Sensitive to outliers
7 8 9 10 11 12
Range = 12 - 7 = 5
7 8 9 10 11 12
Range = 12 - 7 = 5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 5 - 1 = 4
Range = 120 - 1 = 119

Variance
• Average of squared deviations of values from the
mean(individual series)
• Sample variance:
• Population variance:
N
μ)
(x
σ
N
1
i
2
i
2




1
-
n
)
x
(x
s
n
1
i
2
i
2





Standard Deviation
• Most commonly used measure of variation
• Shows variation about the mean
• Has the same units as the original data
• Sample standard deviation:
(Ungroup data)
• Population standard deviation:
N
μ)
(x
σ
N
1
i
2
i




1
-
n
)
x
(x
s
n
1
i
2
i





Sample standard deviation (s) =
 
N
x
f
 
2

 
1
2

 
n
x
x
f
Population standard deviation
For group data standard deviation is computed by using the
following relationship


=
2
2
2
)
(
N
fx
N
fx 


)
1
(
)
(
1
2
2





n
n
fx
n
fx
=

Comparing Standard Deviations
Mean = 15.5
s = 3.338
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
Mean = 15.5
s = .9258
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 4.57
Data C

• C.V. is most widely used relative measure of dispersion in comparing two or more
than two distribution.
• While comparing the two or more distribution, lower the C.V., more
homogeneous or more consistent or more uniform or more regular or more stable
distribution.
• C.V. is used to compare two or more distribution about their variability,
consistency, uniformity, homogeneity, equitability, stability etc.
Coefficient of Variation (CV)

Coefficient of Variation (CV)
Note: A low CV indicates that there is a low variation in the data set
and hence, a higher consistency.
CV 100% (population)
CV 100% (sample)
s
x


 
 

• E.g. 1. Consider the distribution of the yields(per plot) of two paddy varieties and the
information is given below:
C.V. for Variety I = 𝟏𝟎
× 100 = 16.7 % Less variability More consistent
C.V. for Variety I =
𝟔𝟎
𝟗
𝟓𝟎
× 100 = 18.0 %
* But in terms of S.D. the interpretation could be reverse.
Variety I Variety II
Mean (K.G.) 60 50
S.D. (K.G.) 10 9

• If the data distribution is bell-shaped, then the
interval:
• contains about 68% of the values in
the population or the sample
The Empirical Rule
1σ
μ 
X
μ
68%
1σ
μ 

• contains about 95% of the values in
the population or the sample
• contains about 99.7% of the values
in the population or the sample
The Empirical Rule
2σ
μ 
3σ
μ 
3σ
μ 
99.7%
95%
2σ
μ 

• Regardless of how the data are distributed, at
least (1 - 1/k2) of the values will fall within k
standard deviations of the mean
• Examples:
(1 - 1/12) = 0% ……..... k=1 (μ ± 1σ)
(1 - 1/22) = 75% …........ k=2 (μ ± 2σ)
(1 - 1/32) = 89% ………. k=3 (μ ± 3σ)
Tchebysheff’s Theorem
within
At least

Introduction to statistics.pptx

More Related Content

What's hot (20)

Similar to Introduction to statistics.pptx (20)

Recently uploaded (20)

Introduction to statistics.pptx