SlideShare a Scribd company logo
Probability and Statistics
Unit -1
Introduction to Statistics
Statistics:
• The word statistics has two meanings:
• In the most common usage – statistics refers to numerical facts
• The number that represents –
a) annul income
b) age
c) the percentage of students who scored grade A
d) the starting salary of a typical college graduate
• What will be other examples of statistics? ……………..
The following examples present some statistics:
• Approximately 30% of Google’s employees were female in July 2014
(USA TODAY, July 24, 2014).
• In 2013, author James Patterson earned $90 million from the sale of
his books (Forbes, September 29, 2014).
• As per the CBS report, the hotel and restaurant, manufacturing and
transportation sectors of Nepal will witness negative growth of 16.3
percent, 1.1 percent and 2.3 percent, respectively, in the current
fiscal year (The Himalayan Times, April 30, 2020).
• The second meaning of statistics refers to the field or
discipline of study.
• Statistics is the science of collecting, analyzing, presenting,
and interpreting data, as well as of making decisions based
on such analyses.
• A comprehensive definition given by Croxton and Cowden
is:
“Statistics may be defined as the collection, presentation,
analysis and interpretation of numerical data”
• Statistical methods help us make scientific and intelligent
decisions.
• Decisions made by using statistical methods are called
educated guesses.
• Decisions made without using statistical (or scientific)
methods are called pure guesses and, hence, may prove to
be unreliable.
• For example: …….
Applications:
Accounting: Generally the number of individual accounts
receivable is large and time taking to check its validity. Based on
sample data auditors make conclusions as to whether the
accounts receivable amount shown on the client’s balance is
acceptable or not.
Finance: Financial analysis, uses variety of statistical information
and methods to guide their investment recommendations.
Economics: Economists use a variety of statistical information and
methods in making forecasting, planning and formulations
economic policies price index numbers, unemployment rates,
manufacturing capacity utilization, human development indicator
indices, and quality control charts etc.
Fig1. the relation between population and sample
Basic Terms
Population or target population: The collection of all
elements/members whose characteristics are being studied.
For example:………………..
Sample: A portion/fraction of the population of interest.
For example: ……………
Goal of Sample:
Usually populations are so large that a researcher cannot examine
the entire group. Therefore, a sample is selected to represent the
population in a research study. The goal is to use the results
obtained from the sample to help answer questions about the
population.
Introduction to statistics.pptx
Basic terms continued…..
Survey:
A survey is a research method used for collecting data from a
predefined group of respondents to gain information and insights
into various topics of interest.
Census:
procedure of systematically calculating, acquiring and
recording information about the members of a given population.
Sample Survey:
procedure of systematically calculating, acquiring and
recording information from only a portion of a population of
interest.
• Variable
- A variable is a characteristic under study that assumes
different values for different elements.
- A variable is often denoted by letters x, y, or z
- The value of a variable for an element is called an
observation or measurement.
• Data
- collection of information/observations
- The goal of statistics is to help researchers organize and
interpret the data.
Typesof Variables
• Some variables (such as the height of person, price of
groceries) can be measured numerically, whereas others (such
as occupation, income sources) cannot.
• Variables are classified into two types:
a) Quantitative Variable
b) Qualitative Variable
i) Quantitative Variable
• A variable that can be measured numerically is called a quantitative
variable.
• The data collected on a quantitative variable are called quantitative
data.
• Example: Number of workers: 23, 24, 25, 15, 19, 18
• Other examples:
- Annual Gross sale
- No. of accidents
- Weight of a laptop
- Temperature
- No. of gadgets owned
• As you can see from the above examples that certain quantitative
variable can assume may be countable or noncountable
• Quantitative variables may be classified into two categories
a)Discrete Variable
b)Continuous Variable
A) Discrete Variable
• Variable whose values are countable.
• In other words, a discrete variable can assume only certain values
with no intermediate values.
• For example:
- No. of accidents
- The no. of daily admissions in a general hospitals
- The no. of people visit bank in on any day
- The no. of books in a library
B) Continuous Variable
• A variable that can assume any numerical value over a certain
interval or intervals is called a continuous variable.
• Example:
- Price of book: USD105.6
- Annual salary
- Body temperature
- Expenditure on food on any day
- The time it takes to complete a certain task
ii) Qualitative or Categorical Variable
• A variable that cannot assume a numerical value but can be
classified into two or more nonnumeric categories is called a
qualitative or categorical variable.
• The data collected on such a variable are called qualitative data.
• Examples:
- Gender of a person
- A person’s blood type
- Occupation
- Modes of transportation
Introduction to statistics.pptx
Measuring Variables
• To establish relationships between variables, researchers must
observe the variables and record their observations. This
requires that the variables be measured.
• The process of measuring a variable requires a set of categories
called a scale of measurement and a process that classifies each
individual into one category.
Four Types of Measurement Scales
Interval Data
Ordinal Data
Nominal Data
Highest Level
(Strongest forms of
measurement)
Higher Levels
Lowest Level
(Weakest form of
measurement)
Categories (no
ordering or direction)
Ordered Categories
(rankings, order, or
scaling)
Differences between
measurements but no
true zero
Ratio Data
Differences between
measurements, true
zero exists
Nominal data:
Categorical data and numbers that are simply used as identifiers or
names represent a nominal scale of measurement.
Examples: Gender: a) male b) female
Ordinal data:
An ordinal scale of measurement represents an ordered series of
relationships or rank order.
Individuals competing in a contest may be fortunate to achieve first,
second, or third place.
First, second, and third place represent ordinal data
Examples: organizational chart, post, educational qualification,
Interval data:
• A scale which represents quantity and has equal units but for which
zero represents simply an additional point of measurement is an
interval scale
• Example: Temperature, Ph, SAT Score, IQ Test
Ratio data:
• The ratio scale of measurement is similar to the interval scale in that it
also represents quantity and has equality of units.
• However, this scale also has an absolute zero (no numbers exist below
the zero).
• Very often, physical measures will represent ratio data (for example,
height and weight).
Example: Scale of measurement
Branches of Statistics
• Descriptive statistics are methods for organizing and
summarizing data.
• For example, tables or graphs are used to organize data, and
descriptive values such as the average score are used to
summarize data.
• A descriptive value for a population is called a parameter and a
descriptive value for a sample is called a statistic.
• Inferential statistics are methods for using sample data to make
general conclusions (inferences) about populations.
• Because a sample is typically only a part of the whole population,
sample data provide only limited information about the
population. As a result, sample statistics are generally imperfect
representatives of the corresponding population parameters.
• A descriptive study may be performed either on a sample or on a
population. Only when an inference is made about the population,
based on information obtained from the sample, does the study
become inferential.
• Descriptive statistics and inferential statistics are interrelated. You
must almost always use techniques of descriptive statistics to
organize and summarize the information obtained from a sample
before carrying out an inferential analysis.
• Furthermore, as you will see, the preliminary descriptive analysis of a
sample often reveals features that lead you to the choice of the
appropriate inferential method.
Things to remember….
Describing Data with Numerical Measures
a) Measure of central tendency and location
b) Measure of Variability
Topics:
• Compute and interpret the mean, median, and mode for a set of data
• Compute the range, variance, and standard deviation and know what
these values mean
• Construct and interpret a box and whiskers plot
• Compute and explain the coefficient of variation
• Use numerical measures along with graphs, charts, and tables to
interpret data
Summary Measures
Center and Location
Mean
Median
Mode
Other Measures
of Location
Weighted Mean
Describing Data Numerically
Variation
Variance
Standard Deviation
Coefficient of
Variation
Range
Percentiles
Interquartile Range
Quartiles
Measures of Center and Location
Center and Location
Mean Median Mode Weighted Mean
N
x
n
x
x
N
i
i
n
i
i







1
1







i
i
i
W
i
i
i
W
w
x
w
w
x
w
X
Overview
Measures of Center for Ungrouped and Grouped Data
a) Mean
Introduction to statistics.pptx
b) Median
• In an ordered array, the median is the “middle” number
• If n or N is odd, the median is the middle number
• If n or N is even, the median is the average of the two middle
numbers
• The advantage of using the median as a measure of central tendency is
that it is not influenced by outliers.
• When outliers exist, use median instead of mean as a measure of
central tendency.
The median is the value of the middle term in a data set
that has been ranked in increasing order.
173,175 49,723 20,352 10,824 40,911 18,038 61,848
Find the median for these data.
th
1
Median = value
2
n 
 
 
 
28.0 28.2 56.2
Median 28.1 $28.1 million
2 2

   
Calculating median for grouped data
Median= h
f
cf
n
l


2
/
Where l= lower limit of median class
n/2= median position
cf = cumulative frequency preceding to median class
f = median class frequency
h = class width of median class
c) Mode
Mode for ungrouped data
• The mode is the value that occurs with the highest frequency in a data
set.
• Example: …….
• Advantage:
- Can be used for both Qualitative and Quantitative data, whereas
the mean and median can be calculated for only quantitative data
- Not affected by outliers
• Disadvantage: (dependent on the nature of data set)
- There may be no mode
- There may be several modes
Calculating mode for grouped data
• Weighted Mean is an average computed by giving different weights to some of
the individual values. If all the weights are equal, then the weighted mean is
the same as the arithmetic mean.
• It represents the average of a given data. The Weighted mean is similar to
arithmetic mean or simple mean. The Weighted mean is calculated when data
is given in a different way compared to an arithmetic mean or simple mean.
• The Weighted mean for given set of non-negative data x1, x2, x3,… xn with non-
negative weighted w1, w2, w3,… wn. Then the weighted mean is given by;
where, w = given weight
𝑿𝒘 =
w1x1+w2x2+w3x3+…+w𝑛xn
w1+w2+w3+…+w𝑛
=
𝑤𝑥
𝑤
d) Weighted Mean
Days to
Complete
Frequency
5 4
6 12
7 8
8 2
Example: Sample of
26 Repair Projects
Weighted Mean Days
to Complete:
days
6.31
26
164
2
8
12
4
8)
(2
7)
(8
6)
(12
5)
(4
w
x
w
X
i
i
i
W
















Which measure of location is the “best”?
• Mean is generally used, unless extreme values (outliers)
exist
• Then median is often used, since the median is not
sensitive to extreme values.
Relationships Among the Mean, Median, and Mode
Partition values
• The variate values dividing into the total number of observation in equal number of parts are
known as partition values.
• If the values of the variate are arranged in ascending or descending order of magnitudes,
then we have seen that median is that value of the variate which divides the total frequencies
in two equal parts.
• Similarly the given series can be divided into four, ten and hundred equal parts.
• Quartile:
The values of the variate which divide the total frequency into four equal parts, are
called quartiles. there are three types of quartiles:- first quartile (Q1), second quartile
(Q2), and third quartile (Q3 ).
• Decile:
Deciles are those values that divide any set of a given observation into a total of ten
equal parts. Therefore, there are a total of nine deciles. These representation of these
deciles are as follows D1, D2, D3, D4, ……… D9.
• Percentile:
Percentile basically divide any given observation into a total of 100 equal parts. The
representation of these percentiles or centiles is given as P1, P2, P3, P4, ……… P99.
Percentiles
• The pth percentile in an ordered array of n values is
the value in ith position, where
 Example: The 60th percentile in an ordered array of 19
values is the value in 12th position:
1)
(n
100
p
i 

12
1)
(19
100
60
1)
(n
100
p
i 




Calculation of Partition value:
• Quartile:
where, i= 1,2,3
• Decile:
where, i= 1,2,3,…,9
• Percentile:
where, i= 1,2,3,4,……,99
Note : Median = 𝑸𝟐= 𝑫𝟓= 𝑷𝟓𝟎
𝐐𝐢 = 𝐋 +
(
𝐢𝐧
𝟒
− 𝐜. 𝐟. )
𝐟
× 𝐡
𝐃𝐢 = 𝐋 +
(
𝐢𝐧
𝟏𝟎
− 𝐜. 𝐟. )
𝐟
× 𝐡
𝐏𝐢 = 𝐋 +
(
𝐢𝐧
𝟏𝟎𝟎 − 𝐜. 𝐟. )
𝐟
× 𝐡
Interquartile Range
• Can eliminate some outlier problems by using the
interquartile range
• Eliminate some high-and low-valued observations and
calculate the range from the remaining values.
• Interquartile range = 3rd quartile – 1st quartile
Interquartile Range
Median
(Q2)
X
maximum
X
minimum Q1 Q3
Example:
25% 25% 25% 25%
12 30 45 57 70
Interquartile range
= 57 – 30 = 27
Box and Whisker Plot
• A Graphical display of data using 5-number summary:
Minimum -- Q1 -- Median -- Q3 -- Maximum
Example:
Minimum 1st Median 3rd Maximum
Quartile Quartile
25% 25% 25% 25%
Features of Box and Whisker plot:
- Gives a graphic presentation of data using five measures: the median, the
first quartile, the third quartile, and the smallest and the largest values in
the data set between the lower and the upper inner fences.
- Can help visualize the center, the spread, and the skewness of a data set.
- It also helps detect outliers.
- Always located at actual data points, are quickly computable (originally
by hand), and have no tuning parameters. They are particularly useful for
comparing distributions across groups.
Shape of Box and Whisker Plot
• Symmetric
• Right Skewed
• Left Skewed
Why Use a Boxplot?
• A boxplot provides an alternative to a histogram, a dot plot, and a stem-and-
leaf plot. Among the advantages of a boxplot over a histogram are ease of
construction and convenient handling of outliers. In addition, the
construction of a boxplot does not involve subjective judgements, as does a
histogram. That is, two individuals will construct the same boxplot for a
given set of data - which is not necessarily true of a histogram, because the
number of classes and the class endpoints must be chosen. On the other
hand, the boxplot lacks the details the histogram provides.
• Dot plots and stem plots retain the identity of the individual observations; a
boxplot does not. Many sets of data are more suitable for display as
boxplots than as a stem plot. A boxplot as well as a stem plot are useful for
making side-by-side comparisons.
Measures of Variation
Variation
Variance Standard Deviation Coefficient of
Variation
Population
Variance
Sample
Variance
Population
Standard
Deviation
Sample
Standard
Deviation
Range
Interquartile
Range
Variation
• Measures of variation give information on the
spread or variability of the data values.
Same center,
different variation
Measures of Dispersion for Grouped and Ungrouped Data
Range
• Range = Largest value – smallest value
Range = Largest value – smallest value
= 267,277 – 49,651
= 217,626 square miles
Disadvantages of the Range
• Ignores the way in which data are distributed
• Sensitive to outliers
7 8 9 10 11 12
Range = 12 - 7 = 5
7 8 9 10 11 12
Range = 12 - 7 = 5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 5 - 1 = 4
Range = 120 - 1 = 119
Variance
• Average of squared deviations of values from the
mean(individual series)
• Sample variance:
• Population variance:
N
μ)
(x
σ
N
1
i
2
i
2




1
-
n
)
x
(x
s
n
1
i
2
i
2




Standard Deviation
• Most commonly used measure of variation
• Shows variation about the mean
• Has the same units as the original data
• Sample standard deviation:
(Ungroup data)
• Population standard deviation:
N
μ)
(x
σ
N
1
i
2
i




1
-
n
)
x
(x
s
n
1
i
2
i




Sample standard deviation (s) =
 
N
x
f
 
2

 
1
2

 
n
x
x
f
Population standard deviation
For group data standard deviation is computed by using the
following relationship


=
2
2
2
)
(
N
fx
N
fx 


)
1
(
)
(
1
2
2





n
n
fx
n
fx
=
Comparing Standard Deviations
Mean = 15.5
s = 3.338
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
Mean = 15.5
s = .9258
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 4.57
Data C
• C.V. is most widely used relative measure of dispersion in comparing two or more
than two distribution.
• While comparing the two or more distribution, lower the C.V., more
homogeneous or more consistent or more uniform or more regular or more stable
distribution.
• C.V. is used to compare two or more distribution about their variability,
consistency, uniformity, homogeneity, equitability, stability etc.
Coefficient of Variation (CV)
Coefficient of Variation (CV)
Note: A low CV indicates that there is a low variation in the data set
and hence, a higher consistency.
CV 100% (population)
CV 100% (sample)
s
x


 
 
• E.g. 1. Consider the distribution of the yields(per plot) of two paddy varieties and the
information is given below:
C.V. for Variety I = 𝟏𝟎
× 100 = 16.7 % Less variability More consistent
C.V. for Variety I =
𝟔𝟎
𝟗
𝟓𝟎
× 100 = 18.0 %
* But in terms of S.D. the interpretation could be reverse.
Variety I Variety II
Mean (K.G.) 60 50
S.D. (K.G.) 10 9
• If the data distribution is bell-shaped, then the
interval:
• contains about 68% of the values in
the population or the sample
The Empirical Rule
1σ
μ 
X
μ
68%
1σ
μ 
• contains about 95% of the values in
the population or the sample
• contains about 99.7% of the values
in the population or the sample
The Empirical Rule
2σ
μ 
3σ
μ 
3σ
μ 
99.7%
95%
2σ
μ 
• Regardless of how the data are distributed, at
least (1 - 1/k2) of the values will fall within k
standard deviations of the mean
• Examples:
(1 - 1/12) = 0% ……..... k=1 (μ ± 1σ)
(1 - 1/22) = 75% …........ k=2 (μ ± 2σ)
(1 - 1/32) = 89% ………. k=3 (μ ± 3σ)
Tchebysheff’s Theorem
within
At least

More Related Content

PPTX
Inferential statistics
PPT
Normal distribution stat
PPT
multiple regression
PPTX
The Normal Distribution
PPTX
Confidence interval
PDF
Sampling and sampling distribution tttt
PPTX
Measures of central tendancy
Inferential statistics
Normal distribution stat
multiple regression
The Normal Distribution
Confidence interval
Sampling and sampling distribution tttt
Measures of central tendancy

What's hot (20)

PPTX
Research design
PPTX
Statistical Estimation
PPT
Confidence Intervals
PPT
Measures of dispersions
PPTX
Inferential statistics
PPTX
PPTX
probability and non-probability samplings
PPTX
Testing of hypothesis
PPTX
Sign test
PPTX
Lecture 6. univariate and bivariate analysis
PPTX
Experimental research method
PPTX
Statistical inference 2
PPTX
Multivariate analysis
PPTX
Non parametric test
PPTX
Regression
PPTX
Inferential Statistics
PPTX
Non-Parametric Tests
PPTX
Manova ppt
PPTX
Karl pearson's coefficient of correlation (1)
PPTX
QUANTITATIVE RESEARCH.pptx
Research design
Statistical Estimation
Confidence Intervals
Measures of dispersions
Inferential statistics
probability and non-probability samplings
Testing of hypothesis
Sign test
Lecture 6. univariate and bivariate analysis
Experimental research method
Statistical inference 2
Multivariate analysis
Non parametric test
Regression
Inferential Statistics
Non-Parametric Tests
Manova ppt
Karl pearson's coefficient of correlation (1)
QUANTITATIVE RESEARCH.pptx
Ad

Similar to Introduction to statistics.pptx (20)

PPTX
AGRICULTURAL-STATISTICS.pptx
PPTX
Statistic quantitative qualitative sample
PPT
Introduction-To-Statistics-18032022-010747pm (1).ppt
PPT
Introduction to statistics
PPT
Emba502 day 2
DOCX
Statistical lechure
PDF
Instructional-material-in-advanced-statistics-1.pdf
PPTX
Biostatistics ppt itroductionchapter 1.pptx
PPT
Introduction To Statistics.ppt
PPTX
Probability_and_Statistics_lecture_notes_1.pptx
PPTX
Introduction to Statistics statistics formuls
PPTX
Statistics.pptx
PPTX
introduction to statistics
PPT
grade7statistics-150427083137-conversion-gate01.ppt
PDF
PDF
Day1, session i- spss
PPTX
fundamentals of data science and analytics on descriptive analysis.pptx
PPTX
lesson-1_Introduction-to-Statistics.pptx
PDF
Lesson 1.pdf probability and statistics.
DOC
Statistics (2).doc
AGRICULTURAL-STATISTICS.pptx
Statistic quantitative qualitative sample
Introduction-To-Statistics-18032022-010747pm (1).ppt
Introduction to statistics
Emba502 day 2
Statistical lechure
Instructional-material-in-advanced-statistics-1.pdf
Biostatistics ppt itroductionchapter 1.pptx
Introduction To Statistics.ppt
Probability_and_Statistics_lecture_notes_1.pptx
Introduction to Statistics statistics formuls
Statistics.pptx
introduction to statistics
grade7statistics-150427083137-conversion-gate01.ppt
Day1, session i- spss
fundamentals of data science and analytics on descriptive analysis.pptx
lesson-1_Introduction-to-Statistics.pptx
Lesson 1.pdf probability and statistics.
Statistics (2).doc
Ad

Recently uploaded (20)

PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Geodesy 1.pptx...............................................
PPTX
web development for engineering and engineering
PDF
PPT on Performance Review to get promotions
PPTX
Internet of Things (IOT) - A guide to understanding
PPT
Mechanical Engineering MATERIALS Selection
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
additive manufacturing of ss316l using mig welding
DOCX
573137875-Attendance-Management-System-original
PPTX
CH1 Production IntroductoryConcepts.pptx
Structs to JSON How Go Powers REST APIs.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Foundation to blockchain - A guide to Blockchain Tech
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
CYBER-CRIMES AND SECURITY A guide to understanding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Arduino robotics embedded978-1-4302-3184-4.pdf
Geodesy 1.pptx...............................................
web development for engineering and engineering
PPT on Performance Review to get promotions
Internet of Things (IOT) - A guide to understanding
Mechanical Engineering MATERIALS Selection
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
additive manufacturing of ss316l using mig welding
573137875-Attendance-Management-System-original
CH1 Production IntroductoryConcepts.pptx

Introduction to statistics.pptx

  • 2. Introduction to Statistics Statistics: • The word statistics has two meanings: • In the most common usage – statistics refers to numerical facts • The number that represents – a) annul income b) age c) the percentage of students who scored grade A d) the starting salary of a typical college graduate • What will be other examples of statistics? ……………..
  • 3. The following examples present some statistics: • Approximately 30% of Google’s employees were female in July 2014 (USA TODAY, July 24, 2014). • In 2013, author James Patterson earned $90 million from the sale of his books (Forbes, September 29, 2014). • As per the CBS report, the hotel and restaurant, manufacturing and transportation sectors of Nepal will witness negative growth of 16.3 percent, 1.1 percent and 2.3 percent, respectively, in the current fiscal year (The Himalayan Times, April 30, 2020).
  • 4. • The second meaning of statistics refers to the field or discipline of study. • Statistics is the science of collecting, analyzing, presenting, and interpreting data, as well as of making decisions based on such analyses. • A comprehensive definition given by Croxton and Cowden is: “Statistics may be defined as the collection, presentation, analysis and interpretation of numerical data”
  • 5. • Statistical methods help us make scientific and intelligent decisions. • Decisions made by using statistical methods are called educated guesses. • Decisions made without using statistical (or scientific) methods are called pure guesses and, hence, may prove to be unreliable. • For example: …….
  • 6. Applications: Accounting: Generally the number of individual accounts receivable is large and time taking to check its validity. Based on sample data auditors make conclusions as to whether the accounts receivable amount shown on the client’s balance is acceptable or not.
  • 7. Finance: Financial analysis, uses variety of statistical information and methods to guide their investment recommendations. Economics: Economists use a variety of statistical information and methods in making forecasting, planning and formulations economic policies price index numbers, unemployment rates, manufacturing capacity utilization, human development indicator indices, and quality control charts etc.
  • 8. Fig1. the relation between population and sample Basic Terms Population or target population: The collection of all elements/members whose characteristics are being studied. For example:……………….. Sample: A portion/fraction of the population of interest. For example: ……………
  • 9. Goal of Sample: Usually populations are so large that a researcher cannot examine the entire group. Therefore, a sample is selected to represent the population in a research study. The goal is to use the results obtained from the sample to help answer questions about the population.
  • 11. Basic terms continued….. Survey: A survey is a research method used for collecting data from a predefined group of respondents to gain information and insights into various topics of interest. Census: procedure of systematically calculating, acquiring and recording information about the members of a given population. Sample Survey: procedure of systematically calculating, acquiring and recording information from only a portion of a population of interest.
  • 12. • Variable - A variable is a characteristic under study that assumes different values for different elements. - A variable is often denoted by letters x, y, or z - The value of a variable for an element is called an observation or measurement. • Data - collection of information/observations - The goal of statistics is to help researchers organize and interpret the data.
  • 13. Typesof Variables • Some variables (such as the height of person, price of groceries) can be measured numerically, whereas others (such as occupation, income sources) cannot. • Variables are classified into two types: a) Quantitative Variable b) Qualitative Variable
  • 14. i) Quantitative Variable • A variable that can be measured numerically is called a quantitative variable. • The data collected on a quantitative variable are called quantitative data. • Example: Number of workers: 23, 24, 25, 15, 19, 18 • Other examples: - Annual Gross sale - No. of accidents - Weight of a laptop - Temperature - No. of gadgets owned
  • 15. • As you can see from the above examples that certain quantitative variable can assume may be countable or noncountable • Quantitative variables may be classified into two categories a)Discrete Variable b)Continuous Variable
  • 16. A) Discrete Variable • Variable whose values are countable. • In other words, a discrete variable can assume only certain values with no intermediate values. • For example: - No. of accidents - The no. of daily admissions in a general hospitals - The no. of people visit bank in on any day - The no. of books in a library
  • 17. B) Continuous Variable • A variable that can assume any numerical value over a certain interval or intervals is called a continuous variable. • Example: - Price of book: USD105.6 - Annual salary - Body temperature - Expenditure on food on any day - The time it takes to complete a certain task
  • 18. ii) Qualitative or Categorical Variable • A variable that cannot assume a numerical value but can be classified into two or more nonnumeric categories is called a qualitative or categorical variable. • The data collected on such a variable are called qualitative data. • Examples: - Gender of a person - A person’s blood type - Occupation - Modes of transportation
  • 20. Measuring Variables • To establish relationships between variables, researchers must observe the variables and record their observations. This requires that the variables be measured. • The process of measuring a variable requires a set of categories called a scale of measurement and a process that classifies each individual into one category.
  • 21. Four Types of Measurement Scales Interval Data Ordinal Data Nominal Data Highest Level (Strongest forms of measurement) Higher Levels Lowest Level (Weakest form of measurement) Categories (no ordering or direction) Ordered Categories (rankings, order, or scaling) Differences between measurements but no true zero Ratio Data Differences between measurements, true zero exists
  • 22. Nominal data: Categorical data and numbers that are simply used as identifiers or names represent a nominal scale of measurement. Examples: Gender: a) male b) female Ordinal data: An ordinal scale of measurement represents an ordered series of relationships or rank order. Individuals competing in a contest may be fortunate to achieve first, second, or third place. First, second, and third place represent ordinal data Examples: organizational chart, post, educational qualification,
  • 23. Interval data: • A scale which represents quantity and has equal units but for which zero represents simply an additional point of measurement is an interval scale • Example: Temperature, Ph, SAT Score, IQ Test Ratio data: • The ratio scale of measurement is similar to the interval scale in that it also represents quantity and has equality of units. • However, this scale also has an absolute zero (no numbers exist below the zero). • Very often, physical measures will represent ratio data (for example, height and weight).
  • 24. Example: Scale of measurement
  • 25. Branches of Statistics • Descriptive statistics are methods for organizing and summarizing data. • For example, tables or graphs are used to organize data, and descriptive values such as the average score are used to summarize data. • A descriptive value for a population is called a parameter and a descriptive value for a sample is called a statistic.
  • 26. • Inferential statistics are methods for using sample data to make general conclusions (inferences) about populations. • Because a sample is typically only a part of the whole population, sample data provide only limited information about the population. As a result, sample statistics are generally imperfect representatives of the corresponding population parameters.
  • 27. • A descriptive study may be performed either on a sample or on a population. Only when an inference is made about the population, based on information obtained from the sample, does the study become inferential. • Descriptive statistics and inferential statistics are interrelated. You must almost always use techniques of descriptive statistics to organize and summarize the information obtained from a sample before carrying out an inferential analysis. • Furthermore, as you will see, the preliminary descriptive analysis of a sample often reveals features that lead you to the choice of the appropriate inferential method. Things to remember….
  • 28. Describing Data with Numerical Measures a) Measure of central tendency and location b) Measure of Variability
  • 29. Topics: • Compute and interpret the mean, median, and mode for a set of data • Compute the range, variance, and standard deviation and know what these values mean • Construct and interpret a box and whiskers plot • Compute and explain the coefficient of variation • Use numerical measures along with graphs, charts, and tables to interpret data
  • 30. Summary Measures Center and Location Mean Median Mode Other Measures of Location Weighted Mean Describing Data Numerically Variation Variance Standard Deviation Coefficient of Variation Range Percentiles Interquartile Range Quartiles
  • 31. Measures of Center and Location Center and Location Mean Median Mode Weighted Mean N x n x x N i i n i i        1 1        i i i W i i i W w x w w x w X Overview
  • 32. Measures of Center for Ungrouped and Grouped Data a) Mean
  • 34. b) Median • In an ordered array, the median is the “middle” number • If n or N is odd, the median is the middle number • If n or N is even, the median is the average of the two middle numbers • The advantage of using the median as a measure of central tendency is that it is not influenced by outliers. • When outliers exist, use median instead of mean as a measure of central tendency.
  • 35. The median is the value of the middle term in a data set that has been ranked in increasing order. 173,175 49,723 20,352 10,824 40,911 18,038 61,848 Find the median for these data. th 1 Median = value 2 n       
  • 36. 28.0 28.2 56.2 Median 28.1 $28.1 million 2 2     
  • 37. Calculating median for grouped data Median= h f cf n l   2 / Where l= lower limit of median class n/2= median position cf = cumulative frequency preceding to median class f = median class frequency h = class width of median class
  • 38. c) Mode Mode for ungrouped data • The mode is the value that occurs with the highest frequency in a data set. • Example: ……. • Advantage: - Can be used for both Qualitative and Quantitative data, whereas the mean and median can be calculated for only quantitative data - Not affected by outliers • Disadvantage: (dependent on the nature of data set) - There may be no mode - There may be several modes
  • 39. Calculating mode for grouped data
  • 40. • Weighted Mean is an average computed by giving different weights to some of the individual values. If all the weights are equal, then the weighted mean is the same as the arithmetic mean. • It represents the average of a given data. The Weighted mean is similar to arithmetic mean or simple mean. The Weighted mean is calculated when data is given in a different way compared to an arithmetic mean or simple mean. • The Weighted mean for given set of non-negative data x1, x2, x3,… xn with non- negative weighted w1, w2, w3,… wn. Then the weighted mean is given by; where, w = given weight 𝑿𝒘 = w1x1+w2x2+w3x3+…+w𝑛xn w1+w2+w3+…+w𝑛 = 𝑤𝑥 𝑤 d) Weighted Mean
  • 41. Days to Complete Frequency 5 4 6 12 7 8 8 2 Example: Sample of 26 Repair Projects Weighted Mean Days to Complete: days 6.31 26 164 2 8 12 4 8) (2 7) (8 6) (12 5) (4 w x w X i i i W                
  • 42. Which measure of location is the “best”? • Mean is generally used, unless extreme values (outliers) exist • Then median is often used, since the median is not sensitive to extreme values.
  • 43. Relationships Among the Mean, Median, and Mode
  • 44. Partition values • The variate values dividing into the total number of observation in equal number of parts are known as partition values. • If the values of the variate are arranged in ascending or descending order of magnitudes, then we have seen that median is that value of the variate which divides the total frequencies in two equal parts. • Similarly the given series can be divided into four, ten and hundred equal parts. • Quartile: The values of the variate which divide the total frequency into four equal parts, are called quartiles. there are three types of quartiles:- first quartile (Q1), second quartile (Q2), and third quartile (Q3 ). • Decile: Deciles are those values that divide any set of a given observation into a total of ten equal parts. Therefore, there are a total of nine deciles. These representation of these deciles are as follows D1, D2, D3, D4, ……… D9. • Percentile: Percentile basically divide any given observation into a total of 100 equal parts. The representation of these percentiles or centiles is given as P1, P2, P3, P4, ……… P99.
  • 45. Percentiles • The pth percentile in an ordered array of n values is the value in ith position, where  Example: The 60th percentile in an ordered array of 19 values is the value in 12th position: 1) (n 100 p i   12 1) (19 100 60 1) (n 100 p i     
  • 46. Calculation of Partition value: • Quartile: where, i= 1,2,3 • Decile: where, i= 1,2,3,…,9 • Percentile: where, i= 1,2,3,4,……,99 Note : Median = 𝑸𝟐= 𝑫𝟓= 𝑷𝟓𝟎 𝐐𝐢 = 𝐋 + ( 𝐢𝐧 𝟒 − 𝐜. 𝐟. ) 𝐟 × 𝐡 𝐃𝐢 = 𝐋 + ( 𝐢𝐧 𝟏𝟎 − 𝐜. 𝐟. ) 𝐟 × 𝐡 𝐏𝐢 = 𝐋 + ( 𝐢𝐧 𝟏𝟎𝟎 − 𝐜. 𝐟. ) 𝐟 × 𝐡
  • 47. Interquartile Range • Can eliminate some outlier problems by using the interquartile range • Eliminate some high-and low-valued observations and calculate the range from the remaining values. • Interquartile range = 3rd quartile – 1st quartile
  • 48. Interquartile Range Median (Q2) X maximum X minimum Q1 Q3 Example: 25% 25% 25% 25% 12 30 45 57 70 Interquartile range = 57 – 30 = 27
  • 49. Box and Whisker Plot • A Graphical display of data using 5-number summary: Minimum -- Q1 -- Median -- Q3 -- Maximum Example: Minimum 1st Median 3rd Maximum Quartile Quartile 25% 25% 25% 25%
  • 50. Features of Box and Whisker plot: - Gives a graphic presentation of data using five measures: the median, the first quartile, the third quartile, and the smallest and the largest values in the data set between the lower and the upper inner fences. - Can help visualize the center, the spread, and the skewness of a data set. - It also helps detect outliers. - Always located at actual data points, are quickly computable (originally by hand), and have no tuning parameters. They are particularly useful for comparing distributions across groups.
  • 51. Shape of Box and Whisker Plot • Symmetric • Right Skewed • Left Skewed
  • 52. Why Use a Boxplot? • A boxplot provides an alternative to a histogram, a dot plot, and a stem-and- leaf plot. Among the advantages of a boxplot over a histogram are ease of construction and convenient handling of outliers. In addition, the construction of a boxplot does not involve subjective judgements, as does a histogram. That is, two individuals will construct the same boxplot for a given set of data - which is not necessarily true of a histogram, because the number of classes and the class endpoints must be chosen. On the other hand, the boxplot lacks the details the histogram provides. • Dot plots and stem plots retain the identity of the individual observations; a boxplot does not. Many sets of data are more suitable for display as boxplots than as a stem plot. A boxplot as well as a stem plot are useful for making side-by-side comparisons.
  • 53. Measures of Variation Variation Variance Standard Deviation Coefficient of Variation Population Variance Sample Variance Population Standard Deviation Sample Standard Deviation Range Interquartile Range
  • 54. Variation • Measures of variation give information on the spread or variability of the data values. Same center, different variation
  • 55. Measures of Dispersion for Grouped and Ungrouped Data Range • Range = Largest value – smallest value Range = Largest value – smallest value = 267,277 – 49,651 = 217,626 square miles
  • 56. Disadvantages of the Range • Ignores the way in which data are distributed • Sensitive to outliers 7 8 9 10 11 12 Range = 12 - 7 = 5 7 8 9 10 11 12 Range = 12 - 7 = 5 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = 5 - 1 = 4 Range = 120 - 1 = 119
  • 57. Variance • Average of squared deviations of values from the mean(individual series) • Sample variance: • Population variance: N μ) (x σ N 1 i 2 i 2     1 - n ) x (x s n 1 i 2 i 2    
  • 58. Standard Deviation • Most commonly used measure of variation • Shows variation about the mean • Has the same units as the original data • Sample standard deviation: (Ungroup data) • Population standard deviation: N μ) (x σ N 1 i 2 i     1 - n ) x (x s n 1 i 2 i    
  • 59. Sample standard deviation (s) =   N x f   2    1 2    n x x f Population standard deviation For group data standard deviation is computed by using the following relationship   = 2 2 2 ) ( N fx N fx    ) 1 ( ) ( 1 2 2      n n fx n fx =
  • 60. Comparing Standard Deviations Mean = 15.5 s = 3.338 11 12 13 14 15 16 17 18 19 20 21 11 12 13 14 15 16 17 18 19 20 21 Data B Data A Mean = 15.5 s = .9258 11 12 13 14 15 16 17 18 19 20 21 Mean = 15.5 s = 4.57 Data C
  • 61. • C.V. is most widely used relative measure of dispersion in comparing two or more than two distribution. • While comparing the two or more distribution, lower the C.V., more homogeneous or more consistent or more uniform or more regular or more stable distribution. • C.V. is used to compare two or more distribution about their variability, consistency, uniformity, homogeneity, equitability, stability etc. Coefficient of Variation (CV)
  • 62. Coefficient of Variation (CV) Note: A low CV indicates that there is a low variation in the data set and hence, a higher consistency. CV 100% (population) CV 100% (sample) s x      
  • 63. • E.g. 1. Consider the distribution of the yields(per plot) of two paddy varieties and the information is given below: C.V. for Variety I = 𝟏𝟎 × 100 = 16.7 % Less variability More consistent C.V. for Variety I = 𝟔𝟎 𝟗 𝟓𝟎 × 100 = 18.0 % * But in terms of S.D. the interpretation could be reverse. Variety I Variety II Mean (K.G.) 60 50 S.D. (K.G.) 10 9
  • 64. • If the data distribution is bell-shaped, then the interval: • contains about 68% of the values in the population or the sample The Empirical Rule 1σ μ  X μ 68% 1σ μ 
  • 65. • contains about 95% of the values in the population or the sample • contains about 99.7% of the values in the population or the sample The Empirical Rule 2σ μ  3σ μ  3σ μ  99.7% 95% 2σ μ 
  • 66. • Regardless of how the data are distributed, at least (1 - 1/k2) of the values will fall within k standard deviations of the mean • Examples: (1 - 1/12) = 0% ……..... k=1 (μ ± 1σ) (1 - 1/22) = 75% …........ k=2 (μ ± 2σ) (1 - 1/32) = 89% ………. k=3 (μ ± 3σ) Tchebysheff’s Theorem within At least