Descriptive statistics

Descriptive Statistics
by Indramani Tripathi
Measures of Central Tendency
And Dispersion

Measures of Central Tendency
 1. Mode = can be used for any kind of data
but only measure of central tendency for
nominal or qualitative data.
 Formula: value that occurs most often or the
category or interval with highest frequency.
 Note: Omit Formula 3.1 Variation Ratio in
Healey and Prus 2nd
Cdn.

Example for Nominal Variables:

 Religion frequency cf proportion % Cum%
 Catholic 17 17 .41 41 41
 Protestant 4 21 .10 10 51
 Jewish 2 23 .05 5 56
 Muslim 1 24 .02 2 58
 Other 9 33 .22 9 80
 None 8 41 .20 20 100

 Total 41 1.00 100%
 Central Tendency: MODE = largest category = Catholic

Central Tendency (cont.)
 2. Median = exact centre or middle of
ordered data. The 50th percentile.
 Formula:
 Array data.
 When sample size is even, median falls
halfway between two middle numbers.
 To calculate: find (n/2) and (n/2)+1, and
divide the total by 2 to find the exact median.
 When sample size is odd, median is exact
middle (n+1) /2

Example for Raw Data:
 Suppose you have the following set of test
scores:
 66, 89, 41, 98, 76, 77, 68, 60, 60, 67, 69, 66,
98, 52, 74, 66, 89, 95, 66, 69
 1. Array (put in order) your data:
 98 98 95 89 89 77 76 74 69 69
68 67 66 66 66 66 60 60 52 41
N = 20 (N is even)

To calculate:
- find middle numbers(n/2)+(n/2 )+1
- add together the two middle numbers
- divide the total by 2
 First middle number: (20/2) = the 10th
number
 2nd
middle number: (20/2)+1 = the 11th
number
 Look at data:
the middle numbers are 69 and 68
 The median would be (69+68)/2 = 68.5

Median for Aggregate (grouped) Data
 This formula is shown in Healey 1st
Cdn
Edition but NOT in 2/3 Cdn
 We will NOT COVER this one!

Properties of median:
 - for numerical data at interval or ordinal level
 -"balance point“
 -not affected by outliers
 -median is appropriate when distribution is
highly skewed.

3. Mean for Raw Data
 The mean is the sum of measurements /
number of subjects
 Formula: (X-bar) = ΣXi / N
 Data (from above):
66, 89, 41, 98, 76, 77, 68, 60, 60, 67, 69, 66,
98, 52, 74, 66, 89, 95, 66, 69

Example for Mean
 Formula: = ΣXi / N
= 1446 / 20
= 72.3
The mean for these test scores is 72.30

Mean for Aggregate (Grouped) Data
(Note: not in text but covered in class)
 To calculate the mean for grouped data, you
need a frequency table that includes a
column for the midpoints, for the product of
the frequencies times the midpoints (fm).
Formula: = Σ (fm)
N

Frequency table:
Score f m* (fm)
41-50 1 45.5 45.5
51-60 3 55.5 166.5
61-70 8 65.5 524
71-80 3 75.5 226.5
81-90 2 85.5 171
91-100 3 95.5 286.5
N = 20 Σ (fm) = 1420
* Find midpoints first

Calculating Mean for Grouped Data:
Formula: = Σ (fm)
N
= 1420 / 20
= 71
The mean for the grouped data is 71.

Properties of the Mean:
- only for numerical data at interval level
- "balance point“
- can be affected by outliers = skewed distribution
- tail becomes elongated and the mean is pulled in
direction of outlier.
Example…
no outlier:
$30000, 30000, 35000, 25000, 30000 then mean = $30000
but if outlier is present, then:
$130000, 30000, 35000, 25000, 30000 then mean = $50000
(the mean is pulled up or down in the direction of the outlier)

NOTE:
 When distribution is symmetric,
mean = median = mode
 For skewed, mean will lie in direction of skew.
 i.e. skewed to right (tail pulled to right)
mean > median (positive skew)
 skewed to left (tail pulled to left)
median > mean (negative skew)

Measures of Dispersion
 Describe how variable the data are.
 i.e. how spread out around the mean
 Also called measures of variation or
variability

Variability for Non-numerical Data
(Nominal or Ordinal Level Data)
 Measures of variability for non-numerical
nominal or ordinal) data are rarely used
 We will not be covering these in class
 Omit Formula 4.1 IQV in Healey and Prus
1st
Canadian Edition
 Omit Formula 3.1 Variation Ratio in Healey
and Prus 2/3 Canadian Edition

2. Range (for numerical data)
Range = difference between largest and
smallest observations
i.e. if data are $130000, 35000, 30000, 30000,
30000, 30000, 25000, 25000
then range = 130000 - 25000 = $105000

Interquartile Range (Q):
- This is the difference between the 75th and the 25th
percentiles (the middle 50%)
- Gives better idea than range of what the middle of
the distribution looks like.
Formula: Q = Q3 - Q1 (where Q3 = N x .75,
and Q1 = N x .25)
Using above data: Q = Q3 - Q1 = (6th
– 2nd
case)
= $30000-25000 =$5000
The interquartile range (Q) is $5000.

3. Variance and Standard Deviation:
 For raw data at the interval/ratio level.
 Most common measure of variation.
 The numerator in the formula is known as
the sum of squares, and the denominator is
either the population size N or the sample
size n-1
 The variance is denoted by S2
and the
standard deviation, which is the square root
of the variance, by S

Definitional Formula for Variance and
Standard Deviation:
 Variance: s2
= Σ (xi - )2
/ N
 Standard Deviation:
s =
 (the standard deviation is the square root of
the variance; the variance is simply the
standard deviation squared)

Example for S and S2
:
 Data: 66, 89, 41, 98, 76, 77, 68, 60, 60, 67,
69, 66, 98, 52, 74, 66, 89, 95, 66, 69
1. Find ∑ Xi
2
: Square each Xi and find total.
2. Find (∑ Xi)2
: Find total of all Xi and square.
3. Substitute above and N into formula for S.
4. For S2
, simply square S.
S = 14.75 S2
= 217.71

A working formula for the standard
deviation:
Note: the definitional formula for standard deviation is
not practical for use with data when N>10.
The working formula, which is much easier to do on
your calculator, should be used instead.
Both formulae give exactly the same result. Try it!
2
2
X
N
X
S i
−=
∑

Properties of S:
 always greater than or equal to 0
 the greater the variation about mean,
the greater S is
 n-1 corrects for bias when using sample data. S
tends to underestimate the real population standard
deviation when based on sample data so to correct
for this, we use n-1. The larger the sample size, the
smaller difference this correction makes. When
calculating the standard deviation for the whole
population, use N in the denominator.

NOTE:
 σ, N and Mu (µ) denote population
parameters
 s, n, x-bar ( ) denote sample statistics

Remember the Rounding Rules!
 Always use as many decimal places as your
calculator can handle.
 Round your final answer to 2 decimal places,
rounding to nearest number.
 Engineers Rule: When last digit is exactly 5
(followed by 0’s), round the digit before the
last digit to nearest EVEN number.

Homework Questions
 Healey and Prus 1e:
 #3.1, #3.5, #3.11 and 4.9, #4.15
 Healey and Prus 2/3e
 #3.1, #3.5, #3.11 (compute s for 8 nations also), #3.15
 SPSS:
 Read the SPSS sections for Ch. 3 and 4 in 1st
Cdn. Edition
and for Ch. 4 in 2/3 Cdn. Edition
 Try some of the SPSS exercises for practice

Descriptive statistics

More Related Content

What's hot (20)

Similar to Descriptive statistics (20)

Recently uploaded (20)

Descriptive statistics