ANOVA.ppt

TOPICS FOR TODAY
Analysis of Variance
ANOVA

The concept of Analysis of Variance
is explained below:

Earlier, we compared two-population
means by using a two-sample t-test.
However, we are often required to
compare more than two population means
simultaneously.
We might be tempted to apply the two-
sample t-test to all possible pairwise
comparisons of means.

For example, if we wish to compare 4
population means, there will be
separate pairs, and to test the null
hypothesis that all four population means
are equal, we would require six
two-sample t-tests.
6
2
4










Similarly, to test the null hypothesis
that 10 population means are equal, we
would need
separate two-sample t-tests.
This procedure of running multiple
two-sample t-tests for comparing means
would obviously be tedious and time-
consuming.
45
2
10










Thus a series of two-sample t-tests is not
an appropriate procedure to test the equality
of several means simultaneously.
Evidently, we require a simpler
procedure for carrying out this kind of a test.

One such procedure is the Analysis of
Variance, introduced by Sir R.A. Fisher
(1890-1962) in 1923:

Analysis of Variance (ANOVA) is a
procedure which enables us to test the
hypothesis of equality of several population
means
(i.e.
H0 : 1 = 2 = 3 = …… = k
against
HA: not all the means are equal)

The concept of Analysis of Variance is
closely related with the concept of
Experimental Design:

EXPERIMENTAL DESIGN
By an experimental design, we mean a
plan used to collect the data relevant to the
problem under study in such a way as to
provide a basis for valid and objective
inference about the stated problem.

The plan usually includes:
• The selection of treatments whose
effects are to be studied,
• the specification of the experimental layout,
and
• The assignment of treatments to the
experimental units.
All these steps are accomplished before
any experiment is performed.

Experimental Design is a very vast
area. In this course, we will be presenting
only a very basic introduction of this area.

There are two types of designs:
systematic and randomized designs.
Today, we will be discussing only the
randomized designs, and, in this regard, it
should be noted that for the randomized
designs, the analysis of the collected data is
carried out through the technique known as
Analysis of Variance.

Two of the very basic randomized
designs are:
i) The Completely Randomized (CR)
Design,
and
ii) The Randomized Complete
Block (RCB) Design

EXAMPLE:
An experiment was conducted to
compare the yields of three varieties of
potatoes.
Each variety as assigned at random to
equal-size plots, four times.
The yields were as follow:

Variety
A B C
23 18 16
26 28 25
20 17 12
17 21 14

Test the hypothesis that the three
varieties of potatoes are not different in the
yielding capabilities.

SOLUTION:
The first thing to note is that this is an
example of the Completely Randomized (CR)
Design.
We are assuming that all twelve of the
plots (i.e. farms) available to us for this
experiment are homogeneous (i.e. similar)
with regard to the fertility of the soil, the
weather conditions, etc., and hence, we are
assigning the four varieties to the twelve plots
totally at random.

Now, in order to test the hypothesis
that the mean yields of the three varieties
of potato are equal, we carry out the six-
step hypothesis-testing procedure, as given
below:

Hypothesis-Testing Procedure:
i) H0 : A = B = C
HA : Not all the three means
are equal
ii) Level of Significance:
 = 0.05

iii) Test Statistic:
which, if H0 is true, has an F distribution with
1 = k-1 = 3 – 1 = 2 and 2 = n-k = 12 – 3 = 9
degree of freedom
Error
MS
Treatments
MS
F 

Step-4: Computations:
The computation of the test statistic
presented above involves quite a few steps,
including the formation of what is known as
the ANOVA Table.

First of all, let us consider what is
meant by the ANOVA Table (i.e. the
Analysis of Variance Table).

ANOVA Table
Source of
Variation
df Sum of
Squares
Mean
Squares
F
Between
Treatments
k - 1 SST MST MST/
MSE
Within
Treatments
(Error)
n - k SSE MSE
Total n - 1 TSS

Let us try to understand this table step
by step:
The very first column is headed
‘Source of Variation’, and under this
heading, we have three distinct sources of
variation:

‘Total’ stands for the overall variation
in the twelve values that we have in our
data-set.

As you can see, the values in our data-
set are 23, 26, 20, 17, 18, 28, and so on.
Evidently, there is a variation in these
values, and the term ‘Total’ in the lowest
row of the ANOVA Table stands for this
overall variation.

The term ‘Variation Between
Treatments’ stands for the variability that
exists between the three varieties of potato
that we have sown in the plots.
(In this example, the term ‘treatments’
stands for the three varieties of potato that
we are trying to compare.)

(The term ‘variation between treatments’
points to the fact that:
It is possible that the three varieties, or,
at least two of the varieties are significantly
different from each other with regard to their
yielding capabilities. This variability between
the varieties can be measured by measuring
the differences between the mean yields of the
three varieties.)

The third source of variation is
‘variation within treatments’. This points to
the fact that even if only one particular
variety of potato is sown more than once,
we do not get the same yield every time.

In this example, variety A was sown four
times, and the yields were 23, 26, 20, and 17 -
-- all different from one another!
Similar is the case for variety B as well
as variety C.

The variability in the yields of variety
A can be called ‘variation within variety A’.
Similarly, the variability in the yields
of variety B can be called ‘variation within
variety B’.
Also, the variability in the yields of
variety C can be called ‘variation within
variety C’.

We can say that the term ‘variability
within treatments’ stands for the combined
effect of the above-mentioned three
variations.

The ‘variation within treatments’ is
also known as the ‘error variation’.

This is so because we can argue that if
we are sowing the same variety in four plots
which are very similar to each other, then we
should have obtained the same yield from
each plot!
If it is not coming out to be the same
every time, we can regard this as some kind of
an ‘error’.

The second, third and fourth columns
of the ANOVA Table are entitled ‘degrees of
freedom’, ‘Sum of Squares’ and ‘Mean
Square’.

ANOVA Table
Source of
Variation
df Sum of
Squares
Mean
Squares
F
Between
Treatments
k - 1 SST MST MST/
MSE
Within
Treatments
(Error)
n - k SSE MSE --
Total n - 1 TSS -- --

The point to understand is that the
sources of variation corresponding to
treatments and error will be measured by
computing quantities that are called Mean
Squares, and ‘Mean Square’ can be defined
as:

Freedom
of
Degrees
Squares
of
Sum
Square
Mean 

Corresponding to these two sources of
variation, we have the following two
equations:

and
.
f
.
d
'
Treatment
SS
'
'
Treatment
MS
'
)
1 
.
f
.
d
'
Error
SS
'
'
Error
MS
'
)
2 

It has been mathematically proved that,
with reference to Analysis of Variance, the
degrees of freedom corresponding to the
Treatment Sum of Squares are k-1, and the
degrees of freedom corresponding to the
Error Sum of Squares are n-k.
Hence, the above two equations can be
written as:

and
1
k
'
Treatment
SS
'
'
Treatment
MS
'
)
1
-

k
n
'
Error
SS
'
'
Error
MS
'
)
2
-


How do we compute the various sums
of squares?
The three sums of squares occurring in
the third column of the above ANOVA Table
are given by:

where C.F. stands for
‘Correction Factor’, and is given by
and r denotes the number of data-values
per column (i.e. the number of rows).
TSS
SS
Total
)
1 

SST
Treatment
SS
)
2 

2
ij
i j
X CF
-

2
. j
j
T
CF
r
-

2
..
T
CF
n


With reference to the CR Design, it
should be noted that, in some situations, the
various treatments are not repeated an equal
number of times.
For example, with reference to the
twelve plots (farms) that we have been
considering above, we could have sown
variety A in five of the plots, variety B in
three plots, and variety C in four plots.

Going back to the formulae of various
sums of squares, the sum of squares for
error is given by

SST
TSS
SSE
.
e
.
i
Treatment
SS
SS
Total
Error
SS
)
3
-

-


It is interesting to note that,
Total SS = SS Treatment + SS Error
In a similar way, we have the equation:
Total d.f. = d.f. for Treatment + d.f. for
Error

It can be shown that the degrees of
freedom pertaining to ‘Total’ are n - 1.
Now,
n-1 = (k-1) + (n-k)
i.e.
Total d.f. = d.f. for Treatment + d.f. for Error

The notations and terminology given in
the above equations relate to the following
table:

C
B
A
4953
1221
1838
1894
Check
18941
4489
7056
7396
4953
237
67
84
86
1109
2085
833
926
--
--
--
--
16 (256)
25 (625)
12 (144)
14 (196)
18 (324)
28 (784)
17 (289)
21 (441)
23 (529)
26 (676)
20 (400)
17 (289)
Total
Variety 2
ij
j
X

. j
T
2
. j
T
2
ij
i
X


The entries in the body of the table i.e.
23, 26, 20, 17, and so on are the yields of
the three varieties of potato that we had
sown in the twelve farms.
The entries written in brackets next to
the above-mentioned data-values are the
squares of those values.

For example:
529 is the square of 23,
and so on.

Adding all these squares, we obtain :

The notation T.j stands for the total of the
jth column.
(The students must already be aware that, in
general, the rows of a bivariate table are
denoted by the letter ‘i’, whereas the columns
of a bivariate table are denoted by the letter ‘j’.
In other words, we talk about the ‘ith
row’, and the ‘jth column’ of a bivariate
table.)

The ‘dot’ in the notation T.j indicates
the fact that summation has been carried out
over i (i.e. over the rows).

In this example, the total of the values
in the first column is 86, the total of the
values in the second column is 84, and the
total of the values in the third column is 67.

Hence, T.j is equal to 237.
T.j is also denoted by T..
i.e.

The ‘double dot’ in the notation T..
indicates that summation has been carried
out over i as well as over j.

The row below T.j is that of T.j
2, and
squaring the three values of T.j, we obtain
the quantities 7396, 7056 and 4489.
Adding these, we obtain T.j
2 = 18941.

Now that we have obtained all the
required quantities, we are ready to
compute SS Total, SS Treatment, and SS
Error:

We have
Hence, the total sum of squares is given by
 
2
2
..
237
4680.75
12
T
CF
n
  
2
4953 4680.75
272.25
ij
i j
TSS X CF
 -
 -



Also, we have
2
.
SS Treatment
18941
4680.75
4
54.50
j
j
T
SST CF
r
  -
 -



And, hence:
SS Error = SSE = TSS - SST
= 272.25 - 54.50 = 217.75

In this example,
we have n = 12, and k = 3, hence:
n - 1 = 11,
k - 1 = 2,
and
n - k = 9.

Substituting the above sums of squares
and degree of freedom in the ANOVA table,
we obtain:

ANOVA Table
Source of
Variation d.f. Sum of
Squares
Mean
Square
Computed
F
Between
treatments
(i.e. Between
varieties)
2 54.50
Error 9 217.75
Total 11 272.25

Now, the mean squares for treatments
and for error are very easily found by
dividing the sums of squares by the
corresponding degrees of freedom.
Hence, we have

--
--
272.25
11
Total
--
24.19
217.75
9
(Error)
27.25
54.50
2
Between
Treatments
(i.e. Between
Varieties)
F
Mean
Squares
Sum of
Squares
df
Source of
Variation
ANOVA Table

As indicated earlier, the test-statistic
appropriate for testing the null hypothesis
H0 : A = B = C
versus
HA : Not all the three means
are equal
is:

which, if H0 is true, has an F distribution with
1 = k-1 = 3 – 1 = 2 and 2 = n-k = 12 – 3 = 9
degree of freedom
MS Treatment
F
MS Error


Hence, it is obvious that F will be
found by dividing the first entry of the
fourth column of our ANOVA Table by the
second entry of the same column i.e.

13
.
1
24.19
27.25
Error
MS
Treatment
MS
F 



We insert this computed value of F in
the last column of our ANOVA table, and
thus obtain:

--
--
272.25
11
Total
--
24.19
217.75
9
(Error)
1.13
27.25
54.50
2
Between
Treatments
(i.e. Between
Varieties)
F
Mean
Squares
Sum of
Squares
df
Source of
Variation
ANOVA Table

The fifth step of the hypothesis - testing
procedure is to determine the critical region.
With reference to the Analysis of
Variance procedure, it can be shown that it is
appropriate to establish the critical region in
such a way that our test is a right-tailed test.
In other words, the critical region is
given by:

Critical Region:
F > F ( k - 1, n - k)

In this example:
The critical region is
F > F0.05 (2,9) = 4.26

vi) Conclusion:
Since the computed value of F = 1.13
does not fall in the critical region, so we
accept our null hypothesis and may conclude
that, on the average, there is no difference
among the yielding capabilities of the three
varieties of potatoes.

One important point that the students
should note is that the ANOVA technique
being presented here is valid under the
following assumptions:

1. The k populations (whose means are to
be compared) are normally distributed;
2. All k populations have equal variances
i.e. 1
2 = 2
2 = … = k
2. (This property is
called homoscedasticity.)
3. The k samples have been drawn
randomly and independently from the
respective populations.

ANOVA.ppt

More Related Content

Similar to ANOVA.ppt (20)

Recently uploaded (20)

ANOVA.ppt