1. R Statistical Software for Research Training Manual
Shambu Campus
Session Two
By Dechassa O. (PhD);
Wollega University
Department of Statistics.
January 1, 1980
By Dechassa O. (PhD); Wollega University Department of Statistics.
R Statistical Software for Research Training Manual Shambu Campus Session Two
January 1, 1980 1 / 70
2. 1 Statistical Analysis Using R
Visualizing Data Graphically in R
Symbols, Colors, and Sizes of Graphs
Basics of Graphics
Functions Relevant for Graphics:
The Histograms
Exporting R Graphics
Visualize the Relationship Between Two Variables
2 Fitting Simple Linear Regression in R
R-code for Simple Linear Regression
Fitting Multiple Linear Regression using R
Evaluating the Quality of the Model (Model Diagnosis)
Goodness of Fit Test
Multicollinearity
3 Logistic Regression
Multivariate Binary Logistic Regression
4 Experimental Design
Hypotheses Tests for a Difference in Means
Paired t-test in R
Design of Single-Factor ANOVA
Model for a Single-Factor Experiment
Analysis of Variance with Two Factors
Modeling Two Factors Experiment
January 1, 1980 2 / 70
3. Visualizing Data Graphically in R
Plotting is an essential need when analyzing data. One of the major
reasons for developing R was to enable users to create graphics and
charts easily and interactively.
Nothing really tells a story about your data as powerfully as good plots.
Graphics capture your data much better than summary statistics and
often show you features that you would not be able to glean from
summaries alone.
R has very powerful tools for graphical visualization data that can be an
excellent way of communicating your result in scientific publications.
High-level plotting functions create a new plot on the graphics device. It
has a specific Plot() function as illustrated example below.
January 1, 1980 3 / 70
4. Examples:
>#rape seed+soybean=x ,sunflower+soybean=y
> X=c(32,29,38,36,30,25,29,32,25,26,25,31,28,23,26,26)#variable X
>Y=c(30,29,26,34,34,30,32,33,29,28,34,36,32,30,27,29)#variable Y
>plot(X,Y) #Plot variable X versus Y
This produces the simple graph in Figure below.
Figure: Scattered Plot
January 1, 1980 4 / 70
5. Symbols, Colors, and Sizes of Graphs
During our courses, the most frequently asked questions concerning
graphs are whether (1) the plotting symbols can be changed, (2)
different colors can be used, and (3) the size of the plotting symbols can
be varied conditional on the values of another variable.
Changing Plotting Characters: By default, the plot function uses open
circles (open dots) as plotting characters, but characters can be selected
from about 20 additional symbols.
The plotting character is specified with the pch option in the plot
function; its default value is 1 (which is the open dot or circle). Table
shows the symbols that can be obtained with the different values of pch.
Number pch Symbol Number pch Symbol
1 ◦ 9
N
2 4 10 ⊕
3 + 16 •
4 × 13 ⊗
5
6 5 8 ∗
January 1, 1980 5 / 70
6. Symbols, Colors, and Sizes of Graphs
In this section we present several of the types of graphs and plots we will
be using throughout. The basic plotting system is implemented in the
graphics package.
library(graphics) . It is already loaded when you start up R. But you
can use the help function to get a list of the functions:
library(help = graphics).
The main plotting function, plot(), is generic and many packages write
extensions to it to specialize plots. For one variable at a time, you can
make Box-plots, Bar-graphs, Histograms, Density plots and more.
A standard x-y plot has an x and a y title label generated from the
expressions being plotted. You may, however, override these labels and
also add two further titles xlab= for horizontal label, ylab= for
vertical label, and main= for title above the plot and a subtitle at the
very bottom, in the plots.
Inside the plotting region, you can place points and lines that are either
specified in the plot call or added later with sub-functions points and
lines.
January 1, 1980 6 / 70
7. Functions Relevant for Graphics:
Description R Command
Plot the values of a column (vector) X versus the index of X plot(X)
Plot the values of a column (vector) X against those in Y plot(X, Y)
Add points to a plot of column (vector) X against those in Y points(X, Y)
Add lines to a plot of column (vector) X against those in Y lines(X, Y)
Place the text givenbytext at the location specified by X and Y text(X,Y,text)
Place the legend givenbyleg at the location specified by X and Y legend(X,Y,text)
Description R Command
Add Title title to the plot main=title
Character for x- (or y-) axis label xlab= X label, ylab= Y label
Vector with minimum and maximum for x- (or y-) axis xlim= (min(X), max(X)),
ylim= (min(y), max(y)),
Color of Plot name or number name e.g.,col=black,
col=red, col=green, col=blue or
number: e.g., col=1, col=2,
col=3, col=4
Description Command
Plot a histogram of the frequencies of X hist(X)
Plot the density of a column (vector) X plot(density(X))
Boxplot of a column (vector) X boxplot(X)
QQ-plot of a column (vector) X qqnorm(X)
Scatter-plot of a column (vector) X against those in Y pairs(X, Y)
January 1, 1980 7 / 70
8. The Histograms
One of the most frequently used diagrams to depict a data distribution is
the histogram. You can get a reasonable impression of the shape of a
distribution by drawing a histogram; that is, a count of how many
observations fall within specified divisions (“range of numerical
values”) of the x-axis.
It is constructed in the form of side-by-side bars. Within a bar each data
value is represented by an equal amount of area.
The histogram permits the detection at one glance as to whether a
distribution is symmetric (i.e. the same shape on either side of a line
drawn through the center of the histogram) or whether it is skewed
(stretched out on one side – right or left skewed).
January 1, 1980 8 / 70
9. Exporting R Graphics
The plot panel allows you to export the current plot to different formats,
which can be very helpful. Graphic outputs can be saved in various
formats.
The export to image allows exporting to the PNG, JPG, SVG, ... formats.
To save a graphic:
1 Click the Plots Tab window,
2 lick the Export button,
3 Modify the export settings as you desire, and
4 Click Save.
January 1, 1980 9 / 70
10. Example: Bar Graph RNA Sequence Data
consider the RNA sequence below: RNAsequence
=c(A,U,G,C,U,U,C,G,A,A,U,G,C,U,G,U,A,U,G,
A,U,G,U,C)
f-c(5,4,6,9)
N-c (A, C, G, U)
barplot(f,names=N,ylab=Frequency,xlab=RNA Residue
Sequence,col=c(2,3,4,5), main=RNA Residue Analysis)
The Export menu has three options—Save Plot as Image..., Save Plot
as PDF..., or Copy Plot to Clipboard.... Choosing Save Plot as Image
yields the following popup:
January 1, 1980 10 / 70
11. Example: Bar Graph RNA
#To construct Bar-plot for the RNA sequence
barplot(f,names=N,ylab=Frequency,xlab=RNA Residue
Sequence,col=c(2,3,4,5), main=RNA Residue Analysis)
January 1, 1980 11 / 70
12. Example: Pie-chart for RNA
#To construct Pie-chart for the RNA sequence
pie(f,names=N,col=c(2,3,4,5), main= Pie-chart RNA for RNA
Sequence, labels=c(16.7%, 20.8%, 25%, 37.5%))
January 1, 1980 12 / 70
13. Visualize the Relationship Between Two Variables
To examine the relationship between two variables we can use the plot
command which, when applied to numeric objects, draws a scatter plot.
As an illustration, we first generate a set of n = 50 data points from the
linear model:
y = 0.5x + ewheree N(0, 0.12
)andx U(0, 1).
that is coded in R as:
n=50
x=runif(n)
y=0.5*x+rnorm(n,sd=0.1)
Next, using the the command plot(x,y) we create a simple scatter plot of
x versus y. One could also use a formula as in plot(y x).
Generally one will want to label the axes and add a title as the following
code illustrates; the resulting scatter plot is presented in Figure below.
plot(x,y, xlab=Explanatory Variable,ylab=Response Variable,
main=An Example of a Scatter Plot)
January 1, 1980 13 / 70
15. Example 2: Maize Yield Data
A study was made to analyze the effect of rainfall and temperature on
maize yield in Kogi state, Nigeria. The annual mean rainfall, mean
temperature and maize yield in study area was considered between the
period of 2001 and 2010 as presented in Table below to analyze any
variation in maize yield may directly be the result of rainfall and/or
temperature variation.
Year Mean Monthly Mean Monthly Maize Yield
Rainfall (cms) Temperature (0
C) (mt)
2001 83.58 34.0 234.0
2002 106.30 33.0 241.0
2003 82.18 32.5 250.0
2004 111.20 34.8 255.0
2005 78.28 33.4 214.0
2006 140.30 34.2 262.0
2007 125.10 34.6 289.3
2008 105.00 33.3 310.0
2009 138.10 34.0 333.2
2010 89.50 31.0 371.3
January 1, 1980 15 / 70
16. Fitting Simple Linear Regression in R
Regression analysis is used for explaining or modeling the relationship
between a single variable Y, called the response, output or
dependent variable, and one or more predictor, input, independent
or explanatory variables, X1, X2, · · · , Xp. When p=1, it is called
simple regression model but when p1 it is called multiple
regression or sometimes multivariate regression model.
In regression model, the response must be a continuous variable but
the explanatory variables can be continuous, discrete or
categorical variable.
Objectives of regression analyses:
I Prediction of future observations.
I Assessment of the effect of, or relationship between, explanatory
variables on the response.
I A general description of data structure.
Fitting a linear model in R is done using the lm() command. Notice the
syntax for specifying the predictors in the model.
January 1, 1980 16 / 70
17. R-code for Simple Linear Regression
Consider an n pairs of observations, for two variables X and Y, such
that (X, Y) = {(x1, y1) , (x2, y2) , · · · , (xn, yn)}.
Yi = α + β1Xi + εi . (1)
where Yi is a continuous response variable for the ith
subject or
experimental unit, xi is the corresponding value of an explanatory
variable, ei is the error term, α is an intercept parameter, and β is the
slope parameter.
fitted.modelB= lm(y ∼ x1)## fitting simple linear regression.
OR lm(dependent variable Independent variable)
To get more information about the fitted model, summary(the model’s
name) can be used to get details about Residuals, Coefficients,
Residual standard error, R2
, Adjusted R2
. Moreover,
summary(aov(the model’s name)) used to get ANOVA table.
January 1, 1980 17 / 70
18. Example: Iris Data from R-bulletin data(iris)
data(iris)
names(iris)
[1] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
data.frame(Sepal.length, Sepal.Width, Petal.length, Petal.Width)
boxplot(Sepal.length, Sepal.Width, Petal.length, Petal.Width)#Box plot
for each variability.
January 1, 1980 18 / 70
19. Scattered Plot of Sepal Length versus its Width
plot( Sepal.length,Sepal.Widt)# scatter plot Sepal.length versus
Sepal.Width
Figure: Scattered Plot
January 1, 1980 19 / 70
20. R-code for Fitting Liner Regression Model to Sepal
Length versus its Width
model1 = lm(Sepal.Width Sepal.Length, data = iris)
summary(model1)
Call:
lm(formula = Sepal.Width Sepal.length, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.1095 -0.2454 -0.0167 0.2763 1.3338
Coefficients:
. Estimate Std. Error t value Pr(|t|)
(Intercept) 3.41895 0.25356 13.48 2e-16 ***
Sepal.length -0.06188 0.04297 -1.44 0.152
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4343 on 148 degrees of freedom
Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
January 1, 1980 20 / 70
21. R-code for Diagnosis of the Fitted Liner Regression
Model to Sepal Length versus its Width
plot(model1) #Model Diagnosing
January 1, 1980 21 / 70
22. Fitting Multiple Linear Regression using R
Multiple linear regression (MLR) is a method to test and establish linear
relationships between one dependent variable and two or more
independent variables.
Multiple Regression analysis is a conceptually simple method for
investigating functional relationships among the dependent (response)
variable, Yi , and one or more independent (predictor or explanatory)
variables, denoted by X1, X2, · · · , Xp.
Y = β0 + β1X1 + β2X2 + · · · + βpXp + ε
where, the parameter β0 is the y-intercept, which represents the
expected value of Y when each X is zero (0). The other, β1, β2, · · · , βp in
the multiple regression equation are parameters.
R-code for Multiple Regression
fitted.model-B= lm(y ∼ x1 + x2 + · · · + xp) # fitting multiple linear
regression.
January 1, 1980 22 / 70
23. Example:
The following data were obtained from R. H. Woodward (1984) and re
done on R by Axel Drefahl (2019) that examined the packing conditions
of ammonium sulfate. The flow rate of ammonium sulfate-an inorganic
salt-was determined by letting defined quantities of salt flow through a
small funnel. The flow rate is the dependent variable Y, depending on
salt characteristics such as moisture content, crystal shape and
impurities. In the example, the independent variables are: X1= initial
moisture content in units of 0.01 %, X2= length/breadth ratio for crystals
and X3= percent impurity in units of 0.01 %. Their values are listed in
Table below.
January 1, 1980 23 / 70
25. Fitting Multiple Linear Regression for the Ammonium Data using R
pairs(data1[,-1])# Plot the Scattered Plot Matrix
linearmodel= lm(Yi ∼ X1i + X2i + X3i) # fitting the multiple linear
model
summary(linearmodel)
Call:
lm(formula = Yi ∼ X1i + X2i + X3i , data = data1)
Residuals:
. Min 1Q Median 3Q Max
-1.78963 -0.62872 0.08172 0.48581 1.42336
Coefficients:
. Estimate Std. Error t value Pr(|t|)
(Intercept) 6.72824 0.66719 10.085 5.15e-13 ***
X1i -0.04854 0.02799 -1.734 0.08990 .
X2i -0.56733 0.25315 -2.241 0.03011 *
X3i -0.16577 0.05088 -3.258 0.00217 **
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8466 on 44 degrees of freedom
Multiple R-squared: 0.5743, Adjusted R-squared: 0.5452
F-statistic: 19.78 on 3 and 44 DF, p-value: 2.873e-08
January 1, 1980 25 / 70
26. Constructing Confidence Interval Estimate of
Coefficients for Multiple Linear Regression for the
Ammonium Data using R
confint(linearmodel)# compute the confidence interval
. lower(2.5%) upper(97.5%)
(Intercept) 5.3836188 8.072870440
X1i -0.1049522 0.007873304
X2i -1.0775074 -0.057145301
X3i -0.2683166 -0.063223995
January 1, 1980 26 / 70
27. Constructing ANOVA Table for fitted Multiple Linear
Regression for the Ammonium Data using R
The F distribution and the assumption that the Y and X variables have a
multivariate normal distribution, we have:
SSRegression/(p − 1)
SSError /(n − p)
∼ Fp−1,n−p (2)
Reject H0 if p-value (Sign.) is smaller than α = 0.05.
Table: ANOVA table for testing significance in multiple linear regression
with p parameters including β0 in vector β using n observations.
Source Sum of Squares (SS) d.f. MS = SS/d.f. F Value
Regression. SSRegression p-1 MSR = SSR/(p − 1) F =
MSR
MSE
Residual SSError n-p MSE = SSE /(n − p)
Total SSTotal n-1
January 1, 1980 27 / 70
28. R-code for Analysis of Variance (ANOVA) for the Fitted Multiple Liner
Regression Model
aov(linearmodel)# construct ANOVA table
Call:
aov(formula = linearmodel)
Terms:
. X1i X2i X3i Residuals
Sum of Squares 32.21928 2.71539 7.60817 31.53921
Deg. of Freedom 1 1 1 44
Residual standard error: 0.8466405
Estimated effects may be unbalanced
summary(aov(linearmodel))
. Df Sum Sq Mean Sq F value Pr(F)
X1i 1 32.22 32.22 44.949 3.1e-08 ***
X2i 1 2.72 2.72 3.788 0.05802 .
X3i 1 7.61 7.61 10.614 0.00217 **
Residuals 44 31.54 0.72
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
January 1, 1980 28 / 70
29. Evaluating the Quality of the Model (Model Diagnosis)
There are many different ways to evaluate a regression model’s quality.
Many of the techniques can be rather technical, and the details of them
are beyond the scope of this tutorial. However, the function summary()
extracts some additional information that we can use to determine how
well the data fit the resulting model.
Quantile-Quantile Normality Plot of Fitted
Residuals
The residuals are the differences between the actual measured values
and the corresponding values on the fitted regression line.
Each data point’s residual is the distance that the individual data point is
above (positive residual) or below (negative residual) the regression line.
If the line is a good fit with the data, we would expect the
summary(residuals) to have:
I Residual values that are normally distributed around a mean of
zero.
I The median value of all of the residuals is near zero.
I The 1st
quartile Q1 and 3rd
quartile Q3 values of all the sorted
residual values are of roughly the same magnitude.
January 1, 1980 29 / 70
30. Quantile-Quantile Normality Plot of Fitted Residuals
for Multiple Linear Regression for the Ammonium Data
using R
plot(linearmodel)# Diagnosis test for model fitted
Figure: Fitted vs Residual (left) and QQ-Normality Plot of Residuals (right)
January 1, 1980 30 / 70
31. Goodness of Fit Test
How well does the model fit the data? One measure is R2
, the
so-called coefficient of determination or called the coefficient of
multiple correlation, or percentage of variance explained.
Coefficient of determination: From the sums of squares defined in
ANOVA Table above, one can define a measure of model adequacy by
the statistic:
R2
=
SSReg.
SSTotal
, (3)
Coefficient of Determination for the Fitted Multiple Linear
Regression for the Ammonium Data using R
From the fitted regression output table above we have Multiple
R-squared or (R2
) = 0.5742665 or using the ANOVA table above it can
be computed as follows:
Regression Sum of Square =SumofSquare(X1i ) + Square(X2i ) + Square(X
=32.21928 + 2.71539 + 7.60817
=42.54284
Total Sum of Square =Regression Sum of Square + Residual Sum
=32.21928 + 2.71539 + 7.60817 + 31.53921
=74.08205
January 1, 1980 31 / 70
32. Multicollinearity
In multiple regression problems, we expect to find dependencies
between the response variable Y and the regressors Xj .
In most regression problems, however, we find that there are also
dependencies among the regressor variables X0
j s. In situations where
these dependencies are strong, we say that multicollinearity exists.
Multicollinearity can have serious effects on the estimates of the
regression coefficients and on the general applicability of the estimated
model.
We define the variance inflation factor (VIF) for β̂j as:
VIF
β̂j
=
1
1 − R2
j
, j = 1, 2, · · · , k. (4)
These factors are an important measure of the extent to which
multicollinearity is present.
The larger the variance inflation factor, the more severe the
multicollinearity. If the F-test for significance of regression is
significant, but tests on the individual regression coefficients are
not significant, multicollinearity may be present.
January 1, 1980 32 / 70
33. R code for Variance Inflation Factor of the Fitted
Multiple Linear Regression
vif(linearmodel) # compute the VIF for fitted model
X1i X2i X3i
2.244051 1.113935 2.084959
setwd(C:/Users/hp/Desktop/RT raining ) #change working directory
Maize=
read.csv(Temprature.Rainfall.Maize.Yield.csv,header=TRUE) #import
data named Temprature.Rainfall.Maize.Yield.csv to R
Maize=data.frame(Maize.Yield[1:10,])#column bind dataset
Mean.Rainfall=Maize[,1]
Mean.Temprature=Maize[,2]
Maize.Yield=Maize[,3]
Maize.data=data.frame(Maize.Yield,Mean.Rainfall,Mean.Temprature)
January 1, 1980 33 / 70
34. Scatted Two Dimensional Plot Maize Yield Data
pairs(Maize.data)# plot scattered plot
Figure: Scatter Plot Matrix of Maize Yield Data
January 1, 1980 34 / 70
35. Scatted Three Dimensional Plot for Maize Yield Data
library(scatterplot3d)
scatterplot3d(Mean.Rainfall, Mean.Temprature, Maize.Yield,
type=h, xlab=Mean.Rainfall,
ylab=Mean.Temprature,zlab=Maize.Yield)
Figure: Three Dimensional Scatter Plot Matrix of Maize Yield data
January 1, 1980 35 / 70
36. R-code to Fit multiple Linear Regression for Maize Yield Data
multilinermodel=lm(Maize.Yield ∼ Mean.Rainfall+Mean.Temprature)
summary(multilinermodel)
Call:
lm(formula = Maize.Yield ∼ Mean.Rainfall + Mean.Temprature)
Residuals:
Min 1Q Median 3Q Max
-61.934 -9.942 6.206 15.865 47.107
Coefficients:
. Estimate Std. Error t value Pr(|t|)
(Intercept) 763.9463 200.4665 3.811 0.00662
**
Mean.Rainfall 0.6596 0.5205 1.267 0.24559
Mean.Temprature 16.0899 5.5368 -2.906
0.02279 *
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 35.72 on 7 degrees of freedom
Multiple R-squared: 0.5909, Adjusted R-squared: 0.474
F-statistic: 5.055 on 2 and 7 DF, p-value: 0.0438
January 1, 1980 36 / 70
37. Logistic Regression
Linear regression often works very well when the response variable is
quantitative. Logistic regression is a method for fitting a regression
curve, y = f(x), when y is a categorical variable.
The typical use of this model is predicting y given a set of predictors x.
The predictors can be continuous, categorical or a mix of both.
The categorical variable y, in general, can assume different values. In
the simplest case scenario y is binary meaning that it can assume
either the value 1 or 0.
These could be arbitrary assignments resulting from observing a
qualitative response. For example, in a study of a suspected carcinogen,
aflatoxin B1, a number of levels of the compound were fed to test
animals. After a period of time, the animals were sacrificed and the
number of animals having liver tumors was recorded. The response
variable is Y = 1 if the animal has a tumor and Y = 0 if the animal
fails to have a tumor.
January 1, 1980 37 / 70
38. Simple Binary Logistic Regression Model
A binary logistic regression model given in Equation (5), we have the
following:
log
P(Y)
1 − P(Y)
=β0 + β1X
P(Y)
1 − P(Y)
=eβ0+β1X
P(Y) =eβ0+β1X
− P(Y) ∗ eβ0+β1X
P(Y)
1 + eβ0+β1X
=eβ0+β1X
⇒ P(Y) =
eβ0+β1X
1 + eβ0+β1X
(5)
where,
e: natural logarithm base (= 2.71828)
The expression given in (5) in called a simple Logistic regression
models whose response variable is dichotomous or binary
outcome.
January 1, 1980 38 / 70
39. Example: Regressing mastitis on age at first calving in
cows
Is there an effect of age at first calving on incidence of mastitis in cows?
On a sample of 21 cows the presence of mastitis (inflammation of the
memory gland in the breast) and age at first calving (in months) were
recorded as in Table below:
January 1, 1980 39 / 70
40. R-code fo Logistic Regression
logisticmodel=glm( Mastities Age, data=data2, family=binomial)
summary(logisticmodel)
Call:
glm(formula = Mastities Age, family = binomial, data = data2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7745 -0.9551 0.6030 0.8607 1.6605
Coefficients:
. Estimate Std. Error z value Pr(|z|)
(Intercept) 6.7439 3.2640 2.066 0.0388 *
Age -0.2701 0.1315 -2.054 0.0399
* —
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 29.065 on 20 degrees of freedom
Residual deviance: 23.842 on 19 degrees of freedom
AIC: 27.842
January 1, 1980 40 / 70
41. R-code to Construct Confidence Interval Estimate of
Coefficients for Simple Binary Logistic Regression
confint(logisticmodel)
. 2.5 % 97.5 %
(Intercept) 0.9129170 14.08481226
Age -0.5665479 -0.03596198
January 1, 1980 41 / 70
42. Multivariate Binary Logistic Regression
Many categorical response variables have only two categories. Denote a
binary response (dependent) variable by Y and its two possible
outcomes by 1 (“success”) and 0 (“failure”) and X = (X1, X2, · · · , Xk )
T
are k independent variables that can be quantitative or dummy variables.
The multivariate binary logistic regression model is given by:
log
P(Y)
1 − P(Y)
=β0 + β1X1 + β2X2 + · · · + βk Xk
⇒ P(Y) =
eβ0+β1X1+β2X2+···+βk Xk
1 + eβ0+β1X1+β2X2+···+βk Xk
, (6)
where,
P(Y): probability of Y occurring
e: natural logarithm base (= 2.71828)
β0: interception at y-axis
β1: line gradient or coefficient of X1 in predicting the probability of Y
.
.
.
βk : line gradient or coefficient of Xk in predicting the probability of Y.
January 1, 1980 42 / 70
43. Experimental Design: Examples Agricultural Experimentation:
An experimenter wants to compare the crop yield and environmental
effects for two different fertilizers. The experimental units are separate
plots of land.
Some of these plots will be treated with Fertilizer A, and some with
Fertilizer B. For example, Fertilizer A may be the currently used fertilizer;
Fertilizer B is a newly developed alternative, perhaps one designed to
have the same or better crop growth yields but with reduced
environmental side effects.
In this conceptual experiment, the selection of the plots and the
experimental protocol will assure that the fertilizer used on one plot does
not bleed onto another. Schedules for the amount and timing of fertilizer
application will be set up.
Crops will be raised and harvested on each plot, and the crop production
and residual soil chemicals will be measured and compared to see if the
new fertilizer is performing as designed and is an improvement over the
current fertilizer.
Before experimentation, we would not know whether Fertilizer B would
have gotten comparable better yields than A this year due, say, to
especially favorable growing conditions or experimental care compared
to previous years. To identify we must have to run experiments.
January 1, 1980 43 / 70
45. Hypotheses Tests for a Difference in Means
Let X11, X12, · · · , X1n1
be a random sample of n1 observations
from the first population and X21, X22, · · · , X2n2
be a random
sample of n2 observations from the second population.
Let X̄1, X̄2, S2
1 and S2
2 be the sample means and sample
variances, respectively.
The expected value of the difference in sample means is
E
X̄1 − X̄2
= µ1 − µ2, so is an unbiased estimator of the
difference in means (Verify!).
Tests of hypotheses on the difference in means µ1 − µ2 of two
normal distributions where the variances is unknown.
A t-statistic will be used to test these hypotheses.
January 1, 1980 45 / 70
46. Hypotheses Tests for a Difference in Means cont...
The variance of X̄1 − X̄2 is
Var
X̄1 − X̄2
=
σ2
1
n1
+
σ2
2
n2
=σ2
1
n1
+
1
n2
The pooled estimator of σ2, denoted by S2
p, is defined by
S2
p =
(n1−1)S2
1+(n2−1)S2
2
n1+n2−2
Tc =
X̄1 − X̄2
− (µ1 − µ2)
Sp
r
1
n1
+ 1
n2
has a t- distribution with n1 + n2 − 2 degrees of freedom.
January 1, 1980 46 / 70
47. Paired t-test in R
Applications: Comparing means of data from two related samples; say,
observations before and after an intervention on the same participant, or
comparison of measurements from the same participant two different
conditions.
Example: Cholesterol Level The cholesterol level 4 weeks after
the special diet is lower than before the diet with means of 5.84 and 6.40
respectively. The similar standard deviations suggest that the spread of
the values at the two time points is similar. The paired t-test calculates
paired differences for each subject and calculates a test statistic from
these differences. If there was no change in cholesterol between the two
time points, the mean difference of the values would be close to 0.
January 1, 1980 47 / 70
49. R-code for Paired-Independent T-test
t.test(variable1, variable2,paired=T).
# Cholesterol Level
After4weeks=c(6.42,6.76,6.56,4.8,8.43,7.49,8.05,5.05,5.77,3.91,6.77,
6.44,6.17,7.67,7.34,6.85,5.13,5.73)
Before=c(5.83,6.2,5.83,4.27,7.71,7.12,7.25,4.63,5.31,3.7,6.15,5.59,5.56,
7.11,6.84,6.4,4.52,5.13)
t.test(After4weeks, Before,paired=T)
Paired t-test
data: After4weeks and Before
t = 15.439, df = 17,
p-value = 1.958e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.4887486 0.6434736
sample estimates:
mean of the differences
0.56611
Conclusion: The test statistic is t=-15.439 and the p-value is very small (p
0.001) so the null hypothesis is rejected, since p 0.05 and evidence of a
statistically significant difference is concluded.
January 1, 1980 49 / 70
50. Design of Single-Factor ANOVA
The simplest ANOVA problem is referred to variously as a single-factor,
single-classification, or one-way ANOVA.
It involves the analysis of data from experiments in which more than
two treatments have been used.
The characteristic that differentiates the treatments or populations from
one another is called the factor or Treatments under study, and the
different treatments or populations are referred to as replicates of the
factor.
The response for each of the m treatments is a random variable.
Treatment Observations Totals Average
1 y11 y12 · · · y1n y1. ȳ1.
2 y21 y22 · · · y2n y2. ȳ2.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
m ym1 ym2 · · · ymn ym. ȳm.
y.. ȳ..
January 1, 1980 50 / 70
51. Model for a Single-Factor Experiment
Yij = µ + τi + εij ,
i = 1, 2, · · · n
j = 1, 2, · · · m
(7)
where Yij is a random variable denoting the (ij)th
observation, µ is a
parameter common to all treatments called the overall mean, τi is a
parameter associated with the ith
treatment called the ith
treatment
effect, and εij is a random error component.
ANOVA Table for Single Factor Experiment
Source of Sum of Degree of Mean
Variation Squares Freedom Square F
Between SSbetween m-1 MSbetween
MSbetween
MSwithin
Within SSwithin m(n-1) MSwithin
Total SST mn-1
January 1, 1980 51 / 70
52. Example: Bull Effect Data on Weight Gain of Calves
A researcher is interested in determining if the average daily gain in weight of
calves depends on the bull which sired the calf. The researcher has only five
bulls. The five bulls are mated with randomly selected cows and the average
daily gain in weight by the calves produced by the matings are recorded. The
data are given below.
Use these data to run an analysis of variance and test for a significant bull
effect. Use α = 0.05.
January 1, 1980 52 / 70
53. R-code for Single Factorial Design for Bull Effect Example
Weight.gain=c(1.2,1.39,1.36,1.39,1.22,1.31,1.16,1.08,
1.22,0.87,1.17,1.12,0.75,1.12,1.02,1.08,0.83,0.98,0.96,1.16,
1.05,1.00,1.12,1.15,0.99,0.85,1.10,1.03,0.94,0.89)
bull=gl(5,6,label=c(1,2,3,4,5))
tapply(Weight.gain,bull,sum) #treatment total
. 1 2 3 4 5
7.87 6.62 5.78 6.44 5.80
tapply(Weight.gain,bull,mean) #treatment mean
. 1 2 3 4 5
1.3116667 1.1033333 0.9633333 1.0733333 0.9666667
January 1, 1980 53 / 70
55. R-code for ANOVA of Bull Example
summary(aov(Weight.gain ∼ bull)) # Analysis of variance
. Df Sum Sq Mean Sq F value Pr(F)
bull 4 0.4839 0.12097 10.29
4.56e-05 ***
Residuals 25 0.2938 0.01175
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the Table of ANOVA in R-output, the column Pr( F) = 4.56−0.5,
is the P-Value which is less than our level of significance α = 0.05.
Thus, we reject the null hypothesis that states: no significant difference
between the daily weight gain of calves due to the difference of bull
which sired the calf.
Conclusion: There is a significant bull effect which sired the calf on its
daily weight gain at α = 5% significance level
January 1, 1980 55 / 70
56. R-code for Bartlett Test of Homogeneity of Variances for ANOVA of Bull
Example
bartlett.test(Weight.gain ∼ bull) #Bartlett Tests for Equality
of Variance
Bartlett test of homogeneity of variances
data: Weightgain by bull
Bartlett’s K-squared = 2.4124, df = 4, p-value = 0.6604
Bartlett’s χ2
= 2.4124, df = 4, p-value = 0.6604. ⇒ Conclude
that all 5 variances are the same! Fitted Versus Residual
and QQ-Normality Test Plots of Residuals.
January 1, 1980 56 / 70
57. plot(fitted(aov( Weight.gain ∼ bull)),
residuals(aov(Weightgain bull)),type=p,col=4,xlab=Fitted,
ylab=Residuals)
abline(h = 0, lty = 2, col=2)
Figure: Fitted Versus Residual Plot of Fitted ANOVA Model for Daily Weight Gain of Calves
January 1, 1980 57 / 70
58. qqnorm(residuals(aov(Weight.gain ∼bull)),main = Normal Q-Q Plot
of Residuals,col=4)
qqline(residuals(aov(Weight.gain ∼bull)),col=2)
Figure: QQ-Normality Plot of Residuals of Fitted ANOVA Model for Daily
Weight Gain of Calves January 1, 1980 58 / 70
59. Analysis of Variance with Two Factors
Here there are two factors A and B with r levels of factor A and m
levels of factor B, each replicate contains all rm treatment
combinations. When you are interested in the effects of two or
more factors on the response variable, the Two-Way ANOVA is
an analysis method for a quantitative outcome and two
categorical explanatory variables that are defined in such a way
that each experimental unit (subject) can be exposed to any
combination of one level of one explanatory variable and one level
of the other explanatory variable.
January 1, 1980 59 / 70
60. Factor B Row Row
1 2 · · · m Totals Average
y111, y112, y121, y122, · · · y1m1, y1m2,
1 · · · , y11n · · · , y12n · · · · · · , y1mn y1.. ȳ1..
Factor A y211, y212, y221, y222, · · · y2m1, y2m2,
2 · · · , y21n · · · , y22n · · · · · · , y2mn y2.. ȳ2..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
yr11, yr12, yr21, yr22, · · · yrm1, yrm2,
r · · · , yr1n · · · , yr2n · · · · · · , yrmn yr.. ȳr..
Column y.1. y.2. · · · y.m. y...
Totals
Column ȳ.1. ȳ.2. · · · ȳ.m. ȳ...
Average
The observation in the ijth cell for the kth replicate is denoted by Yijk . In
performing the experiment, the rmn observations would be run in
random order.
January 1, 1980 60 / 70
61. Additive Model of Two Factorial Experiment
Yijk = µ + τi + βj + εijk ,
i = 1, 2, · · · , r
j = 1, 2, · · · , m and
k = 1, 2, · · · , n
(8)
Model of Two Factorial Experiment With Interaction
Yijk = µ + τi + βj + (τβ)ij + εijk ,
i = 1, 2, · · · , r
j = 1, 2, · · · , m and
k = 1, 2, · · · , n
(9)
where µ is the overall mean effect, τi is the effect of the ith level of
factor A, βj is the effect of the jth level of factor B, (τβ)ij is the effect of
the interaction between A and B, and εijk is assumed to be a random
error component having a normal distribution with mean zero and
variance σ2. Since there are two factors in the experiment, the test
procedure is sometimes called the two-way analysis of variance.
January 1, 1980 61 / 70
62. ANOVA Table for Two Factor A and B Experiment
Source of Sum of Degree of Mean
Variation Squares Freedom Square F
A SSA r-1 MSA
MSA
MSError
B SSB m-1 MSB
MSB
MSError
AB SSAB (r-1)(m-1) MSAB
MSAB
MSError
Error SSError rm(n-1) MSerror
Total SST rmn-1
January 1, 1980 62 / 70
63. Example: Two Factors Experiment: Assay-Lab Calcium Content
A consumer product agency wants to evaluate the accuracy of
determining the level of calcium in a food supplement. There are a
large number of possible testing laboratories and a large number of
chemical assays for calcium. The agency randomly selects three
laboratories and three assays for use in the study. Each laboratory will
use all three assays in the study. Eighteen samples containing 10 mg
of calcium are prepared and each assay–laboratory combination is
randomly assigned to two samples. The determinations of calcium
content are given in Table below (numbers in parentheses are
averages for the assay–laboratory combinations).
January 1, 1980 63 / 70
66. R-code for ANOVA of Two Factorial Experiment fro Calcium Concentration
Example
anova(level.Calcium)
Analysis of Variance Table
Response: Calcium
. Df Sum Sq Mean Sq F value
Pr(F)
Assay 2 1.56 0.7800 5.6613
0.025597 *
Lab 2 7.56 3.7800 27.4355
0.000148 ***
Assay:Lab 4 1.64 0.4100 2.9758
0.080332 .
Residuals 9 1.24 0.1378
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
January 1, 1980 66 / 70
67. Interaction Plot forTwo Factorial Experiment fro
Calcium Concentration Example
interaction.plot(Lab,Assay,Calcium, col=c(2,4,6), main=(a),pch=20)
From figure, we observe that the mean calcium content in chemical
January 1, 1980 67 / 70
68. Diagnosis Test for Assay-Lab Model
e=residuals(level.Calcium)
f=fitted(level.Calcium)
plot(Assay, e, xlab=Assay,ylab=Residual)
plot(Lab, e, xlab=Lab,ylab=Residual)
The coefficient of determination is one method of diagnosis of our
model to estimate the percentage of the variability of calcium content
explained by the different assay chemicals and under different
laboratories.
R2
=
SSModel
SSTotal
=
10.76
12
= 0.897
where
SSModel = SSAssay +SSLab +SSAssay∗Lab = 1.56+7.56+1.64 = 10.76.
That is, about 90% of the variability in the calcium content is explained
by the different assay chemicals and under different laboratories and
their interaction.
January 1, 1980 68 / 70
69. Figure: Box Plot for Residual Versus Assay (left) and Residual Versus Lab (right)
Figures (a) and (b) above plot the residuals versus calcium content in
three chemical assays used and the three Laboratories where the
experiment was applied, respectively. Both plots indicate mild
inequality of variance, with the treatment combination of Assay 3 and
lab 3 possibly having larger variance than the others.
January 1, 1980 69 / 70
70. plot(f, e, col=4,lty=0,lwd=6, pch=20, xlab=Fitted,ylab=Residual,
main=(a) Fitted Versus Residual Plot)
abline(h = 0, lty = 2, lwd=3,col=2)
qqnorm(e, col=4,lty=0,lwd=6,main=(b) QQ − Normality Plot,
pch=20)
qqline(e, col=2)
Figure: Fitted Versus Residual Plot and QQ-Normality Plot
January 1, 1980 70 / 70