SlideShare a Scribd company logo
R Statistical Software for Research Training Manual
Shambu Campus
Session Two
By Dechassa O. (PhD);
Wollega University
Department of Statistics.
January 1, 1980
By Dechassa O. (PhD); Wollega University Department of Statistics.
R Statistical Software for Research Training Manual Shambu Campus Session Two
January 1, 1980 1 / 70
1 Statistical Analysis Using R
Visualizing Data Graphically in R
Symbols, Colors, and Sizes of Graphs
Basics of Graphics
Functions Relevant for Graphics:
The Histograms
Exporting R Graphics
Visualize the Relationship Between Two Variables
2 Fitting Simple Linear Regression in R
R-code for Simple Linear Regression
Fitting Multiple Linear Regression using R
Evaluating the Quality of the Model (Model Diagnosis)
Goodness of Fit Test
Multicollinearity
3 Logistic Regression
Multivariate Binary Logistic Regression
4 Experimental Design
Hypotheses Tests for a Difference in Means
Paired t-test in R
Design of Single-Factor ANOVA
Model for a Single-Factor Experiment
Analysis of Variance with Two Factors
Modeling Two Factors Experiment
January 1, 1980 2 / 70
Visualizing Data Graphically in R
Plotting is an essential need when analyzing data. One of the major
reasons for developing R was to enable users to create graphics and
charts easily and interactively.
Nothing really tells a story about your data as powerfully as good plots.
Graphics capture your data much better than summary statistics and
often show you features that you would not be able to glean from
summaries alone.
R has very powerful tools for graphical visualization data that can be an
excellent way of communicating your result in scientific publications.
High-level plotting functions create a new plot on the graphics device. It
has a specific Plot() function as illustrated example below.
January 1, 1980 3 / 70
Examples:
>#rape seed+soybean=x ,sunflower+soybean=y
> X=c(32,29,38,36,30,25,29,32,25,26,25,31,28,23,26,26)#variable X
>Y=c(30,29,26,34,34,30,32,33,29,28,34,36,32,30,27,29)#variable Y
>plot(X,Y) #Plot variable X versus Y
This produces the simple graph in Figure below.
Figure: Scattered Plot
January 1, 1980 4 / 70
Symbols, Colors, and Sizes of Graphs
During our courses, the most frequently asked questions concerning
graphs are whether (1) the plotting symbols can be changed, (2)
different colors can be used, and (3) the size of the plotting symbols can
be varied conditional on the values of another variable.
Changing Plotting Characters: By default, the plot function uses open
circles (open dots) as plotting characters, but characters can be selected
from about 20 additional symbols.
The plotting character is specified with the pch option in the plot
function; its default value is 1 (which is the open dot or circle). Table
shows the symbols that can be obtained with the different values of pch.
Number pch Symbol Number pch Symbol
1 ◦ 9
N
2 4 10 ⊕
3 + 16 •
4 × 13 ⊗
5 
6 5 8 ∗
January 1, 1980 5 / 70
Symbols, Colors, and Sizes of Graphs
In this section we present several of the types of graphs and plots we will
be using throughout. The basic plotting system is implemented in the
graphics package.
library(graphics) . It is already loaded when you start up R. But you
can use the help function to get a list of the functions:
library(help = graphics).
The main plotting function, plot(), is generic and many packages write
extensions to it to specialize plots. For one variable at a time, you can
make Box-plots, Bar-graphs, Histograms, Density plots and more.
A standard x-y plot has an x and a y title label generated from the
expressions being plotted. You may, however, override these labels and
also add two further titles xlab=  for horizontal label, ylab=  for
vertical label, and main=  for title above the plot and a subtitle at the
very bottom, in the plots.
Inside the plotting region, you can place points and lines that are either
specified in the plot call or added later with sub-functions points and
lines.
January 1, 1980 6 / 70
Functions Relevant for Graphics:
Description R Command
Plot the values of a column (vector) X versus the index of X plot(X)
Plot the values of a column (vector) X against those in Y plot(X, Y)
Add points to a plot of column (vector) X against those in Y points(X, Y)
Add lines to a plot of column (vector) X against those in Y lines(X, Y)
Place the text givenbytext at the location specified by X and Y text(X,Y,text)
Place the legend givenbyleg at the location specified by X and Y legend(X,Y,text)
Description R Command
Add Title title to the plot main=title
Character for x- (or y-) axis label xlab= X label, ylab= Y label
Vector with minimum and maximum for x- (or y-) axis xlim= (min(X), max(X)),
ylim= (min(y), max(y)),
Color of Plot name or number name e.g.,col=black,
col=red, col=green, col=blue or
number: e.g., col=1, col=2,
col=3, col=4
Description Command
Plot a histogram of the frequencies of X hist(X)
Plot the density of a column (vector) X plot(density(X))
Boxplot of a column (vector) X boxplot(X)
QQ-plot of a column (vector) X qqnorm(X)
Scatter-plot of a column (vector) X against those in Y pairs(X, Y)
January 1, 1980 7 / 70
The Histograms
One of the most frequently used diagrams to depict a data distribution is
the histogram. You can get a reasonable impression of the shape of a
distribution by drawing a histogram; that is, a count of how many
observations fall within specified divisions (“range of numerical
values”) of the x-axis.
It is constructed in the form of side-by-side bars. Within a bar each data
value is represented by an equal amount of area.
The histogram permits the detection at one glance as to whether a
distribution is symmetric (i.e. the same shape on either side of a line
drawn through the center of the histogram) or whether it is skewed
(stretched out on one side – right or left skewed).
January 1, 1980 8 / 70
Exporting R Graphics
The plot panel allows you to export the current plot to different formats,
which can be very helpful. Graphic outputs can be saved in various
formats.
The export to image allows exporting to the PNG, JPG, SVG, ... formats.
To save a graphic:
1 Click the Plots Tab window,
2 lick the Export button,
3 Modify the export settings as you desire, and
4 Click Save.
January 1, 1980 9 / 70
Example: Bar Graph RNA Sequence Data
consider the RNA sequence below: RNAsequence
=c(A,U,G,C,U,U,C,G,A,A,U,G,C,U,G,U,A,U,G,
A,U,G,U,C)
f-c(5,4,6,9)
N-c (A, C, G, U)
barplot(f,names=N,ylab=Frequency,xlab=RNA Residue
Sequence,col=c(2,3,4,5), main=RNA Residue Analysis)
The Export menu has three options—Save Plot as Image..., Save Plot
as PDF..., or Copy Plot to Clipboard.... Choosing Save Plot as Image
yields the following popup:
January 1, 1980 10 / 70
Example: Bar Graph RNA
#To construct Bar-plot for the RNA sequence
barplot(f,names=N,ylab=Frequency,xlab=RNA Residue
Sequence,col=c(2,3,4,5), main=RNA Residue Analysis)
January 1, 1980 11 / 70
Example: Pie-chart for RNA
#To construct Pie-chart for the RNA sequence
pie(f,names=N,col=c(2,3,4,5), main= Pie-chart RNA for RNA
Sequence, labels=c(16.7%, 20.8%, 25%, 37.5%))
January 1, 1980 12 / 70
Visualize the Relationship Between Two Variables
To examine the relationship between two variables we can use the plot
command which, when applied to numeric objects, draws a scatter plot.
As an illustration, we first generate a set of n = 50 data points from the
linear model:
y = 0.5x + ewheree N(0, 0.12
)andx U(0, 1).
that is coded in R as:
 n=50
 x=runif(n)
 y=0.5*x+rnorm(n,sd=0.1)
Next, using the the command plot(x,y) we create a simple scatter plot of
x versus y. One could also use a formula as in plot(y x).
Generally one will want to label the axes and add a title as the following
code illustrates; the resulting scatter plot is presented in Figure below. 
plot(x,y, xlab=Explanatory Variable,ylab=Response Variable,
main=An Example of a Scatter Plot)
January 1, 1980 13 / 70
Figure: Scattered Plot
January 1, 1980 14 / 70
Example 2: Maize Yield Data
A study was made to analyze the effect of rainfall and temperature on
maize yield in Kogi state, Nigeria. The annual mean rainfall, mean
temperature and maize yield in study area was considered between the
period of 2001 and 2010 as presented in Table below to analyze any
variation in maize yield may directly be the result of rainfall and/or
temperature variation.
Year Mean Monthly Mean Monthly Maize Yield
Rainfall (cms) Temperature (0
C) (mt)
2001 83.58 34.0 234.0
2002 106.30 33.0 241.0
2003 82.18 32.5 250.0
2004 111.20 34.8 255.0
2005 78.28 33.4 214.0
2006 140.30 34.2 262.0
2007 125.10 34.6 289.3
2008 105.00 33.3 310.0
2009 138.10 34.0 333.2
2010 89.50 31.0 371.3
January 1, 1980 15 / 70
Fitting Simple Linear Regression in R
Regression analysis is used for explaining or modeling the relationship
between a single variable Y, called the response, output or
dependent variable, and one or more predictor, input, independent
or explanatory variables, X1, X2, · · · , Xp. When p=1, it is called
simple regression model but when p1 it is called multiple
regression or sometimes multivariate regression model.
In regression model, the response must be a continuous variable but
the explanatory variables can be continuous, discrete or
categorical variable.
Objectives of regression analyses:
I Prediction of future observations.
I Assessment of the effect of, or relationship between, explanatory
variables on the response.
I A general description of data structure.
Fitting a linear model in R is done using the lm() command. Notice the
syntax for specifying the predictors in the model.
January 1, 1980 16 / 70
R-code for Simple Linear Regression
Consider an n pairs of observations, for two variables X and Y, such
that (X, Y) = {(x1, y1) , (x2, y2) , · · · , (xn, yn)}.
Yi = α + β1Xi + εi . (1)
where Yi is a continuous response variable for the ith
subject or
experimental unit, xi is the corresponding value of an explanatory
variable, ei is the error term, α is an intercept parameter, and β is the
slope parameter.
fitted.modelB= lm(y ∼ x1)## fitting simple linear regression.
OR lm(dependent variable Independent variable)
To get more information about the fitted model, summary(the model’s
name) can be used to get details about Residuals, Coefficients,
Residual standard error, R2
, Adjusted R2
. Moreover,
summary(aov(the model’s name)) used to get ANOVA table.
January 1, 1980 17 / 70
Example: Iris Data from R-bulletin data(iris)
data(iris)
names(iris)
[1] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
data.frame(Sepal.length, Sepal.Width, Petal.length, Petal.Width)
 boxplot(Sepal.length, Sepal.Width, Petal.length, Petal.Width)#Box plot
for each variability.
January 1, 1980 18 / 70
Scattered Plot of Sepal Length versus its Width
 plot( Sepal.length,Sepal.Widt)# scatter plot Sepal.length versus
Sepal.Width
Figure: Scattered Plot
January 1, 1980 19 / 70
R-code for Fitting Liner Regression Model to Sepal
Length versus its Width
model1 = lm(Sepal.Width Sepal.Length, data = iris)
summary(model1)
Call:
lm(formula = Sepal.Width Sepal.length, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.1095 -0.2454 -0.0167 0.2763 1.3338
Coefficients:
. Estimate Std. Error t value Pr(|t|)
(Intercept) 3.41895 0.25356 13.48 2e-16 ***
Sepal.length -0.06188 0.04297 -1.44 0.152
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4343 on 148 degrees of freedom
Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
January 1, 1980 20 / 70
R-code for Diagnosis of the Fitted Liner Regression
Model to Sepal Length versus its Width
plot(model1) #Model Diagnosing
January 1, 1980 21 / 70
Fitting Multiple Linear Regression using R
Multiple linear regression (MLR) is a method to test and establish linear
relationships between one dependent variable and two or more
independent variables.
Multiple Regression analysis is a conceptually simple method for
investigating functional relationships among the dependent (response)
variable, Yi , and one or more independent (predictor or explanatory)
variables, denoted by X1, X2, · · · , Xp.
Y = β0 + β1X1 + β2X2 + · · · + βpXp + ε
where, the parameter β0 is the y-intercept, which represents the
expected value of Y when each X is zero (0). The other, β1, β2, · · · , βp in
the multiple regression equation are parameters.
R-code for Multiple Regression
fitted.model-B= lm(y ∼ x1 + x2 + · · · + xp) # fitting multiple linear
regression.
January 1, 1980 22 / 70
Example:
The following data were obtained from R. H. Woodward (1984) and re
done on R by Axel Drefahl (2019) that examined the packing conditions
of ammonium sulfate. The flow rate of ammonium sulfate-an inorganic
salt-was determined by letting defined quantities of salt flow through a
small funnel. The flow rate is the dependent variable Y, depending on
salt characteristics such as moisture content, crystal shape and
impurities. In the example, the independent variables are: X1= initial
moisture content in units of 0.01 %, X2= length/breadth ratio for crystals
and X3= percent impurity in units of 0.01 %. Their values are listed in
Table below.
January 1, 1980 23 / 70
i X1i X2i X3i Yi
1 21 2.4 0 5.00
2 20 2.4 0 4.81
3 16 2.4 0 4.46
4 18 2.5 0 4.81
5 16 3.2 0 4.46
6 18 3.1 1 3.85
7 12 3.2 1 3.21
8 12 2.7 0 3.25
9 13 2.7 0 4.55
10 13 2.7 0 4.85
11 17 2.7 0 4.00
12 24 2.8 0 3.62
13 11 2.5 0 5.15
14 10 2.6 0 3.76
15 17 2.0 0 4.90
16 14 2.0 0 4.13
17 14 2.0 1 5.10
18 14 1.9 0 5.05
19 20 2.1 2 4.27
20 12 1.9 1 4.90
21 11 2.0 2 4.5
22 10 2.0 7 5.32
23 10 2.0 2 4.39
24 16 2.0 2 4.85
i X1i X2i X3i Yi
25 17 2.2 3 4.59
26 17 2.4 4 5.00
27 17 2.4 0 3.82
28 15 2.4 2 3.68
29 17 2.2 3 5.15
30 21 2.2 4 2.94
31 23 2.2 10 3.18
32 22 2.0 7 2.28
33 21 1.9 4 5.00
34 24 2.1 8 2.43
35 37 2.3 14 0
36 21 2.4 2 4.10
37 28 2.4 5 3.70
38 29 2.4 7 3.36
39 23 3.6 7 3.79
40 32 3.3 8 3.40
41 26 3.5 4 1.51
42 28 3.5 12 0
43 21 3.0 3 1.72
44 22 3.0 6 2.33
45 34 3.0 8 2.38
46 29 3.5 5 3.68
47 17 3.5 3 4.20
48 11 3.2 2 5.00
Table: Ammonium Sulfate Data
January 1, 1980 24 / 70
Fitting Multiple Linear Regression for the Ammonium Data using R
 pairs(data1[,-1])# Plot the Scattered Plot Matrix
 linearmodel= lm(Yi ∼ X1i + X2i + X3i) # fitting the multiple linear
model
summary(linearmodel)
Call:
lm(formula = Yi ∼ X1i + X2i + X3i , data = data1)
Residuals:
. Min 1Q Median 3Q Max
-1.78963 -0.62872 0.08172 0.48581 1.42336
Coefficients:
. Estimate Std. Error t value Pr(|t|)
(Intercept) 6.72824 0.66719 10.085 5.15e-13 ***
X1i -0.04854 0.02799 -1.734 0.08990 .
X2i -0.56733 0.25315 -2.241 0.03011 *
X3i -0.16577 0.05088 -3.258 0.00217 **
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8466 on 44 degrees of freedom
Multiple R-squared: 0.5743, Adjusted R-squared: 0.5452
F-statistic: 19.78 on 3 and 44 DF, p-value: 2.873e-08
January 1, 1980 25 / 70
Constructing Confidence Interval Estimate of
Coefficients for Multiple Linear Regression for the
Ammonium Data using R
 confint(linearmodel)# compute the confidence interval
. lower(2.5%) upper(97.5%)
(Intercept) 5.3836188 8.072870440
X1i -0.1049522 0.007873304
X2i -1.0775074 -0.057145301
X3i -0.2683166 -0.063223995
January 1, 1980 26 / 70
Constructing ANOVA Table for fitted Multiple Linear
Regression for the Ammonium Data using R
The F distribution and the assumption that the Y and X variables have a
multivariate normal distribution, we have:
SSRegression/(p − 1)
SSError /(n − p)
∼ Fp−1,n−p (2)
Reject H0 if p-value (Sign.) is smaller than α = 0.05.
Table: ANOVA table for testing significance in multiple linear regression
with p parameters including β0 in vector β using n observations.
Source Sum of Squares (SS) d.f. MS = SS/d.f. F Value
Regression. SSRegression p-1 MSR = SSR/(p − 1) F =
MSR
MSE
Residual SSError n-p MSE = SSE /(n − p)
Total SSTotal n-1
January 1, 1980 27 / 70
R-code for Analysis of Variance (ANOVA) for the Fitted Multiple Liner
Regression Model
 aov(linearmodel)# construct ANOVA table
Call:
aov(formula = linearmodel)
Terms:
. X1i X2i X3i Residuals
Sum of Squares 32.21928 2.71539 7.60817 31.53921
Deg. of Freedom 1 1 1 44
Residual standard error: 0.8466405
Estimated effects may be unbalanced
 summary(aov(linearmodel))
. Df Sum Sq Mean Sq F value Pr(F)
X1i 1 32.22 32.22 44.949 3.1e-08 ***
X2i 1 2.72 2.72 3.788 0.05802 .
X3i 1 7.61 7.61 10.614 0.00217 **
Residuals 44 31.54 0.72
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
January 1, 1980 28 / 70
Evaluating the Quality of the Model (Model Diagnosis)
There are many different ways to evaluate a regression model’s quality.
Many of the techniques can be rather technical, and the details of them
are beyond the scope of this tutorial. However, the function summary()
extracts some additional information that we can use to determine how
well the data fit the resulting model.
Quantile-Quantile Normality Plot of Fitted
Residuals
The residuals are the differences between the actual measured values
and the corresponding values on the fitted regression line.
Each data point’s residual is the distance that the individual data point is
above (positive residual) or below (negative residual) the regression line.
If the line is a good fit with the data, we would expect the
summary(residuals) to have:
I Residual values that are normally distributed around a mean of
zero.
I The median value of all of the residuals is near zero.
I The 1st
quartile Q1 and 3rd
quartile Q3 values of all the sorted
residual values are of roughly the same magnitude.
January 1, 1980 29 / 70
Quantile-Quantile Normality Plot of Fitted Residuals
for Multiple Linear Regression for the Ammonium Data
using R
plot(linearmodel)# Diagnosis test for model fitted
Figure: Fitted vs Residual (left) and QQ-Normality Plot of Residuals (right)
January 1, 1980 30 / 70
Goodness of Fit Test
How well does the model fit the data? One measure is R2
, the
so-called coefficient of determination or called the coefficient of
multiple correlation, or percentage of variance explained.
Coefficient of determination: From the sums of squares defined in
ANOVA Table above, one can define a measure of model adequacy by
the statistic:
R2
=
SSReg.
SSTotal
, (3)
Coefficient of Determination for the Fitted Multiple Linear
Regression for the Ammonium Data using R
From the fitted regression output table above we have Multiple
R-squared or (R2
) = 0.5742665 or using the ANOVA table above it can
be computed as follows:
Regression Sum of Square =SumofSquare(X1i ) + Square(X2i ) + Square(X
=32.21928 + 2.71539 + 7.60817
=42.54284
Total Sum of Square =Regression Sum of Square + Residual Sum
=32.21928 + 2.71539 + 7.60817 + 31.53921
=74.08205
January 1, 1980 31 / 70
Multicollinearity
In multiple regression problems, we expect to find dependencies
between the response variable Y and the regressors Xj .
In most regression problems, however, we find that there are also
dependencies among the regressor variables X0
j s. In situations where
these dependencies are strong, we say that multicollinearity exists.
Multicollinearity can have serious effects on the estimates of the
regression coefficients and on the general applicability of the estimated
model.
We define the variance inflation factor (VIF) for β̂j as:
VIF

β̂j

=
1
1 − R2
j
, j = 1, 2, · · · , k. (4)
These factors are an important measure of the extent to which
multicollinearity is present.
The larger the variance inflation factor, the more severe the
multicollinearity. If the F-test for significance of regression is
significant, but tests on the individual regression coefficients are
not significant, multicollinearity may be present.
January 1, 1980 32 / 70
R code for Variance Inflation Factor of the Fitted
Multiple Linear Regression
vif(linearmodel) # compute the VIF for fitted model
X1i X2i X3i
2.244051 1.113935 2.084959
 setwd(C:/Users/hp/Desktop/RT raining ) #change working directory
 Maize=
read.csv(Temprature.Rainfall.Maize.Yield.csv,header=TRUE) #import
data named Temprature.Rainfall.Maize.Yield.csv to R
 Maize=data.frame(Maize.Yield[1:10,])#column bind dataset
 Mean.Rainfall=Maize[,1]
 Mean.Temprature=Maize[,2]
 Maize.Yield=Maize[,3]
Maize.data=data.frame(Maize.Yield,Mean.Rainfall,Mean.Temprature)
January 1, 1980 33 / 70
Scatted Two Dimensional Plot Maize Yield Data
 pairs(Maize.data)# plot scattered plot
Figure: Scatter Plot Matrix of Maize Yield Data
January 1, 1980 34 / 70
Scatted Three Dimensional Plot for Maize Yield Data
 library(scatterplot3d)
 scatterplot3d(Mean.Rainfall, Mean.Temprature, Maize.Yield,
type=h, xlab=Mean.Rainfall,
ylab=Mean.Temprature,zlab=Maize.Yield)
Figure: Three Dimensional Scatter Plot Matrix of Maize Yield data
January 1, 1980 35 / 70
R-code to Fit multiple Linear Regression for Maize Yield Data
multilinermodel=lm(Maize.Yield ∼ Mean.Rainfall+Mean.Temprature)
 summary(multilinermodel)
Call:
lm(formula = Maize.Yield ∼ Mean.Rainfall + Mean.Temprature)
Residuals:
Min 1Q Median 3Q Max
-61.934 -9.942 6.206 15.865 47.107
Coefficients:
. Estimate Std. Error t value Pr(|t|)
(Intercept) 763.9463 200.4665 3.811 0.00662
**
Mean.Rainfall 0.6596 0.5205 1.267 0.24559
Mean.Temprature 16.0899 5.5368 -2.906
0.02279 *
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 35.72 on 7 degrees of freedom
Multiple R-squared: 0.5909, Adjusted R-squared: 0.474
F-statistic: 5.055 on 2 and 7 DF, p-value: 0.0438
January 1, 1980 36 / 70
Logistic Regression
Linear regression often works very well when the response variable is
quantitative. Logistic regression is a method for fitting a regression
curve, y = f(x), when y is a categorical variable.
The typical use of this model is predicting y given a set of predictors x.
The predictors can be continuous, categorical or a mix of both.
The categorical variable y, in general, can assume different values. In
the simplest case scenario y is binary meaning that it can assume
either the value 1 or 0.
These could be arbitrary assignments resulting from observing a
qualitative response. For example, in a study of a suspected carcinogen,
aflatoxin B1, a number of levels of the compound were fed to test
animals. After a period of time, the animals were sacrificed and the
number of animals having liver tumors was recorded. The response
variable is Y = 1 if the animal has a tumor and Y = 0 if the animal
fails to have a tumor.
January 1, 1980 37 / 70
Simple Binary Logistic Regression Model
A binary logistic regression model given in Equation (5), we have the
following:
log

P(Y)
1 − P(Y)

=β0 + β1X
P(Y)
1 − P(Y)
=eβ0+β1X
P(Y) =eβ0+β1X
− P(Y) ∗ eβ0+β1X
P(Y)

1 + eβ0+β1X

=eβ0+β1X
⇒ P(Y) =
eβ0+β1X
1 + eβ0+β1X
(5)
where,
e: natural logarithm base (= 2.71828)
The expression given in (5) in called a simple Logistic regression
models whose response variable is dichotomous or binary
outcome.
January 1, 1980 38 / 70
Example: Regressing mastitis on age at first calving in
cows
Is there an effect of age at first calving on incidence of mastitis in cows?
On a sample of 21 cows the presence of mastitis (inflammation of the
memory gland in the breast) and age at first calving (in months) were
recorded as in Table below:
January 1, 1980 39 / 70
R-code fo Logistic Regression
 logisticmodel=glm( Mastities Age, data=data2, family=binomial)
 summary(logisticmodel)
Call:
glm(formula = Mastities Age, family = binomial, data = data2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7745 -0.9551 0.6030 0.8607 1.6605
Coefficients:
. Estimate Std. Error z value Pr(|z|)
(Intercept) 6.7439 3.2640 2.066 0.0388 *
Age -0.2701 0.1315 -2.054 0.0399
* —
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 29.065 on 20 degrees of freedom
Residual deviance: 23.842 on 19 degrees of freedom
AIC: 27.842
January 1, 1980 40 / 70
R-code to Construct Confidence Interval Estimate of
Coefficients for Simple Binary Logistic Regression
 confint(logisticmodel)
. 2.5 % 97.5 %
(Intercept) 0.9129170 14.08481226
Age -0.5665479 -0.03596198
January 1, 1980 41 / 70
Multivariate Binary Logistic Regression
Many categorical response variables have only two categories. Denote a
binary response (dependent) variable by Y and its two possible
outcomes by 1 (“success”) and 0 (“failure”) and X = (X1, X2, · · · , Xk )
T
are k independent variables that can be quantitative or dummy variables.
The multivariate binary logistic regression model is given by:
log

P(Y)
1 − P(Y)

=β0 + β1X1 + β2X2 + · · · + βk Xk
⇒ P(Y) =
eβ0+β1X1+β2X2+···+βk Xk
1 + eβ0+β1X1+β2X2+···+βk Xk
, (6)
where,
P(Y): probability of Y occurring
e: natural logarithm base (= 2.71828)
β0: interception at y-axis
β1: line gradient or coefficient of X1 in predicting the probability of Y
.
.
.
βk : line gradient or coefficient of Xk in predicting the probability of Y.
January 1, 1980 42 / 70
Experimental Design: Examples Agricultural Experimentation:
An experimenter wants to compare the crop yield and environmental
effects for two different fertilizers. The experimental units are separate
plots of land.
Some of these plots will be treated with Fertilizer A, and some with
Fertilizer B. For example, Fertilizer A may be the currently used fertilizer;
Fertilizer B is a newly developed alternative, perhaps one designed to
have the same or better crop growth yields but with reduced
environmental side effects.
In this conceptual experiment, the selection of the plots and the
experimental protocol will assure that the fertilizer used on one plot does
not bleed onto another. Schedules for the amount and timing of fertilizer
application will be set up.
Crops will be raised and harvested on each plot, and the crop production
and residual soil chemicals will be measured and compared to see if the
new fertilizer is performing as designed and is an improvement over the
current fertilizer.
Before experimentation, we would not know whether Fertilizer B would
have gotten comparable better yields than A this year due, say, to
especially favorable growing conditions or experimental care compared
to previous years. To identify we must have to run experiments.
January 1, 1980 43 / 70
January 1, 1980 44 / 70
Hypotheses Tests for a Difference in Means
Let X11, X12, · · · , X1n1
be a random sample of n1 observations
from the first population and X21, X22, · · · , X2n2
be a random
sample of n2 observations from the second population.
Let X̄1, X̄2, S2
1 and S2
2 be the sample means and sample
variances, respectively.
The expected value of the difference in sample means is
E

X̄1 − X̄2

= µ1 − µ2, so is an unbiased estimator of the
difference in means (Verify!).
Tests of hypotheses on the difference in means µ1 − µ2 of two
normal distributions where the variances is unknown.
A t-statistic will be used to test these hypotheses.
January 1, 1980 45 / 70
Hypotheses Tests for a Difference in Means cont...
The variance of X̄1 − X̄2 is
Var

X̄1 − X̄2

=
σ2
1
n1
+
σ2
2
n2
=σ2

1
n1
+
1
n2

The pooled estimator of σ2, denoted by S2
p, is defined by
S2
p =
(n1−1)S2
1+(n2−1)S2
2
n1+n2−2
Tc =
X̄1 − X̄2

− (µ1 − µ2)
Sp
r
1
n1
+ 1
n2

has a t- distribution with n1 + n2 − 2 degrees of freedom.
January 1, 1980 46 / 70
Paired t-test in R
Applications: Comparing means of data from two related samples; say,
observations before and after an intervention on the same participant, or
comparison of measurements from the same participant two different
conditions.
Example: Cholesterol Level The cholesterol level 4 weeks after
the special diet is lower than before the diet with means of 5.84 and 6.40
respectively. The similar standard deviations suggest that the spread of
the values at the two time points is similar. The paired t-test calculates
paired differences for each subject and calculates a test statistic from
these differences. If there was no change in cholesterol between the two
time points, the mean difference of the values would be close to 0.
January 1, 1980 47 / 70
Subject Before After 4 weeks
1 6.42 5.83
2 6.76 6.2
3 6.56 5.83
4 4.8 4.27
5 8.43 7.71
6 7.49 7.12
7 8.05 7.25
8 5.05 4.63
9 5.77 5.31
10 3.91 3.7
11 6.77 6.15
12 6.44 5.59
13 6.17 5.56
14 7.67 7.11
15 7.34 6.84
16 6.85 6.4
17 5.13 4.52
18 5.73 5.13
January 1, 1980 48 / 70
R-code for Paired-Independent T-test
t.test(variable1, variable2,paired=T).
# Cholesterol Level
After4weeks=c(6.42,6.76,6.56,4.8,8.43,7.49,8.05,5.05,5.77,3.91,6.77,
6.44,6.17,7.67,7.34,6.85,5.13,5.73)
Before=c(5.83,6.2,5.83,4.27,7.71,7.12,7.25,4.63,5.31,3.7,6.15,5.59,5.56,
7.11,6.84,6.4,4.52,5.13)
 t.test(After4weeks, Before,paired=T)
Paired t-test
data: After4weeks and Before
t = 15.439, df = 17,
p-value = 1.958e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.4887486 0.6434736
sample estimates:
mean of the differences
0.56611
Conclusion: The test statistic is t=-15.439 and the p-value is very small (p 
0.001) so the null hypothesis is rejected, since p  0.05 and evidence of a
statistically significant difference is concluded.
January 1, 1980 49 / 70
Design of Single-Factor ANOVA
The simplest ANOVA problem is referred to variously as a single-factor,
single-classification, or one-way ANOVA.
It involves the analysis of data from experiments in which more than
two treatments have been used.
The characteristic that differentiates the treatments or populations from
one another is called the factor or Treatments under study, and the
different treatments or populations are referred to as replicates of the
factor.
The response for each of the m treatments is a random variable.
Treatment Observations Totals Average
1 y11 y12 · · · y1n y1. ȳ1.
2 y21 y22 · · · y2n y2. ȳ2.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
m ym1 ym2 · · · ymn ym. ȳm.
y.. ȳ..
January 1, 1980 50 / 70
Model for a Single-Factor Experiment
Yij = µ + τi + εij ,

i = 1, 2, · · · n
j = 1, 2, · · · m
(7)
where Yij is a random variable denoting the (ij)th
observation, µ is a
parameter common to all treatments called the overall mean, τi is a
parameter associated with the ith
treatment called the ith
treatment
effect, and εij is a random error component.
ANOVA Table for Single Factor Experiment
Source of Sum of Degree of Mean
Variation Squares Freedom Square F
Between SSbetween m-1 MSbetween
MSbetween
MSwithin
Within SSwithin m(n-1) MSwithin
Total SST mn-1
January 1, 1980 51 / 70
Example: Bull Effect Data on Weight Gain of Calves
A researcher is interested in determining if the average daily gain in weight of
calves depends on the bull which sired the calf. The researcher has only five
bulls. The five bulls are mated with randomly selected cows and the average
daily gain in weight by the calves produced by the matings are recorded. The
data are given below.
Use these data to run an analysis of variance and test for a significant bull
effect. Use α = 0.05.
January 1, 1980 52 / 70
R-code for Single Factorial Design for Bull Effect Example
 Weight.gain=c(1.2,1.39,1.36,1.39,1.22,1.31,1.16,1.08,
1.22,0.87,1.17,1.12,0.75,1.12,1.02,1.08,0.83,0.98,0.96,1.16,
1.05,1.00,1.12,1.15,0.99,0.85,1.10,1.03,0.94,0.89)
 bull=gl(5,6,label=c(1,2,3,4,5))
 tapply(Weight.gain,bull,sum) #treatment total
. 1 2 3 4 5
7.87 6.62 5.78 6.44 5.80
 tapply(Weight.gain,bull,mean) #treatment mean
. 1 2 3 4 5
1.3116667 1.1033333 0.9633333 1.0733333 0.9666667
January 1, 1980 53 / 70
boxplot(Weight.gain bull, ylab=Daily Gain Weight, xlab=Bull
Sired)
Figure: Box Plot for Daily Weight Gain of Calves
January 1, 1980 54 / 70
R-code for ANOVA of Bull Example
 summary(aov(Weight.gain ∼ bull)) # Analysis of variance
. Df Sum Sq Mean Sq F value Pr(F)
bull 4 0.4839 0.12097 10.29
4.56e-05 ***
Residuals 25 0.2938 0.01175
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the Table of ANOVA in R-output, the column Pr( F) = 4.56−0.5,
is the P-Value which is less than our level of significance α = 0.05.
Thus, we reject the null hypothesis that states: no significant difference
between the daily weight gain of calves due to the difference of bull
which sired the calf.
Conclusion: There is a significant bull effect which sired the calf on its
daily weight gain at α = 5% significance level
January 1, 1980 55 / 70
R-code for Bartlett Test of Homogeneity of Variances for ANOVA of Bull
Example
 bartlett.test(Weight.gain ∼ bull) #Bartlett Tests for Equality
of Variance
Bartlett test of homogeneity of variances
data: Weightgain by bull
Bartlett’s K-squared = 2.4124, df = 4, p-value = 0.6604
Bartlett’s χ2
= 2.4124, df = 4, p-value = 0.6604. ⇒ Conclude
that all 5 variances are the same! Fitted Versus Residual
and QQ-Normality Test Plots of Residuals.
January 1, 1980 56 / 70
plot(fitted(aov( Weight.gain ∼ bull)),
residuals(aov(Weightgain bull)),type=p,col=4,xlab=Fitted,
ylab=Residuals)
 abline(h = 0, lty = 2, col=2)
Figure: Fitted Versus Residual Plot of Fitted ANOVA Model for Daily Weight Gain of Calves
January 1, 1980 57 / 70
qqnorm(residuals(aov(Weight.gain ∼bull)),main = Normal Q-Q Plot
of Residuals,col=4)
 qqline(residuals(aov(Weight.gain ∼bull)),col=2)
Figure: QQ-Normality Plot of Residuals of Fitted ANOVA Model for Daily
Weight Gain of Calves January 1, 1980 58 / 70
Analysis of Variance with Two Factors
Here there are two factors A and B with r levels of factor A and m
levels of factor B, each replicate contains all rm treatment
combinations. When you are interested in the effects of two or
more factors on the response variable, the Two-Way ANOVA is
an analysis method for a quantitative outcome and two
categorical explanatory variables that are defined in such a way
that each experimental unit (subject) can be exposed to any
combination of one level of one explanatory variable and one level
of the other explanatory variable.
January 1, 1980 59 / 70
Factor B Row Row
1 2 · · · m Totals Average
y111, y112, y121, y122, · · · y1m1, y1m2,
1 · · · , y11n · · · , y12n · · · · · · , y1mn y1.. ȳ1..
Factor A y211, y212, y221, y222, · · · y2m1, y2m2,
2 · · · , y21n · · · , y22n · · · · · · , y2mn y2.. ȳ2..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
yr11, yr12, yr21, yr22, · · · yrm1, yrm2,
r · · · , yr1n · · · , yr2n · · · · · · , yrmn yr.. ȳr..
Column y.1. y.2. · · · y.m. y...
Totals
Column ȳ.1. ȳ.2. · · · ȳ.m. ȳ...
Average
The observation in the ijth cell for the kth replicate is denoted by Yijk . In
performing the experiment, the rmn observations would be run in
random order.
January 1, 1980 60 / 70
Additive Model of Two Factorial Experiment
Yijk = µ + τi + βj + εijk ,



i = 1, 2, · · · , r
j = 1, 2, · · · , m and
k = 1, 2, · · · , n
(8)
Model of Two Factorial Experiment With Interaction
Yijk = µ + τi + βj + (τβ)ij + εijk ,



i = 1, 2, · · · , r
j = 1, 2, · · · , m and
k = 1, 2, · · · , n
(9)
where µ is the overall mean effect, τi is the effect of the ith level of
factor A, βj is the effect of the jth level of factor B, (τβ)ij is the effect of
the interaction between A and B, and εijk is assumed to be a random
error component having a normal distribution with mean zero and
variance σ2. Since there are two factors in the experiment, the test
procedure is sometimes called the two-way analysis of variance.
January 1, 1980 61 / 70
ANOVA Table for Two Factor A and B Experiment
Source of Sum of Degree of Mean
Variation Squares Freedom Square F
A SSA r-1 MSA
MSA
MSError
B SSB m-1 MSB
MSB
MSError
AB SSAB (r-1)(m-1) MSAB
MSAB
MSError
Error SSError rm(n-1) MSerror
Total SST rmn-1
January 1, 1980 62 / 70
Example: Two Factors Experiment: Assay-Lab Calcium Content
A consumer product agency wants to evaluate the accuracy of
determining the level of calcium in a food supplement. There are a
large number of possible testing laboratories and a large number of
chemical assays for calcium. The agency randomly selects three
laboratories and three assays for use in the study. Each laboratory will
use all three assays in the study. Eighteen samples containing 10 mg
of calcium are prepared and each assay–laboratory combination is
randomly assigned to two samples. The determinations of calcium
content are given in Table below (numbers in parentheses are
averages for the assay–laboratory combinations).
January 1, 1980 63 / 70
January 1, 1980 64 / 70
R-Code for Two Factors Experiment: Assay-Lab Calcium Content
Calcium=c(10.9,10.9,10.5,9.8,9.7,10.0,11.3,11.7,9.4,10.2,8.8,9.2,11.8,
11.2,10.0,10.7,10.4,10.7)
 Assay=gl(3, 6)
 Lab=gl(3, 2, 18, labels = c(1, 2, 3))
 tapply(Calcium, list(Assay, Lab), sum)
. 1 2 3
1 21.8 20.3 19.7
2 23.0 19.6 18.0
3 23.0 20.7 21.1
 tapply(Calcium, list(Assay, Lab), mean)
. 1 2 3
1 10.9 10.15 9.85
2 11.5 9.80 9.00
3 11.5 10.35 10.55
 level.Calcium=lm(Calcium Assay*Lab)
January 1, 1980 65 / 70
R-code for ANOVA of Two Factorial Experiment fro Calcium Concentration
Example
 anova(level.Calcium)
Analysis of Variance Table
Response: Calcium
. Df Sum Sq Mean Sq F value
Pr(F)
Assay 2 1.56 0.7800 5.6613
0.025597 *
Lab 2 7.56 3.7800 27.4355
0.000148 ***
Assay:Lab 4 1.64 0.4100 2.9758
0.080332 .
Residuals 9 1.24 0.1378
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
January 1, 1980 66 / 70
Interaction Plot forTwo Factorial Experiment fro
Calcium Concentration Example
 interaction.plot(Lab,Assay,Calcium, col=c(2,4,6), main=(a),pch=20)
From figure, we observe that the mean calcium content in chemical
January 1, 1980 67 / 70
Diagnosis Test for Assay-Lab Model
 e=residuals(level.Calcium)
 f=fitted(level.Calcium)
 plot(Assay, e, xlab=Assay,ylab=Residual)
 plot(Lab, e, xlab=Lab,ylab=Residual)
The coefficient of determination is one method of diagnosis of our
model to estimate the percentage of the variability of calcium content
explained by the different assay chemicals and under different
laboratories.
R2
=
SSModel
SSTotal
=
10.76
12
= 0.897
where
SSModel = SSAssay +SSLab +SSAssay∗Lab = 1.56+7.56+1.64 = 10.76.
That is, about 90% of the variability in the calcium content is explained
by the different assay chemicals and under different laboratories and
their interaction.
January 1, 1980 68 / 70
Figure: Box Plot for Residual Versus Assay (left) and Residual Versus Lab (right)
Figures (a) and (b) above plot the residuals versus calcium content in
three chemical assays used and the three Laboratories where the
experiment was applied, respectively. Both plots indicate mild
inequality of variance, with the treatment combination of Assay 3 and
lab 3 possibly having larger variance than the others.
January 1, 1980 69 / 70
plot(f, e, col=4,lty=0,lwd=6, pch=20, xlab=Fitted,ylab=Residual,
main=(a) Fitted Versus Residual Plot)
 abline(h = 0, lty = 2, lwd=3,col=2)
 qqnorm(e, col=4,lty=0,lwd=6,main=(b) QQ − Normality Plot,
pch=20)
 qqline(e, col=2)
Figure: Fitted Versus Residual Plot and QQ-Normality Plot
January 1, 1980 70 / 70

More Related Content

PPT
overviewoflivestockfeedsupplyinethiopia-120221074154-phpapp01 - Copy.ppt
PPTX
Mannualllllllllllllllllllllllllllll_1.pptx
PPTX
climate_change_impacts_on_agriculture.pptx
PPT
1926708387878y67555555-Grass-Legumes.ppt
PPT
Advanced biostatistics QTL mapping11.ppt
PPT
Lecture-4 Advanced biostatistics BLUP.ppt
PPT
Lecture-2Quantitative Genetics for animals.ppt
PPT
Factors affecting gene freq. p.p.presentation-1.ppt
overviewoflivestockfeedsupplyinethiopia-120221074154-phpapp01 - Copy.ppt
Mannualllllllllllllllllllllllllllll_1.pptx
climate_change_impacts_on_agriculture.pptx
1926708387878y67555555-Grass-Legumes.ppt
Advanced biostatistics QTL mapping11.ppt
Lecture-4 Advanced biostatistics BLUP.ppt
Lecture-2Quantitative Genetics for animals.ppt
Factors affecting gene freq. p.p.presentation-1.ppt

More from Fantahun Dugassa (20)

PPT
Lecture-8 Genetic analysis of Threshold characters PPP.ppt
PPT
Computation of Inbreeding coefficient and coefficient of relation ship-1.ppt
PPT
Research method for gggggggggggggggg.ppt
PPTX
ilri forage grass and legumes gene bank.pptx
PPTX
Improved forage selection agro-ecologies.pptx
PDF
lecturenote_535117675Microsoft PowerPoint - Fisheries and Aquacultur (Biol.30...
PDF
dssatslidesphil-110930024540-phpapp01.pdf
PPTX
ASACSSA Annual_Meeting_DSSAT_Update.pptx
PPTX
345pm_Boote_2_CROPGRO_PFM modelling.pptx
PPTX
Conservation Based Forage Development.pptx
PPT
3. conservation agriculture lecture note.ppt
PPTX
Forage yield and quality for animals ppt.pptx
PPTX
Conservation Based Forage Development.pptx
PPTX
345pm_Boote_2_CROPGRO-PFM modelling.pptx
PDF
Correlation and Regression in analysis.pdf
PPTX
Biostatistics Lecture on Correlation.pptx
PPTX
Bee-breeding and queen rearing-Haim-Efrat.pptx
PPTX
The common Ethiopian bee flora visited by worker bees
PPTX
Concepts of Queen Rearing for the beekeepers
PPT
internationaltradeppt-110207041415-phpapp02.ppt
Lecture-8 Genetic analysis of Threshold characters PPP.ppt
Computation of Inbreeding coefficient and coefficient of relation ship-1.ppt
Research method for gggggggggggggggg.ppt
ilri forage grass and legumes gene bank.pptx
Improved forage selection agro-ecologies.pptx
lecturenote_535117675Microsoft PowerPoint - Fisheries and Aquacultur (Biol.30...
dssatslidesphil-110930024540-phpapp01.pdf
ASACSSA Annual_Meeting_DSSAT_Update.pptx
345pm_Boote_2_CROPGRO_PFM modelling.pptx
Conservation Based Forage Development.pptx
3. conservation agriculture lecture note.ppt
Forage yield and quality for animals ppt.pptx
Conservation Based Forage Development.pptx
345pm_Boote_2_CROPGRO-PFM modelling.pptx
Correlation and Regression in analysis.pdf
Biostatistics Lecture on Correlation.pptx
Bee-breeding and queen rearing-Haim-Efrat.pptx
The common Ethiopian bee flora visited by worker bees
Concepts of Queen Rearing for the beekeepers
internationaltradeppt-110207041415-phpapp02.ppt
Ad

Recently uploaded (20)

PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Lesson notes of climatology university.
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
Pharma ospi slides which help in ospi learning
Final Presentation General Medicine 03-08-2024.pptx
01-Introduction-to-Information-Management.pdf
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Pharmacology of Heart Failure /Pharmacotherapy of CHF
human mycosis Human fungal infections are called human mycosis..pptx
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Module 4: Burden of Disease Tutorial Slides S2 2025
Abdominal Access Techniques with Prof. Dr. R K Mishra
Lesson notes of climatology university.
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Microbial disease of the cardiovascular and lymphatic systems
Anesthesia in Laparoscopic Surgery in India
GDM (1) (1).pptx small presentation for students
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Pharma ospi slides which help in ospi learning
Ad

R_Satistical_Software_Trainning _Man.pdf

  • 1. R Statistical Software for Research Training Manual Shambu Campus Session Two By Dechassa O. (PhD); Wollega University Department of Statistics. January 1, 1980 By Dechassa O. (PhD); Wollega University Department of Statistics. R Statistical Software for Research Training Manual Shambu Campus Session Two January 1, 1980 1 / 70
  • 2. 1 Statistical Analysis Using R Visualizing Data Graphically in R Symbols, Colors, and Sizes of Graphs Basics of Graphics Functions Relevant for Graphics: The Histograms Exporting R Graphics Visualize the Relationship Between Two Variables 2 Fitting Simple Linear Regression in R R-code for Simple Linear Regression Fitting Multiple Linear Regression using R Evaluating the Quality of the Model (Model Diagnosis) Goodness of Fit Test Multicollinearity 3 Logistic Regression Multivariate Binary Logistic Regression 4 Experimental Design Hypotheses Tests for a Difference in Means Paired t-test in R Design of Single-Factor ANOVA Model for a Single-Factor Experiment Analysis of Variance with Two Factors Modeling Two Factors Experiment January 1, 1980 2 / 70
  • 3. Visualizing Data Graphically in R Plotting is an essential need when analyzing data. One of the major reasons for developing R was to enable users to create graphics and charts easily and interactively. Nothing really tells a story about your data as powerfully as good plots. Graphics capture your data much better than summary statistics and often show you features that you would not be able to glean from summaries alone. R has very powerful tools for graphical visualization data that can be an excellent way of communicating your result in scientific publications. High-level plotting functions create a new plot on the graphics device. It has a specific Plot() function as illustrated example below. January 1, 1980 3 / 70
  • 4. Examples: >#rape seed+soybean=x ,sunflower+soybean=y > X=c(32,29,38,36,30,25,29,32,25,26,25,31,28,23,26,26)#variable X >Y=c(30,29,26,34,34,30,32,33,29,28,34,36,32,30,27,29)#variable Y >plot(X,Y) #Plot variable X versus Y This produces the simple graph in Figure below. Figure: Scattered Plot January 1, 1980 4 / 70
  • 5. Symbols, Colors, and Sizes of Graphs During our courses, the most frequently asked questions concerning graphs are whether (1) the plotting symbols can be changed, (2) different colors can be used, and (3) the size of the plotting symbols can be varied conditional on the values of another variable. Changing Plotting Characters: By default, the plot function uses open circles (open dots) as plotting characters, but characters can be selected from about 20 additional symbols. The plotting character is specified with the pch option in the plot function; its default value is 1 (which is the open dot or circle). Table shows the symbols that can be obtained with the different values of pch. Number pch Symbol Number pch Symbol 1 ◦ 9 N 2 4 10 ⊕ 3 + 16 • 4 × 13 ⊗ 5 6 5 8 ∗ January 1, 1980 5 / 70
  • 6. Symbols, Colors, and Sizes of Graphs In this section we present several of the types of graphs and plots we will be using throughout. The basic plotting system is implemented in the graphics package. library(graphics) . It is already loaded when you start up R. But you can use the help function to get a list of the functions: library(help = graphics). The main plotting function, plot(), is generic and many packages write extensions to it to specialize plots. For one variable at a time, you can make Box-plots, Bar-graphs, Histograms, Density plots and more. A standard x-y plot has an x and a y title label generated from the expressions being plotted. You may, however, override these labels and also add two further titles xlab= for horizontal label, ylab= for vertical label, and main= for title above the plot and a subtitle at the very bottom, in the plots. Inside the plotting region, you can place points and lines that are either specified in the plot call or added later with sub-functions points and lines. January 1, 1980 6 / 70
  • 7. Functions Relevant for Graphics: Description R Command Plot the values of a column (vector) X versus the index of X plot(X) Plot the values of a column (vector) X against those in Y plot(X, Y) Add points to a plot of column (vector) X against those in Y points(X, Y) Add lines to a plot of column (vector) X against those in Y lines(X, Y) Place the text givenbytext at the location specified by X and Y text(X,Y,text) Place the legend givenbyleg at the location specified by X and Y legend(X,Y,text) Description R Command Add Title title to the plot main=title Character for x- (or y-) axis label xlab= X label, ylab= Y label Vector with minimum and maximum for x- (or y-) axis xlim= (min(X), max(X)), ylim= (min(y), max(y)), Color of Plot name or number name e.g.,col=black, col=red, col=green, col=blue or number: e.g., col=1, col=2, col=3, col=4 Description Command Plot a histogram of the frequencies of X hist(X) Plot the density of a column (vector) X plot(density(X)) Boxplot of a column (vector) X boxplot(X) QQ-plot of a column (vector) X qqnorm(X) Scatter-plot of a column (vector) X against those in Y pairs(X, Y) January 1, 1980 7 / 70
  • 8. The Histograms One of the most frequently used diagrams to depict a data distribution is the histogram. You can get a reasonable impression of the shape of a distribution by drawing a histogram; that is, a count of how many observations fall within specified divisions (“range of numerical values”) of the x-axis. It is constructed in the form of side-by-side bars. Within a bar each data value is represented by an equal amount of area. The histogram permits the detection at one glance as to whether a distribution is symmetric (i.e. the same shape on either side of a line drawn through the center of the histogram) or whether it is skewed (stretched out on one side – right or left skewed). January 1, 1980 8 / 70
  • 9. Exporting R Graphics The plot panel allows you to export the current plot to different formats, which can be very helpful. Graphic outputs can be saved in various formats. The export to image allows exporting to the PNG, JPG, SVG, ... formats. To save a graphic: 1 Click the Plots Tab window, 2 lick the Export button, 3 Modify the export settings as you desire, and 4 Click Save. January 1, 1980 9 / 70
  • 10. Example: Bar Graph RNA Sequence Data consider the RNA sequence below: RNAsequence =c(A,U,G,C,U,U,C,G,A,A,U,G,C,U,G,U,A,U,G, A,U,G,U,C) f-c(5,4,6,9) N-c (A, C, G, U) barplot(f,names=N,ylab=Frequency,xlab=RNA Residue Sequence,col=c(2,3,4,5), main=RNA Residue Analysis) The Export menu has three options—Save Plot as Image..., Save Plot as PDF..., or Copy Plot to Clipboard.... Choosing Save Plot as Image yields the following popup: January 1, 1980 10 / 70
  • 11. Example: Bar Graph RNA #To construct Bar-plot for the RNA sequence barplot(f,names=N,ylab=Frequency,xlab=RNA Residue Sequence,col=c(2,3,4,5), main=RNA Residue Analysis) January 1, 1980 11 / 70
  • 12. Example: Pie-chart for RNA #To construct Pie-chart for the RNA sequence pie(f,names=N,col=c(2,3,4,5), main= Pie-chart RNA for RNA Sequence, labels=c(16.7%, 20.8%, 25%, 37.5%)) January 1, 1980 12 / 70
  • 13. Visualize the Relationship Between Two Variables To examine the relationship between two variables we can use the plot command which, when applied to numeric objects, draws a scatter plot. As an illustration, we first generate a set of n = 50 data points from the linear model: y = 0.5x + ewheree N(0, 0.12 )andx U(0, 1). that is coded in R as: n=50 x=runif(n) y=0.5*x+rnorm(n,sd=0.1) Next, using the the command plot(x,y) we create a simple scatter plot of x versus y. One could also use a formula as in plot(y x). Generally one will want to label the axes and add a title as the following code illustrates; the resulting scatter plot is presented in Figure below. plot(x,y, xlab=Explanatory Variable,ylab=Response Variable, main=An Example of a Scatter Plot) January 1, 1980 13 / 70
  • 15. Example 2: Maize Yield Data A study was made to analyze the effect of rainfall and temperature on maize yield in Kogi state, Nigeria. The annual mean rainfall, mean temperature and maize yield in study area was considered between the period of 2001 and 2010 as presented in Table below to analyze any variation in maize yield may directly be the result of rainfall and/or temperature variation. Year Mean Monthly Mean Monthly Maize Yield Rainfall (cms) Temperature (0 C) (mt) 2001 83.58 34.0 234.0 2002 106.30 33.0 241.0 2003 82.18 32.5 250.0 2004 111.20 34.8 255.0 2005 78.28 33.4 214.0 2006 140.30 34.2 262.0 2007 125.10 34.6 289.3 2008 105.00 33.3 310.0 2009 138.10 34.0 333.2 2010 89.50 31.0 371.3 January 1, 1980 15 / 70
  • 16. Fitting Simple Linear Regression in R Regression analysis is used for explaining or modeling the relationship between a single variable Y, called the response, output or dependent variable, and one or more predictor, input, independent or explanatory variables, X1, X2, · · · , Xp. When p=1, it is called simple regression model but when p1 it is called multiple regression or sometimes multivariate regression model. In regression model, the response must be a continuous variable but the explanatory variables can be continuous, discrete or categorical variable. Objectives of regression analyses: I Prediction of future observations. I Assessment of the effect of, or relationship between, explanatory variables on the response. I A general description of data structure. Fitting a linear model in R is done using the lm() command. Notice the syntax for specifying the predictors in the model. January 1, 1980 16 / 70
  • 17. R-code for Simple Linear Regression Consider an n pairs of observations, for two variables X and Y, such that (X, Y) = {(x1, y1) , (x2, y2) , · · · , (xn, yn)}. Yi = α + β1Xi + εi . (1) where Yi is a continuous response variable for the ith subject or experimental unit, xi is the corresponding value of an explanatory variable, ei is the error term, α is an intercept parameter, and β is the slope parameter. fitted.modelB= lm(y ∼ x1)## fitting simple linear regression. OR lm(dependent variable Independent variable) To get more information about the fitted model, summary(the model’s name) can be used to get details about Residuals, Coefficients, Residual standard error, R2 , Adjusted R2 . Moreover, summary(aov(the model’s name)) used to get ANOVA table. January 1, 1980 17 / 70
  • 18. Example: Iris Data from R-bulletin data(iris) data(iris) names(iris) [1] Sepal.Length Sepal.Width Petal.Length Petal.Width Species data.frame(Sepal.length, Sepal.Width, Petal.length, Petal.Width) boxplot(Sepal.length, Sepal.Width, Petal.length, Petal.Width)#Box plot for each variability. January 1, 1980 18 / 70
  • 19. Scattered Plot of Sepal Length versus its Width plot( Sepal.length,Sepal.Widt)# scatter plot Sepal.length versus Sepal.Width Figure: Scattered Plot January 1, 1980 19 / 70
  • 20. R-code for Fitting Liner Regression Model to Sepal Length versus its Width model1 = lm(Sepal.Width Sepal.Length, data = iris) summary(model1) Call: lm(formula = Sepal.Width Sepal.length, data = iris) Residuals: Min 1Q Median 3Q Max -1.1095 -0.2454 -0.0167 0.2763 1.3338 Coefficients: . Estimate Std. Error t value Pr(|t|) (Intercept) 3.41895 0.25356 13.48 2e-16 *** Sepal.length -0.06188 0.04297 -1.44 0.152 — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4343 on 148 degrees of freedom Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159 F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519 January 1, 1980 20 / 70
  • 21. R-code for Diagnosis of the Fitted Liner Regression Model to Sepal Length versus its Width plot(model1) #Model Diagnosing January 1, 1980 21 / 70
  • 22. Fitting Multiple Linear Regression using R Multiple linear regression (MLR) is a method to test and establish linear relationships between one dependent variable and two or more independent variables. Multiple Regression analysis is a conceptually simple method for investigating functional relationships among the dependent (response) variable, Yi , and one or more independent (predictor or explanatory) variables, denoted by X1, X2, · · · , Xp. Y = β0 + β1X1 + β2X2 + · · · + βpXp + ε where, the parameter β0 is the y-intercept, which represents the expected value of Y when each X is zero (0). The other, β1, β2, · · · , βp in the multiple regression equation are parameters. R-code for Multiple Regression fitted.model-B= lm(y ∼ x1 + x2 + · · · + xp) # fitting multiple linear regression. January 1, 1980 22 / 70
  • 23. Example: The following data were obtained from R. H. Woodward (1984) and re done on R by Axel Drefahl (2019) that examined the packing conditions of ammonium sulfate. The flow rate of ammonium sulfate-an inorganic salt-was determined by letting defined quantities of salt flow through a small funnel. The flow rate is the dependent variable Y, depending on salt characteristics such as moisture content, crystal shape and impurities. In the example, the independent variables are: X1= initial moisture content in units of 0.01 %, X2= length/breadth ratio for crystals and X3= percent impurity in units of 0.01 %. Their values are listed in Table below. January 1, 1980 23 / 70
  • 24. i X1i X2i X3i Yi 1 21 2.4 0 5.00 2 20 2.4 0 4.81 3 16 2.4 0 4.46 4 18 2.5 0 4.81 5 16 3.2 0 4.46 6 18 3.1 1 3.85 7 12 3.2 1 3.21 8 12 2.7 0 3.25 9 13 2.7 0 4.55 10 13 2.7 0 4.85 11 17 2.7 0 4.00 12 24 2.8 0 3.62 13 11 2.5 0 5.15 14 10 2.6 0 3.76 15 17 2.0 0 4.90 16 14 2.0 0 4.13 17 14 2.0 1 5.10 18 14 1.9 0 5.05 19 20 2.1 2 4.27 20 12 1.9 1 4.90 21 11 2.0 2 4.5 22 10 2.0 7 5.32 23 10 2.0 2 4.39 24 16 2.0 2 4.85 i X1i X2i X3i Yi 25 17 2.2 3 4.59 26 17 2.4 4 5.00 27 17 2.4 0 3.82 28 15 2.4 2 3.68 29 17 2.2 3 5.15 30 21 2.2 4 2.94 31 23 2.2 10 3.18 32 22 2.0 7 2.28 33 21 1.9 4 5.00 34 24 2.1 8 2.43 35 37 2.3 14 0 36 21 2.4 2 4.10 37 28 2.4 5 3.70 38 29 2.4 7 3.36 39 23 3.6 7 3.79 40 32 3.3 8 3.40 41 26 3.5 4 1.51 42 28 3.5 12 0 43 21 3.0 3 1.72 44 22 3.0 6 2.33 45 34 3.0 8 2.38 46 29 3.5 5 3.68 47 17 3.5 3 4.20 48 11 3.2 2 5.00 Table: Ammonium Sulfate Data January 1, 1980 24 / 70
  • 25. Fitting Multiple Linear Regression for the Ammonium Data using R pairs(data1[,-1])# Plot the Scattered Plot Matrix linearmodel= lm(Yi ∼ X1i + X2i + X3i) # fitting the multiple linear model summary(linearmodel) Call: lm(formula = Yi ∼ X1i + X2i + X3i , data = data1) Residuals: . Min 1Q Median 3Q Max -1.78963 -0.62872 0.08172 0.48581 1.42336 Coefficients: . Estimate Std. Error t value Pr(|t|) (Intercept) 6.72824 0.66719 10.085 5.15e-13 *** X1i -0.04854 0.02799 -1.734 0.08990 . X2i -0.56733 0.25315 -2.241 0.03011 * X3i -0.16577 0.05088 -3.258 0.00217 ** — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8466 on 44 degrees of freedom Multiple R-squared: 0.5743, Adjusted R-squared: 0.5452 F-statistic: 19.78 on 3 and 44 DF, p-value: 2.873e-08 January 1, 1980 25 / 70
  • 26. Constructing Confidence Interval Estimate of Coefficients for Multiple Linear Regression for the Ammonium Data using R confint(linearmodel)# compute the confidence interval . lower(2.5%) upper(97.5%) (Intercept) 5.3836188 8.072870440 X1i -0.1049522 0.007873304 X2i -1.0775074 -0.057145301 X3i -0.2683166 -0.063223995 January 1, 1980 26 / 70
  • 27. Constructing ANOVA Table for fitted Multiple Linear Regression for the Ammonium Data using R The F distribution and the assumption that the Y and X variables have a multivariate normal distribution, we have: SSRegression/(p − 1) SSError /(n − p) ∼ Fp−1,n−p (2) Reject H0 if p-value (Sign.) is smaller than α = 0.05. Table: ANOVA table for testing significance in multiple linear regression with p parameters including β0 in vector β using n observations. Source Sum of Squares (SS) d.f. MS = SS/d.f. F Value Regression. SSRegression p-1 MSR = SSR/(p − 1) F = MSR MSE Residual SSError n-p MSE = SSE /(n − p) Total SSTotal n-1 January 1, 1980 27 / 70
  • 28. R-code for Analysis of Variance (ANOVA) for the Fitted Multiple Liner Regression Model aov(linearmodel)# construct ANOVA table Call: aov(formula = linearmodel) Terms: . X1i X2i X3i Residuals Sum of Squares 32.21928 2.71539 7.60817 31.53921 Deg. of Freedom 1 1 1 44 Residual standard error: 0.8466405 Estimated effects may be unbalanced summary(aov(linearmodel)) . Df Sum Sq Mean Sq F value Pr(F) X1i 1 32.22 32.22 44.949 3.1e-08 *** X2i 1 2.72 2.72 3.788 0.05802 . X3i 1 7.61 7.61 10.614 0.00217 ** Residuals 44 31.54 0.72 — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 January 1, 1980 28 / 70
  • 29. Evaluating the Quality of the Model (Model Diagnosis) There are many different ways to evaluate a regression model’s quality. Many of the techniques can be rather technical, and the details of them are beyond the scope of this tutorial. However, the function summary() extracts some additional information that we can use to determine how well the data fit the resulting model. Quantile-Quantile Normality Plot of Fitted Residuals The residuals are the differences between the actual measured values and the corresponding values on the fitted regression line. Each data point’s residual is the distance that the individual data point is above (positive residual) or below (negative residual) the regression line. If the line is a good fit with the data, we would expect the summary(residuals) to have: I Residual values that are normally distributed around a mean of zero. I The median value of all of the residuals is near zero. I The 1st quartile Q1 and 3rd quartile Q3 values of all the sorted residual values are of roughly the same magnitude. January 1, 1980 29 / 70
  • 30. Quantile-Quantile Normality Plot of Fitted Residuals for Multiple Linear Regression for the Ammonium Data using R plot(linearmodel)# Diagnosis test for model fitted Figure: Fitted vs Residual (left) and QQ-Normality Plot of Residuals (right) January 1, 1980 30 / 70
  • 31. Goodness of Fit Test How well does the model fit the data? One measure is R2 , the so-called coefficient of determination or called the coefficient of multiple correlation, or percentage of variance explained. Coefficient of determination: From the sums of squares defined in ANOVA Table above, one can define a measure of model adequacy by the statistic: R2 = SSReg. SSTotal , (3) Coefficient of Determination for the Fitted Multiple Linear Regression for the Ammonium Data using R From the fitted regression output table above we have Multiple R-squared or (R2 ) = 0.5742665 or using the ANOVA table above it can be computed as follows: Regression Sum of Square =SumofSquare(X1i ) + Square(X2i ) + Square(X =32.21928 + 2.71539 + 7.60817 =42.54284 Total Sum of Square =Regression Sum of Square + Residual Sum =32.21928 + 2.71539 + 7.60817 + 31.53921 =74.08205 January 1, 1980 31 / 70
  • 32. Multicollinearity In multiple regression problems, we expect to find dependencies between the response variable Y and the regressors Xj . In most regression problems, however, we find that there are also dependencies among the regressor variables X0 j s. In situations where these dependencies are strong, we say that multicollinearity exists. Multicollinearity can have serious effects on the estimates of the regression coefficients and on the general applicability of the estimated model. We define the variance inflation factor (VIF) for β̂j as: VIF β̂j = 1 1 − R2 j , j = 1, 2, · · · , k. (4) These factors are an important measure of the extent to which multicollinearity is present. The larger the variance inflation factor, the more severe the multicollinearity. If the F-test for significance of regression is significant, but tests on the individual regression coefficients are not significant, multicollinearity may be present. January 1, 1980 32 / 70
  • 33. R code for Variance Inflation Factor of the Fitted Multiple Linear Regression vif(linearmodel) # compute the VIF for fitted model X1i X2i X3i 2.244051 1.113935 2.084959 setwd(C:/Users/hp/Desktop/RT raining ) #change working directory Maize= read.csv(Temprature.Rainfall.Maize.Yield.csv,header=TRUE) #import data named Temprature.Rainfall.Maize.Yield.csv to R Maize=data.frame(Maize.Yield[1:10,])#column bind dataset Mean.Rainfall=Maize[,1] Mean.Temprature=Maize[,2] Maize.Yield=Maize[,3] Maize.data=data.frame(Maize.Yield,Mean.Rainfall,Mean.Temprature) January 1, 1980 33 / 70
  • 34. Scatted Two Dimensional Plot Maize Yield Data pairs(Maize.data)# plot scattered plot Figure: Scatter Plot Matrix of Maize Yield Data January 1, 1980 34 / 70
  • 35. Scatted Three Dimensional Plot for Maize Yield Data library(scatterplot3d) scatterplot3d(Mean.Rainfall, Mean.Temprature, Maize.Yield, type=h, xlab=Mean.Rainfall, ylab=Mean.Temprature,zlab=Maize.Yield) Figure: Three Dimensional Scatter Plot Matrix of Maize Yield data January 1, 1980 35 / 70
  • 36. R-code to Fit multiple Linear Regression for Maize Yield Data multilinermodel=lm(Maize.Yield ∼ Mean.Rainfall+Mean.Temprature) summary(multilinermodel) Call: lm(formula = Maize.Yield ∼ Mean.Rainfall + Mean.Temprature) Residuals: Min 1Q Median 3Q Max -61.934 -9.942 6.206 15.865 47.107 Coefficients: . Estimate Std. Error t value Pr(|t|) (Intercept) 763.9463 200.4665 3.811 0.00662 ** Mean.Rainfall 0.6596 0.5205 1.267 0.24559 Mean.Temprature 16.0899 5.5368 -2.906 0.02279 * — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 35.72 on 7 degrees of freedom Multiple R-squared: 0.5909, Adjusted R-squared: 0.474 F-statistic: 5.055 on 2 and 7 DF, p-value: 0.0438 January 1, 1980 36 / 70
  • 37. Logistic Regression Linear regression often works very well when the response variable is quantitative. Logistic regression is a method for fitting a regression curve, y = f(x), when y is a categorical variable. The typical use of this model is predicting y given a set of predictors x. The predictors can be continuous, categorical or a mix of both. The categorical variable y, in general, can assume different values. In the simplest case scenario y is binary meaning that it can assume either the value 1 or 0. These could be arbitrary assignments resulting from observing a qualitative response. For example, in a study of a suspected carcinogen, aflatoxin B1, a number of levels of the compound were fed to test animals. After a period of time, the animals were sacrificed and the number of animals having liver tumors was recorded. The response variable is Y = 1 if the animal has a tumor and Y = 0 if the animal fails to have a tumor. January 1, 1980 37 / 70
  • 38. Simple Binary Logistic Regression Model A binary logistic regression model given in Equation (5), we have the following: log P(Y) 1 − P(Y) =β0 + β1X P(Y) 1 − P(Y) =eβ0+β1X P(Y) =eβ0+β1X − P(Y) ∗ eβ0+β1X P(Y) 1 + eβ0+β1X =eβ0+β1X ⇒ P(Y) = eβ0+β1X 1 + eβ0+β1X (5) where, e: natural logarithm base (= 2.71828) The expression given in (5) in called a simple Logistic regression models whose response variable is dichotomous or binary outcome. January 1, 1980 38 / 70
  • 39. Example: Regressing mastitis on age at first calving in cows Is there an effect of age at first calving on incidence of mastitis in cows? On a sample of 21 cows the presence of mastitis (inflammation of the memory gland in the breast) and age at first calving (in months) were recorded as in Table below: January 1, 1980 39 / 70
  • 40. R-code fo Logistic Regression logisticmodel=glm( Mastities Age, data=data2, family=binomial) summary(logisticmodel) Call: glm(formula = Mastities Age, family = binomial, data = data2) Deviance Residuals: Min 1Q Median 3Q Max -1.7745 -0.9551 0.6030 0.8607 1.6605 Coefficients: . Estimate Std. Error z value Pr(|z|) (Intercept) 6.7439 3.2640 2.066 0.0388 * Age -0.2701 0.1315 -2.054 0.0399 * — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 29.065 on 20 degrees of freedom Residual deviance: 23.842 on 19 degrees of freedom AIC: 27.842 January 1, 1980 40 / 70
  • 41. R-code to Construct Confidence Interval Estimate of Coefficients for Simple Binary Logistic Regression confint(logisticmodel) . 2.5 % 97.5 % (Intercept) 0.9129170 14.08481226 Age -0.5665479 -0.03596198 January 1, 1980 41 / 70
  • 42. Multivariate Binary Logistic Regression Many categorical response variables have only two categories. Denote a binary response (dependent) variable by Y and its two possible outcomes by 1 (“success”) and 0 (“failure”) and X = (X1, X2, · · · , Xk ) T are k independent variables that can be quantitative or dummy variables. The multivariate binary logistic regression model is given by: log P(Y) 1 − P(Y) =β0 + β1X1 + β2X2 + · · · + βk Xk ⇒ P(Y) = eβ0+β1X1+β2X2+···+βk Xk 1 + eβ0+β1X1+β2X2+···+βk Xk , (6) where, P(Y): probability of Y occurring e: natural logarithm base (= 2.71828) β0: interception at y-axis β1: line gradient or coefficient of X1 in predicting the probability of Y . . . βk : line gradient or coefficient of Xk in predicting the probability of Y. January 1, 1980 42 / 70
  • 43. Experimental Design: Examples Agricultural Experimentation: An experimenter wants to compare the crop yield and environmental effects for two different fertilizers. The experimental units are separate plots of land. Some of these plots will be treated with Fertilizer A, and some with Fertilizer B. For example, Fertilizer A may be the currently used fertilizer; Fertilizer B is a newly developed alternative, perhaps one designed to have the same or better crop growth yields but with reduced environmental side effects. In this conceptual experiment, the selection of the plots and the experimental protocol will assure that the fertilizer used on one plot does not bleed onto another. Schedules for the amount and timing of fertilizer application will be set up. Crops will be raised and harvested on each plot, and the crop production and residual soil chemicals will be measured and compared to see if the new fertilizer is performing as designed and is an improvement over the current fertilizer. Before experimentation, we would not know whether Fertilizer B would have gotten comparable better yields than A this year due, say, to especially favorable growing conditions or experimental care compared to previous years. To identify we must have to run experiments. January 1, 1980 43 / 70
  • 44. January 1, 1980 44 / 70
  • 45. Hypotheses Tests for a Difference in Means Let X11, X12, · · · , X1n1 be a random sample of n1 observations from the first population and X21, X22, · · · , X2n2 be a random sample of n2 observations from the second population. Let X̄1, X̄2, S2 1 and S2 2 be the sample means and sample variances, respectively. The expected value of the difference in sample means is E X̄1 − X̄2 = µ1 − µ2, so is an unbiased estimator of the difference in means (Verify!). Tests of hypotheses on the difference in means µ1 − µ2 of two normal distributions where the variances is unknown. A t-statistic will be used to test these hypotheses. January 1, 1980 45 / 70
  • 46. Hypotheses Tests for a Difference in Means cont... The variance of X̄1 − X̄2 is Var X̄1 − X̄2 = σ2 1 n1 + σ2 2 n2 =σ2 1 n1 + 1 n2 The pooled estimator of σ2, denoted by S2 p, is defined by S2 p = (n1−1)S2 1+(n2−1)S2 2 n1+n2−2 Tc = X̄1 − X̄2 − (µ1 − µ2) Sp r 1 n1 + 1 n2 has a t- distribution with n1 + n2 − 2 degrees of freedom. January 1, 1980 46 / 70
  • 47. Paired t-test in R Applications: Comparing means of data from two related samples; say, observations before and after an intervention on the same participant, or comparison of measurements from the same participant two different conditions. Example: Cholesterol Level The cholesterol level 4 weeks after the special diet is lower than before the diet with means of 5.84 and 6.40 respectively. The similar standard deviations suggest that the spread of the values at the two time points is similar. The paired t-test calculates paired differences for each subject and calculates a test statistic from these differences. If there was no change in cholesterol between the two time points, the mean difference of the values would be close to 0. January 1, 1980 47 / 70
  • 48. Subject Before After 4 weeks 1 6.42 5.83 2 6.76 6.2 3 6.56 5.83 4 4.8 4.27 5 8.43 7.71 6 7.49 7.12 7 8.05 7.25 8 5.05 4.63 9 5.77 5.31 10 3.91 3.7 11 6.77 6.15 12 6.44 5.59 13 6.17 5.56 14 7.67 7.11 15 7.34 6.84 16 6.85 6.4 17 5.13 4.52 18 5.73 5.13 January 1, 1980 48 / 70
  • 49. R-code for Paired-Independent T-test t.test(variable1, variable2,paired=T). # Cholesterol Level After4weeks=c(6.42,6.76,6.56,4.8,8.43,7.49,8.05,5.05,5.77,3.91,6.77, 6.44,6.17,7.67,7.34,6.85,5.13,5.73) Before=c(5.83,6.2,5.83,4.27,7.71,7.12,7.25,4.63,5.31,3.7,6.15,5.59,5.56, 7.11,6.84,6.4,4.52,5.13) t.test(After4weeks, Before,paired=T) Paired t-test data: After4weeks and Before t = 15.439, df = 17, p-value = 1.958e-11 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.4887486 0.6434736 sample estimates: mean of the differences 0.56611 Conclusion: The test statistic is t=-15.439 and the p-value is very small (p 0.001) so the null hypothesis is rejected, since p 0.05 and evidence of a statistically significant difference is concluded. January 1, 1980 49 / 70
  • 50. Design of Single-Factor ANOVA The simplest ANOVA problem is referred to variously as a single-factor, single-classification, or one-way ANOVA. It involves the analysis of data from experiments in which more than two treatments have been used. The characteristic that differentiates the treatments or populations from one another is called the factor or Treatments under study, and the different treatments or populations are referred to as replicates of the factor. The response for each of the m treatments is a random variable. Treatment Observations Totals Average 1 y11 y12 · · · y1n y1. ȳ1. 2 y21 y22 · · · y2n y2. ȳ2. . . . . . . . . . . . . . . . . . . . . . . . . . . . m ym1 ym2 · · · ymn ym. ȳm. y.. ȳ.. January 1, 1980 50 / 70
  • 51. Model for a Single-Factor Experiment Yij = µ + τi + εij , i = 1, 2, · · · n j = 1, 2, · · · m (7) where Yij is a random variable denoting the (ij)th observation, µ is a parameter common to all treatments called the overall mean, τi is a parameter associated with the ith treatment called the ith treatment effect, and εij is a random error component. ANOVA Table for Single Factor Experiment Source of Sum of Degree of Mean Variation Squares Freedom Square F Between SSbetween m-1 MSbetween MSbetween MSwithin Within SSwithin m(n-1) MSwithin Total SST mn-1 January 1, 1980 51 / 70
  • 52. Example: Bull Effect Data on Weight Gain of Calves A researcher is interested in determining if the average daily gain in weight of calves depends on the bull which sired the calf. The researcher has only five bulls. The five bulls are mated with randomly selected cows and the average daily gain in weight by the calves produced by the matings are recorded. The data are given below. Use these data to run an analysis of variance and test for a significant bull effect. Use α = 0.05. January 1, 1980 52 / 70
  • 53. R-code for Single Factorial Design for Bull Effect Example Weight.gain=c(1.2,1.39,1.36,1.39,1.22,1.31,1.16,1.08, 1.22,0.87,1.17,1.12,0.75,1.12,1.02,1.08,0.83,0.98,0.96,1.16, 1.05,1.00,1.12,1.15,0.99,0.85,1.10,1.03,0.94,0.89) bull=gl(5,6,label=c(1,2,3,4,5)) tapply(Weight.gain,bull,sum) #treatment total . 1 2 3 4 5 7.87 6.62 5.78 6.44 5.80 tapply(Weight.gain,bull,mean) #treatment mean . 1 2 3 4 5 1.3116667 1.1033333 0.9633333 1.0733333 0.9666667 January 1, 1980 53 / 70
  • 54. boxplot(Weight.gain bull, ylab=Daily Gain Weight, xlab=Bull Sired) Figure: Box Plot for Daily Weight Gain of Calves January 1, 1980 54 / 70
  • 55. R-code for ANOVA of Bull Example summary(aov(Weight.gain ∼ bull)) # Analysis of variance . Df Sum Sq Mean Sq F value Pr(F) bull 4 0.4839 0.12097 10.29 4.56e-05 *** Residuals 25 0.2938 0.01175 — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 From the Table of ANOVA in R-output, the column Pr( F) = 4.56−0.5, is the P-Value which is less than our level of significance α = 0.05. Thus, we reject the null hypothesis that states: no significant difference between the daily weight gain of calves due to the difference of bull which sired the calf. Conclusion: There is a significant bull effect which sired the calf on its daily weight gain at α = 5% significance level January 1, 1980 55 / 70
  • 56. R-code for Bartlett Test of Homogeneity of Variances for ANOVA of Bull Example bartlett.test(Weight.gain ∼ bull) #Bartlett Tests for Equality of Variance Bartlett test of homogeneity of variances data: Weightgain by bull Bartlett’s K-squared = 2.4124, df = 4, p-value = 0.6604 Bartlett’s χ2 = 2.4124, df = 4, p-value = 0.6604. ⇒ Conclude that all 5 variances are the same! Fitted Versus Residual and QQ-Normality Test Plots of Residuals. January 1, 1980 56 / 70
  • 57. plot(fitted(aov( Weight.gain ∼ bull)), residuals(aov(Weightgain bull)),type=p,col=4,xlab=Fitted, ylab=Residuals) abline(h = 0, lty = 2, col=2) Figure: Fitted Versus Residual Plot of Fitted ANOVA Model for Daily Weight Gain of Calves January 1, 1980 57 / 70
  • 58. qqnorm(residuals(aov(Weight.gain ∼bull)),main = Normal Q-Q Plot of Residuals,col=4) qqline(residuals(aov(Weight.gain ∼bull)),col=2) Figure: QQ-Normality Plot of Residuals of Fitted ANOVA Model for Daily Weight Gain of Calves January 1, 1980 58 / 70
  • 59. Analysis of Variance with Two Factors Here there are two factors A and B with r levels of factor A and m levels of factor B, each replicate contains all rm treatment combinations. When you are interested in the effects of two or more factors on the response variable, the Two-Way ANOVA is an analysis method for a quantitative outcome and two categorical explanatory variables that are defined in such a way that each experimental unit (subject) can be exposed to any combination of one level of one explanatory variable and one level of the other explanatory variable. January 1, 1980 59 / 70
  • 60. Factor B Row Row 1 2 · · · m Totals Average y111, y112, y121, y122, · · · y1m1, y1m2, 1 · · · , y11n · · · , y12n · · · · · · , y1mn y1.. ȳ1.. Factor A y211, y212, y221, y222, · · · y2m1, y2m2, 2 · · · , y21n · · · , y22n · · · · · · , y2mn y2.. ȳ2.. . . . . . . . . . . . . . . . . . . . . . . . . . . . yr11, yr12, yr21, yr22, · · · yrm1, yrm2, r · · · , yr1n · · · , yr2n · · · · · · , yrmn yr.. ȳr.. Column y.1. y.2. · · · y.m. y... Totals Column ȳ.1. ȳ.2. · · · ȳ.m. ȳ... Average The observation in the ijth cell for the kth replicate is denoted by Yijk . In performing the experiment, the rmn observations would be run in random order. January 1, 1980 60 / 70
  • 61. Additive Model of Two Factorial Experiment Yijk = µ + τi + βj + εijk ,    i = 1, 2, · · · , r j = 1, 2, · · · , m and k = 1, 2, · · · , n (8) Model of Two Factorial Experiment With Interaction Yijk = µ + τi + βj + (τβ)ij + εijk ,    i = 1, 2, · · · , r j = 1, 2, · · · , m and k = 1, 2, · · · , n (9) where µ is the overall mean effect, τi is the effect of the ith level of factor A, βj is the effect of the jth level of factor B, (τβ)ij is the effect of the interaction between A and B, and εijk is assumed to be a random error component having a normal distribution with mean zero and variance σ2. Since there are two factors in the experiment, the test procedure is sometimes called the two-way analysis of variance. January 1, 1980 61 / 70
  • 62. ANOVA Table for Two Factor A and B Experiment Source of Sum of Degree of Mean Variation Squares Freedom Square F A SSA r-1 MSA MSA MSError B SSB m-1 MSB MSB MSError AB SSAB (r-1)(m-1) MSAB MSAB MSError Error SSError rm(n-1) MSerror Total SST rmn-1 January 1, 1980 62 / 70
  • 63. Example: Two Factors Experiment: Assay-Lab Calcium Content A consumer product agency wants to evaluate the accuracy of determining the level of calcium in a food supplement. There are a large number of possible testing laboratories and a large number of chemical assays for calcium. The agency randomly selects three laboratories and three assays for use in the study. Each laboratory will use all three assays in the study. Eighteen samples containing 10 mg of calcium are prepared and each assay–laboratory combination is randomly assigned to two samples. The determinations of calcium content are given in Table below (numbers in parentheses are averages for the assay–laboratory combinations). January 1, 1980 63 / 70
  • 64. January 1, 1980 64 / 70
  • 65. R-Code for Two Factors Experiment: Assay-Lab Calcium Content Calcium=c(10.9,10.9,10.5,9.8,9.7,10.0,11.3,11.7,9.4,10.2,8.8,9.2,11.8, 11.2,10.0,10.7,10.4,10.7) Assay=gl(3, 6) Lab=gl(3, 2, 18, labels = c(1, 2, 3)) tapply(Calcium, list(Assay, Lab), sum) . 1 2 3 1 21.8 20.3 19.7 2 23.0 19.6 18.0 3 23.0 20.7 21.1 tapply(Calcium, list(Assay, Lab), mean) . 1 2 3 1 10.9 10.15 9.85 2 11.5 9.80 9.00 3 11.5 10.35 10.55 level.Calcium=lm(Calcium Assay*Lab) January 1, 1980 65 / 70
  • 66. R-code for ANOVA of Two Factorial Experiment fro Calcium Concentration Example anova(level.Calcium) Analysis of Variance Table Response: Calcium . Df Sum Sq Mean Sq F value Pr(F) Assay 2 1.56 0.7800 5.6613 0.025597 * Lab 2 7.56 3.7800 27.4355 0.000148 *** Assay:Lab 4 1.64 0.4100 2.9758 0.080332 . Residuals 9 1.24 0.1378 — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 January 1, 1980 66 / 70
  • 67. Interaction Plot forTwo Factorial Experiment fro Calcium Concentration Example interaction.plot(Lab,Assay,Calcium, col=c(2,4,6), main=(a),pch=20) From figure, we observe that the mean calcium content in chemical January 1, 1980 67 / 70
  • 68. Diagnosis Test for Assay-Lab Model e=residuals(level.Calcium) f=fitted(level.Calcium) plot(Assay, e, xlab=Assay,ylab=Residual) plot(Lab, e, xlab=Lab,ylab=Residual) The coefficient of determination is one method of diagnosis of our model to estimate the percentage of the variability of calcium content explained by the different assay chemicals and under different laboratories. R2 = SSModel SSTotal = 10.76 12 = 0.897 where SSModel = SSAssay +SSLab +SSAssay∗Lab = 1.56+7.56+1.64 = 10.76. That is, about 90% of the variability in the calcium content is explained by the different assay chemicals and under different laboratories and their interaction. January 1, 1980 68 / 70
  • 69. Figure: Box Plot for Residual Versus Assay (left) and Residual Versus Lab (right) Figures (a) and (b) above plot the residuals versus calcium content in three chemical assays used and the three Laboratories where the experiment was applied, respectively. Both plots indicate mild inequality of variance, with the treatment combination of Assay 3 and lab 3 possibly having larger variance than the others. January 1, 1980 69 / 70
  • 70. plot(f, e, col=4,lty=0,lwd=6, pch=20, xlab=Fitted,ylab=Residual, main=(a) Fitted Versus Residual Plot) abline(h = 0, lty = 2, lwd=3,col=2) qqnorm(e, col=4,lty=0,lwd=6,main=(b) QQ − Normality Plot, pch=20) qqline(e, col=2) Figure: Fitted Versus Residual Plot and QQ-Normality Plot January 1, 1980 70 / 70