SlideShare a Scribd company logo
Linear Regression An Introduction To Statistical
Models Peter Martin download
https://ebookbell.com/product/linear-regression-an-introduction-
to-statistical-models-peter-martin-47176696
Explore and download more ebooks at ebookbell.com
Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Linear Regression Using R An Introduction To Data Modeling David J
Lilja
https://ebookbell.com/product/linear-regression-using-r-an-
introduction-to-data-modeling-david-j-lilja-9959444
Applied Linear Regression For Longitudinal Data With An Emphasis On
Missing Observations Frans Es Tan
https://ebookbell.com/product/applied-linear-regression-for-
longitudinal-data-with-an-emphasis-on-missing-observations-frans-es-
tan-46824202
An Application Of The Linear Regression Technique For Determining
Length And Weight Of Six Fish Taxa The Role Of Selected Fish Species
In Aleut Paleodiet Trevor J Orchard
https://ebookbell.com/product/an-application-of-the-linear-regression-
technique-for-determining-length-and-weight-of-six-fish-taxa-the-role-
of-selected-fish-species-in-aleut-paleodiet-trevor-j-orchard-49993950
Regression Analysis An Intuitive Guide For Using And Interpreting
Linear Models 1st Edition Jim Frost
https://ebookbell.com/product/regression-analysis-an-intuitive-guide-
for-using-and-interpreting-linear-models-1st-edition-jim-
frost-42876970
Linear Regression Models Applications In R John P Hoffman
https://ebookbell.com/product/linear-regression-models-applications-
in-r-john-p-hoffman-51710136
Linear Regression Analysis 2nd Edition Wiley Series In Probability And
Statistics 2nd Edition George A F Seber
https://ebookbell.com/product/linear-regression-analysis-2nd-edition-
wiley-series-in-probability-and-statistics-2nd-edition-george-a-f-
seber-2539350
Linear Regression 1st Edition Jrgen Gro Auth
https://ebookbell.com/product/linear-regression-1st-edition-jrgen-gro-
auth-4271820
Linear Regression Analysis Theory And Computing 1st Edition Xin Yan
https://ebookbell.com/product/linear-regression-analysis-theory-and-
computing-1st-edition-xin-yan-43136366
Linear Regression David J Olive
https://ebookbell.com/product/linear-regression-david-j-olive-5772564
Linear Regression An Introduction To Statistical Models Peter Martin
Linear Regression An Introduction To Statistical Models Peter Martin
LINEAR
REGRESSION
THE SAGE QUANTITATIVE RESEARCH KIT
Beginning Quantitative Research by Malcolm Williams, Richard D. Wiggins, and the late W.
Paul Vogt is the first volume in The SAGE Quantitative Research Kit. This book can be used
together with the other titles in the Kit as a comprehensive guide to the process of doing
quantitative research, but it is equally valuable on its own as a practical introduction to
completing quantitative research.
Editors of The SAGE Quantitative Research Kit:
Malcolm Williams – Cardiff University, UK
Richard D. Wiggins – UCL Social Research Institute, UK
D. Betsy McCoach – University of Connecticut, USA
Founding editor:
The late W. Paul Vogt – Illinois State University, USA
LINEAR
REGRESSION:
AN INTRODUCTION TO
STATISTICAL MODELS
PETER MARTIN
THE SAGE QUANTITATIVE RESEARCH KIT
SAGE Publications Ltd
1 Oliver’s Yard
55 City Road
London EC1Y 1SP
SAGE Publications Inc.
2455 Teller Road
Thousand Oaks, California 91320
SAGE Publications India Pvt Ltd
B 1/I 1 Mohan Cooperative Industrial Area
Mathura Road
New Delhi 110 044
SAGE Publications Asia-Pacific Pte Ltd
3 Church Street
#10-04 Samsung Hub
Singapore 049483
Editor: Jai Seaman
Assistant editor: Charlotte Bush
Production editor: Manmeet Kaur Tura
Copyeditor: QuADS Prepress Pvt Ltd
Proofreader: Elaine Leek
Indexer: Cathryn Pritchard
Marketing manager: Susheel Gokarakonda
Cover design: Shaun Mercier
Typeset by: C&M Digitals (P) Ltd, Chennai, India
Printed in the UK
© Peter Martin 2021
This volume published as part of The SAGE Quantitative
Research Kit (2021), edited by Malcolm Williams,
Richard D. Wiggins and D. Betsy McCoach.
Apart from any fair dealing for the purposes of research,
private study, or criticism or review, as permitted under the
Copyright, Designs and Patents Act, 1988, this publication
may not be reproduced, stored or transmitted in any form,
or by any means, without the prior permission in writing of
the publisher, or in the case of reprographic reproduction,
in accordance with the terms of licences issued by the
Copyright Licensing Agency. Enquiries concerning
reproduction outside those terms should be sent to the
publisher.
Library of Congress Control Number: 2020949998
British Library Cataloguing in Publication data
A catalogue record for this book is available from the
British Library
ISBN 978-1-5264-2417-4
At SAGE we take sustainability seriously. Most of our products are printed in the UK using responsibly sourced
papers and boards. When we print overseas we ensure sustainable papers are used as measured by the
PREPS grading system. We undertake an annual audit to monitor our sustainability.
Contents
List of Figures, Tables and Boxes ix
About the Author xv
Acknowledgements xvii
Preface xix
1 What Is a Statistical Model? 1
Kinds of Models: Visual, Deterministic and Statistical 2
Why Social Scientists Use Models 3
Linear and Non-Linear Relationships: Two Examples 4
First Approach to Models: The t-Test as a Comparison of Two
Statistical Models 6
The Sceptic’s Model (Null Hypothesis of the t-Test) 8
The Power Pose Model: Alternative Hypothesis of the t-Test 9
Using Data to Compare Two Models 10
The Signal and the Noise 14
2 Simple Linear Regression 17
Origins of Regression: Francis Galton and the Inheritance of Height 18
The Regression Line 21
Regression Coefficients: Intercept and Slope 23
Errors of Prediction and Random Variation 24
The True and the Estimated Regression Line 25
Residuals 26
How to Estimate a Regression Line 27
How Well Does Our Model Explain the Data? The R2
Statistic 29
Sums of Squares: Total, Regression and Residual 29
R2
as a Measure of the Proportion of Variance Explained 31
R2
as a Measure of the Proportional Reduction of Error 31
Interpreting R2
32
Final Remarks on the R2
Statistic 32
Residual Standard Error 33
Interpreting Galton’s Data and the Origin of ‘Regression’ 33
Inference: Confidence Intervals and Hypothesis Tests 35
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
vi
Confidence Range for a Regression Line 39
Prediction and Prediction Intervals 42
Regression in Practice: Things That Can Go Wrong 44
Influential Observations 45
Selecting the Right Group 46
The Dangers of Extrapolation 47
3 Assumptions and Transformations 51
The Assumptions of Linear Regression 52
Investigating Assumptions: Regression Diagnostics 54
Errors and Residuals 54
Standardised Residuals 55
Regression Diagnostics: Application With Examples 56
Normality 56
Homoscedasticity and Linearity: The Spread-Level Plot 61
Outliers and Influential Observations 64
Independence of Errors 70
What if Assumptions Do Not Hold? An Example 71
A Non-Linear Relationship 71
Model Diagnostics for the Linear Regression of Life Expectancy on GDP 73
Transforming a Variable: Logarithmic Transformation of GDP 73
Regression Diagnostics for the Linear Regression With Predictor
Transformation 79
Types of Transformations, and When to Use Them 79
Common Transformations 80
Techniques for Choosing an Appropriate Transformation 83
4 Multiple Linear Regression: A Model for Multivariate
Relationships 87
Confounders and Suppressors 88
Spurious Relationships and Confounding Variables 88
Masked Relationships and Suppressor Variables 91
Multivariate Relationships: A Simple Example With Two Predictors 93
Multiple Regression: General Definition 96
Simple Examples of Multiple Regression Models 97
Example 1: One Numeric Predictor, One Dichotomous Predictor 98
Example 2: Multiple Regression With Two Numeric Predictors 107
Research Example: Neighbourhood Cohesion and Mental Wellbeing 113
contents vii
Dummy Variables for Representing Categorical Predictors 117
What Are Dummy Variables? 118
Research Example: Highest Qualification Coded Into Dummy Variables 118
Choice of Reference Category for Dummy Variables 122
5 Multiple Linear Regression: Inference, Assumptions and
Standardisation 125
Inference About Coefficients 126
Standard Errors of Coefficient Estimates 126
Confidence Interval for a Coefficient 128
Hypothesis Test for a Single Coefficient 128
Example Application of the t-Test for a Single Coefficient 129
Do We Need to Conduct a Hypothesis Test for Every Coefficient? 130
The Analysis of Variance Table and the F-Test of Model Fit 131
F-Test of Model Fit 132
Model Building and Model Comparison 135
Nested and Non-Nested Models 135
Comparing Nested Models: F-Test of Difference in Fit 137
Adjusted R2
Statistic 139
Application of Adjusted R2
140
Assumptions and Estimation Problems 141
Collinearity and Multicollinearity 141
Diagnosing Collinearity 142
Regression Diagnostics 144
Standardisation 148
Standardisation and Dummy Predictors 151
Standardisation and Interactions 151
Comparing Coefficients of Different Predictors 152
Some Final Comments on Standardisation 152
6 Where to Go From Here 155
Regression Models for Non-Normal Error Distributions 156
Factorial Design Experiments: Analysis of Variance 157
Beyond Modelling the Mean: Quantile Regression 158
Identifying an Appropriate Transformation: Fractional Polynomials 158
Extreme Non-Linearity: Generalised Additive Models 159
Dependency in Data: Multilevel Models (Mixed Effects Models,
Hierarchical Models) 159
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
viii
Missing Values: Multiple Imputation and Other Methods 159
Bayesian Statistical Models 160
Causality 160
Measurement Models: Factor Analysis and Structural Equations 161
Glossary 163
References 171
Index 175
List of Figures, Tables and Boxes
List of figures
1.1 Child wellbeing and income inequality in 25 countries 4
1.2 Gross domestic product (GDP) per capita and life expectancy in
134 countries (2007) 6
1.3 Hypothetical data from a power pose experiment 8
1.4 Illustrating two statistical models for the power pose experiment 8
1.5 Partition of a statistical model into a systematic and a random part 15
2.1 Scatter plot of parents’ and children’s heights 19
2.2 Galton’s data with superimposed regression line 22
2.3 An illustration of the regression line, its intercept and slope 23
2.4 Illustration of residuals 27
2.5 Partition of the total outcome variation into explained and
residual variation 30
2.6 Illustration of R2
as a measure of model fit 31
2.7 Galton’s regression line compared to the line of equal heights 34
2.8 Regression line with 95% confidence range for mean prediction  40
2.9 Regression line with 95% prediction intervals 43
2.10 Misleading regression lines resulting from influential observations 45
2.11 The relationship between GDP per capita and life expectancy, in two
different selections from the same data set 46
2.12 Linear regression of life expectancy on GDP per capita in the 12 Asian
countries with highest GDP, with extrapolation beyond the data range 48
2.13 Checking the extrapolation from Figure 2.12 by including the points
for the 12 Asian countries with the lowest GDP per capita 48
3.1 Illustration of the assumptions of normality and homoscedasticity in
Galton’s regression 54
3.2 An illustration of the normal distribution 57
3.3 Histogram of standardised residuals from Galton’s regression, with a
superimposed normal curve 58
3.4 Histograms of standardised residuals illustrating six distribution shapes 59
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
x
3.5 Normal q–q plot of standardised residuals from Galton’s regression 60
3.6 Normal q–q plots of standardised residuals for six distribution shapes 61
3.7 Spread-level plot: standardised residuals and regression predicted
values from Galton’s regression 62
3.8 Spread-level plots and scatter plots for four simulated data sets 63
3.9 Illustration of a standard normal distribution, with conventional
critical values 65
3.10 Observations with the largest Cook’s distances from Galton’s regression 68
3.11 Galton’s regression data with four hypothetical influential observations 69
3.12 Life expectancy by GDP per capita in 88 countries 71
3.13 Diagnostic plots for the linear regression of life expectancy on GDP
per capita 73
3.14 Life expectancy and GDP per capita – illustrating a linear regression
on the logarithmic scale 76
3.15 The curvilinear relationship between life expectancy and GDP
per capita 78
3.16 Diagnostic plots for the linear regression of life expectancy on
log2
(GDP)79
3.17 The shape of the relationship between Y and X in three common
transformations, with positive slope coefficient (top row) and
negative slope coefficient (bottom row) 81
4.1 A hypothetical scatter plot of two apparently correlated variables 89
4.2 Illustration of a confounder causing a spurious association
between X and Y90
4.3 A hypothetical scatter plot of two apparently unrelated variables 91
4.4 Illustration of a suppressor variable masking the true relationship
between X and Y91
4.5 Five hypothetical data sets illustrating possible models for Mental
Wellbeing predicted by Social Participation and Limiting Illness 94
4.6 The distributions of Mental Wellbeing, Social Participation and
Limiting Illness 98
4.7 Scatter plot of Mental Wellbeing by Social Participation, grouped
by Limiting Illness 100
4.8 Mental Wellbeing, Social Participation and Limiting Illness: an
illustration of five possible models for the National Child
Development Study data 101
4.9 Distributions of Neighbourhood Cohesion and Social Support scales 108
list of figures, tables and boxes xi
4.10 Three-dimensional representation of the relationship between Mental
Wellbeing, Neighbourhood Cohesion and Social Support 110
4.11 Three regression lines for the prediction of Mental Wellbeing by
Neighbourhood Cohesion, for different values of Social Support 111
4.12 Three regression lines for the prediction of Mental Wellbeing by
Neighbourhood Cohesion, Social Support and their interaction 113
4.13 Comparing predictions of Mental Wellbeing from the unadjusted and
adjusted models (Models 4.1 and 4.2) 116
4.14 Distribution of ‘Highest Qualification’  119
5.1 Fisher distribution with df1
= 5 and df2
= 7597, with critical region 133
5.2 Normal q–q plot for standardised residuals from Model 5.3 145
5.3 Spread-level plot of standardised residuals against predicted values
from Model 5.3 145
List of tables
1.1 Testosterone change from a power pose experiment (hypothetical data) 11
2.1 Extract from Galton’s data on heights in 928 families 19
2.2 A typical regression results table (based on Galton’s data) 38
3.1 The largest positive and negative standardised residuals from
Galton’s regression 66
3.2 Estimates from a simple linear regression of life expectancy on
GDP per capita 72
3.3 Logarithms for bases 2, 10 and Euler’s number e75
3.4 Calculating the base-2 logarithm for a selection of GDP per capita values 75
3.5 Raw and log-transformed GDP per capita values for six countries 76
3.6 Estimates from a simple linear regression of life expectancy on
log2
(GDP)77
4.1 Coefficient estimates for five models predicting Mental Wellbeing 101
4.2 Coefficient estimates from a regression of Mental Wellbeing on
Neighbourhood Cohesion and Social Support 109
4.3 Coefficient estimates for the prediction of Mental Wellbeing by
Neighbourhood Cohesion, Social Support and their interaction 112
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
xii
4.4 Coefficient estimates, standard errors and confidence intervals for
two regression models predicting Mental Wellbeing 115
4.5 A scheme for coding a categorical variable with three categories
into two dummy variables 118
4.6 A scheme to represent Highest Qualification by five dummy variables 120
4.7 Hypothetical data set with five dummy variables representing the
categorical variable Highest Qualification 120
4.8 Estimates from a linear regression predicting Mental Wellbeing,
with dummy variables representing Highest Qualification (Model 4.3) 122
5.1 Coefficient estimates, standard errors and confidence intervals for a
multiple regression predicting Mental Wellbeing (Model 5.1) 127
5.2 Estimated coefficients for a regression of Mental Wellbeing on four
predictors and two interactions (Model 5.2) 130
5.3 Analysis of variance table for linear regression 132
5.4 Analysis of variance table for a multiple regression predicting
Mental Wellbeing (Model 5.1) 134
5.5 Model comparison of Models 5.1 and 5.3 138
5.6 Analysis of variance table for Models 5.1 and 5.3 140
5.7 Multicollinearity diagnostics for Model 5.3 143
5.8 The largest standardised residuals from Model 5.3 146
5.9 Estimates from a linear regression predicting Mental Wellbeing
(Model 5.3) 149
5.10 Unstandardised and standardised coefficient estimates from
Model 5.3 151
List of boxes
2.1 Types of Variables 18
2.2 Galton and Eugenics 20
2.3 Various Names for the Variables Involved in a Regression Model 21
2.4 Finding the Slope and the Intercept for a Regression Line 28
2.5 How to Calculate a Confidence Range Around the Regression Line 40
2.6 How to Calculate a Prediction Interval 43
3.1 The Normal Distribution and the Standard Normal Distribution 56
3.2 Regression Diagnostics and Uncertainty 59
3.3 Further Properties of the Normal Distribution 64
3.4 Logarithms 74
list of figures, tables and boxes xiii
4.1 Variables From the National Child Development Study Used in
Example 1 99
4.2 Interactions in Regression Models 104
4.3 Measurement of Neighbourhood Cohesion and Social Support in
the NCDS 108
5.1 Nested and Non-Nested Models 136
Linear Regression An Introduction To Statistical Models Peter Martin
About the Author
Peter Martin is Lecturer in Applied Statistics at University College London. He
has taught statistics to students of sociology, psychology, epidemiology and other
disciplines since 2003. One of the joys of being a statistician is that it opens doors
to research collaborations with many people in diverse fields. Dr Martin has been
involved in investigations in life course research, survey methodology and the analy-
sis of racism. In recent years, his research has focused on health inequalities, psy-
chotherapy and the evaluation of healthcare services. He has a particular interest in
topics around mental health care.
Linear Regression An Introduction To Statistical Models Peter Martin
Acknowledgements
Thanks to Richard D. Wiggins, Malcolm Williams and D. Betsy McCoach for invit-
ing me to write this book. To Amy Macdougall, Andy Ross, D. Betsy McCoach, Kalia
Cleridou, Praveetha Patalay and Richard D. Wiggins for generously providing feed-
back on draft chapters. To the team at Sage for editorial support. To Brian Castellani
for suggesting a vital phrase. To my colleagues for giving me time. To the staff of
several East London cafés for space and warmth. To everyone I ever taught statistics
for helping me learn. To Richard D. Wiggins for generous advice and encouragement
over many years. To Pippa Hembry for being there.
Thanks also to
• The UNICEF MICS team for permission to use data from their archive (https://
mics.unicef.org).
• The Gapminder Foundation for making available data on life expectancy and GDP
from around the world.
• The UK Data Archive for permission to use data from the National Child
Development Study.
The data analyses reported in this book were conducted using the R Software for
Statistical Computing (R Core Team, 2019) with the RStudio environment (RStudio
Team, 2016). All graphs were made in R, in most cases using the package ggplot2.
Other R packages used in the making of this book are catspec, gapminder, ggrepel,
grid, knitr, MASS, plyr, psych, reshape2, scales, scatterplot3d, tidyverse.
Linear Regression An Introduction To Statistical Models Peter Martin
Preface
This is a book about statistical models as they are used in the social sciences. It gives
a first course in the type of models commonly referred to as linear regression mod-
els. At the same time, it introduces many general principles of statistical modelling,
which are important for understanding more advanced methods.
Statistical models are useful when we have, or aim to collect, data about social
phenomena and wish to understand how different phenomena relate to one another.
Examples in this book are based on real social science research studies that have
investigated questions about:
• Sociology of community: Do neighbourhoods with a more cohesive community
spirit foster mental wellbeing for local people?
• Demography and economics: Is it necessary for a country to get richer and richer to
increase the health of its population?
• Inequality and wellbeing: Is a country’s income inequality related to the wellbeing
of its children?
• Psychology: Can some people increase their feelings of confidence by assuming
certain ‘power poses’?
This book won’t give conclusive answers to these questions. But it does introduce
some of the analytical methods that have been used to address them, and other ques-
tions like them. Specifically, this book looks at linear regression, which is a method
for analysing continuous variables, such as a person’s height, a child’s score on a
measure of self-rated depression or a country’s average life expectancy. Other types of
outcome variables, such as categorical and count variables, are covered in The SAGE
Quantitative Research Kit, Volume 8.
Realistic data sets
The examples in this book are based on published social science studies, and most
analyses shown use the original data on which these source studies were conducted,
or subsets thereof. Since the statistical analysis uses realistic data, the results reported
are sometimes ambiguous, which is to say: what conclusions we should draw from
the analysis may remain debatable. This highlights an important point about statis-
tical models: in themselves, statistical models do not give you the answers to your
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
xx
research questions. What statistical analysis does provide is a principled way to derive
evidence from data. This evidence is important, and you can use it in your argument
for or against a certain conclusion. But all statistical results need to be interpreted to
be meaningful.
Prior knowledge useful for understanding this book
This book is intended for those who have a thorough grounding in descriptive statis-
tics, as well as in the fundamentals of inferential statistics. I assume throughout that
you understand what I mean when I speak of means, standard deviations, percentiles,
histograms, and scatter plots, and that you know the basic ideas underlying a t-test, a
z-test and a confidence interval. Finally, I assume that you are familiar with some of
the ways social science data are collected or obtained – surveys, experiments, admin-
istrative data sources, and so forth – and that you understand that all these methods
have strengths and weaknesses that affect the conclusions we can draw from any
analysis of the data. An excellent way to acquire the knowledge required to benefit
from this book is to study Volumes 1 to 6 of The SAGE Quantitative Research Kit.
Mathematics: equations, calculations, Greek symbols
This book is intended for social scientists and students of social science who wish to
understand statistical modelling from a practical perspective. Statistical models are
based on elaborate and advanced mathematical methods, but knowledge of advanced
mathematics is not needed to understand this book.
Nonetheless, this book does require you, and possibly challenges you, to learn to
recognise the essential equations that define statistical models, and to gain an intui-
tive understanding of how they work. I believe that this is a valuable skill to have. For
example, it’s important to recognise the difference between the sort of equation that
defines a straight line and another sort that defines a curve. As you will see, this is
essential for the ability to choose an appropriate model for a given research question
and data set. Attempts at using statistical models without any mathematical under-
standing carry a high risk of producing nonsensical and misleading results.
So there will be equations. There will be Greek symbols. But there will be careful
explanations of them all, along with graphs and illustrations to illuminate the maths.
Think of the maths as a language that it’s useful to get a working understanding of.
Suppose you decide to live in a foreign country for a while, and that you don’t yet
preface xxi
know the main language spoken in this country. Suppose further that enough people
in that country understand and speak your own language, so that most of the time
you can get by using a language you are familiar with. Nonetheless, you will under-
stand more about the country if you learn a little bit of its language. Even if you don’t
aspire to ever speak it fluently, or write poetry in it, you may learn enough of it to
enable you to understand a newspaper headline, read the menu in a restaurant and
have a good guess what the native speakers at the next table are talking about. In a
similar way, you don’t need to become an expert mathematician to understand a lit-
tle bit of the mathematical aspect of statistical modelling, and to use this understand-
ing to your advantage. So what’s needed to benefit from this book is not so much
mathematical skill, but rather an openness to considering the language of mathemat-
ics as an aid to understanding the underlying logic of statistical modelling.
Software
This book is software-neutral. It can be read and understood without using any sta-
tistical software. On the other hand, what you learn here can be applied using any
statistical software that can estimate regression models. In writing this book, I used
the free open-source software R (R Core Team, 2019). Other statistical packages often
used by social scientists for linear regression models are Stata, SPSS and SAS.
Web support pages with worked examples
It is generally a good idea to learn statistics by doing it – that is, to work with data
sets and statistical software and play around with fitting statistical models to the data.
To help with this, the support website for this book supplies data sets for most of the
examples used in this book and gives worked examples of the analyses.
The support website is written in the R software. R has the advantage that it can
be downloaded free of charge, and that it has a growing community of users who
write new add-on packages to extend its capability, publish tutorials, and exchange
tips and tricks online. However, if you prefer to use a different software, or if you
are required to learn a different software for a course you are attending, you can
download the data sets from the support website and read them into your software
of choice. Instructions for this, as well as instructions on how to download R
for free, are given on the support website. Head to: https://study.sagepub.com/
quantitativekit
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
xxii
References
R Core Team. (n.d.). R: A language and environment for statistical computing. R Foundation
for Statistical Computing. www.R-project.org
RStudio Team. (2016). RStudio: Integrated development for R. RStudio. www.rstudio.com
1
What Is a Statistical Model?
Chapter Overview
Kinds of models: visual, deterministic and statistical����������������������������������� 2
Why social scientists use models����������������������������������������������������������������� 3
Linear and non-linear relationships: two examples�������������������������������������� 4
First approach to models: the t-test as a comparison of two
statistical models������������������������������������������������������������������������������������������ 6
The signal and the noise���������������������������������������������������������������������������� 14
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
2
What is a statistical model? This chapter gives a first introduction. We start by consid-
ering the concept of a ‘model’ in areas other than statistics. I then give some exam-
ples of how statistical models are applied in the social sciences. Finally, we see how a
simple parametric statistical hypothesis test, the t-test for independent samples, can
be understood as a systematic comparison of two statistical models. An important
aim of this chapter is to convey the notion that statistical models can be used to
investigate how well a theory fits the data. In other words, statistical models help us
to systematically evaluate the evidence for or against certain hypotheses we might
have about the social world.
Kinds of models: visual, deterministic and statistical
Models are simplified representations of systems, objects or theories that allow us to
understand things better. An architect builds a model house to help herself and her
clients imagine how the real house will look once it is built. The Paris metro map is
a model that helps passengers understand where they can catch a train, where they
can travel to and where they can change trains.
Some models come in the form of mathematical equations. For example, consider
an engineer who wants to design a submarine. The deeper the submarine dives, the
higher the water pressure is going to be. The engineer needs to account for this lest
the walls of his submarine crack and get crushed. As you may know, pressure is meas-
ured in a unit called bar, and the air pressure at sea level is equal to 1 bar. A math-
ematical model of the pressure experienced by a submarine under the surface of the
sea is as follows:
Pressure bar bar depth
= + ×
1 0 1
.
where depth is measured in metres. This equation expresses the insight that, with
every metre that the submarine dives deeper, the water pressure increases by 0.1 bar.
Using this model, we can calculate that the pressure at 100 m depth will be
Pressure bar bar bar
= + × =
1 0 1 100 11
.
Thus, if your job is to build a submarine that can dive to 100 m, you know you need
to build it so that its walls can withstand 11 bars of pressure (i.e. 11 times the pressure
at sea level).
Statistical models also are expressed in the form of equations. As you will see,
a simple statistical model looks very similar to the mathematical model we just
considered. The difference between the two is how they deal with the differ-
ences between what the model predicts about reality and observations from real-
ity itself. The engineer who uses the mathematical model of pressure might be
what is a statistical model? 3
happy to ignore small differences between the model and reality. The pressure
at 100 m is taken to be 11 bar. If it is really 10.997 bar or 11.029 bar, so what?
The approximation is good enough for the engineer’s purposes. Such a model is
called deterministic, because according to the model, the depth determines the
pressure precisely.
In contrast, statistical models are used in situations where there is considerable
uncertainty about how accurate the model predictions are. This is almost always the
case in social science, because humans, and the societies they build, are complex,
complicated, and not predictable as precisely as some natural phenomena, such as
the relationship between depth and water pressure.
All models are simplifications of reality. The architect’s model house lacks many
details. The Paris metro map does not accurately represent the distances between
the stations. Our (simplistic) mathematical model of underwater pressure ignores
that pressure at the same depth will not be the same everywhere the submarine goes
(e.g. because the waters of different oceans vary in salinity). Whether these impreci-
sions matter depends on the purpose of the model. The Paris metro map is useful for
travellers but not detailed enough for an engineer who wishes to extend the existing
tunnels to accommodate a new metro line. In the same way, a statistical model may
be good for one purpose but useless for another.
Why social scientists use models
Social scientists use statistical models to investigate relationships between social phe-
nomena, such as:
• Diet and longevity: Is what you eat associated with how long you can expect to live?
• Unemployment and health: Is unemployment associated with poorer health?
• Inequality and crime: Do countries with a wide income gap between the highest
and the lowest earners have higher crime rates than more equal countries?
Of course, descriptive statistics provide important evidence for social research. Tables
and graphs, means and standard deviations, correlation coefficients and comparisons
of groups – all these are important tools of analysis. But statistical models go beyond
description in important ways:
• Models can serve as formalisations of theories about the social world. By
comparing how well two models fit a given set of data, we can rigorously assess
which of two competing theories is more consistent with empirical observations.
• Statistical models provide rigorous procedures for telling the signal from the noise:
for deciding whether a pattern we see in a table or a graph can be considered
evidence for a real effect, relationship, or regularity in the social world.
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
4
• We can also use statistical models to develop specific predictions that we can test
in a new data set.
• Finally, statistical models allow us to investigate the influences of several variables
on one or several others simultaneously.
Linear and non-linear relationships: two examples
So what sort of things do we use statistical models for? Have a look at Figure 1.1, which
shows data on income inequality and child wellbeing in 25 of the richest countries of
the world. Income inequality is measured by the Gini coefficient; a higher Gini coef-
ficient indicates more unequal incomes. Child wellbeing is measured by the UNICEF
(United Nations Children’s Fund) index; a higher number means better child well-
being across the domains health, education, housing and environment, and behaviours.
Gini coefficient (2010)
UNICEF
index
of
child
wellbeing
0.24 0.28 0.32 0.36
–2
0
1
2
–1
Slovenia
Czech Republic
Hungary
Austria
Latvia
USA
Greece
Estonia
Italy
Poland
Canada United Kingdom
Spain
Portugal
France
Ireland
Denmark
Germany
Sweden
Iceland
Norway
Finland
Netherlands
Luxembourg
Belgium
Figure 1.1 Child wellbeing and income inequality in 25 countries
Note. Gini coefficient: a higher coefficient indicates more income inequality. UNICEF index of child
wellbeing: a higher number indicates better child wellbeing averaged over four dimensions: health,
education, housing and environment, and behaviours. Data for Gini coefficient: Organisation for
Economic Co-operation and Development (www.oecd.org/social/income-distribution-database.htm)
and child wellbeing: Martorano et al. (2014, Table 15). This graph is inspired by Figure 1 in Pickett and
Wilkinson (2007) but is based on more recent data. UNICEF = United Nations Children’s Fund.
what is a statistical model? 5
One way to describe these data is to draw attention to the positions of individual
countries. For example, the Netherlands, Norway and Iceland are rated the highest
on UNICEF’s index of child wellbeing, while Latvia, the USA and Greece are rated the
lowest. The three countries with the highest income inequality are the USA, Latvia
and the UK. The most egalitarian countries in terms of income are Slovenia, Norway
and Denmark.
Figure 1.1 also demonstrates a general pattern. The distribution of countries sug-
gests that the more inequality there is in a country, the poorer the wellbeing of the
children. As you may remember from The SAGE Quantitative Research Kit, Volume 2,
this is called a negative relationship (as one variable goes up, the other tends to go
down), and it can be represented by a correlation coefficient, Pearson’s r. The observed
correlation between inequality and child wellbeing in Figure 1.1 is r = − 0.70.
We might also want to illustrate the relationship by drawing a line, as I have done
in Figure 1.1. This line summarises the negative relationship we have just described.
The line describes how the wellbeing of children in a country depends on the degree
of a country’s economic inequality. The points don’t fall on the line exactly, but we
may argue that the line represents a fair summary of the general tendency observed
in this data set. This line is called a regression line, and it is a simple illustration of
linear regression, a type of statistical model that we will discuss in Chapter 2.
Every statistical model is based on assumptions. For example, by drawing the
straight line in Figure 1.1, we are assuming that there is a linear relationship between
inequality and child health. The word linear in the context of statistical models refers
to a straight line. Curved lines are not considered ’linear’. Judging from Figure 1.1,
the assumption of linearity might seem reasonable in this case, but more generally
many things are related in non-linear ways. Consider, for example, Figure 1.2, which
shows the relationship between GDP (gross domestic product) per capita and life
expectancy in 134 countries.
The graph suggests that there is a strong relationship between GDP and life
expectancy. But this relationship is not linear; it is not well represented by a straight
line. Among the poorest countries, even relatively small differences in GDP tend
to make a big difference in life expectancy. For the richer countries, even relatively
large differences in GDP appear to affect life expectancy only a little, or maybe not
at all. We may try to represent this relationship by drawing a curved line, as shown
in Figure 1.2. This is a simple illustration of a non-linear model, representing a
non-linear relationship. Like the line in Figure 1.1, the line in Figure 1.2 does not
represent the relationship between GDP and life expectancy perfectly. For exam-
ple, there are at least six African countries whose life expectancy is much lower
than the line predicts based on these countries’ GDP. We will see later in the book
how cases that don’t appear to fit our model can help us to improve our analysis.
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
6
First approach to models: the t-test as a comparison of two
statistical models
The practice of modelling often involves investigating which of a set of models gives
the best account of the data. In this way, we might compare a linear model with a
non-linear one, a simpler model with a more complex one, or a model corresponding
to one theory with a model corresponding to another. As a first introduction to how
this works, I will show you how an elementary hypothesis test, the t-test for inde-
pendent samples, can be understood as a systematic comparison of two statistical
models. The example will also introduce you to some simple mathematical notation
that will be useful in understanding subsequent chapters.
The example concerns psychological aspects of the mind–body problem. Most of
us have experienced that the way we hold our body can reflect the state of mind
that we are in: when we are anxious our body is tense, when we are happy our body
is relaxed, and so forth. But does this relationship work the other way around? Can
+
+
+
++
+
+
+
+
+
+
+
+
++
+
+
++ +
+
+
+
+
+
+
+
+
++
+
+
+
++
+
++
+
+
+
+
+
+
+
+
+
0 10,000 20,000 30,000 40,000 50,000
GDP per capita (US$)
Life
expectancy
(years)
30
40
50
60
70
80
+
+
+
+
Africa
+
Americas
Europe
Oceania
Asia
Figure 1.2 Gross domestic product (GDP) per capita and life expectancy in 134 countries
(2007)
Note. Data from the Gapminder Foundation (Bryan, 2017). See www.gapminder.org
what is a statistical model? 7
we change our state of mind by assuming a certain posture? Carney et al. (2010)
published an experimental study about what they called power poses. An example of
a power pose is to sit on a chair with your legs stretched out and your feet resting on
your desk, your arms comfortably crossed behind your neck. Let’s call this the ‘boss
pose’. Carney et al. (2010) reported that participants who were instructed to hold a
power pose felt more powerful subjectively, assumed a more risk-taking attitude and
even had higher levels of testosterone in their bodies compared to other participants,
who were instructed to hold a ‘submissive pose’ instead. The study was small, involv-
ing 42 participants, but it was covered widely in the media and became the basis of a
popular TED talk by one of the co-authors.
A sceptic may have doubts about the study’s results. From a theoretical point of
view, one might propose that the mind–body connection is a bit more complicated
than the study appears to imply. Methodologically speaking, we may also note that
with such a small sample (n = 42), there is a lot of uncertainty in any estimates
derived from the data. Could it be that the authors are mistaking a chance finding for
a signal of scientific value?
A scientific way to settle such questions is to conduct a replication study. For
the sake of example, let’s focus on one question only: does assuming a power pose
increase testosterone levels in participants, compared to assuming a different kind
of pose?
To test this, let’s imagine we conduct a replication of Carney et al.’s (2010)
experiment. We will randomise respondents to one of two conditions: The
experimental group are instructed to assume a power pose, such as the ‘boss
pose’ described above. In contrast, the control group are asked to hold a sub-
missive pose – the opposite of a power pose – such as sitting hunched, looking
downwards, with hands folded between the thighs. Before assuming their pose,
the participants have their testosterone levels measured. They then hold their
assigned pose for 2 minutes, after which time testosterone is measured again. The
outcome variable is the difference in testosterone after holding the pose minus
testosterone before holding the pose. A positive value on this variable means that
testosterone was higher after posing than before. A negative number means the
opposite. Zero indicates no change.
Before we conduct the study, we might visualise what the the data will look like.
Figure 1.3 shows hypothetical distributions of testosterone difference for the two
groups, the power pose group and the control group.
Let’s begin by turning the two competing theories about power poses – either power
poses can change testosterone levels or they cannot – into two different models that
aim to account for these hypothetical data.
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
8
The sceptic’s model (null hypothesis of the t-test)
The sceptic doesn’t believe that power poses influence testosterone levels. So she predicts
that, on average, the two groups have the same testosterone change. This is symbolised
by the line on the left panel of Figure 1.4: the means for the two groups are predicted to
be the same. The sceptic also recognises that not everyone may react to the experiment in the
same way, however, and so she expects individual variation around the mean testosterone
change. In brief, the sceptic says, ‘All we need to say about testosterone change in this experi-
ment is that there is random variation around the overall mean. Nothing else to see here.’
Control Power pose
Change
in
testosterone
−50
−40
−30
−20
−10
0
10
20
30
40
Figure 1.3 Hypothetical data from a power pose experiment
Control Power pose
Change
in
testosterone
−50
−40
−30
−20
−10
0
10
20
30
40
Control Power pose
−50
−40
−30
−20
−10
0
10
20
30
40
Sceptic’s model Power poses model
Figure 1.4 Illustrating two statistical models for the power pose experiment
what is a statistical model? 9
Let’s now look at how we can formalise this model using mathematical notation.
The sceptic’s model can be written as follows:
Individual’s testosterone change = mean testosterone change + individual variation
In mathematical symbols, we might write the same equation as:
Yi i
= +
µ ε
where
• Yi refers to the testosterone change of the ith individual:
{ for example, Y1 is the testosterone change of the first person, Y5
is the
testosterone change of the fifth person, and so on.
• µ is the population mean of testosterone change. This is denoted by the Greek letter
µ (‘mu’).
• εi is the difference between the ith individual’s testosterone change and the mean
µ. For example, ε1 is the difference between Y1 and the mean µ. This is denoted
by the Greek letter ε (‘epsilon’).
The equation thus represents each participant’s testosterone change (Yi) as a combi-
nation of two components: the population mean µ and the participant’s individual
deviation from that mean, εi
.
The εi
are called the errors. This might be considered a confusing name, as the
term error seems to imply that something has gone wrong. But this is not meant to be
implied here. If we rearrange the sceptic’s model equation, we can see that the errors
are simply the individual differences from the population mean:
ε µ
i i
Y
= −
The power pose model: alternative hypothesis of the t-test
Now let’s contrast the sceptic’s model with the power pose model. If you thought that
holding a power pose can increase testosterone, you would predict that the mean
change in testosterone is higher in the power pose group than in the control group.
This is illustrated by the lines in the right panel of Figure 1.4: the means for the two
groups are predicted to be different.
In mathematical notation, we can write this model as follows:
Y X
i i i
= + +
α β ε
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
10
where
• Yi
is, once again, the change in testosterone level for individual i.
• Xi is a variable that indicates the group membership of individual i; this variable
can either be 0 or 1, where
{ X = 0 indicates the control group and
{ X = 1 indicates the power pose group.
• α is a coefficient, which in this case represents the mean of the control group.
• β is a coefficient that represents the difference in average testosterone change
between the power pose group and the control group.
• εi represents individual variation in testosterone change around the group mean.
In this type of model, we call X the predictor variable, and Y the outcome variable. X is
used to predict Y. In our example, the power pose model proposes that knowledge of an
individual’s experimental group membership – did they adopt a power pose or not? – can
help us predict that individual’s testosterone levels. (In other publications, you may find
other names for X and Y: X may also be called the independent variable, the exposure or
the explanatory variable; Y may be called the dependent variable or the response.)
To understand how the model works, consider how the equation looks for the con-
trol group, where X = 0. We have
Yi i
i
= + × +
= +
α β ε
α ε
0
(Since the term β × 0 is always zero, it can be left out.)
So for the control group, the model equation reduces to Yi i
= +
α ε . The coefficient
α thus represents the mean of the control group.
For the power pose group, where X = 1, the model looks like this:
Yi i
i
= + × +
= + +
α β ε
α β ε
1
So the power pose model predicts that the mean of the power pose group is different
from α by an amount β. Note that if β = 0, then there is no difference between
the means of the power pose group and the control group. In other words, if β = 0,
then the power pose model becomes the sceptic’s model (with α µ
= ). The power
pose hypothesis of course implies that the power pose group mean is higher than the
control group mean, which implies that β  0.
Using data to compare two models
So we have two competing models: the sceptic’s model and the power pose model.
The sceptic’s model corresponds to the null hypothesis of a statistical hypothesis test;
the power pose model corresponds to the alternative hypothesis.
what is a statistical model? 11
If we have conducted a study and observed data, we can estimate the coefficients
of each model. We will use the hypothetical data displayed in Figure 1.3. Table 1.1
shows them as raw data with some descriptive statistics.
Table 1.1 Testosterone change from a power pose experiment (hypothetical data)
Group
Control Power Poses
8.9 27.9
15.7 13.0
−52.8 7.6
16.9 6.7
−24.9 −24.9
−12.4 21.3
14.0 11.8
−9.9 41.2
38.2 34.9
−14.6 9.1
30.7 13.5
1.3 35.0
15.7 29.4
20.1 10.8
37.1 4.7
−21.5 −13.3
−36.5 3.7
28.6 −6.5
16.0 6.5
0.2 −36.0
Mean 3.54 9.81
Standard deviation 24.84 19.63
Overall mean 6.68
Overall standard deviation 22.33
Pooled standard deviation 22.39
We will now use these data to estimate the unknown coefficients in the power pose
model. We denote estimates of coefficients by putting a hat (^) on the coefficient
symbols. Thus, we use α̂ (read ‘alpha-hat’) to denote an estimate of α, and β̂ (‘beta-
hat’) to denote an estimate of β.
Recall that the power pose model is as follows:
Y X
i i i
= + +
α β ε
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
12
I said above that the coefficient α represents the population mean testosterone
change of the control group in the power pose model. How best to estimate this coef-
ficient? Intuitively, it makes sense to estimate the population mean by the sample
mean we observe in our data. In this case,
α̂ = YControl
where YControl is the observed mean testosterone change in the control group.
Similarly, I said above that the coefficient β represents the difference between
the mean testosterone change in the power pose group and the mean change in the
control group. As the estimate of this, we are going to use the difference between the
sample means of our two groups:
β̂ = −
Y Y
Power pose Control
From the descriptive statistics given in Table 1.1, we can calculate these estimates:
ˆ .
α = =
YControl 3 54
ˆ . . .
β = − = − =
Y Y
Power pose Control 9 81 3 54 6 27
So the control group mean is estimated to be 3.54, and the difference between the
experimental and the control group is estimated to be 6.27. Recall, however, that
these estimates are based on the assumption that the power pose model is correct.
The sceptic, who disagrees with the power pose model, would argue that a sim-
pler model is sufficient to account for the data. Recall that the sceptic’s model is
as follows:
Yi i
= +
µ ε
According to this model, there is no difference between the group means in the pop-
ulation. All we need to estimate is the overall group mean µ. Again, to denote an
estimate of µ, we furnish it with a hat. And we will use the overall sample mean as
the estimator:
µ̂ = Yall
From Table 1.1, we have
ˆ .
µ = 6 68
Under the assumption of the sceptic’s model, then, we estimate that, on average,
holding some pose for 2 minutes raises people’s testosterone levels by 6.68 (and it
doesn’t matter what kind of pose they are holding).
The power pose model and the sceptic’s model estimate different coefficients, and
the two models are contradictory: they cannot both be correct. Either power poses
what is a statistical model? 13
make a difference to testosterone change compared to submissive poses, or they do
not. How do we decide which model is better?
We will use the data to test the two models against each other. The logic goes like
this:
• We write down the model equation of the more complex model. In our case, this
is the power pose model, and it is written as Y X
i i i
= + +
α β ε .
• We hypothesise that the simpler model is true. The simpler model is the sceptic’s,
in our case. If the sceptic is right, this would imply that the coefficient β in the
model equation is equal to zero. So we wish to conduct a test of the hypothesis β = 0.
• We then make assumptions about the data and the distribution of the outcome
variable. These are the usual assumptions of the t-test for independent samples
(see The SAGE Quantitative Research Kit, Volume 3):
{ Randomisation: Allocation to groups has been random.
{ Independence of observations: There is no relationship between the individuals.
{ Normality: In each group, the sampling distribution of the mean testosterone
change is a normal distribution.
{ Equality of variances: The population variance is the same in both groups.
• If the null model is true and all assumptions hold, the statistic
t
s
=
ˆ
ˆ
β
β
{ has a central t-distribution with mean zero and degrees of freedom
df n n
= + −
0 1 2. With sˆ ,
β
I denote the estimated standard error of β̂.
• We calculate the observed t-statistic from the data. Then, we compare the result
to the t-distribution under the null model. This allows us to calculate a p-value,
which is the probability of obtaining our observed t-statistic, or one further away
from zero, if the null hypothesis model is true.
I have already shown how to calculate β̂ from the data; in the previous section, we
found that ˆ . .
β =6 27 But I haven’t shown how to calculate the estimated standard
error of β̂, which we denote by the symbol sβ̂
. This standard error is a measure of the
variability of β̂. We can estimate this standard error from our data. In Chapter 2, I
will show you how this is done. For now, I will ask you to accept that this is possible
to do and to believe me when I say that sˆ .
β
=7 08 for our data. Approximately, this
means that if we conducted an infinite number of power pose experiments, each with
exactly the same design and sample size n = 42, our estimate β̂ would differ from the
true value β by 7.08 on average (approximately). The smaller the standard error, the
more precise our estimates. So a small standard error is desirable.
But now to conduct the test. Using our estimates of β̂ and sβ̂
, we have
t
s
= = =
ˆ .
.
.
ˆ
β
β
6 27
7 08
0 89
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
14
So t = 0.89. With 38 degrees of freedom, this yields a two-sided p-value of 0.38 (or a
one-sided p-value of 0.19). Since the p-value is quite large, we would conclude that
there is little evidence, if any, for the power pose model from these data. Although in
our sample there is a small difference in testosterone change between the power pose
group and the control group, this is well within the range of random variability that
we would expect to see in an experiment of this size. In the logic of the t-test for inde-
pendent samples, we say that we have little evidence against the null hypothesis. In
the logic of statistical modelling, we might say that the sceptic’s simple model seems
to be sufficient to account for the data.
In the original experiment, Carney et al. (2010) did find evidence for an effect of
power poses on testosterone (i.e. their p-value was quite small). In the language of
model comparison, their conclusion was that β is larger than zero. I made up the
data used in this chapter, so these pages are not a contribution to the scientific lit-
erature on power poses. I do want to mention, however, that other research teams
have tried to replicate the power pose effect. For example, Ranehill et al. (2015) used
a sample of 200 participants to test the power pose hypothesis and found no evidence
for an effect of power poses on either risk taking, stress or testosterone, although they
did find evidence that power poses, on average, increase participants’ self-reported
feeling of power.
Further research may shift the weight of the overall evidence either way. These
future researchers may employ statistical hypothesis tests, such as a t-test, without
specifically casting their report in the language of statistical models. But underlying
the research will be the effort to try to establish which of two models explains better
how the world works: the power pose model, where striking a power pose can raise
your testosterone, or the sceptic’s model, according to which testosterone levels may
be governed by many things but where striking a power pose is not one of them.
The signal and the noise
Statistics is the science of reasoning about data. The central problem that statistics
tackles is uncertainty about the data generating process: we don’t know why the data
are the way they are. If there is regularity in the way the world works, then research
may generate data that make this regularity visible. For example, if it is the case that
children in countries with more equal income distributions fare better than children
in unequal countries, we would expect to see a relationship between a measure of
income (in)equality and a measure of child wellbeing.
But there are many other processes that influence how the data turn out.
Measurement errors may cause the data to be inaccurate. Also, random processes
what is a statistical model? 15
may introduce variations. Examples of such random processes are random sampling
or small variations over time, such as year-on-year variations in a country’s GDP, that
are not related to the research problem at hand. Finally, other variables may interfere
and hide the true relationship between child wellbeing and income inequality. Or
they may interfere in the opposite way and bring about the appearance of a relation-
ship, when really there is none.
Let’s make a distinction between the signal and the noise (Silver, 2012). The signal
is the thing we are interested in, such as, say, the relationship between GDP and life
expectancy. The noise is what we are less interested in but what is nonetheless pre-
sent in the data: measurement errors, random fluctuations in GDP or life expectancy,
and influences of other variables whose importance we either don’t know about or
which we were unable to measure.
In fancier words, we call the signal the systematic part of the model and the noise
the random part of the model. Recall the model we considered for the power poses,
which is shown in Figure 1.5.
Figure 1.5 Partition of a statistical model into a systematic and a random part
Yi i i
X
= + +
α β ε
Systematic
part
Random
part
The systematic part of the model is α β
+ Xi . This specifies the relationship between
the predictor and the outcome. The random part, εi , collects individual variation in
the outcome that is not related to the predictor. It is this random part which distin-
guishes a statistical model from a deterministic one (e.g. the model of depth and
water pressure we considered in the section ‘Kinds of Models’). When using statistical
models, we aim to detect and describe the signal, but we also pay attention to the
noise and what influence it might have on what we can say about the signal.
Linear Regression An Introduction To Statistical Models Peter Martin
2
Simple Linear Regression
Chapter Overview
Origins of regression: Francis Galton and the inheritance of height��������� 18
The regression line������������������������������������������������������������������������������������� 21
Regression coefficients: intercept and slope��������������������������������������������� 23
Errors of prediction and random variation������������������������������������������������� 24
The true and the estimated regression line����������������������������������������������� 25
Residuals���������������������������������������������������������������������������������������������������� 26
How to estimate a regression line�������������������������������������������������������������� 27
How well does our model explain the data? The R2
statistic��������������������� 29
Residual standard error������������������������������������������������������������������������������ 33
Interpreting Galton’s data and the origin of ‘regression’�������������������������� 33
Inference: confidence intervals and hypothesis tests�������������������������������� 35
Confidence range for a regression line������������������������������������������������������ 39
Prediction and prediction intervals������������������������������������������������������������ 42
Regression in practice: things that can go wrong�������������������������������������� 44
Further Reading����������������������������������������������������������������������������������������� 50
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
18
Linear regression is a statistical model that represents the relationship between two variables as
a straight line. In doing so, a distinction is made between the outcome variable and the predic-
tor variable, as we did in Chapter 1. Linear regression is appropriate for outcome variables that
are continuous and that are measured on an interval or ratio scale. Box 2.1 gives an overview
of the different kinds of variables that can feature in a statistical model. For measurement
levels (nominal, ordinal, interval and ratio), see The SAGE Quantitative Research Kit, Volume 2.
This chapter considers simple linear regression, which is a linear regression with
exactly one predictor variable. In Chapters 4 and 5, we will look at linear regres-
sion with more than one predictor.
Origins of regression: Francis Galton and the inheritance of height
The first regression in history was carried out in the late 19th century by Francis Galton,
a half-cousin of Charles Darwin. One interest of Galton’s was the study of biological
inheritance: how parents pass on their individual characteristics to their children.
Types of Variables
In statistics, variables are distinguished in various ways, according to their properties. An
important distinction is made between numeric variables and categorical variables.
Numeric variables have values that are numbers (1, 68.2, −15, 0.9, and so forth). Height is
one such numeric variable. The values of categorical variables are categories. Country of
birth is categorical, with values ‘Afghanistan’, ‘Albania’, ‘Algeria’ and so forth. Categori-
cal variables are sometimes represented by numbers in data sets (where, say, ‘1’ means
Afghanistan, ‘2’ means Albania, and so forth), but in that case, the number just acts as a
label for a category and doesn’t mean that the variable is truly numeric.
Numeric variables, in their turn, are divided into continuous and discrete variables. A con-
tinuous variable can take any value within its possible range. For example, age is a contin-
uous variable: a person can be 28 years old, 28.4 years old or even 28.397853 years old. Age
changes every day, every minute, every second, so our measurement of age is limited only by
how precise we can or wish to be. Another example of a continuous variable is human height.
In contrast, a discrete variable only takes particular numeric values. For example,
number of children is a discrete variable: you can have zero children, one child or seven
children, but not 1.5 children.
The outcome variable of a linear regression should be continuous. In practice, our
measurement of continuous variables may make them appear discrete – for example,
when we record height only to the nearest inch. This does not necessarily harm the esti-
mation of our regression model, as long as the discrete measurement is not too coarse.
The predictor in a linear regression should be numeric and may be discrete or con-
tinuous. In Chapter 4, we will see how we can turn categorical predictors into numeric
‘dummy variables’ to enable us to include them in a regression model.
Box 2.1
simple linear regression 19
Among other things, he studied the relationship between the heights of parents and
their children (once the children had grown up). To this end, he collected data from 928
families. An extract of the data is shown in Table 2.1, and Figure 2.1 illustrates the data.1
Table 2.1 Extract from Galton’s data on heights in 928 families
Family Number (i) Height of Parents (Average) Height of Adult Child
1 66.5 66.2
2 69.5 67.2
3 68.5 64.2
4 68.5 68.2
5 70.5 71.2
6 68.5 67.2
…
…
…
926 69.5 66.2
927 69.5 71.2
928 68.5 69.2
Mean 68.30 68.08
Standard deviation 1.81 2.54
Variance 3.29 6.44
Note. The means, standard deviations and variances deviate slightly from Galton’s original results,
because I have added a small random jiggle to the data to make illustration and explanation easier.
62
62
64
64
66
66
68
68
70
70
72
72
74
74
Average height of parents (inches)
Height
of
child
(inches)
Figure 2.1 Scatter plot of parents’ and children’s heights
Note. Data are taken from Galton (1886) via the ‘psych’ package for R (Revelle, 2020). Data points have
been jiggled randomly to avoid overlap.
1
I use Galton’s original data, as documented in Revelle (2020), but I have added a slight modification.
Galton recorded the heights in categories of 1-inch steps. Thus, most combinations of parents’ and
child’s height occur more than once, which makes for an unattractive overlap of points in a scatter
plot, and would generally have made the analysis difficult to explain. I have therefore added a small
random jiggle to all data points. All my analyses are done on the jiggled data, not Galton’s original
data. Therefore, for example, the means and standard deviations shown in Table 2.1 differ slightly
from those in Galton’s original data. I have done this purely for didactic purposes.
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
20
Box 2.2
The scatter plot, Figure 2.1, provides a first look at the relationship between the par-
ents’ and the children’s heights. Each dot represents one pair of measurements: the
average height of the parents2
and the height of the adult child.3
The scatter plot
suggests that there is a positive, moderately strong relationship between parents’ and
children’s heights. In general, taller parents tend to have taller children. Nonetheless,
for any given parental height, there is much variation among the children.
One way of describing the relationship between parents’ and children’s heights is to
calculate a correlation coefficient. If the relationship is linear, Pearson’s product moment
correlation coefficient provides a suitable description of the strength and direction of the
relationship. (You may remember this from The SAGE Quantitative Research Kit, Volume 2.)
From Figure 2.1, it looks as though there is a linear relationship between the heights
of parents and their children. Thus we are justified in calculating Pearson’s r for Galton’s
data. Doing so, we obtain a Pearson correlation of r = 0.456. This confirms the impression
gained from the scatter plot: this is a linear positive relationship of moderate strength.
Galton and Eugenics
Francis Galton’s interest in heredity was linked to his interest in eugenics: the belief that
human populations can and should be ‘improved’ by excluding certain groups from
having children, based on the idea that people with certain heritable characteristics
are less worthy of existence than others. Galton was a leading eugenicist of his time. In
fact, it was he who coined the term eugenics. Eugenicist ideas were widespread in the
Western world in the early 20th century and in many countries inspired discriminatory
policies such as forced sterilisation and marriage prohibition for people labelled ‘unfit
to reproduce’, which included people with mental or physical disabilities. Historically,
the eugenics movement had close ideological links with racism (Todorov, 1993), and
pursued the aim of ‘purifying’ a population by reducing its diversity. Eugenicist ideas
and practices were most strongly and ruthlessly adopted by the Nazi regime in Ger-
many, 1933–1945. Like many Europeans of his time, Galton also held strong racist views
about the supposed superiority of some ‘races’ over others. Galton thus leaves a com-
plicated legacy: he was a great scientist (his scientific achievements reach far beyond
regression), but he promoted ideas that were rooted in racist ideology and that helped
to promote racism and discrimination. For further information about Galton, and how
contemporary statisticians grapple with his legacy, see Langkjær-Bain (2019).
2
From now on, I shall refer to the average of the parents’ height simply as ‘parents’ height’.
Galton himself used the term height of the mid-parents.
3
To make male and female heights comparable, Galton multiplied the heights of females in his
sample by 1.08.
simple linear regression 21
The regression line
Now let’s consider how to develop a statistical model. This goes beyond the cor-
relation coefficient, as we now make a distinction between the outcome variable,
and the predictor variable. The outcome is the variable that we wish to explain
or predict. The predictor is the variable we use to do so. Different books and texts
use different names for the outcome and the predictor variables. Box 2.3 gives an
explanation.
In making this distinction between the predictor and the outcome, we do not nec-
essarily imply a causal relationship. Whether it is plausible to deduce a causal rela-
tionship from an observed correlation depends on many things, including knowl-
edge about the research design and data collection process, as well as evidence from
other studies and theoretical knowledge about the variables involved in the analysis.
In our example, Galton wished to understand why people have different heights (the
outcome) and thought he could find an explanation by considering the heights of
people’s parents (the predictor). Our current scientific knowledge suggests that par-
ent’s and children’s heights are indeed related due to common causes, including the
genes shared by parents with their children as well as environmental and social fac-
tors such as nutrition, which tend to be more similar within families than between
different families.
Because we have concluded that the relationship between parents’ and chil-
dren’s heights is approximately linear, we propose a linear model: we will draw a
straight line to represent the relationship in Galton’s data. Such a line is shown
in Figure 2.2.
Various Names for the Variables Involved in a Regression Model
The outcome of a regression model is also known as the dependent variable (DV),
or the response. The predictor is also sometimes called an independent variable (IV),
an exposure or an explanatory variable. The terminological fashion varies somewhat
between disciplines. For example, psychology prefers the terms DV and IV, while in
epidemiology, outcome and exposure are more commonly used. In this book, I shall
use the terms outcome and predictor consistently.
Box 2.3
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
22
62
62
64
64
66
66
68
68
70
70
72
72
74
74
Average height of parents (inches)
Height
of
child
(inches)
Figure 2.2 Galton’s data with superimposed regression line
The regression line is a representation of the relationship we observe between a
predictor variable (here, parents’ height) and an outcome variable (here, the height
of the adult child). By convention, we call the predictor X and the outcome Y. The
algebraic expression of a regression line is an equation of the form
Ŷ X
i i
= +
α β
where
• Ŷi is the predicted value of the outcome Y for the ith person – in our case, the
predicted height of the child of the ith family (read Ŷ as ‘Y-hat’, the hat indicates
that this is a prediction).
• Xi
is the value of the predictor variable for the ith person – here, the parents’
height in family i.
• α is called the intercept of the regression line; this is the value of Ŷ when X = 0.
• β is called the slope of the regression line; this is the predicted difference in Y for
a 1-unit difference in X – in our case, the predicted height difference between two
children whose parents’ heights differ by 1 inch.
To understand how the regression equation works, let’s look at the equation for the
line in Figure 2.2. This is:
ˆ . .
Y X
i i
= +
24 526 0 638
simple linear regression 23
If it helps, you may write this equation as follows:
Predicted child s height Parents height
′ ′
= + ×
24 526 0 638
. .
We can use this equation to derive a predicted height for a child, if we are given the
parents’ height. For example, take a child whose parents’ height is 64.5 inches. Plug-
ging that number into the regression equation, we get:
Y
∧
= + ×
=
24 526 0 638 64 5
65 7
. . .
.
A child of parents with height 64.5 inches is predicted to be 65.7 inches tall. In the equa-
tion above, the ‘hat’ over Y indicates that this result is a prediction, not the actual height
of the child. This is important because the prediction is not perfect: not every child is
going to have exactly the height predicted by the regression equation. The aim of the
regression equation is to be right on average, not necessarily for every individual case.
Regression coefficients: intercept and slope
Let us have a closer look at the intercept (α) and the slope (β) of the regression equa-
tion. Jointly, they are referred to as the coefficients. The coefficients are unknown but
can be estimated from the data. This is analogous to using a sample mean to estimate
a population mean, or estimating a correlation from a sample data set. Figure 2.3
provides an illustration of how the coefficients define a regression line.
0
0
1
1
1
2
2
3
3
4
X
Y
Intercept(α)
Slope(β)
Y = α +βX
^
Figure 2.3 An illustration of the regression line, its intercept and slope
Note. Intercept: the value of Y when X is zero. Slope: the predicted difference in Y for a 1-unit difference in X.
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
24
The intercept is the predicted value of Y at the point when X is zero. In our exam-
ple, the intercept is equal to 24.526. Formally, this means that the predicted height of
a person whose parents have zero height is 24.526 inches. As a prediction, this obvi-
ously does not make sense, because parents of zero height don’t exist. The intercept
is of scientific interest only when X = 0 is a meaningful data point.
The slope determines by how much the line rises in the Y-direction for a 1-unit
step in the X-direction. In our example, the slope is equal to 0.638. This means that
a 1-inch difference in parents’ height is associated with a 0.638-inch difference in the
height of the children. For example, if the Joneses are 1 inch taller than the Smiths,
the Joneses’ children are predicted to be taller than the Smiths’ children by 0.638
inches on average. In general, a positive slope indicates a positive relationship, and a
negative slope indicates a negative relationship. If the slope is zero, there is no rela-
tionship between X and Y.
Errors of prediction and random variation
The regression line allows us to predict the value of an outcome, given information
about a predictor variable. But the regression line is not yet a full statistical model. If
we had only the regression line, our prediction of the outcome would be determinis-
tic, rather than statistical. A deterministic model would be appropriate if we believed
that the height of a child was precisely determined by the height of their parents. But
we know that is not true: if it was, all children born to the same parents would end up
having the same height as adults. In Galton’s data, we see that most children do not
have exactly the height predicted by the regression line. There is variation around the
prediction. That is why we need a statistical model, not a deterministic one.
As we saw in Chapter 1, a full statistical model includes two parts: a systematic part
that relates Y to X and a random part that represents the variation in Y unrelated to
X. The linear regression model looks like this:
Y X
i i i
= + +
α β ε
where
• Yi
is the Y value of the ith individual.
• Xi
is the X value of the ith individual.
• α and β are the intercept and the slope as before.
• α + βXi
is the systematic part of the model; in our example, this represents the part
of a child’s height that is determined by their parents’ heights.
• εi
is called the error (of the ith individual): it is the difference between the
observed value (Yi
) and the predicted value (Ŷi). The errors represent the random
part of our model: this is the part of a child’s height that is determined by things
other than their parents’ heights.
simple linear regression 25
Note that this regression equation looks just the same as the equation of the power
pose model in Chapter 1. But there is one difference. In Chapter 1, X was a dichoto-
mous variable – that is, a variable that could assume one of two values: 0 (for the
control group) or 1 (for the experimental group). In the model for Galton’s data,
however, X is a continuous variable, which, in our data, takes values between 63.5
inches and 73.5 inches.
The systematic part of our model, α + βXi
, describes the part of the outcome that
is related to the predictor. In our example, we might say that the systematic part of
Galton’s regression represents the part of a child’s height that is inherited from the
parents. Galton did not know about genes, but today we might assume that a child
might inherit their height from their parents through two kinds of processes: nature
(genes) and nurture (experiences – e.g. nutrition and other living conditions during
the growth period, which might have been similar for the parents and their children
– e.g. because they each grew up in the same social class).
The error term, εi
, represents the variation in Y that is not related to X. In our exam-
ple, such variation might be due to things such as:
• Differences between living conditions in the parents’ and the children’s growth
periods (e.g. due to changes in society and culture, historical events such as
famines, or changes in family fortune)
• The vagaries of genetic inheritance (different children inherit different genes from
the same parents)
• Other influences, some of which we either do not understand or that might be
genuinely random (governed by a probabilistic natural process, rather than a
deterministic one)
Importantly, the errors are the differences between the observed values Yi
and the
predicted values Yi
∧
. That is, the errors tell you by how much the regression predic-
tion is off for a particular case. To see this mathematically, rearrange the regression
equation as follows:
ε α β
i i i
i
� �
� �
= − +
( )
= −
∧
Y X
Y Y
i
We will later see that the full specification of the statistical model will require us to make
certain assumptions about the errors. These assumptions are the topic of Chapter 3.
The true and the estimated regression line
When we conceptualise a model in the abstract, the coefficients and errors are con-
ceptualised as properties of the ‘true’ regression model, which is valid for the popula-
tion. In practice, however, we will only ever have information from a sample, and
LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS
26
we use that sample to estimate the coefficients. This is called fitting a model
to a data set, or (equivalently) estimating a model on a data set. Our model,
Yi
= α + βXi
+ εi
, specifies the sort of relationship between X and Y that we propose, or
wish to investigate. When we fit this model to a data set, we obtain estimates of the
model parameters α and β. The predictive equation that contains these estimates,
ˆ . .
Y X
i i
= +
24 526 0 638 , is called the fitted model, or the estimated model.
So just as we distinguish between the population mean μ and the sample mean x ,
and between the population standard deviation σ and its sample estimate s, we
also need to distinguish between the parameters α and β in the true (population)
model and the estimates of these parameters, which we will call α̂ and β̂ (read
these as ‘alpha-hat’ and ‘beta-hat’, respectively). We also need to distinguish between
the errors (the departures from the true regression line) and the estimates of the
errors. This is because a regression line fitted to a sample of data is just an estimate
of the ‘true’ regression line proposed by our model. For this reason, we never directly
observe the errors. We must make do with the departures from our estimated regres-
sion line. We call these departures the residuals.
Residuals
The residuals are the differences between the observed values of Y and the predicted val-
ues of Y from an estimated regression equation. Let’s return to Galton’s data. The regres-
sion line predicts a child’s height, given the height of the parents, using the equation
ˆ . .
Y X
i i
= +
24 526 0 638
As we have noted, the prediction is not perfect: although a few dots are exactly on the
regression line, most are not.
Have a look at Figure 2.4. I have given names to two of the children in Galton’s
data: Francis and Florence. Let’s consider Francis first. His parents are 64.5 inches tall.
Given this information, the regression equation predicts Francis’s height to be 65.7
inches, as we saw above in the section ‘The Regression Line’. But Francis actually
measures in at 63.3 inches, a bit shorter than the model predicts. Between Francis’s
actual height and the prediction, there is a difference of 2.4 inches. We call this dif-
ference a residual.
Formally, a residual is defined as the difference between the observed value and
the predicted value of the dependent variable, where the prediction comes from a
regression line estimated from a sample. We can express this definition in algebraic
symbols:
e Y Y
i i i
= − ˆ
simple linear regression 27
It is customary to represent residuals with the letter e, to distinguish them from the
errors ε. We write ei
if we want to refer to any particular residual (the residual of per-
son i). In Francis’s case, the calculation would go as follows:
e Y Y
Francis Francis Francis
= −
−
−
=
=
∧
63 3 65 7
2 4
. .
.
inches inches
inche
es
Francis’s residual is a negative number, because Francis is shorter than our model predicts.
Now consider Florence. Her parents’ height is 68.5 inches. From this, we can cal-
culate that her predicted height is 68.2 inches. But Florence is in fact 71.2 inches tall.
Because she is taller than predicted, her residual is a positive number:
e Y Y
Florence Florence Florence
= −
−
=
=
∧
71 2 68 2
3 0
. .
.
inches inches
inc
ches
This residual tells us that Florence is 3.0 inches taller than the model predicts.
How to estimate a regression line
Now that we understand residuals, we can consider how the estimates of coefficients are
found. A residual represents how wrong the regression prediction is for a given individual.
62
62
64
64
66
66
68
68
70
70
72
72
74
74
Average height of parents (inches)
Height
of
child
(inches)
eFrancis
= −2.4
eFlorence
= 3.0
Figure 2.4 Illustration of residuals
Random documents with unrelated
content Scribd suggests to you:
“When I s-spring one,” says he, “it’ll be a joke, you can bet. I
won’t just shoot off somethin’ on the chance somebody’ll laugh. I’ll
study over it some, and kind of try it out in my mind, and maybe
repeat it out loud to myself a couple of times to see how it sounds
when I say it. That’s the way to do with jokes. Jokes is like dollars. A
good dollar is worth a hundred cents, but a bad dollar is apt to get
you s-s-shut up in jail. Or eggs,” says he. “You don’t have to crack a
joke to tell if it’s bad, like you do an egg.”
“I suppose,” says I, “that was a joke?”
“There’s folks would call it sich,” says he.
“Aw, come on,” says Binney. “Quit your jawin’ like old wimmin at a
knittin’-bee and git to work. What’s goin’ to be done?”
“I wisht I knew,” says Mark.
“If we found him he wouldn’t come back,” says Binney. “He’d be
afraid of the sheriff.”
Mark slapped his leg. “There’s somethin’ for us to d-d-do,” says
he. “We kin fix it so George dast come back.”
So he sent Binney after the mail, and Tallow to order in a car to
make a shipment, and him and I went off to see the deputy sheriff,
whose name was Whoppleham. Mostly you could find him down by
the blacksmith shop pitching horseshoes. He was about the best
horseshoe-pitcher in the county. He was there, all right, pitching
with old Jim Battershaw, and they was down on their knees
measuring from the peg to a couple of horseshoes with a piece of
string to find out which was the nearest, and quarreling about it as if
it was the most important thing that had happened in the world
since Noah built his ark. We waited for them to decide which
horseshoe was nearest, but they couldn’t decide, and they wouldn’t
call it even. I calc’late they’d have gone for the county surveyor to
measure them up scientific if just then Battershaw’s setter-dog and
Whoppleham’s shepherd-dog hadn’t got tired of waiting and started
an argument of their own. It was quite considerable of an argument,
and it come swinging and clawing and snarling right across the lot to
where the horseshoes was and settled down to business there. The
way them dogs clawed into the ground and kicked up the dust was a
caution, and old Battershaw and Whoppleham dancing around the
edge of it, hollering like all-git-out and trying to stop it.
Well, all of a sudden the setter give up the ship and tucked his tail
between his legs and scooted, with the shepherd after him lickety-
split. When they was gone and we looked at the peg and the
horseshoes there wasn’t anything left to argue about. Those dogs
had kicked them galley west and come nigh to digging up the peg. It
was a fine thing for both those men, because it gave them
something to argue about all the rest of their lives, with no chance
of having the argument settled. I’ll bet that in ten years they’ll still
be slanging and sassing each other about that game, each of them
insisting his horseshoe was the nearest. That’s the kind of old coots
they are.
Well, it gave Mark his chance to speak to Whoppleham, and he
done so.
“Mr. Sheriff,” says he, “kin I s-s-speak to you for a m-minute?”
“I’m busy,” says the sheriff.
“This is official b-b-business,” says Mark.
“Oh!... Hum!... Official, eh? Somebody been breakin’ the law
hereabouts? Out with it, young feller. Sheriff Whoppleham’s the man
for you.” He pointed down to the star on his suspenders and says:
“The people has confidence in me, I guess, or they wouldn’t never
have put me into this here position of trust and confidence. I guess
they knew who would be able to clean out the criminals of these
parts. They knowed a venturesome man when they seen one, and a
man that wouldn’t stop at nothin’ in the int’rests of justice. What
crime’s been did, and who done it?”
“We want to s-s-speak about George Piggins,” says Mark.
“Have you seen that there crim’nal? Eh? Where’s he hidin’? I know
he’s dangerous and desprit, but be I hesitatin’? Be I timid? I guess
not. Sheriff Whoppleham would be willin’ to face Jesse James and
drag him to jail by the whiskers. Lucky for them Western bandits I
never went out there to mix in. I’d have cleaned ’em up perty quick.”
“We don’t know where he is,” says Mark, “but we want to talk to
you about f-f-fixin’ up that hog-stealin’ so he can come home and
not be molested.”
“Fix it? How?”
“Well, Mr. Hooker’s got back his hog and no harm’s been done.
We f-f-figgered maybe you would be willin’ to call it square and let
George come home if he promised never to do it again.”
“Huh!” says the sheriff. “What’s everybody so doggone int’rested
into George for, all of a sudden? Nobody was excited about him none
a spell back, but now it looks like everybody seen all to once that
there wasn’t no harm in him and he ought to be let home without
havin’ to suffer for bein’ a miscreant. What’s the meanin’ of it?”
“Has somebody else been to see about him?” says Mark.
“I should smile,” says the sheriff. “Why, this mornin’ there was a
reg’lar delegation, and who d’you s’pose come along with them but
Hooker himself? Yes, sir. And they wanted the charge should be
dropped and George let home. I says to ’em that my job was
ketchin’ dangerous crim’nals, not pardonin’ ’em, and that they’d have
to thrash it out with the prosecutin’ attorney. So they went off to do
that.... What I want to know is, how do they expect a officer of the
law to do his duty and bring crim’nals to justice if folks goes around
gettin’ ’em let off by prosecutin’ attorneys? How? Eh? Well, then.
They’re cuttin’ into my trade, that’s what, and I hain’t goin’ to stand
for it. I’m goin’ out to ketch George Piggins before he gits pardoned,
that’s what I be, and I’m a-goin’ to drag him to jail dead or alive.
When I git him there they can do like they please, but my duty’ll be
did.”
Well, we saw there wasn’t any good hanging around there, so we
went along, and Mark was looking pretty serious.
“Wiggamore means b-business,” says he. “He hain’t lettin’ any
grass grow under his feet, is he?”
“Calc’late it was Wiggamore that tried to get George out of
trouble?”
“Of course it was,” says he, “and he’ll do it, too. Well, let him.
That saves us the t-trouble. While he’s botherin’ with that, we can be
l-lookin’ for George.”
“I wonder if Miss Piggins knows where he is?”
“’Tain’t likely,” says Mark. “I don’t b’lieve it, but we kin keep an
eye on her. George was always a powerful hungry f-feller, and if she
knows and he’s anywheres around, we’ll see her sneakin’ out with a
basket of grub.”
“She’d do it at night,” says I.
“Yes,” says he.
“So there’s nothin’ for us to do but wait,” says I.
“You n-never make no money waitin’,” says Mark. “We got to be
d-d-doin’ somethin’.”
“We’ll be kept busy to-day loadin’ that car.”
“Yes, and if we g-g-git an order for bowls and things from that
firm Zadok told us about, why, we’ll be busier ’n ever,” says he.
So we went back to the mill, and Binney was there, and so was
Tallow. The mail had come and there was a letter giving us an order
for bowls and turned stuff and asking us to ship at once. Mark said
the prices was as good as he expected, and better, and that if we
could keep on getting such prices we would make a nice lot of
money.
“How about a car?” he says to Tallow.
“Can’t git none,” says Tallow.
“Why can’t we git one? We got to git one.”
“Nobody in Wicksville can git one, nor nobody on this branch,
seems like. Somethin’s happened somewheres and there hain’t no
cars, and if there was we couldn’t have any, because the railroad has
let on to the agent here that he dassen’t accept any shipments to
the city. He said it was an embargo.”
“Embargo,” says Mark, “I wonder what one of them is?”
“Why,” says Tallow, “an embargo means when the railroad won’t
let you ship to a place or from a place or somethin’ like that.”
“How long is it goin’ to l-last?”
“Maybe a week, maybe a month, maybe all the year,” says Tallow.
“There hain’t enough cars to go around, and the railroad yards in the
city is crowded with cars that they can’t git men to unload, and that
kind of thing.”
“Hum!” says Mark. “Perty kettle of fish. Embargo. How in tunket
be we g-g-goin’ to send out stuff, then, I’d like to know?”
“We hain’t goin’ to,” says Tallow.
“But we g-got to. We jest got to.”
“They won’t let us.”
“There must be some kind of a way. We got to ship as f-f-fast as
we manufacture, and get the money back, or we can’t pay the men
and keep goin’. If we was held back from shippin’ for two weeks we
would be b-busted.”
“And Wiggamore would get the dam and the mill,” says I.
“He hain’t got ’em yet,” says Mark, “and he hain’t g-goin’ to get
’em.”
“What’ll we do,” says I, “drag our chair stock and bowls and
things around in carts? It would take quite a spell to git a car-load to
the city, or even to Bostwick, that way.”
“I don’t know how we’re goin’ to do it, but we’re goin’ to. You f-
fellers git to work and I’ll go and f-figger on this. We got to hit on
some scheme, and we got to hit on it right off. These here goods
has got to be shipped immediate, because we got to have the
money.”
So he went and sat down in the office, and I could see him
pinching his cheek and pulling his ear like he always does when he is
puzzling out something. He kept at it more than an hour, and then I
saw him come out and get a piece of wood and take out his jack-
knife to whittle. At that I got scairt, for he never whittles till he’s in
the last ditch. When everything else fails he takes to his jack-knife,
and when he does that it’s time to get worried.
He whittled and whittled and whittled, and nothing come of it.
You see, he hadn’t ever had any experience with railroads, and he
didn’t know what kind of a scheme would work with them.
He didn’t go home to dinner, but just called to me to stop at his
house and fetch him a snack. I knew what a snack meant for him,
so I fetched back three ham sandwiches and three jelly sandwiches
and two apples and a banana and a piece of apple pie and a piece of
cherry pie and a hunk of cake and about a quart of milk. He went at
them sort of deliberate and gradual, but the way they disappeared
was enough to make you think he was some kind of a magician.
Before you knew it the whole lot was gone and he was looking down
into the basket kind of sorrowful.
“What’s the m-matter, Plunk?” says he. “Was they short of grub at
home? Seems like the edge hain’t hardly gone off’n my appetite.”
“You’ve et enough to keep me for a week,” says I.
“Huh!” says he. “Well, a f-feller kin think better when he’s hungry,
they say.”
Hungry! I swan to Betsy if he hadn’t et a square meal for three
grown men.
He went to whittling again. About three o’clock he come out and
says, “Plunk, we got to go to the city.”
“What for?” says I.
“To git f-freight-cars,” says he.
“And fetch ’em home in our pockets, I s’pose,” says I.
“Maybe,” says he. “Git enough clothes to stay all night. We’ll catch
the five-o’clock t-train.”
“But what you goin’ to do?”
“I hain’t sure. But there’s somebody up to those head offices of
the r-r-railroad company that’s got a right to give us cars. I’m goin’
to f-f-find out who it is, even if it’s the President of the United States,
and I’m goin’ to find some way to make him give ’em to us.”
“They wouldn’t ever let a couple of kids in to see the head men,”
says I.
“They will,” says he.
“How d’you know?” says I.
“Because,” says he, “I’ll make ’em.”
“Don’t bite off more ’n you kin chaw,” says I.
“Look here,” says he, “are you g-g-goin’ to lay down on this job?
Because if you be I kin take Tallow or Binney. They won’t git cold f-f-
feet.”
“I’ll stick,” says I, “but we hain’t got a chance.”
“Anybody’s always got a chance,” says he. “Folks can make
chances. Anything that’s p-p-possible kin be done if you stick to it
and use your head. This here is p-possible and it’s necessary. I’m
goin’ to git them freight-cars.”
That was just like him. You couldn’t scare him and you couldn’t
discourage him. He would stick to anything till you sawed him loose.
I guess maybe there was some bulldog in him, or something. Maybe
he had had a meal of glue some day and that made him stick to
things. I don’t think I’ve ever seen him when he showed that he was
discouraged, and I really don’t believe he ever was discouraged. No,
sir; he got so interested in trying to do whatever it was that he
wanted to do that he forgot all about how hard it was. And I guess
that’s a good idea.
CHAPTER XIII
I like to ride on the cars pretty well, and so does Mark. There are
always such a heap of things to see out of the window, and such a
lot of different kinds of people right on the cars. It was about four
hours’ ride to the city, but it didn’t seem half that long, and I was
sorry when we got there. It was pretty dark when we walked out of
the depot into the street.
“Now what?” says I.
“B-bed,” says he.
“Where?” says I.
“Hotel,” says he.
“There’s one,” I says, pointing right across the street, so we took
our satchels and went over. There was a fellow behind a counter,
and when we came up he sort of grinned and says good evening.
“How much does it cost to sleep here?” says Mark.
“Two dollars and a half is our cheapest room.”
“For both of us?”
“I guess I can make it three and a half for two.”
“I g-guess you can’t,” says Mark. “The way I look at it, no two
boys can do three d-d-dollars and a half worth of sleepin’ in one
night. Hain’t there no cheaper places?”
“Lots of ’em, young man. There’s a tramps’ lodging-house down
the street where you can stay for ten cents.”
“Um!... Well, I calc’late what we want is somethin’ betwixt and
between. Somethin’ where we kin stay for about a dollar apiece.”
That seemed like an awful lot to spend just for sleeping. Why, in
the morning our two dollars would be gone and we wouldn’t have
anything to show for it. It seems like when you spend money you
ought to git something. I nudged Mark and says to him that it was
cheaper to stay awake, and we could use our dollars to-morrow to
buy something we could touch. But he says we got to sleep to be
fresh for business.
“I’ll tell you,” says the man behind the counter. “I’ve got a little
room without a bath, and if you can sleep two in a bed, you can
have it for two-fifty.”
“All r-right,” says Mark. “Kin we have breakfast here?”
“If you’ve got the money to pay for it.”
“Um!... But there’s places where we can git g-g-good grub
cheaper ’n you sell it, hain’t there?”
“Why, yes! There’s a good serve-self lunch up the street where
you can get a lot to eat for fifty cents. Say, what are you kids up to?
Running away from home?”
“Not that you can n-notice,” says Mark. “We’re here on b-
business. We come to see the p-president of that railroad across the
street.”
“Oh,” says the man, and he laughed right out. “You come to see
him, did you? Was he expecting you?”
“No.”
“Um!... Well, from all accounts, he’s a nice man to see—I guess
not. They say he eats a couple of men for breakfast every morning.
He keeps a baseball-bat on his desk, and hits everybody that comes
to see him a lick over the head. I see him every little while, and,
believe me, I’m glad I don’t have to mix in with him any. I expect
he’s the grouchiest man in town.”
“Sorry to hear it,” says Mark, “but I guess we kin m-make out to
git along with him s-somehow.”
“Want to go to your room?”
“Yes.”
Well, a boy with a uniform picked up our satchels and showed us
into the elevator and then went into our room first and lighted the
lights. Then he sort of stood around and eyed us like there was
something he wanted to say, but he didn’t say a word. We looked at
him right back, because we weren’t going to let on that we cared a
rap what any kid with a uniform on did or said. Pretty soon Mark
says:
“Well, was there anythin’ you was n-needin’?”
“Huh!” says the kid.
“What you hangin’ around for, anyhow?”
“I guess you hain’t traveled much,” says the boy.
“It hain’t p-p-part of your job to tell us, is it?”
“Did you ever hear of a tip?” says he.
“Tip?” says Mark.
“Most generally gentlemen gives us bell-boys a tip when we carry
their bags to their room,” says he.
“Tip of what?” says I. “I hain’t got no tip unless it’s the tip of my
nose.”
“A tip is money,” says the boy.
“We hired this here room for two dollars and a half, didn’t we?”
“Yes,” says he.
“We didn’t make no b-bargain with you about carryin’ satchels,
nor with the man at the counter, did we?”
“No,” says he. “Nobody does. But everybody gives tips. You got to
give tips.”
“Hain’t you p-paid wages for doin’ what you do?”
“Yes, but they hain’t enough.”
“Then,” says Mark, “you ought to make the hotel raise your pay
and not go t-t-tryin’ to gouge it out of folks that stays here.”
“Everybody does it,” says the boy. “You can’t never git nothin’
done in a hotel if you don’t tip.”
“Do you git a tip every time you carry a satchel?”
“Yes.”
“Now you look here. I got an idee you’re tryin’ to git somethin’
out of us ’cause we’re kids and come from Wicksville. I’m g-g-goin’
to f-find out. If it’s the custom, why, I’ll give you a tip ’cause I want
to do what’s right. But if you’re t-tryin’ to do us out of money, why,
you won’t git it. I’m goin’ to ask the man behind the counter.”
And that’s what he done. He went right down and asked, and the
man laughed like all-git-out and told Mark all about tips, and Mark
told him what he thought about them, and then he give the boy a
dime and we went to bed.
We went to sleep in a minute and it seemed like it wasn’t more
than a minute before we was awake again. Mark woke up first and
gouged me in the ribs till I woke up. Then we dressed.
“It’s f-five o’clock,” says Mark. “We want to git our breakfast and
hustle. You kin bet a man with a big job on a r-r-railroad is down to
work early. He’d have to be. Maybe we kin s-see the man we want
about six o’clock and git an early train home.”
So we went to a serve-self place where you didn’t eat off of a
table, but off of the arm of your chair, and we et quite a good deal
and it was good. Then we came back to the railroad station and it
was just six o’clock. There wasn’t many folks around, but we found a
man in a uniform and Mark asked him who was boss of all the
freight-cars. The man told him he guessed the general freight agent
was, and Mark says, “Where’s his office?”
The man told him and Mark went there with me. It was shut up
tight. We waited and kept on waiting, and in about an hour a man
came along with overalls and a cap that said something on the front
of it.
“Hey, mister!” says Mark. “We’re waitin’ to see the general freight
agent. What’s the m-m-matter with him? Is he sick or somethin’?”
“Him!” says the man. “No, he hain’t sick. What makes you think
he is?”
“’Cause he hain’t down to work.”
“Did you expect to see him at seven o’clock in the mornin’?”
“To be sure.”
“Well, you come back again about nine and maybe he’ll be here
by that time. He usually gits around about nine.”
“Nine,” says Mark. “Why, that’s ’most n-noon.”
The man let out a laugh.
“How long does he work in the afternoon?” says Mark.
“Oh, he goes to lunch about one o’clock, and gets back around
half past two, and then he sticks to the job maybe till four.”
“Honest?” says Mark.
“Honest,” says the man.
“Well, I’ll be dinged!” says Mark. “And they pay him a r-r-reg’lar
day’s wages for that? Him workin’ maybe five hours a day?”
“If you got his salary, kid, you could buy a railroad for yourself.”
The man went along, and we kept on waiting, but Mark couldn’t
get it out of his head how a man with an important job could hang
onto it and do such a little mite of work. He said he guessed maybe
he’d get him a job like that some day where he just had to work five
hours. He said he’d do all that work in a stretch and then go out for
dinner, and in the afternoon he would have him another job just like
it, and work ten hours a day and make twice as much. I thought
that was a pretty good idea myself.
It was all of nine o’clock when that man came, though there was
folks working under him that came a little earlier. We kept asking if
he was there until a man told us we was a doggone nuisance and
that the boss wouldn’t see us, anyhow. And that’s just what
happened. When he got there we asked if we could see him, and the
man that was near the gate in the office asked what our business
was, and we told him, and he said we couldn’t bother the boss with
it. Mark said he guessed maybe the boss better be told we was
there, anyhow, and after quite a lot of fuss the man went and told
him, and then came back to say the boss was busy and couldn’t see
us. He told us there wasn’t any use hanging around, because we
wouldn’t ever get to see him.
That looked pretty bad, and Mark was as mad as could be. He
said we had a right to see that man, and that it wasn’t decent or
good business for him to refuse to see us. But that didn’t mend
matters. We could git as mad as we wanted to, but that wouldn’t get
us a minute’s talk with the freight agent.
“I’ll b-bet there’s somebody kin m-make him see us,” says Mark.
“The p-p-president of this railroad’s a bigger man than the freight
agent, and we’ll git him to fix it for us.”
I says to myself that if we couldn’t get to see one it was mighty
funny if we could get to see the biggest man of all; but Mark was
bound to try, so we found out where the president’s office was and
went up there. It was half past nine and he wasn’t to work yet.
“When’ll he be here?” says Mark.
“Maybe ten o’clock,” says a man that was working outside the
president’s door.
“Ten,” says Mark, “um!... And how long does he stay?”
“Oh, he’ll be around maybe till one, and then he gets lunch and
you can’t tell how long he’ll be out. Then he goes home mostly
about three or half past.”
“Goodness!” says Mark to me. “I hain’t goin’ to be any f-f-freight
man. I’m goin’ to be a p-p-president. Looks like he only works three
hours, and maybe he gets p-paid three or four thousand dollars for
it. Why, any feller could have three jobs like that, workin’ one right
on the end of the other, and doin’ nine hours’ work a day! I could git
rich doin’ that.”
So we waited some more, and after a while in come a slender
man with white hair and a cane, all dressed up like he was going to
a party instead of coming to work. Everybody acted like they was
afraid of him when he came in, and pertended to be mighty busy. He
didn’t speak to anybody, but just marched through into his own
room and scowled like anything. He looked like he was a regular
man-eater.
“Was that him?” says Mark.
“Yes.”
“Well, will you tell him that I want to t-t-talk to him?”
“Who are you and what do you want?”
Mark told him.
“I dassen’t bother him with that,” says the man. “He looks savage
to-day. He might discharge me right off.”
“But I’ve got to see him. It’s important. It’s awful important.”
“I’ll try it,” says the man, “but there isn’t a chance.”
So he went to the door and rapped and put in his head. We heard
a man roar.
“Get out of here!” he bellowed. “Shut that door! Get out! I won’t
see anybody this morning! Understand? Get out and stay out!”
The man came back and says, “There, you see.”
We did see, all right, and I was discouraged. Maybe Mark was,
too, but he didn’t show it. He just looked madder than ever.
“I’m goin’ to s-s-see that man,” says he, and we went out of that
room into the long corridor. There we stopped and stood looking out
of the window.
In about two minutes Mark says, “Dast you t-t-try it, Plunk?”
“Yes,” says I. “What?”
“Look at that fire-escape. See how it goes along right past that
room we were in. The p-president’s office is next and it goes p-p-
past his window. We kin git in that way.”
“He’d throw you off into the street,” says I.
“He couldn’t l-lift me,” says he, and grinned.
“Well,” says I, “I’m willin’ to go second if you’ll go first.”
“Come on,” says he.
In two jerks of a lamb’s tail we pushed up the window and got
onto the fire-escape. Then we skittered along it, ducking past
windows as quick as we could, until we were in front of a window
that we judged was in the president’s room. We looked in. Sure
enough, there he was leaning back in his chair and scowling and
smoking like a chimney. His window was up a little from the bottom,
but not enough for us to get in. We stood and watched him a
minute. Then Mark says, “Here goes.”
He rapped loud on the window and then pushed it up.
“Good m-m-mornin’!” says he. “Kin we come in?”
The president looked at us like he was seeing spooks or
something, and rubbed his eyes and jumped up, and Mark says:
“Don’t be scairt. We hain’t f-f-figgerin’ on hurtin’ you.”
With that both of us got into the room and walked over toward
him. He didn’t say a word, but just stared and scowled.
“We come to see you on b-b-business,” says Mark, “but they
wouldn’t let us in. We had to see you, so here we are.”
“I see you’re here,” says he, sharp and savage. “Now let me see
you get out again. Quick!”
I was ready to turn tail and skedaddle, but not Mark. He walked
right over to that president just like he was anybody common and
says:
“I’m s-sorry, sir, if we b-bother you. But I’ve got to t-talk to you a
minute. We can’t get to see anybody, and if we can’t get f-fixed up
we are goin’ to bust.”
The man scowled worse than ever and took a step toward Mark,
but Mark never give back an inch.
“I’ll have you thrown out,” says the man.
“If you say you won’t t-t-talk to us,” says Mark, “and if you can
feel down in your heart that you’re doin’ right, why, we’ll go without
b-bein’ thrown. But we was sure that a man couldn’t get to be p-
president of a whole railroad unless he was fair and square. That’s
why we come right to you. We sort of had confidence, sir, that you
was goin’ to see that what was right was done.... But if you don’t
feel that way about it, why, we’ll be g-g-goin’ along.”
He turned then and went over toward the door. The man didn’t
say a word till we were almost there, then he says, “Hold on there!”
We stopped.
“What do you know about what is fair and what isn’t, or what is
good business and what isn’t?”
“I may not know much about b-b-business,” says Mark, “but
anybody knows what’s f-fair. Here I am—a customer of your railroad
just like a man that buys a steak from a b-butcher is a customer of
the butcher. If folks wouldn’t use your railroad to send stuff on you
would have to go out of b-business. It looks to me like I was doing
something you ought to appreciate when I ship a car of freight, and
that when I come to see you about railroad b-business, that is goin’
to put m-money into your p-pocket, the least you could do and be
fair would be to l-listen. I’m always mighty anxious to keep my
customers feelin’ f-f-friendly toward me.”
“H’m!” says the president.
Mark went on along toward the door and never looked back.
“Just a minute,” says the president. “What’s your hurry?”
“We thought you wanted us to g-go.”
“Come back here,” says he. “Come back here. What do you mean,
anyhow, coming into my office and talking to me like this? How dare
you talk to me like this?”
I tell you I was pretty scared, but I looked at Mark and his eyes
were twinkling.
“I know I was right about you, sir,” says he.
“Right? What do you mean?”
“That you was f-fair and square, sir.”
“H’m!” says the president. “Sit down and be quick. I haven’t any
time to waste. Tell me what you want and tell it briefly. No beating
around the bush.” Anybody would have thought he was going to bite
our heads off.
So Mark told him the whole thing from beginning to end, and he
told it quick. I hadn’t any idea so much could be told to anybody in
such a short time; but then I might have known Mark could do it if
he wanted to. When he got right down to business he could be
mighty brief, I’ll tell you.
“And that’s what you’ve dared to break into my office to bother
me with, is it? For a cent I’d have you thrown out. I don’t know but I
ought to do worse.”
Mark he never said a word, but just looked at the president
respectful and confident.
The president turned around to his desk and wrote, and then he
fairly threw a paper at Mark. “There,” says he. “Now git out.”
Mark looked at the paper and I looked over his shoulder. It said:
To all officials and employees of the P. G. R. R.: See to it
that the bearer, Mark Tidd, is provided with freight-cars at
any point to be transported to any other point in the
United States within twelve hours of a request. This order
is superior to all other rules or embargoes that may be at
this time in force.
And his name was signed.
“Thank you, sir,” says Mark, “and good-by.”
He never looked up, and I thought he wasn’t even going to nod
his head when we went out, but he called us back again. “D’you
know why I gave you that order?” says he.
“I think so, sir,” says Mark.
“Well, you don’t,” says the president, “but I’ll tell you. It’s because
you’ve got the most tremendous crust in the world. It’s because you
weren’t afraid, and it was because you had the backbone to force
your way in here and compel me to talk to you. That’s why. Now git.”
We got.
CHAPTER XIV
“Now,” says Mark Tidd when we were on the train again, “I guess
we kin go to work l-l-lookin’ for George Piggins.”
“Somethin’ else is apt to happen,” says I. “You can’t never tell.”
“I guess ’most everything has h-happened,” says he. “There hain’t
much more left.” Then all of a sudden he give me a poke in the ribs
and says, “Tod Nodder.”
“Eh?” says I.
“Tod Nodder,” says he.
“What about him? Tod Nodder hain’t no reason for pokin’ me
black and blue.”
“Who was he always loafin’ around with?”
“Why, George Piggins!” says I.
“Never seen one without the other, did you?”
“Not that I know of.”
“Well?” says he.
“Well yourself,” says I, “and see how you like it.”
“I mean,” says he, “that if anybody in the world knows where
George is, the feller is Tod Nodder.”
“Maybe so, but what does that git us?”
“If he knows where George is,” says Mark, “maybe we kin git s-s-
somethin’ out of him some way.”
“It’s worth t-tryin’,” says I.
“Anythin’s worth t-tryin’,” says he, “and everythin’s worth tryin’
when you’re in the fix we’re in. For a spell we’ll leave Silas Doolittle
Bugg to run the mill. I guess he kin l-look after the manufacturin’
end with what help we kin give, and put all our time on f-findin’
George. We know Wiggamore’s l-lookin’ for him, and Wiggamore’s
got money to look with. He kin hire men to do his lookin’. All we got
is us and what b-brains we got.”
“Admittin’ we got any,” says I.
It was evening when we got home, but we got hold of Binney and
Tallow and told them what had happened and how we was going to
get all the freight-cars we needed; and we planned how we would
meet next morning early, and two of us would keep watch on Miss
Piggins’s house and the other two would lay for Tod Nodder. Mark
and I were going after Nodder. That left it so that if anything
happened one of each couple could stay to watch while the other
went for help or to do any following that was necessary. Mark said it
would be a pretty good idea to keep an eye on Wiggamore or any
men that he had hanging around town.
That’s the way it turned out. Binney stayed to watch Miss Piggins.
Tallow went mogging after a strange man with fancy clothes that let
on he was a detective and was working for Wiggamore, and Mark
and I went to hunt up Tod Nodder.
You could ’most always tell where to find Tod. It was the place
where nobody would be like to come along and offer him a job. Tod
was the kind that always complained about not having work, and
then took mighty good care to hide somewheres where work
couldn’t find him. Lazy! Whoo! Why, he was so lazy when he fished
he did it with a night line, and then he hated to pull it in to take off
the fish!
We stopped at the mill a minute, and Silas Doolittle come up to
us, all excited.
“Say,” says he, “somebody was monkeyin’ around this mill last
night. I was passin’ about nine o’clock and I seen a light. I come
rushin’ right down. It looked like the light was ’way up toward the
roof. Well, I busted right in and went rampagin’ up-stairs, and before
I knowed I rammed right into a feller on the stairs. He was comin’
down as fast as I was goin’ up, and the way we come together
would ’a’ made a railroad accident jealous. He got the best of it,
though, for he was a-comin’ down-stairs. Yes, sir. He lammed right
into me and clean upset me so’s I rolled all the way down, and
doggone it if I didn’t leave about a peck of skin on them steps. Then
he trompled right over the top of me and skedaddled. I couldn’t
ketch him and I couldn’t find no harm he’d done. But after this I
calc’late I’ll sleep right here into this mill. That’s what I’ll do, and if
anybody comes fussin’ around I guess they’ll find out they got Silas
Doolittle Bugg to reckon with.”
“Mighty good idee,” says Mark. “Say, we got two freight-cars
comin’ in this m-mornin’. Git ’em loaded so’s they’ll ketch the noon
freight.”
“Have to have help,” says Silas.
“Hire some of them grocery-store loafers to help,” says Mark. “Us
f-fellers has got somethin’ mighty important to look after.”
Well, Mark and I started out then to get our eyes on Tod Nodder
and to keep them on him. He wasn’t so easy to find as we thought
he would be. Maybe that was because there was a man in town
trying to hire folks to do some work on the railroad. Tod would hide
away from such a man harder than he would hide from a tribe of
scalping Indians. He wasn’t at any of the usual loafing-places, and at
the livery-stable where he ’most generally slept they said they hadn’t
seen him since daylight. They said he started off somewheres about
four o’clock in the morning. Now when a man like Tod Nodder goes
somewheres at four o’clock in the morning there are lots of things
he might go to do, but there hain’t but one thing he’s very likely to
go for, and that’s fish.
After we had rummaged all around and couldn’t come across him
Mark says, “Well, the s-s-skeezicks must’a’ gone f-f-fishin’.”
“Where?” says I.
“Tod’s one of these p-pickerel fishermen,” says Mark. “Seems like
pickerel and him is mighty fond of each other. So,” says he, “I
calc’late we better make for the bayou.”
The bayou was a kind of elbow of the Looking-glass River that
flows into the main river just below town. When the railroad came
along they built right across that elbow, shutting it off into a kind of
a lake shaped like a letter U, and the banks was mostly swampy and
all overgrown with underbrush. Seems like the pickerel was fond of
hanging around in there, and folks who knew how to fish was always
hauling regular whoppers out of there. There was places where the
banks were high and where you could take a long pole and fish right
from the shore. We sort of figured Tod would pick out one of those
places if he was there, on account of its being less work than to row
out a boat.
Mark was always thinking ahead a little, so what does he do but
go past his house and stop for a lunch. He wasn’t going to be caught
out in the country somewheres without anything to eat, not if he
knew himself. Then we started off for the bayou, which wasn’t far.
We started in at the railroad on one end and just skirted the shore,
keeping our eyes open every inch of the way, and, sure enough,
along about half-way around we saw a bamboo fish-pole sticking
out.
“Injuns,” says Mark Tidd.
“Where?” says I.
“Everywhere. All around us. They’re a r-r-raidin’ party gittin’ ready
to bust out on the town and scalp everybody and carry off the
wimmin and children. We got to creep up on ’em and f-f-find out
their plans and warn Wicksville.”
“I don’t understand no Injun language,” says I.
“I do,” says he. “I learned ’most all the Injun languages when I
was a captive among them some time back.”
“Um!” says I. “I forgot about that. Come to think of it, I was one
of them captives, too. I kin speak Choctaw and Hog Latin and a lot
of them languages myself.”
“Good!” says he. “Now cautious if you want to keep any hair g-g-
growin’ on your head.”
We did pretty good. In ten minutes we was lying not a hundred
foot from Tod Nodder, and he hadn’t the least idea in the world that
anybody was within a mile of him. At that distance we could whisper
without any danger, so Mark leans over and says to me:
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

More Related Content

PDF
Regression Linear Modeling Best Practices And Modern Methods 1st Edition Jaso...
DOCX
The future is uncertain. Some events do have a very small probabil.docx
PDF
Introduction to Econometrics 3rd, global Edition James H. Stock
PPTX
An Introduction to Regression Models: Linear and Logistic approaches
PPT
Data Analysison Regression
PDF
Lecture 1.pdf
PDF
libro para asignatura de regresion lineal
PPTX
ForecastIT 2. Linear Regression & Model Statistics
Regression Linear Modeling Best Practices And Modern Methods 1st Edition Jaso...
The future is uncertain. Some events do have a very small probabil.docx
Introduction to Econometrics 3rd, global Edition James H. Stock
An Introduction to Regression Models: Linear and Logistic approaches
Data Analysison Regression
Lecture 1.pdf
libro para asignatura de regresion lineal
ForecastIT 2. Linear Regression & Model Statistics

Similar to Linear Regression An Introduction To Statistical Models Peter Martin (20)

PDF
Linear models for data science
PPTX
simple-linear-regression (1).pptx
PDF
Business statistics-ii-aarhus-bss
PPTX
manecohuhuhuhubasicEstimation-1.pptx
PPTX
Regression-Sheldon Ross from Chapter 9-year2024
PPTX
regression.pptx
PDF
Introduction to Business Statistics 6th Edition Ronald M. Weiers
PPTX
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
PDF
An Introduction to Generalized Linear Models 3rd Edition Annette J. Dobson
PDF
Bivariate Regression
PDF
Statistical Regression And Classification From Linear Models To Machine Learn...
PPT
Chapter14
PPTX
Regression Analysis
PPT
Lecture 4
PDF
Full Download An Introduction to Generalized Linear Models 3rd Edition Annett...
PDF
Week_3_Lecture.pdf
PPTX
Matlab: Regression
PPTX
Matlab:Regression
PPTX
presentation on R language Regression in R
PDF
An Introduction to Generalized Linear Models 3rd Edition Annette J. Dobson
Linear models for data science
simple-linear-regression (1).pptx
Business statistics-ii-aarhus-bss
manecohuhuhuhubasicEstimation-1.pptx
Regression-Sheldon Ross from Chapter 9-year2024
regression.pptx
Introduction to Business Statistics 6th Edition Ronald M. Weiers
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
An Introduction to Generalized Linear Models 3rd Edition Annette J. Dobson
Bivariate Regression
Statistical Regression And Classification From Linear Models To Machine Learn...
Chapter14
Regression Analysis
Lecture 4
Full Download An Introduction to Generalized Linear Models 3rd Edition Annett...
Week_3_Lecture.pdf
Matlab: Regression
Matlab:Regression
presentation on R language Regression in R
An Introduction to Generalized Linear Models 3rd Edition Annette J. Dobson
Ad

Recently uploaded (20)

PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Pharma ospi slides which help in ospi learning
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
01-Introduction-to-Information-Management.pdf
PDF
Pre independence Education in Inndia.pdf
PPTX
master seminar digital applications in india
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Institutional Correction lecture only . . .
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Cell Types and Its function , kingdom of life
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Pharma ospi slides which help in ospi learning
GDM (1) (1).pptx small presentation for students
Final Presentation General Medicine 03-08-2024.pptx
01-Introduction-to-Information-Management.pdf
Pre independence Education in Inndia.pdf
master seminar digital applications in india
Sports Quiz easy sports quiz sports quiz
Institutional Correction lecture only . . .
Supply Chain Operations Speaking Notes -ICLT Program
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPH.pptx obstetrics and gynecology in nursing
Cell Types and Its function , kingdom of life
O7-L3 Supply Chain Operations - ICLT Program
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Anesthesia in Laparoscopic Surgery in India
Microbial disease of the cardiovascular and lymphatic systems
Renaissance Architecture: A Journey from Faith to Humanism
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
Ad

Linear Regression An Introduction To Statistical Models Peter Martin

  • 1. Linear Regression An Introduction To Statistical Models Peter Martin download https://ebookbell.com/product/linear-regression-an-introduction- to-statistical-models-peter-martin-47176696 Explore and download more ebooks at ebookbell.com
  • 2. Here are some recommended products that we believe you will be interested in. You can click the link to download. Linear Regression Using R An Introduction To Data Modeling David J Lilja https://ebookbell.com/product/linear-regression-using-r-an- introduction-to-data-modeling-david-j-lilja-9959444 Applied Linear Regression For Longitudinal Data With An Emphasis On Missing Observations Frans Es Tan https://ebookbell.com/product/applied-linear-regression-for- longitudinal-data-with-an-emphasis-on-missing-observations-frans-es- tan-46824202 An Application Of The Linear Regression Technique For Determining Length And Weight Of Six Fish Taxa The Role Of Selected Fish Species In Aleut Paleodiet Trevor J Orchard https://ebookbell.com/product/an-application-of-the-linear-regression- technique-for-determining-length-and-weight-of-six-fish-taxa-the-role- of-selected-fish-species-in-aleut-paleodiet-trevor-j-orchard-49993950 Regression Analysis An Intuitive Guide For Using And Interpreting Linear Models 1st Edition Jim Frost https://ebookbell.com/product/regression-analysis-an-intuitive-guide- for-using-and-interpreting-linear-models-1st-edition-jim- frost-42876970
  • 3. Linear Regression Models Applications In R John P Hoffman https://ebookbell.com/product/linear-regression-models-applications- in-r-john-p-hoffman-51710136 Linear Regression Analysis 2nd Edition Wiley Series In Probability And Statistics 2nd Edition George A F Seber https://ebookbell.com/product/linear-regression-analysis-2nd-edition- wiley-series-in-probability-and-statistics-2nd-edition-george-a-f- seber-2539350 Linear Regression 1st Edition Jrgen Gro Auth https://ebookbell.com/product/linear-regression-1st-edition-jrgen-gro- auth-4271820 Linear Regression Analysis Theory And Computing 1st Edition Xin Yan https://ebookbell.com/product/linear-regression-analysis-theory-and- computing-1st-edition-xin-yan-43136366 Linear Regression David J Olive https://ebookbell.com/product/linear-regression-david-j-olive-5772564
  • 7. THE SAGE QUANTITATIVE RESEARCH KIT Beginning Quantitative Research by Malcolm Williams, Richard D. Wiggins, and the late W. Paul Vogt is the first volume in The SAGE Quantitative Research Kit. This book can be used together with the other titles in the Kit as a comprehensive guide to the process of doing quantitative research, but it is equally valuable on its own as a practical introduction to completing quantitative research. Editors of The SAGE Quantitative Research Kit: Malcolm Williams – Cardiff University, UK Richard D. Wiggins – UCL Social Research Institute, UK D. Betsy McCoach – University of Connecticut, USA Founding editor: The late W. Paul Vogt – Illinois State University, USA
  • 8. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS PETER MARTIN THE SAGE QUANTITATIVE RESEARCH KIT
  • 9. SAGE Publications Ltd 1 Oliver’s Yard 55 City Road London EC1Y 1SP SAGE Publications Inc. 2455 Teller Road Thousand Oaks, California 91320 SAGE Publications India Pvt Ltd B 1/I 1 Mohan Cooperative Industrial Area Mathura Road New Delhi 110 044 SAGE Publications Asia-Pacific Pte Ltd 3 Church Street #10-04 Samsung Hub Singapore 049483 Editor: Jai Seaman Assistant editor: Charlotte Bush Production editor: Manmeet Kaur Tura Copyeditor: QuADS Prepress Pvt Ltd Proofreader: Elaine Leek Indexer: Cathryn Pritchard Marketing manager: Susheel Gokarakonda Cover design: Shaun Mercier Typeset by: C&M Digitals (P) Ltd, Chennai, India Printed in the UK © Peter Martin 2021 This volume published as part of The SAGE Quantitative Research Kit (2021), edited by Malcolm Williams, Richard D. Wiggins and D. Betsy McCoach. Apart from any fair dealing for the purposes of research, private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act, 1988, this publication may not be reproduced, stored or transmitted in any form, or by any means, without the prior permission in writing of the publisher, or in the case of reprographic reproduction, in accordance with the terms of licences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publisher. Library of Congress Control Number: 2020949998 British Library Cataloguing in Publication data A catalogue record for this book is available from the British Library ISBN 978-1-5264-2417-4 At SAGE we take sustainability seriously. Most of our products are printed in the UK using responsibly sourced papers and boards. When we print overseas we ensure sustainable papers are used as measured by the PREPS grading system. We undertake an annual audit to monitor our sustainability.
  • 10. Contents List of Figures, Tables and Boxes ix About the Author xv Acknowledgements xvii Preface xix 1 What Is a Statistical Model? 1 Kinds of Models: Visual, Deterministic and Statistical 2 Why Social Scientists Use Models 3 Linear and Non-Linear Relationships: Two Examples 4 First Approach to Models: The t-Test as a Comparison of Two Statistical Models 6 The Sceptic’s Model (Null Hypothesis of the t-Test) 8 The Power Pose Model: Alternative Hypothesis of the t-Test 9 Using Data to Compare Two Models 10 The Signal and the Noise 14 2 Simple Linear Regression 17 Origins of Regression: Francis Galton and the Inheritance of Height 18 The Regression Line 21 Regression Coefficients: Intercept and Slope 23 Errors of Prediction and Random Variation 24 The True and the Estimated Regression Line 25 Residuals 26 How to Estimate a Regression Line 27 How Well Does Our Model Explain the Data? The R2 Statistic 29 Sums of Squares: Total, Regression and Residual 29 R2 as a Measure of the Proportion of Variance Explained 31 R2 as a Measure of the Proportional Reduction of Error 31 Interpreting R2 32 Final Remarks on the R2 Statistic 32 Residual Standard Error 33 Interpreting Galton’s Data and the Origin of ‘Regression’ 33 Inference: Confidence Intervals and Hypothesis Tests 35
  • 11. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS vi Confidence Range for a Regression Line 39 Prediction and Prediction Intervals 42 Regression in Practice: Things That Can Go Wrong 44 Influential Observations 45 Selecting the Right Group 46 The Dangers of Extrapolation 47 3 Assumptions and Transformations 51 The Assumptions of Linear Regression 52 Investigating Assumptions: Regression Diagnostics 54 Errors and Residuals 54 Standardised Residuals 55 Regression Diagnostics: Application With Examples 56 Normality 56 Homoscedasticity and Linearity: The Spread-Level Plot 61 Outliers and Influential Observations 64 Independence of Errors 70 What if Assumptions Do Not Hold? An Example 71 A Non-Linear Relationship 71 Model Diagnostics for the Linear Regression of Life Expectancy on GDP 73 Transforming a Variable: Logarithmic Transformation of GDP 73 Regression Diagnostics for the Linear Regression With Predictor Transformation 79 Types of Transformations, and When to Use Them 79 Common Transformations 80 Techniques for Choosing an Appropriate Transformation 83 4 Multiple Linear Regression: A Model for Multivariate Relationships 87 Confounders and Suppressors 88 Spurious Relationships and Confounding Variables 88 Masked Relationships and Suppressor Variables 91 Multivariate Relationships: A Simple Example With Two Predictors 93 Multiple Regression: General Definition 96 Simple Examples of Multiple Regression Models 97 Example 1: One Numeric Predictor, One Dichotomous Predictor 98 Example 2: Multiple Regression With Two Numeric Predictors 107 Research Example: Neighbourhood Cohesion and Mental Wellbeing 113
  • 12. contents vii Dummy Variables for Representing Categorical Predictors 117 What Are Dummy Variables? 118 Research Example: Highest Qualification Coded Into Dummy Variables 118 Choice of Reference Category for Dummy Variables 122 5 Multiple Linear Regression: Inference, Assumptions and Standardisation 125 Inference About Coefficients 126 Standard Errors of Coefficient Estimates 126 Confidence Interval for a Coefficient 128 Hypothesis Test for a Single Coefficient 128 Example Application of the t-Test for a Single Coefficient 129 Do We Need to Conduct a Hypothesis Test for Every Coefficient? 130 The Analysis of Variance Table and the F-Test of Model Fit 131 F-Test of Model Fit 132 Model Building and Model Comparison 135 Nested and Non-Nested Models 135 Comparing Nested Models: F-Test of Difference in Fit 137 Adjusted R2 Statistic 139 Application of Adjusted R2 140 Assumptions and Estimation Problems 141 Collinearity and Multicollinearity 141 Diagnosing Collinearity 142 Regression Diagnostics 144 Standardisation 148 Standardisation and Dummy Predictors 151 Standardisation and Interactions 151 Comparing Coefficients of Different Predictors 152 Some Final Comments on Standardisation 152 6 Where to Go From Here 155 Regression Models for Non-Normal Error Distributions 156 Factorial Design Experiments: Analysis of Variance 157 Beyond Modelling the Mean: Quantile Regression 158 Identifying an Appropriate Transformation: Fractional Polynomials 158 Extreme Non-Linearity: Generalised Additive Models 159 Dependency in Data: Multilevel Models (Mixed Effects Models, Hierarchical Models) 159
  • 13. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS viii Missing Values: Multiple Imputation and Other Methods 159 Bayesian Statistical Models 160 Causality 160 Measurement Models: Factor Analysis and Structural Equations 161 Glossary 163 References 171 Index 175
  • 14. List of Figures, Tables and Boxes List of figures 1.1 Child wellbeing and income inequality in 25 countries 4 1.2 Gross domestic product (GDP) per capita and life expectancy in 134 countries (2007) 6 1.3 Hypothetical data from a power pose experiment 8 1.4 Illustrating two statistical models for the power pose experiment 8 1.5 Partition of a statistical model into a systematic and a random part 15 2.1 Scatter plot of parents’ and children’s heights 19 2.2 Galton’s data with superimposed regression line 22 2.3 An illustration of the regression line, its intercept and slope 23 2.4 Illustration of residuals 27 2.5 Partition of the total outcome variation into explained and residual variation 30 2.6 Illustration of R2 as a measure of model fit 31 2.7 Galton’s regression line compared to the line of equal heights 34 2.8 Regression line with 95% confidence range for mean prediction 40 2.9 Regression line with 95% prediction intervals 43 2.10 Misleading regression lines resulting from influential observations 45 2.11 The relationship between GDP per capita and life expectancy, in two different selections from the same data set 46 2.12 Linear regression of life expectancy on GDP per capita in the 12 Asian countries with highest GDP, with extrapolation beyond the data range 48 2.13 Checking the extrapolation from Figure 2.12 by including the points for the 12 Asian countries with the lowest GDP per capita 48 3.1 Illustration of the assumptions of normality and homoscedasticity in Galton’s regression 54 3.2 An illustration of the normal distribution 57 3.3 Histogram of standardised residuals from Galton’s regression, with a superimposed normal curve 58 3.4 Histograms of standardised residuals illustrating six distribution shapes 59
  • 15. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS x 3.5 Normal q–q plot of standardised residuals from Galton’s regression 60 3.6 Normal q–q plots of standardised residuals for six distribution shapes 61 3.7 Spread-level plot: standardised residuals and regression predicted values from Galton’s regression 62 3.8 Spread-level plots and scatter plots for four simulated data sets 63 3.9 Illustration of a standard normal distribution, with conventional critical values 65 3.10 Observations with the largest Cook’s distances from Galton’s regression 68 3.11 Galton’s regression data with four hypothetical influential observations 69 3.12 Life expectancy by GDP per capita in 88 countries 71 3.13 Diagnostic plots for the linear regression of life expectancy on GDP per capita 73 3.14 Life expectancy and GDP per capita – illustrating a linear regression on the logarithmic scale 76 3.15 The curvilinear relationship between life expectancy and GDP per capita 78 3.16 Diagnostic plots for the linear regression of life expectancy on log2 (GDP)79 3.17 The shape of the relationship between Y and X in three common transformations, with positive slope coefficient (top row) and negative slope coefficient (bottom row) 81 4.1 A hypothetical scatter plot of two apparently correlated variables 89 4.2 Illustration of a confounder causing a spurious association between X and Y90 4.3 A hypothetical scatter plot of two apparently unrelated variables 91 4.4 Illustration of a suppressor variable masking the true relationship between X and Y91 4.5 Five hypothetical data sets illustrating possible models for Mental Wellbeing predicted by Social Participation and Limiting Illness 94 4.6 The distributions of Mental Wellbeing, Social Participation and Limiting Illness 98 4.7 Scatter plot of Mental Wellbeing by Social Participation, grouped by Limiting Illness 100 4.8 Mental Wellbeing, Social Participation and Limiting Illness: an illustration of five possible models for the National Child Development Study data 101 4.9 Distributions of Neighbourhood Cohesion and Social Support scales 108
  • 16. list of figures, tables and boxes xi 4.10 Three-dimensional representation of the relationship between Mental Wellbeing, Neighbourhood Cohesion and Social Support 110 4.11 Three regression lines for the prediction of Mental Wellbeing by Neighbourhood Cohesion, for different values of Social Support 111 4.12 Three regression lines for the prediction of Mental Wellbeing by Neighbourhood Cohesion, Social Support and their interaction 113 4.13 Comparing predictions of Mental Wellbeing from the unadjusted and adjusted models (Models 4.1 and 4.2) 116 4.14 Distribution of ‘Highest Qualification’ 119 5.1 Fisher distribution with df1 = 5 and df2 = 7597, with critical region 133 5.2 Normal q–q plot for standardised residuals from Model 5.3 145 5.3 Spread-level plot of standardised residuals against predicted values from Model 5.3 145 List of tables 1.1 Testosterone change from a power pose experiment (hypothetical data) 11 2.1 Extract from Galton’s data on heights in 928 families 19 2.2 A typical regression results table (based on Galton’s data) 38 3.1 The largest positive and negative standardised residuals from Galton’s regression 66 3.2 Estimates from a simple linear regression of life expectancy on GDP per capita 72 3.3 Logarithms for bases 2, 10 and Euler’s number e75 3.4 Calculating the base-2 logarithm for a selection of GDP per capita values 75 3.5 Raw and log-transformed GDP per capita values for six countries 76 3.6 Estimates from a simple linear regression of life expectancy on log2 (GDP)77 4.1 Coefficient estimates for five models predicting Mental Wellbeing 101 4.2 Coefficient estimates from a regression of Mental Wellbeing on Neighbourhood Cohesion and Social Support 109 4.3 Coefficient estimates for the prediction of Mental Wellbeing by Neighbourhood Cohesion, Social Support and their interaction 112
  • 17. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS xii 4.4 Coefficient estimates, standard errors and confidence intervals for two regression models predicting Mental Wellbeing 115 4.5 A scheme for coding a categorical variable with three categories into two dummy variables 118 4.6 A scheme to represent Highest Qualification by five dummy variables 120 4.7 Hypothetical data set with five dummy variables representing the categorical variable Highest Qualification 120 4.8 Estimates from a linear regression predicting Mental Wellbeing, with dummy variables representing Highest Qualification (Model 4.3) 122 5.1 Coefficient estimates, standard errors and confidence intervals for a multiple regression predicting Mental Wellbeing (Model 5.1) 127 5.2 Estimated coefficients for a regression of Mental Wellbeing on four predictors and two interactions (Model 5.2) 130 5.3 Analysis of variance table for linear regression 132 5.4 Analysis of variance table for a multiple regression predicting Mental Wellbeing (Model 5.1) 134 5.5 Model comparison of Models 5.1 and 5.3 138 5.6 Analysis of variance table for Models 5.1 and 5.3 140 5.7 Multicollinearity diagnostics for Model 5.3 143 5.8 The largest standardised residuals from Model 5.3 146 5.9 Estimates from a linear regression predicting Mental Wellbeing (Model 5.3) 149 5.10 Unstandardised and standardised coefficient estimates from Model 5.3 151 List of boxes 2.1 Types of Variables 18 2.2 Galton and Eugenics 20 2.3 Various Names for the Variables Involved in a Regression Model 21 2.4 Finding the Slope and the Intercept for a Regression Line 28 2.5 How to Calculate a Confidence Range Around the Regression Line 40 2.6 How to Calculate a Prediction Interval 43 3.1 The Normal Distribution and the Standard Normal Distribution 56 3.2 Regression Diagnostics and Uncertainty 59 3.3 Further Properties of the Normal Distribution 64 3.4 Logarithms 74
  • 18. list of figures, tables and boxes xiii 4.1 Variables From the National Child Development Study Used in Example 1 99 4.2 Interactions in Regression Models 104 4.3 Measurement of Neighbourhood Cohesion and Social Support in the NCDS 108 5.1 Nested and Non-Nested Models 136
  • 20. About the Author Peter Martin is Lecturer in Applied Statistics at University College London. He has taught statistics to students of sociology, psychology, epidemiology and other disciplines since 2003. One of the joys of being a statistician is that it opens doors to research collaborations with many people in diverse fields. Dr Martin has been involved in investigations in life course research, survey methodology and the analy- sis of racism. In recent years, his research has focused on health inequalities, psy- chotherapy and the evaluation of healthcare services. He has a particular interest in topics around mental health care.
  • 22. Acknowledgements Thanks to Richard D. Wiggins, Malcolm Williams and D. Betsy McCoach for invit- ing me to write this book. To Amy Macdougall, Andy Ross, D. Betsy McCoach, Kalia Cleridou, Praveetha Patalay and Richard D. Wiggins for generously providing feed- back on draft chapters. To the team at Sage for editorial support. To Brian Castellani for suggesting a vital phrase. To my colleagues for giving me time. To the staff of several East London cafés for space and warmth. To everyone I ever taught statistics for helping me learn. To Richard D. Wiggins for generous advice and encouragement over many years. To Pippa Hembry for being there. Thanks also to • The UNICEF MICS team for permission to use data from their archive (https:// mics.unicef.org). • The Gapminder Foundation for making available data on life expectancy and GDP from around the world. • The UK Data Archive for permission to use data from the National Child Development Study. The data analyses reported in this book were conducted using the R Software for Statistical Computing (R Core Team, 2019) with the RStudio environment (RStudio Team, 2016). All graphs were made in R, in most cases using the package ggplot2. Other R packages used in the making of this book are catspec, gapminder, ggrepel, grid, knitr, MASS, plyr, psych, reshape2, scales, scatterplot3d, tidyverse.
  • 24. Preface This is a book about statistical models as they are used in the social sciences. It gives a first course in the type of models commonly referred to as linear regression mod- els. At the same time, it introduces many general principles of statistical modelling, which are important for understanding more advanced methods. Statistical models are useful when we have, or aim to collect, data about social phenomena and wish to understand how different phenomena relate to one another. Examples in this book are based on real social science research studies that have investigated questions about: • Sociology of community: Do neighbourhoods with a more cohesive community spirit foster mental wellbeing for local people? • Demography and economics: Is it necessary for a country to get richer and richer to increase the health of its population? • Inequality and wellbeing: Is a country’s income inequality related to the wellbeing of its children? • Psychology: Can some people increase their feelings of confidence by assuming certain ‘power poses’? This book won’t give conclusive answers to these questions. But it does introduce some of the analytical methods that have been used to address them, and other ques- tions like them. Specifically, this book looks at linear regression, which is a method for analysing continuous variables, such as a person’s height, a child’s score on a measure of self-rated depression or a country’s average life expectancy. Other types of outcome variables, such as categorical and count variables, are covered in The SAGE Quantitative Research Kit, Volume 8. Realistic data sets The examples in this book are based on published social science studies, and most analyses shown use the original data on which these source studies were conducted, or subsets thereof. Since the statistical analysis uses realistic data, the results reported are sometimes ambiguous, which is to say: what conclusions we should draw from the analysis may remain debatable. This highlights an important point about statis- tical models: in themselves, statistical models do not give you the answers to your
  • 25. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS xx research questions. What statistical analysis does provide is a principled way to derive evidence from data. This evidence is important, and you can use it in your argument for or against a certain conclusion. But all statistical results need to be interpreted to be meaningful. Prior knowledge useful for understanding this book This book is intended for those who have a thorough grounding in descriptive statis- tics, as well as in the fundamentals of inferential statistics. I assume throughout that you understand what I mean when I speak of means, standard deviations, percentiles, histograms, and scatter plots, and that you know the basic ideas underlying a t-test, a z-test and a confidence interval. Finally, I assume that you are familiar with some of the ways social science data are collected or obtained – surveys, experiments, admin- istrative data sources, and so forth – and that you understand that all these methods have strengths and weaknesses that affect the conclusions we can draw from any analysis of the data. An excellent way to acquire the knowledge required to benefit from this book is to study Volumes 1 to 6 of The SAGE Quantitative Research Kit. Mathematics: equations, calculations, Greek symbols This book is intended for social scientists and students of social science who wish to understand statistical modelling from a practical perspective. Statistical models are based on elaborate and advanced mathematical methods, but knowledge of advanced mathematics is not needed to understand this book. Nonetheless, this book does require you, and possibly challenges you, to learn to recognise the essential equations that define statistical models, and to gain an intui- tive understanding of how they work. I believe that this is a valuable skill to have. For example, it’s important to recognise the difference between the sort of equation that defines a straight line and another sort that defines a curve. As you will see, this is essential for the ability to choose an appropriate model for a given research question and data set. Attempts at using statistical models without any mathematical under- standing carry a high risk of producing nonsensical and misleading results. So there will be equations. There will be Greek symbols. But there will be careful explanations of them all, along with graphs and illustrations to illuminate the maths. Think of the maths as a language that it’s useful to get a working understanding of. Suppose you decide to live in a foreign country for a while, and that you don’t yet
  • 26. preface xxi know the main language spoken in this country. Suppose further that enough people in that country understand and speak your own language, so that most of the time you can get by using a language you are familiar with. Nonetheless, you will under- stand more about the country if you learn a little bit of its language. Even if you don’t aspire to ever speak it fluently, or write poetry in it, you may learn enough of it to enable you to understand a newspaper headline, read the menu in a restaurant and have a good guess what the native speakers at the next table are talking about. In a similar way, you don’t need to become an expert mathematician to understand a lit- tle bit of the mathematical aspect of statistical modelling, and to use this understand- ing to your advantage. So what’s needed to benefit from this book is not so much mathematical skill, but rather an openness to considering the language of mathemat- ics as an aid to understanding the underlying logic of statistical modelling. Software This book is software-neutral. It can be read and understood without using any sta- tistical software. On the other hand, what you learn here can be applied using any statistical software that can estimate regression models. In writing this book, I used the free open-source software R (R Core Team, 2019). Other statistical packages often used by social scientists for linear regression models are Stata, SPSS and SAS. Web support pages with worked examples It is generally a good idea to learn statistics by doing it – that is, to work with data sets and statistical software and play around with fitting statistical models to the data. To help with this, the support website for this book supplies data sets for most of the examples used in this book and gives worked examples of the analyses. The support website is written in the R software. R has the advantage that it can be downloaded free of charge, and that it has a growing community of users who write new add-on packages to extend its capability, publish tutorials, and exchange tips and tricks online. However, if you prefer to use a different software, or if you are required to learn a different software for a course you are attending, you can download the data sets from the support website and read them into your software of choice. Instructions for this, as well as instructions on how to download R for free, are given on the support website. Head to: https://study.sagepub.com/ quantitativekit
  • 27. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS xxii References R Core Team. (n.d.). R: A language and environment for statistical computing. R Foundation for Statistical Computing. www.R-project.org RStudio Team. (2016). RStudio: Integrated development for R. RStudio. www.rstudio.com
  • 28. 1 What Is a Statistical Model? Chapter Overview Kinds of models: visual, deterministic and statistical����������������������������������� 2 Why social scientists use models����������������������������������������������������������������� 3 Linear and non-linear relationships: two examples�������������������������������������� 4 First approach to models: the t-test as a comparison of two statistical models������������������������������������������������������������������������������������������ 6 The signal and the noise���������������������������������������������������������������������������� 14
  • 29. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS 2 What is a statistical model? This chapter gives a first introduction. We start by consid- ering the concept of a ‘model’ in areas other than statistics. I then give some exam- ples of how statistical models are applied in the social sciences. Finally, we see how a simple parametric statistical hypothesis test, the t-test for independent samples, can be understood as a systematic comparison of two statistical models. An important aim of this chapter is to convey the notion that statistical models can be used to investigate how well a theory fits the data. In other words, statistical models help us to systematically evaluate the evidence for or against certain hypotheses we might have about the social world. Kinds of models: visual, deterministic and statistical Models are simplified representations of systems, objects or theories that allow us to understand things better. An architect builds a model house to help herself and her clients imagine how the real house will look once it is built. The Paris metro map is a model that helps passengers understand where they can catch a train, where they can travel to and where they can change trains. Some models come in the form of mathematical equations. For example, consider an engineer who wants to design a submarine. The deeper the submarine dives, the higher the water pressure is going to be. The engineer needs to account for this lest the walls of his submarine crack and get crushed. As you may know, pressure is meas- ured in a unit called bar, and the air pressure at sea level is equal to 1 bar. A math- ematical model of the pressure experienced by a submarine under the surface of the sea is as follows: Pressure bar bar depth = + × 1 0 1 . where depth is measured in metres. This equation expresses the insight that, with every metre that the submarine dives deeper, the water pressure increases by 0.1 bar. Using this model, we can calculate that the pressure at 100 m depth will be Pressure bar bar bar = + × = 1 0 1 100 11 . Thus, if your job is to build a submarine that can dive to 100 m, you know you need to build it so that its walls can withstand 11 bars of pressure (i.e. 11 times the pressure at sea level). Statistical models also are expressed in the form of equations. As you will see, a simple statistical model looks very similar to the mathematical model we just considered. The difference between the two is how they deal with the differ- ences between what the model predicts about reality and observations from real- ity itself. The engineer who uses the mathematical model of pressure might be
  • 30. what is a statistical model? 3 happy to ignore small differences between the model and reality. The pressure at 100 m is taken to be 11 bar. If it is really 10.997 bar or 11.029 bar, so what? The approximation is good enough for the engineer’s purposes. Such a model is called deterministic, because according to the model, the depth determines the pressure precisely. In contrast, statistical models are used in situations where there is considerable uncertainty about how accurate the model predictions are. This is almost always the case in social science, because humans, and the societies they build, are complex, complicated, and not predictable as precisely as some natural phenomena, such as the relationship between depth and water pressure. All models are simplifications of reality. The architect’s model house lacks many details. The Paris metro map does not accurately represent the distances between the stations. Our (simplistic) mathematical model of underwater pressure ignores that pressure at the same depth will not be the same everywhere the submarine goes (e.g. because the waters of different oceans vary in salinity). Whether these impreci- sions matter depends on the purpose of the model. The Paris metro map is useful for travellers but not detailed enough for an engineer who wishes to extend the existing tunnels to accommodate a new metro line. In the same way, a statistical model may be good for one purpose but useless for another. Why social scientists use models Social scientists use statistical models to investigate relationships between social phe- nomena, such as: • Diet and longevity: Is what you eat associated with how long you can expect to live? • Unemployment and health: Is unemployment associated with poorer health? • Inequality and crime: Do countries with a wide income gap between the highest and the lowest earners have higher crime rates than more equal countries? Of course, descriptive statistics provide important evidence for social research. Tables and graphs, means and standard deviations, correlation coefficients and comparisons of groups – all these are important tools of analysis. But statistical models go beyond description in important ways: • Models can serve as formalisations of theories about the social world. By comparing how well two models fit a given set of data, we can rigorously assess which of two competing theories is more consistent with empirical observations. • Statistical models provide rigorous procedures for telling the signal from the noise: for deciding whether a pattern we see in a table or a graph can be considered evidence for a real effect, relationship, or regularity in the social world.
  • 31. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS 4 • We can also use statistical models to develop specific predictions that we can test in a new data set. • Finally, statistical models allow us to investigate the influences of several variables on one or several others simultaneously. Linear and non-linear relationships: two examples So what sort of things do we use statistical models for? Have a look at Figure 1.1, which shows data on income inequality and child wellbeing in 25 of the richest countries of the world. Income inequality is measured by the Gini coefficient; a higher Gini coef- ficient indicates more unequal incomes. Child wellbeing is measured by the UNICEF (United Nations Children’s Fund) index; a higher number means better child well- being across the domains health, education, housing and environment, and behaviours. Gini coefficient (2010) UNICEF index of child wellbeing 0.24 0.28 0.32 0.36 –2 0 1 2 –1 Slovenia Czech Republic Hungary Austria Latvia USA Greece Estonia Italy Poland Canada United Kingdom Spain Portugal France Ireland Denmark Germany Sweden Iceland Norway Finland Netherlands Luxembourg Belgium Figure 1.1 Child wellbeing and income inequality in 25 countries Note. Gini coefficient: a higher coefficient indicates more income inequality. UNICEF index of child wellbeing: a higher number indicates better child wellbeing averaged over four dimensions: health, education, housing and environment, and behaviours. Data for Gini coefficient: Organisation for Economic Co-operation and Development (www.oecd.org/social/income-distribution-database.htm) and child wellbeing: Martorano et al. (2014, Table 15). This graph is inspired by Figure 1 in Pickett and Wilkinson (2007) but is based on more recent data. UNICEF = United Nations Children’s Fund.
  • 32. what is a statistical model? 5 One way to describe these data is to draw attention to the positions of individual countries. For example, the Netherlands, Norway and Iceland are rated the highest on UNICEF’s index of child wellbeing, while Latvia, the USA and Greece are rated the lowest. The three countries with the highest income inequality are the USA, Latvia and the UK. The most egalitarian countries in terms of income are Slovenia, Norway and Denmark. Figure 1.1 also demonstrates a general pattern. The distribution of countries sug- gests that the more inequality there is in a country, the poorer the wellbeing of the children. As you may remember from The SAGE Quantitative Research Kit, Volume 2, this is called a negative relationship (as one variable goes up, the other tends to go down), and it can be represented by a correlation coefficient, Pearson’s r. The observed correlation between inequality and child wellbeing in Figure 1.1 is r = − 0.70. We might also want to illustrate the relationship by drawing a line, as I have done in Figure 1.1. This line summarises the negative relationship we have just described. The line describes how the wellbeing of children in a country depends on the degree of a country’s economic inequality. The points don’t fall on the line exactly, but we may argue that the line represents a fair summary of the general tendency observed in this data set. This line is called a regression line, and it is a simple illustration of linear regression, a type of statistical model that we will discuss in Chapter 2. Every statistical model is based on assumptions. For example, by drawing the straight line in Figure 1.1, we are assuming that there is a linear relationship between inequality and child health. The word linear in the context of statistical models refers to a straight line. Curved lines are not considered ’linear’. Judging from Figure 1.1, the assumption of linearity might seem reasonable in this case, but more generally many things are related in non-linear ways. Consider, for example, Figure 1.2, which shows the relationship between GDP (gross domestic product) per capita and life expectancy in 134 countries. The graph suggests that there is a strong relationship between GDP and life expectancy. But this relationship is not linear; it is not well represented by a straight line. Among the poorest countries, even relatively small differences in GDP tend to make a big difference in life expectancy. For the richer countries, even relatively large differences in GDP appear to affect life expectancy only a little, or maybe not at all. We may try to represent this relationship by drawing a curved line, as shown in Figure 1.2. This is a simple illustration of a non-linear model, representing a non-linear relationship. Like the line in Figure 1.1, the line in Figure 1.2 does not represent the relationship between GDP and life expectancy perfectly. For exam- ple, there are at least six African countries whose life expectancy is much lower than the line predicts based on these countries’ GDP. We will see later in the book how cases that don’t appear to fit our model can help us to improve our analysis.
  • 33. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS 6 First approach to models: the t-test as a comparison of two statistical models The practice of modelling often involves investigating which of a set of models gives the best account of the data. In this way, we might compare a linear model with a non-linear one, a simpler model with a more complex one, or a model corresponding to one theory with a model corresponding to another. As a first introduction to how this works, I will show you how an elementary hypothesis test, the t-test for inde- pendent samples, can be understood as a systematic comparison of two statistical models. The example will also introduce you to some simple mathematical notation that will be useful in understanding subsequent chapters. The example concerns psychological aspects of the mind–body problem. Most of us have experienced that the way we hold our body can reflect the state of mind that we are in: when we are anxious our body is tense, when we are happy our body is relaxed, and so forth. But does this relationship work the other way around? Can + + + ++ + + + + + + + + ++ + + ++ + + + + + + + + + ++ + + + ++ + ++ + + + + + + + + + 0 10,000 20,000 30,000 40,000 50,000 GDP per capita (US$) Life expectancy (years) 30 40 50 60 70 80 + + + + Africa + Americas Europe Oceania Asia Figure 1.2 Gross domestic product (GDP) per capita and life expectancy in 134 countries (2007) Note. Data from the Gapminder Foundation (Bryan, 2017). See www.gapminder.org
  • 34. what is a statistical model? 7 we change our state of mind by assuming a certain posture? Carney et al. (2010) published an experimental study about what they called power poses. An example of a power pose is to sit on a chair with your legs stretched out and your feet resting on your desk, your arms comfortably crossed behind your neck. Let’s call this the ‘boss pose’. Carney et al. (2010) reported that participants who were instructed to hold a power pose felt more powerful subjectively, assumed a more risk-taking attitude and even had higher levels of testosterone in their bodies compared to other participants, who were instructed to hold a ‘submissive pose’ instead. The study was small, involv- ing 42 participants, but it was covered widely in the media and became the basis of a popular TED talk by one of the co-authors. A sceptic may have doubts about the study’s results. From a theoretical point of view, one might propose that the mind–body connection is a bit more complicated than the study appears to imply. Methodologically speaking, we may also note that with such a small sample (n = 42), there is a lot of uncertainty in any estimates derived from the data. Could it be that the authors are mistaking a chance finding for a signal of scientific value? A scientific way to settle such questions is to conduct a replication study. For the sake of example, let’s focus on one question only: does assuming a power pose increase testosterone levels in participants, compared to assuming a different kind of pose? To test this, let’s imagine we conduct a replication of Carney et al.’s (2010) experiment. We will randomise respondents to one of two conditions: The experimental group are instructed to assume a power pose, such as the ‘boss pose’ described above. In contrast, the control group are asked to hold a sub- missive pose – the opposite of a power pose – such as sitting hunched, looking downwards, with hands folded between the thighs. Before assuming their pose, the participants have their testosterone levels measured. They then hold their assigned pose for 2 minutes, after which time testosterone is measured again. The outcome variable is the difference in testosterone after holding the pose minus testosterone before holding the pose. A positive value on this variable means that testosterone was higher after posing than before. A negative number means the opposite. Zero indicates no change. Before we conduct the study, we might visualise what the the data will look like. Figure 1.3 shows hypothetical distributions of testosterone difference for the two groups, the power pose group and the control group. Let’s begin by turning the two competing theories about power poses – either power poses can change testosterone levels or they cannot – into two different models that aim to account for these hypothetical data.
  • 35. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS 8 The sceptic’s model (null hypothesis of the t-test) The sceptic doesn’t believe that power poses influence testosterone levels. So she predicts that, on average, the two groups have the same testosterone change. This is symbolised by the line on the left panel of Figure 1.4: the means for the two groups are predicted to be the same. The sceptic also recognises that not everyone may react to the experiment in the same way, however, and so she expects individual variation around the mean testosterone change. In brief, the sceptic says, ‘All we need to say about testosterone change in this experi- ment is that there is random variation around the overall mean. Nothing else to see here.’ Control Power pose Change in testosterone −50 −40 −30 −20 −10 0 10 20 30 40 Figure 1.3 Hypothetical data from a power pose experiment Control Power pose Change in testosterone −50 −40 −30 −20 −10 0 10 20 30 40 Control Power pose −50 −40 −30 −20 −10 0 10 20 30 40 Sceptic’s model Power poses model Figure 1.4 Illustrating two statistical models for the power pose experiment
  • 36. what is a statistical model? 9 Let’s now look at how we can formalise this model using mathematical notation. The sceptic’s model can be written as follows: Individual’s testosterone change = mean testosterone change + individual variation In mathematical symbols, we might write the same equation as: Yi i = + µ ε where • Yi refers to the testosterone change of the ith individual: { for example, Y1 is the testosterone change of the first person, Y5 is the testosterone change of the fifth person, and so on. • µ is the population mean of testosterone change. This is denoted by the Greek letter µ (‘mu’). • εi is the difference between the ith individual’s testosterone change and the mean µ. For example, ε1 is the difference between Y1 and the mean µ. This is denoted by the Greek letter ε (‘epsilon’). The equation thus represents each participant’s testosterone change (Yi) as a combi- nation of two components: the population mean µ and the participant’s individual deviation from that mean, εi . The εi are called the errors. This might be considered a confusing name, as the term error seems to imply that something has gone wrong. But this is not meant to be implied here. If we rearrange the sceptic’s model equation, we can see that the errors are simply the individual differences from the population mean: ε µ i i Y = − The power pose model: alternative hypothesis of the t-test Now let’s contrast the sceptic’s model with the power pose model. If you thought that holding a power pose can increase testosterone, you would predict that the mean change in testosterone is higher in the power pose group than in the control group. This is illustrated by the lines in the right panel of Figure 1.4: the means for the two groups are predicted to be different. In mathematical notation, we can write this model as follows: Y X i i i = + + α β ε
  • 37. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS 10 where • Yi is, once again, the change in testosterone level for individual i. • Xi is a variable that indicates the group membership of individual i; this variable can either be 0 or 1, where { X = 0 indicates the control group and { X = 1 indicates the power pose group. • α is a coefficient, which in this case represents the mean of the control group. • β is a coefficient that represents the difference in average testosterone change between the power pose group and the control group. • εi represents individual variation in testosterone change around the group mean. In this type of model, we call X the predictor variable, and Y the outcome variable. X is used to predict Y. In our example, the power pose model proposes that knowledge of an individual’s experimental group membership – did they adopt a power pose or not? – can help us predict that individual’s testosterone levels. (In other publications, you may find other names for X and Y: X may also be called the independent variable, the exposure or the explanatory variable; Y may be called the dependent variable or the response.) To understand how the model works, consider how the equation looks for the con- trol group, where X = 0. We have Yi i i = + × + = + α β ε α ε 0 (Since the term β × 0 is always zero, it can be left out.) So for the control group, the model equation reduces to Yi i = + α ε . The coefficient α thus represents the mean of the control group. For the power pose group, where X = 1, the model looks like this: Yi i i = + × + = + + α β ε α β ε 1 So the power pose model predicts that the mean of the power pose group is different from α by an amount β. Note that if β = 0, then there is no difference between the means of the power pose group and the control group. In other words, if β = 0, then the power pose model becomes the sceptic’s model (with α µ = ). The power pose hypothesis of course implies that the power pose group mean is higher than the control group mean, which implies that β 0. Using data to compare two models So we have two competing models: the sceptic’s model and the power pose model. The sceptic’s model corresponds to the null hypothesis of a statistical hypothesis test; the power pose model corresponds to the alternative hypothesis.
  • 38. what is a statistical model? 11 If we have conducted a study and observed data, we can estimate the coefficients of each model. We will use the hypothetical data displayed in Figure 1.3. Table 1.1 shows them as raw data with some descriptive statistics. Table 1.1 Testosterone change from a power pose experiment (hypothetical data) Group Control Power Poses 8.9 27.9 15.7 13.0 −52.8 7.6 16.9 6.7 −24.9 −24.9 −12.4 21.3 14.0 11.8 −9.9 41.2 38.2 34.9 −14.6 9.1 30.7 13.5 1.3 35.0 15.7 29.4 20.1 10.8 37.1 4.7 −21.5 −13.3 −36.5 3.7 28.6 −6.5 16.0 6.5 0.2 −36.0 Mean 3.54 9.81 Standard deviation 24.84 19.63 Overall mean 6.68 Overall standard deviation 22.33 Pooled standard deviation 22.39 We will now use these data to estimate the unknown coefficients in the power pose model. We denote estimates of coefficients by putting a hat (^) on the coefficient symbols. Thus, we use α̂ (read ‘alpha-hat’) to denote an estimate of α, and β̂ (‘beta- hat’) to denote an estimate of β. Recall that the power pose model is as follows: Y X i i i = + + α β ε
  • 39. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS 12 I said above that the coefficient α represents the population mean testosterone change of the control group in the power pose model. How best to estimate this coef- ficient? Intuitively, it makes sense to estimate the population mean by the sample mean we observe in our data. In this case, α̂ = YControl where YControl is the observed mean testosterone change in the control group. Similarly, I said above that the coefficient β represents the difference between the mean testosterone change in the power pose group and the mean change in the control group. As the estimate of this, we are going to use the difference between the sample means of our two groups: β̂ = − Y Y Power pose Control From the descriptive statistics given in Table 1.1, we can calculate these estimates: ˆ . α = = YControl 3 54 ˆ . . . β = − = − = Y Y Power pose Control 9 81 3 54 6 27 So the control group mean is estimated to be 3.54, and the difference between the experimental and the control group is estimated to be 6.27. Recall, however, that these estimates are based on the assumption that the power pose model is correct. The sceptic, who disagrees with the power pose model, would argue that a sim- pler model is sufficient to account for the data. Recall that the sceptic’s model is as follows: Yi i = + µ ε According to this model, there is no difference between the group means in the pop- ulation. All we need to estimate is the overall group mean µ. Again, to denote an estimate of µ, we furnish it with a hat. And we will use the overall sample mean as the estimator: µ̂ = Yall From Table 1.1, we have ˆ . µ = 6 68 Under the assumption of the sceptic’s model, then, we estimate that, on average, holding some pose for 2 minutes raises people’s testosterone levels by 6.68 (and it doesn’t matter what kind of pose they are holding). The power pose model and the sceptic’s model estimate different coefficients, and the two models are contradictory: they cannot both be correct. Either power poses
  • 40. what is a statistical model? 13 make a difference to testosterone change compared to submissive poses, or they do not. How do we decide which model is better? We will use the data to test the two models against each other. The logic goes like this: • We write down the model equation of the more complex model. In our case, this is the power pose model, and it is written as Y X i i i = + + α β ε . • We hypothesise that the simpler model is true. The simpler model is the sceptic’s, in our case. If the sceptic is right, this would imply that the coefficient β in the model equation is equal to zero. So we wish to conduct a test of the hypothesis β = 0. • We then make assumptions about the data and the distribution of the outcome variable. These are the usual assumptions of the t-test for independent samples (see The SAGE Quantitative Research Kit, Volume 3): { Randomisation: Allocation to groups has been random. { Independence of observations: There is no relationship between the individuals. { Normality: In each group, the sampling distribution of the mean testosterone change is a normal distribution. { Equality of variances: The population variance is the same in both groups. • If the null model is true and all assumptions hold, the statistic t s = ˆ ˆ β β { has a central t-distribution with mean zero and degrees of freedom df n n = + − 0 1 2. With sˆ , β I denote the estimated standard error of β̂. • We calculate the observed t-statistic from the data. Then, we compare the result to the t-distribution under the null model. This allows us to calculate a p-value, which is the probability of obtaining our observed t-statistic, or one further away from zero, if the null hypothesis model is true. I have already shown how to calculate β̂ from the data; in the previous section, we found that ˆ . . β =6 27 But I haven’t shown how to calculate the estimated standard error of β̂, which we denote by the symbol sβ̂ . This standard error is a measure of the variability of β̂. We can estimate this standard error from our data. In Chapter 2, I will show you how this is done. For now, I will ask you to accept that this is possible to do and to believe me when I say that sˆ . β =7 08 for our data. Approximately, this means that if we conducted an infinite number of power pose experiments, each with exactly the same design and sample size n = 42, our estimate β̂ would differ from the true value β by 7.08 on average (approximately). The smaller the standard error, the more precise our estimates. So a small standard error is desirable. But now to conduct the test. Using our estimates of β̂ and sβ̂ , we have t s = = = ˆ . . . ˆ β β 6 27 7 08 0 89
  • 41. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS 14 So t = 0.89. With 38 degrees of freedom, this yields a two-sided p-value of 0.38 (or a one-sided p-value of 0.19). Since the p-value is quite large, we would conclude that there is little evidence, if any, for the power pose model from these data. Although in our sample there is a small difference in testosterone change between the power pose group and the control group, this is well within the range of random variability that we would expect to see in an experiment of this size. In the logic of the t-test for inde- pendent samples, we say that we have little evidence against the null hypothesis. In the logic of statistical modelling, we might say that the sceptic’s simple model seems to be sufficient to account for the data. In the original experiment, Carney et al. (2010) did find evidence for an effect of power poses on testosterone (i.e. their p-value was quite small). In the language of model comparison, their conclusion was that β is larger than zero. I made up the data used in this chapter, so these pages are not a contribution to the scientific lit- erature on power poses. I do want to mention, however, that other research teams have tried to replicate the power pose effect. For example, Ranehill et al. (2015) used a sample of 200 participants to test the power pose hypothesis and found no evidence for an effect of power poses on either risk taking, stress or testosterone, although they did find evidence that power poses, on average, increase participants’ self-reported feeling of power. Further research may shift the weight of the overall evidence either way. These future researchers may employ statistical hypothesis tests, such as a t-test, without specifically casting their report in the language of statistical models. But underlying the research will be the effort to try to establish which of two models explains better how the world works: the power pose model, where striking a power pose can raise your testosterone, or the sceptic’s model, according to which testosterone levels may be governed by many things but where striking a power pose is not one of them. The signal and the noise Statistics is the science of reasoning about data. The central problem that statistics tackles is uncertainty about the data generating process: we don’t know why the data are the way they are. If there is regularity in the way the world works, then research may generate data that make this regularity visible. For example, if it is the case that children in countries with more equal income distributions fare better than children in unequal countries, we would expect to see a relationship between a measure of income (in)equality and a measure of child wellbeing. But there are many other processes that influence how the data turn out. Measurement errors may cause the data to be inaccurate. Also, random processes
  • 42. what is a statistical model? 15 may introduce variations. Examples of such random processes are random sampling or small variations over time, such as year-on-year variations in a country’s GDP, that are not related to the research problem at hand. Finally, other variables may interfere and hide the true relationship between child wellbeing and income inequality. Or they may interfere in the opposite way and bring about the appearance of a relation- ship, when really there is none. Let’s make a distinction between the signal and the noise (Silver, 2012). The signal is the thing we are interested in, such as, say, the relationship between GDP and life expectancy. The noise is what we are less interested in but what is nonetheless pre- sent in the data: measurement errors, random fluctuations in GDP or life expectancy, and influences of other variables whose importance we either don’t know about or which we were unable to measure. In fancier words, we call the signal the systematic part of the model and the noise the random part of the model. Recall the model we considered for the power poses, which is shown in Figure 1.5. Figure 1.5 Partition of a statistical model into a systematic and a random part Yi i i X = + + α β ε Systematic part Random part The systematic part of the model is α β + Xi . This specifies the relationship between the predictor and the outcome. The random part, εi , collects individual variation in the outcome that is not related to the predictor. It is this random part which distin- guishes a statistical model from a deterministic one (e.g. the model of depth and water pressure we considered in the section ‘Kinds of Models’). When using statistical models, we aim to detect and describe the signal, but we also pay attention to the noise and what influence it might have on what we can say about the signal.
  • 44. 2 Simple Linear Regression Chapter Overview Origins of regression: Francis Galton and the inheritance of height��������� 18 The regression line������������������������������������������������������������������������������������� 21 Regression coefficients: intercept and slope��������������������������������������������� 23 Errors of prediction and random variation������������������������������������������������� 24 The true and the estimated regression line����������������������������������������������� 25 Residuals���������������������������������������������������������������������������������������������������� 26 How to estimate a regression line�������������������������������������������������������������� 27 How well does our model explain the data? The R2 statistic��������������������� 29 Residual standard error������������������������������������������������������������������������������ 33 Interpreting Galton’s data and the origin of ‘regression’�������������������������� 33 Inference: confidence intervals and hypothesis tests�������������������������������� 35 Confidence range for a regression line������������������������������������������������������ 39 Prediction and prediction intervals������������������������������������������������������������ 42 Regression in practice: things that can go wrong�������������������������������������� 44 Further Reading����������������������������������������������������������������������������������������� 50
  • 45. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS 18 Linear regression is a statistical model that represents the relationship between two variables as a straight line. In doing so, a distinction is made between the outcome variable and the predic- tor variable, as we did in Chapter 1. Linear regression is appropriate for outcome variables that are continuous and that are measured on an interval or ratio scale. Box 2.1 gives an overview of the different kinds of variables that can feature in a statistical model. For measurement levels (nominal, ordinal, interval and ratio), see The SAGE Quantitative Research Kit, Volume 2. This chapter considers simple linear regression, which is a linear regression with exactly one predictor variable. In Chapters 4 and 5, we will look at linear regres- sion with more than one predictor. Origins of regression: Francis Galton and the inheritance of height The first regression in history was carried out in the late 19th century by Francis Galton, a half-cousin of Charles Darwin. One interest of Galton’s was the study of biological inheritance: how parents pass on their individual characteristics to their children. Types of Variables In statistics, variables are distinguished in various ways, according to their properties. An important distinction is made between numeric variables and categorical variables. Numeric variables have values that are numbers (1, 68.2, −15, 0.9, and so forth). Height is one such numeric variable. The values of categorical variables are categories. Country of birth is categorical, with values ‘Afghanistan’, ‘Albania’, ‘Algeria’ and so forth. Categori- cal variables are sometimes represented by numbers in data sets (where, say, ‘1’ means Afghanistan, ‘2’ means Albania, and so forth), but in that case, the number just acts as a label for a category and doesn’t mean that the variable is truly numeric. Numeric variables, in their turn, are divided into continuous and discrete variables. A con- tinuous variable can take any value within its possible range. For example, age is a contin- uous variable: a person can be 28 years old, 28.4 years old or even 28.397853 years old. Age changes every day, every minute, every second, so our measurement of age is limited only by how precise we can or wish to be. Another example of a continuous variable is human height. In contrast, a discrete variable only takes particular numeric values. For example, number of children is a discrete variable: you can have zero children, one child or seven children, but not 1.5 children. The outcome variable of a linear regression should be continuous. In practice, our measurement of continuous variables may make them appear discrete – for example, when we record height only to the nearest inch. This does not necessarily harm the esti- mation of our regression model, as long as the discrete measurement is not too coarse. The predictor in a linear regression should be numeric and may be discrete or con- tinuous. In Chapter 4, we will see how we can turn categorical predictors into numeric ‘dummy variables’ to enable us to include them in a regression model. Box 2.1
  • 46. simple linear regression 19 Among other things, he studied the relationship between the heights of parents and their children (once the children had grown up). To this end, he collected data from 928 families. An extract of the data is shown in Table 2.1, and Figure 2.1 illustrates the data.1 Table 2.1 Extract from Galton’s data on heights in 928 families Family Number (i) Height of Parents (Average) Height of Adult Child 1 66.5 66.2 2 69.5 67.2 3 68.5 64.2 4 68.5 68.2 5 70.5 71.2 6 68.5 67.2 … … … 926 69.5 66.2 927 69.5 71.2 928 68.5 69.2 Mean 68.30 68.08 Standard deviation 1.81 2.54 Variance 3.29 6.44 Note. The means, standard deviations and variances deviate slightly from Galton’s original results, because I have added a small random jiggle to the data to make illustration and explanation easier. 62 62 64 64 66 66 68 68 70 70 72 72 74 74 Average height of parents (inches) Height of child (inches) Figure 2.1 Scatter plot of parents’ and children’s heights Note. Data are taken from Galton (1886) via the ‘psych’ package for R (Revelle, 2020). Data points have been jiggled randomly to avoid overlap. 1 I use Galton’s original data, as documented in Revelle (2020), but I have added a slight modification. Galton recorded the heights in categories of 1-inch steps. Thus, most combinations of parents’ and child’s height occur more than once, which makes for an unattractive overlap of points in a scatter plot, and would generally have made the analysis difficult to explain. I have therefore added a small random jiggle to all data points. All my analyses are done on the jiggled data, not Galton’s original data. Therefore, for example, the means and standard deviations shown in Table 2.1 differ slightly from those in Galton’s original data. I have done this purely for didactic purposes.
  • 47. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS 20 Box 2.2 The scatter plot, Figure 2.1, provides a first look at the relationship between the par- ents’ and the children’s heights. Each dot represents one pair of measurements: the average height of the parents2 and the height of the adult child.3 The scatter plot suggests that there is a positive, moderately strong relationship between parents’ and children’s heights. In general, taller parents tend to have taller children. Nonetheless, for any given parental height, there is much variation among the children. One way of describing the relationship between parents’ and children’s heights is to calculate a correlation coefficient. If the relationship is linear, Pearson’s product moment correlation coefficient provides a suitable description of the strength and direction of the relationship. (You may remember this from The SAGE Quantitative Research Kit, Volume 2.) From Figure 2.1, it looks as though there is a linear relationship between the heights of parents and their children. Thus we are justified in calculating Pearson’s r for Galton’s data. Doing so, we obtain a Pearson correlation of r = 0.456. This confirms the impression gained from the scatter plot: this is a linear positive relationship of moderate strength. Galton and Eugenics Francis Galton’s interest in heredity was linked to his interest in eugenics: the belief that human populations can and should be ‘improved’ by excluding certain groups from having children, based on the idea that people with certain heritable characteristics are less worthy of existence than others. Galton was a leading eugenicist of his time. In fact, it was he who coined the term eugenics. Eugenicist ideas were widespread in the Western world in the early 20th century and in many countries inspired discriminatory policies such as forced sterilisation and marriage prohibition for people labelled ‘unfit to reproduce’, which included people with mental or physical disabilities. Historically, the eugenics movement had close ideological links with racism (Todorov, 1993), and pursued the aim of ‘purifying’ a population by reducing its diversity. Eugenicist ideas and practices were most strongly and ruthlessly adopted by the Nazi regime in Ger- many, 1933–1945. Like many Europeans of his time, Galton also held strong racist views about the supposed superiority of some ‘races’ over others. Galton thus leaves a com- plicated legacy: he was a great scientist (his scientific achievements reach far beyond regression), but he promoted ideas that were rooted in racist ideology and that helped to promote racism and discrimination. For further information about Galton, and how contemporary statisticians grapple with his legacy, see Langkjær-Bain (2019). 2 From now on, I shall refer to the average of the parents’ height simply as ‘parents’ height’. Galton himself used the term height of the mid-parents. 3 To make male and female heights comparable, Galton multiplied the heights of females in his sample by 1.08.
  • 48. simple linear regression 21 The regression line Now let’s consider how to develop a statistical model. This goes beyond the cor- relation coefficient, as we now make a distinction between the outcome variable, and the predictor variable. The outcome is the variable that we wish to explain or predict. The predictor is the variable we use to do so. Different books and texts use different names for the outcome and the predictor variables. Box 2.3 gives an explanation. In making this distinction between the predictor and the outcome, we do not nec- essarily imply a causal relationship. Whether it is plausible to deduce a causal rela- tionship from an observed correlation depends on many things, including knowl- edge about the research design and data collection process, as well as evidence from other studies and theoretical knowledge about the variables involved in the analysis. In our example, Galton wished to understand why people have different heights (the outcome) and thought he could find an explanation by considering the heights of people’s parents (the predictor). Our current scientific knowledge suggests that par- ent’s and children’s heights are indeed related due to common causes, including the genes shared by parents with their children as well as environmental and social fac- tors such as nutrition, which tend to be more similar within families than between different families. Because we have concluded that the relationship between parents’ and chil- dren’s heights is approximately linear, we propose a linear model: we will draw a straight line to represent the relationship in Galton’s data. Such a line is shown in Figure 2.2. Various Names for the Variables Involved in a Regression Model The outcome of a regression model is also known as the dependent variable (DV), or the response. The predictor is also sometimes called an independent variable (IV), an exposure or an explanatory variable. The terminological fashion varies somewhat between disciplines. For example, psychology prefers the terms DV and IV, while in epidemiology, outcome and exposure are more commonly used. In this book, I shall use the terms outcome and predictor consistently. Box 2.3
  • 49. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS 22 62 62 64 64 66 66 68 68 70 70 72 72 74 74 Average height of parents (inches) Height of child (inches) Figure 2.2 Galton’s data with superimposed regression line The regression line is a representation of the relationship we observe between a predictor variable (here, parents’ height) and an outcome variable (here, the height of the adult child). By convention, we call the predictor X and the outcome Y. The algebraic expression of a regression line is an equation of the form Ŷ X i i = + α β where • Ŷi is the predicted value of the outcome Y for the ith person – in our case, the predicted height of the child of the ith family (read Ŷ as ‘Y-hat’, the hat indicates that this is a prediction). • Xi is the value of the predictor variable for the ith person – here, the parents’ height in family i. • α is called the intercept of the regression line; this is the value of Ŷ when X = 0. • β is called the slope of the regression line; this is the predicted difference in Y for a 1-unit difference in X – in our case, the predicted height difference between two children whose parents’ heights differ by 1 inch. To understand how the regression equation works, let’s look at the equation for the line in Figure 2.2. This is: ˆ . . Y X i i = + 24 526 0 638
  • 50. simple linear regression 23 If it helps, you may write this equation as follows: Predicted child s height Parents height ′ ′ = + × 24 526 0 638 . . We can use this equation to derive a predicted height for a child, if we are given the parents’ height. For example, take a child whose parents’ height is 64.5 inches. Plug- ging that number into the regression equation, we get: Y ∧ = + × = 24 526 0 638 64 5 65 7 . . . . A child of parents with height 64.5 inches is predicted to be 65.7 inches tall. In the equa- tion above, the ‘hat’ over Y indicates that this result is a prediction, not the actual height of the child. This is important because the prediction is not perfect: not every child is going to have exactly the height predicted by the regression equation. The aim of the regression equation is to be right on average, not necessarily for every individual case. Regression coefficients: intercept and slope Let us have a closer look at the intercept (α) and the slope (β) of the regression equa- tion. Jointly, they are referred to as the coefficients. The coefficients are unknown but can be estimated from the data. This is analogous to using a sample mean to estimate a population mean, or estimating a correlation from a sample data set. Figure 2.3 provides an illustration of how the coefficients define a regression line. 0 0 1 1 1 2 2 3 3 4 X Y Intercept(α) Slope(β) Y = α +βX ^ Figure 2.3 An illustration of the regression line, its intercept and slope Note. Intercept: the value of Y when X is zero. Slope: the predicted difference in Y for a 1-unit difference in X.
  • 51. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS 24 The intercept is the predicted value of Y at the point when X is zero. In our exam- ple, the intercept is equal to 24.526. Formally, this means that the predicted height of a person whose parents have zero height is 24.526 inches. As a prediction, this obvi- ously does not make sense, because parents of zero height don’t exist. The intercept is of scientific interest only when X = 0 is a meaningful data point. The slope determines by how much the line rises in the Y-direction for a 1-unit step in the X-direction. In our example, the slope is equal to 0.638. This means that a 1-inch difference in parents’ height is associated with a 0.638-inch difference in the height of the children. For example, if the Joneses are 1 inch taller than the Smiths, the Joneses’ children are predicted to be taller than the Smiths’ children by 0.638 inches on average. In general, a positive slope indicates a positive relationship, and a negative slope indicates a negative relationship. If the slope is zero, there is no rela- tionship between X and Y. Errors of prediction and random variation The regression line allows us to predict the value of an outcome, given information about a predictor variable. But the regression line is not yet a full statistical model. If we had only the regression line, our prediction of the outcome would be determinis- tic, rather than statistical. A deterministic model would be appropriate if we believed that the height of a child was precisely determined by the height of their parents. But we know that is not true: if it was, all children born to the same parents would end up having the same height as adults. In Galton’s data, we see that most children do not have exactly the height predicted by the regression line. There is variation around the prediction. That is why we need a statistical model, not a deterministic one. As we saw in Chapter 1, a full statistical model includes two parts: a systematic part that relates Y to X and a random part that represents the variation in Y unrelated to X. The linear regression model looks like this: Y X i i i = + + α β ε where • Yi is the Y value of the ith individual. • Xi is the X value of the ith individual. • α and β are the intercept and the slope as before. • α + βXi is the systematic part of the model; in our example, this represents the part of a child’s height that is determined by their parents’ heights. • εi is called the error (of the ith individual): it is the difference between the observed value (Yi ) and the predicted value (Ŷi). The errors represent the random part of our model: this is the part of a child’s height that is determined by things other than their parents’ heights.
  • 52. simple linear regression 25 Note that this regression equation looks just the same as the equation of the power pose model in Chapter 1. But there is one difference. In Chapter 1, X was a dichoto- mous variable – that is, a variable that could assume one of two values: 0 (for the control group) or 1 (for the experimental group). In the model for Galton’s data, however, X is a continuous variable, which, in our data, takes values between 63.5 inches and 73.5 inches. The systematic part of our model, α + βXi , describes the part of the outcome that is related to the predictor. In our example, we might say that the systematic part of Galton’s regression represents the part of a child’s height that is inherited from the parents. Galton did not know about genes, but today we might assume that a child might inherit their height from their parents through two kinds of processes: nature (genes) and nurture (experiences – e.g. nutrition and other living conditions during the growth period, which might have been similar for the parents and their children – e.g. because they each grew up in the same social class). The error term, εi , represents the variation in Y that is not related to X. In our exam- ple, such variation might be due to things such as: • Differences between living conditions in the parents’ and the children’s growth periods (e.g. due to changes in society and culture, historical events such as famines, or changes in family fortune) • The vagaries of genetic inheritance (different children inherit different genes from the same parents) • Other influences, some of which we either do not understand or that might be genuinely random (governed by a probabilistic natural process, rather than a deterministic one) Importantly, the errors are the differences between the observed values Yi and the predicted values Yi ∧ . That is, the errors tell you by how much the regression predic- tion is off for a particular case. To see this mathematically, rearrange the regression equation as follows: ε α β i i i i � � � � = − + ( ) = − ∧ Y X Y Y i We will later see that the full specification of the statistical model will require us to make certain assumptions about the errors. These assumptions are the topic of Chapter 3. The true and the estimated regression line When we conceptualise a model in the abstract, the coefficients and errors are con- ceptualised as properties of the ‘true’ regression model, which is valid for the popula- tion. In practice, however, we will only ever have information from a sample, and
  • 53. LINEAR REGRESSION: AN INTRODUCTION TO STATISTICAL MODELS 26 we use that sample to estimate the coefficients. This is called fitting a model to a data set, or (equivalently) estimating a model on a data set. Our model, Yi = α + βXi + εi , specifies the sort of relationship between X and Y that we propose, or wish to investigate. When we fit this model to a data set, we obtain estimates of the model parameters α and β. The predictive equation that contains these estimates, ˆ . . Y X i i = + 24 526 0 638 , is called the fitted model, or the estimated model. So just as we distinguish between the population mean μ and the sample mean x , and between the population standard deviation σ and its sample estimate s, we also need to distinguish between the parameters α and β in the true (population) model and the estimates of these parameters, which we will call α̂ and β̂ (read these as ‘alpha-hat’ and ‘beta-hat’, respectively). We also need to distinguish between the errors (the departures from the true regression line) and the estimates of the errors. This is because a regression line fitted to a sample of data is just an estimate of the ‘true’ regression line proposed by our model. For this reason, we never directly observe the errors. We must make do with the departures from our estimated regres- sion line. We call these departures the residuals. Residuals The residuals are the differences between the observed values of Y and the predicted val- ues of Y from an estimated regression equation. Let’s return to Galton’s data. The regres- sion line predicts a child’s height, given the height of the parents, using the equation ˆ . . Y X i i = + 24 526 0 638 As we have noted, the prediction is not perfect: although a few dots are exactly on the regression line, most are not. Have a look at Figure 2.4. I have given names to two of the children in Galton’s data: Francis and Florence. Let’s consider Francis first. His parents are 64.5 inches tall. Given this information, the regression equation predicts Francis’s height to be 65.7 inches, as we saw above in the section ‘The Regression Line’. But Francis actually measures in at 63.3 inches, a bit shorter than the model predicts. Between Francis’s actual height and the prediction, there is a difference of 2.4 inches. We call this dif- ference a residual. Formally, a residual is defined as the difference between the observed value and the predicted value of the dependent variable, where the prediction comes from a regression line estimated from a sample. We can express this definition in algebraic symbols: e Y Y i i i = − ˆ
  • 54. simple linear regression 27 It is customary to represent residuals with the letter e, to distinguish them from the errors ε. We write ei if we want to refer to any particular residual (the residual of per- son i). In Francis’s case, the calculation would go as follows: e Y Y Francis Francis Francis = − − − = = ∧ 63 3 65 7 2 4 . . . inches inches inche es Francis’s residual is a negative number, because Francis is shorter than our model predicts. Now consider Florence. Her parents’ height is 68.5 inches. From this, we can cal- culate that her predicted height is 68.2 inches. But Florence is in fact 71.2 inches tall. Because she is taller than predicted, her residual is a positive number: e Y Y Florence Florence Florence = − − = = ∧ 71 2 68 2 3 0 . . . inches inches inc ches This residual tells us that Florence is 3.0 inches taller than the model predicts. How to estimate a regression line Now that we understand residuals, we can consider how the estimates of coefficients are found. A residual represents how wrong the regression prediction is for a given individual. 62 62 64 64 66 66 68 68 70 70 72 72 74 74 Average height of parents (inches) Height of child (inches) eFrancis = −2.4 eFlorence = 3.0 Figure 2.4 Illustration of residuals
  • 55. Random documents with unrelated content Scribd suggests to you:
  • 56. “When I s-spring one,” says he, “it’ll be a joke, you can bet. I won’t just shoot off somethin’ on the chance somebody’ll laugh. I’ll study over it some, and kind of try it out in my mind, and maybe repeat it out loud to myself a couple of times to see how it sounds when I say it. That’s the way to do with jokes. Jokes is like dollars. A good dollar is worth a hundred cents, but a bad dollar is apt to get you s-s-shut up in jail. Or eggs,” says he. “You don’t have to crack a joke to tell if it’s bad, like you do an egg.” “I suppose,” says I, “that was a joke?” “There’s folks would call it sich,” says he. “Aw, come on,” says Binney. “Quit your jawin’ like old wimmin at a knittin’-bee and git to work. What’s goin’ to be done?” “I wisht I knew,” says Mark. “If we found him he wouldn’t come back,” says Binney. “He’d be afraid of the sheriff.” Mark slapped his leg. “There’s somethin’ for us to d-d-do,” says he. “We kin fix it so George dast come back.” So he sent Binney after the mail, and Tallow to order in a car to make a shipment, and him and I went off to see the deputy sheriff, whose name was Whoppleham. Mostly you could find him down by the blacksmith shop pitching horseshoes. He was about the best horseshoe-pitcher in the county. He was there, all right, pitching with old Jim Battershaw, and they was down on their knees measuring from the peg to a couple of horseshoes with a piece of string to find out which was the nearest, and quarreling about it as if it was the most important thing that had happened in the world since Noah built his ark. We waited for them to decide which horseshoe was nearest, but they couldn’t decide, and they wouldn’t call it even. I calc’late they’d have gone for the county surveyor to measure them up scientific if just then Battershaw’s setter-dog and Whoppleham’s shepherd-dog hadn’t got tired of waiting and started an argument of their own. It was quite considerable of an argument, and it come swinging and clawing and snarling right across the lot to where the horseshoes was and settled down to business there. The
  • 57. way them dogs clawed into the ground and kicked up the dust was a caution, and old Battershaw and Whoppleham dancing around the edge of it, hollering like all-git-out and trying to stop it. Well, all of a sudden the setter give up the ship and tucked his tail between his legs and scooted, with the shepherd after him lickety- split. When they was gone and we looked at the peg and the horseshoes there wasn’t anything left to argue about. Those dogs had kicked them galley west and come nigh to digging up the peg. It was a fine thing for both those men, because it gave them something to argue about all the rest of their lives, with no chance of having the argument settled. I’ll bet that in ten years they’ll still be slanging and sassing each other about that game, each of them insisting his horseshoe was the nearest. That’s the kind of old coots they are. Well, it gave Mark his chance to speak to Whoppleham, and he done so. “Mr. Sheriff,” says he, “kin I s-s-speak to you for a m-minute?” “I’m busy,” says the sheriff. “This is official b-b-business,” says Mark. “Oh!... Hum!... Official, eh? Somebody been breakin’ the law hereabouts? Out with it, young feller. Sheriff Whoppleham’s the man for you.” He pointed down to the star on his suspenders and says: “The people has confidence in me, I guess, or they wouldn’t never have put me into this here position of trust and confidence. I guess they knew who would be able to clean out the criminals of these parts. They knowed a venturesome man when they seen one, and a man that wouldn’t stop at nothin’ in the int’rests of justice. What crime’s been did, and who done it?” “We want to s-s-speak about George Piggins,” says Mark. “Have you seen that there crim’nal? Eh? Where’s he hidin’? I know he’s dangerous and desprit, but be I hesitatin’? Be I timid? I guess not. Sheriff Whoppleham would be willin’ to face Jesse James and drag him to jail by the whiskers. Lucky for them Western bandits I never went out there to mix in. I’d have cleaned ’em up perty quick.”
  • 58. “We don’t know where he is,” says Mark, “but we want to talk to you about f-f-fixin’ up that hog-stealin’ so he can come home and not be molested.” “Fix it? How?” “Well, Mr. Hooker’s got back his hog and no harm’s been done. We f-f-figgered maybe you would be willin’ to call it square and let George come home if he promised never to do it again.” “Huh!” says the sheriff. “What’s everybody so doggone int’rested into George for, all of a sudden? Nobody was excited about him none a spell back, but now it looks like everybody seen all to once that there wasn’t no harm in him and he ought to be let home without havin’ to suffer for bein’ a miscreant. What’s the meanin’ of it?” “Has somebody else been to see about him?” says Mark. “I should smile,” says the sheriff. “Why, this mornin’ there was a reg’lar delegation, and who d’you s’pose come along with them but Hooker himself? Yes, sir. And they wanted the charge should be dropped and George let home. I says to ’em that my job was ketchin’ dangerous crim’nals, not pardonin’ ’em, and that they’d have to thrash it out with the prosecutin’ attorney. So they went off to do that.... What I want to know is, how do they expect a officer of the law to do his duty and bring crim’nals to justice if folks goes around gettin’ ’em let off by prosecutin’ attorneys? How? Eh? Well, then. They’re cuttin’ into my trade, that’s what, and I hain’t goin’ to stand for it. I’m goin’ out to ketch George Piggins before he gits pardoned, that’s what I be, and I’m a-goin’ to drag him to jail dead or alive. When I git him there they can do like they please, but my duty’ll be did.” Well, we saw there wasn’t any good hanging around there, so we went along, and Mark was looking pretty serious. “Wiggamore means b-business,” says he. “He hain’t lettin’ any grass grow under his feet, is he?” “Calc’late it was Wiggamore that tried to get George out of trouble?”
  • 59. “Of course it was,” says he, “and he’ll do it, too. Well, let him. That saves us the t-trouble. While he’s botherin’ with that, we can be l-lookin’ for George.” “I wonder if Miss Piggins knows where he is?” “’Tain’t likely,” says Mark. “I don’t b’lieve it, but we kin keep an eye on her. George was always a powerful hungry f-feller, and if she knows and he’s anywheres around, we’ll see her sneakin’ out with a basket of grub.” “She’d do it at night,” says I. “Yes,” says he. “So there’s nothin’ for us to do but wait,” says I. “You n-never make no money waitin’,” says Mark. “We got to be d-d-doin’ somethin’.” “We’ll be kept busy to-day loadin’ that car.” “Yes, and if we g-g-git an order for bowls and things from that firm Zadok told us about, why, we’ll be busier ’n ever,” says he. So we went back to the mill, and Binney was there, and so was Tallow. The mail had come and there was a letter giving us an order for bowls and turned stuff and asking us to ship at once. Mark said the prices was as good as he expected, and better, and that if we could keep on getting such prices we would make a nice lot of money. “How about a car?” he says to Tallow. “Can’t git none,” says Tallow. “Why can’t we git one? We got to git one.” “Nobody in Wicksville can git one, nor nobody on this branch, seems like. Somethin’s happened somewheres and there hain’t no cars, and if there was we couldn’t have any, because the railroad has let on to the agent here that he dassen’t accept any shipments to the city. He said it was an embargo.” “Embargo,” says Mark, “I wonder what one of them is?”
  • 60. “Why,” says Tallow, “an embargo means when the railroad won’t let you ship to a place or from a place or somethin’ like that.” “How long is it goin’ to l-last?” “Maybe a week, maybe a month, maybe all the year,” says Tallow. “There hain’t enough cars to go around, and the railroad yards in the city is crowded with cars that they can’t git men to unload, and that kind of thing.” “Hum!” says Mark. “Perty kettle of fish. Embargo. How in tunket be we g-g-goin’ to send out stuff, then, I’d like to know?” “We hain’t goin’ to,” says Tallow. “But we g-got to. We jest got to.” “They won’t let us.” “There must be some kind of a way. We got to ship as f-f-fast as we manufacture, and get the money back, or we can’t pay the men and keep goin’. If we was held back from shippin’ for two weeks we would be b-busted.” “And Wiggamore would get the dam and the mill,” says I. “He hain’t got ’em yet,” says Mark, “and he hain’t g-goin’ to get ’em.” “What’ll we do,” says I, “drag our chair stock and bowls and things around in carts? It would take quite a spell to git a car-load to the city, or even to Bostwick, that way.” “I don’t know how we’re goin’ to do it, but we’re goin’ to. You f- fellers git to work and I’ll go and f-figger on this. We got to hit on some scheme, and we got to hit on it right off. These here goods has got to be shipped immediate, because we got to have the money.” So he went and sat down in the office, and I could see him pinching his cheek and pulling his ear like he always does when he is puzzling out something. He kept at it more than an hour, and then I saw him come out and get a piece of wood and take out his jack- knife to whittle. At that I got scairt, for he never whittles till he’s in
  • 61. the last ditch. When everything else fails he takes to his jack-knife, and when he does that it’s time to get worried. He whittled and whittled and whittled, and nothing come of it. You see, he hadn’t ever had any experience with railroads, and he didn’t know what kind of a scheme would work with them. He didn’t go home to dinner, but just called to me to stop at his house and fetch him a snack. I knew what a snack meant for him, so I fetched back three ham sandwiches and three jelly sandwiches and two apples and a banana and a piece of apple pie and a piece of cherry pie and a hunk of cake and about a quart of milk. He went at them sort of deliberate and gradual, but the way they disappeared was enough to make you think he was some kind of a magician. Before you knew it the whole lot was gone and he was looking down into the basket kind of sorrowful. “What’s the m-matter, Plunk?” says he. “Was they short of grub at home? Seems like the edge hain’t hardly gone off’n my appetite.” “You’ve et enough to keep me for a week,” says I. “Huh!” says he. “Well, a f-feller kin think better when he’s hungry, they say.” Hungry! I swan to Betsy if he hadn’t et a square meal for three grown men. He went to whittling again. About three o’clock he come out and says, “Plunk, we got to go to the city.” “What for?” says I. “To git f-freight-cars,” says he. “And fetch ’em home in our pockets, I s’pose,” says I. “Maybe,” says he. “Git enough clothes to stay all night. We’ll catch the five-o’clock t-train.” “But what you goin’ to do?” “I hain’t sure. But there’s somebody up to those head offices of the r-r-railroad company that’s got a right to give us cars. I’m goin’ to f-f-find out who it is, even if it’s the President of the United States, and I’m goin’ to find some way to make him give ’em to us.”
  • 62. “They wouldn’t ever let a couple of kids in to see the head men,” says I. “They will,” says he. “How d’you know?” says I. “Because,” says he, “I’ll make ’em.” “Don’t bite off more ’n you kin chaw,” says I. “Look here,” says he, “are you g-g-goin’ to lay down on this job? Because if you be I kin take Tallow or Binney. They won’t git cold f-f- feet.” “I’ll stick,” says I, “but we hain’t got a chance.” “Anybody’s always got a chance,” says he. “Folks can make chances. Anything that’s p-p-possible kin be done if you stick to it and use your head. This here is p-possible and it’s necessary. I’m goin’ to git them freight-cars.” That was just like him. You couldn’t scare him and you couldn’t discourage him. He would stick to anything till you sawed him loose. I guess maybe there was some bulldog in him, or something. Maybe he had had a meal of glue some day and that made him stick to things. I don’t think I’ve ever seen him when he showed that he was discouraged, and I really don’t believe he ever was discouraged. No, sir; he got so interested in trying to do whatever it was that he wanted to do that he forgot all about how hard it was. And I guess that’s a good idea.
  • 63. CHAPTER XIII I like to ride on the cars pretty well, and so does Mark. There are always such a heap of things to see out of the window, and such a lot of different kinds of people right on the cars. It was about four hours’ ride to the city, but it didn’t seem half that long, and I was sorry when we got there. It was pretty dark when we walked out of the depot into the street. “Now what?” says I. “B-bed,” says he. “Where?” says I. “Hotel,” says he. “There’s one,” I says, pointing right across the street, so we took our satchels and went over. There was a fellow behind a counter, and when we came up he sort of grinned and says good evening. “How much does it cost to sleep here?” says Mark. “Two dollars and a half is our cheapest room.” “For both of us?” “I guess I can make it three and a half for two.” “I g-guess you can’t,” says Mark. “The way I look at it, no two boys can do three d-d-dollars and a half worth of sleepin’ in one night. Hain’t there no cheaper places?” “Lots of ’em, young man. There’s a tramps’ lodging-house down the street where you can stay for ten cents.” “Um!... Well, I calc’late what we want is somethin’ betwixt and between. Somethin’ where we kin stay for about a dollar apiece.” That seemed like an awful lot to spend just for sleeping. Why, in the morning our two dollars would be gone and we wouldn’t have anything to show for it. It seems like when you spend money you
  • 64. ought to git something. I nudged Mark and says to him that it was cheaper to stay awake, and we could use our dollars to-morrow to buy something we could touch. But he says we got to sleep to be fresh for business. “I’ll tell you,” says the man behind the counter. “I’ve got a little room without a bath, and if you can sleep two in a bed, you can have it for two-fifty.” “All r-right,” says Mark. “Kin we have breakfast here?” “If you’ve got the money to pay for it.” “Um!... But there’s places where we can git g-g-good grub cheaper ’n you sell it, hain’t there?” “Why, yes! There’s a good serve-self lunch up the street where you can get a lot to eat for fifty cents. Say, what are you kids up to? Running away from home?” “Not that you can n-notice,” says Mark. “We’re here on b- business. We come to see the p-president of that railroad across the street.” “Oh,” says the man, and he laughed right out. “You come to see him, did you? Was he expecting you?” “No.” “Um!... Well, from all accounts, he’s a nice man to see—I guess not. They say he eats a couple of men for breakfast every morning. He keeps a baseball-bat on his desk, and hits everybody that comes to see him a lick over the head. I see him every little while, and, believe me, I’m glad I don’t have to mix in with him any. I expect he’s the grouchiest man in town.” “Sorry to hear it,” says Mark, “but I guess we kin m-make out to git along with him s-somehow.” “Want to go to your room?” “Yes.” Well, a boy with a uniform picked up our satchels and showed us into the elevator and then went into our room first and lighted the lights. Then he sort of stood around and eyed us like there was
  • 65. something he wanted to say, but he didn’t say a word. We looked at him right back, because we weren’t going to let on that we cared a rap what any kid with a uniform on did or said. Pretty soon Mark says: “Well, was there anythin’ you was n-needin’?” “Huh!” says the kid. “What you hangin’ around for, anyhow?” “I guess you hain’t traveled much,” says the boy. “It hain’t p-p-part of your job to tell us, is it?” “Did you ever hear of a tip?” says he. “Tip?” says Mark. “Most generally gentlemen gives us bell-boys a tip when we carry their bags to their room,” says he. “Tip of what?” says I. “I hain’t got no tip unless it’s the tip of my nose.” “A tip is money,” says the boy. “We hired this here room for two dollars and a half, didn’t we?” “Yes,” says he. “We didn’t make no b-bargain with you about carryin’ satchels, nor with the man at the counter, did we?” “No,” says he. “Nobody does. But everybody gives tips. You got to give tips.” “Hain’t you p-paid wages for doin’ what you do?” “Yes, but they hain’t enough.” “Then,” says Mark, “you ought to make the hotel raise your pay and not go t-t-tryin’ to gouge it out of folks that stays here.” “Everybody does it,” says the boy. “You can’t never git nothin’ done in a hotel if you don’t tip.” “Do you git a tip every time you carry a satchel?” “Yes.”
  • 66. “Now you look here. I got an idee you’re tryin’ to git somethin’ out of us ’cause we’re kids and come from Wicksville. I’m g-g-goin’ to f-find out. If it’s the custom, why, I’ll give you a tip ’cause I want to do what’s right. But if you’re t-tryin’ to do us out of money, why, you won’t git it. I’m goin’ to ask the man behind the counter.” And that’s what he done. He went right down and asked, and the man laughed like all-git-out and told Mark all about tips, and Mark told him what he thought about them, and then he give the boy a dime and we went to bed. We went to sleep in a minute and it seemed like it wasn’t more than a minute before we was awake again. Mark woke up first and gouged me in the ribs till I woke up. Then we dressed. “It’s f-five o’clock,” says Mark. “We want to git our breakfast and hustle. You kin bet a man with a big job on a r-r-railroad is down to work early. He’d have to be. Maybe we kin s-see the man we want about six o’clock and git an early train home.” So we went to a serve-self place where you didn’t eat off of a table, but off of the arm of your chair, and we et quite a good deal and it was good. Then we came back to the railroad station and it was just six o’clock. There wasn’t many folks around, but we found a man in a uniform and Mark asked him who was boss of all the freight-cars. The man told him he guessed the general freight agent was, and Mark says, “Where’s his office?” The man told him and Mark went there with me. It was shut up tight. We waited and kept on waiting, and in about an hour a man came along with overalls and a cap that said something on the front of it. “Hey, mister!” says Mark. “We’re waitin’ to see the general freight agent. What’s the m-m-matter with him? Is he sick or somethin’?” “Him!” says the man. “No, he hain’t sick. What makes you think he is?” “’Cause he hain’t down to work.” “Did you expect to see him at seven o’clock in the mornin’?”
  • 67. “To be sure.” “Well, you come back again about nine and maybe he’ll be here by that time. He usually gits around about nine.” “Nine,” says Mark. “Why, that’s ’most n-noon.” The man let out a laugh. “How long does he work in the afternoon?” says Mark. “Oh, he goes to lunch about one o’clock, and gets back around half past two, and then he sticks to the job maybe till four.” “Honest?” says Mark. “Honest,” says the man. “Well, I’ll be dinged!” says Mark. “And they pay him a r-r-reg’lar day’s wages for that? Him workin’ maybe five hours a day?” “If you got his salary, kid, you could buy a railroad for yourself.” The man went along, and we kept on waiting, but Mark couldn’t get it out of his head how a man with an important job could hang onto it and do such a little mite of work. He said he guessed maybe he’d get him a job like that some day where he just had to work five hours. He said he’d do all that work in a stretch and then go out for dinner, and in the afternoon he would have him another job just like it, and work ten hours a day and make twice as much. I thought that was a pretty good idea myself. It was all of nine o’clock when that man came, though there was folks working under him that came a little earlier. We kept asking if he was there until a man told us we was a doggone nuisance and that the boss wouldn’t see us, anyhow. And that’s just what happened. When he got there we asked if we could see him, and the man that was near the gate in the office asked what our business was, and we told him, and he said we couldn’t bother the boss with it. Mark said he guessed maybe the boss better be told we was there, anyhow, and after quite a lot of fuss the man went and told him, and then came back to say the boss was busy and couldn’t see us. He told us there wasn’t any use hanging around, because we wouldn’t ever get to see him.
  • 68. That looked pretty bad, and Mark was as mad as could be. He said we had a right to see that man, and that it wasn’t decent or good business for him to refuse to see us. But that didn’t mend matters. We could git as mad as we wanted to, but that wouldn’t get us a minute’s talk with the freight agent. “I’ll b-bet there’s somebody kin m-make him see us,” says Mark. “The p-p-president of this railroad’s a bigger man than the freight agent, and we’ll git him to fix it for us.” I says to myself that if we couldn’t get to see one it was mighty funny if we could get to see the biggest man of all; but Mark was bound to try, so we found out where the president’s office was and went up there. It was half past nine and he wasn’t to work yet. “When’ll he be here?” says Mark. “Maybe ten o’clock,” says a man that was working outside the president’s door. “Ten,” says Mark, “um!... And how long does he stay?” “Oh, he’ll be around maybe till one, and then he gets lunch and you can’t tell how long he’ll be out. Then he goes home mostly about three or half past.” “Goodness!” says Mark to me. “I hain’t goin’ to be any f-f-freight man. I’m goin’ to be a p-p-president. Looks like he only works three hours, and maybe he gets p-paid three or four thousand dollars for it. Why, any feller could have three jobs like that, workin’ one right on the end of the other, and doin’ nine hours’ work a day! I could git rich doin’ that.” So we waited some more, and after a while in come a slender man with white hair and a cane, all dressed up like he was going to a party instead of coming to work. Everybody acted like they was afraid of him when he came in, and pertended to be mighty busy. He didn’t speak to anybody, but just marched through into his own room and scowled like anything. He looked like he was a regular man-eater. “Was that him?” says Mark.
  • 69. “Yes.” “Well, will you tell him that I want to t-t-talk to him?” “Who are you and what do you want?” Mark told him. “I dassen’t bother him with that,” says the man. “He looks savage to-day. He might discharge me right off.” “But I’ve got to see him. It’s important. It’s awful important.” “I’ll try it,” says the man, “but there isn’t a chance.” So he went to the door and rapped and put in his head. We heard a man roar. “Get out of here!” he bellowed. “Shut that door! Get out! I won’t see anybody this morning! Understand? Get out and stay out!” The man came back and says, “There, you see.” We did see, all right, and I was discouraged. Maybe Mark was, too, but he didn’t show it. He just looked madder than ever. “I’m goin’ to s-s-see that man,” says he, and we went out of that room into the long corridor. There we stopped and stood looking out of the window. In about two minutes Mark says, “Dast you t-t-try it, Plunk?” “Yes,” says I. “What?” “Look at that fire-escape. See how it goes along right past that room we were in. The p-president’s office is next and it goes p-p- past his window. We kin git in that way.” “He’d throw you off into the street,” says I. “He couldn’t l-lift me,” says he, and grinned. “Well,” says I, “I’m willin’ to go second if you’ll go first.” “Come on,” says he. In two jerks of a lamb’s tail we pushed up the window and got onto the fire-escape. Then we skittered along it, ducking past windows as quick as we could, until we were in front of a window that we judged was in the president’s room. We looked in. Sure enough, there he was leaning back in his chair and scowling and
  • 70. smoking like a chimney. His window was up a little from the bottom, but not enough for us to get in. We stood and watched him a minute. Then Mark says, “Here goes.” He rapped loud on the window and then pushed it up. “Good m-m-mornin’!” says he. “Kin we come in?” The president looked at us like he was seeing spooks or something, and rubbed his eyes and jumped up, and Mark says: “Don’t be scairt. We hain’t f-f-figgerin’ on hurtin’ you.” With that both of us got into the room and walked over toward him. He didn’t say a word, but just stared and scowled. “We come to see you on b-b-business,” says Mark, “but they wouldn’t let us in. We had to see you, so here we are.” “I see you’re here,” says he, sharp and savage. “Now let me see you get out again. Quick!” I was ready to turn tail and skedaddle, but not Mark. He walked right over to that president just like he was anybody common and says: “I’m s-sorry, sir, if we b-bother you. But I’ve got to t-talk to you a minute. We can’t get to see anybody, and if we can’t get f-fixed up we are goin’ to bust.” The man scowled worse than ever and took a step toward Mark, but Mark never give back an inch. “I’ll have you thrown out,” says the man. “If you say you won’t t-t-talk to us,” says Mark, “and if you can feel down in your heart that you’re doin’ right, why, we’ll go without b-bein’ thrown. But we was sure that a man couldn’t get to be p- president of a whole railroad unless he was fair and square. That’s why we come right to you. We sort of had confidence, sir, that you was goin’ to see that what was right was done.... But if you don’t feel that way about it, why, we’ll be g-g-goin’ along.” He turned then and went over toward the door. The man didn’t say a word till we were almost there, then he says, “Hold on there!” We stopped.
  • 71. “What do you know about what is fair and what isn’t, or what is good business and what isn’t?” “I may not know much about b-b-business,” says Mark, “but anybody knows what’s f-fair. Here I am—a customer of your railroad just like a man that buys a steak from a b-butcher is a customer of the butcher. If folks wouldn’t use your railroad to send stuff on you would have to go out of b-business. It looks to me like I was doing something you ought to appreciate when I ship a car of freight, and that when I come to see you about railroad b-business, that is goin’ to put m-money into your p-pocket, the least you could do and be fair would be to l-listen. I’m always mighty anxious to keep my customers feelin’ f-f-friendly toward me.” “H’m!” says the president. Mark went on along toward the door and never looked back. “Just a minute,” says the president. “What’s your hurry?” “We thought you wanted us to g-go.” “Come back here,” says he. “Come back here. What do you mean, anyhow, coming into my office and talking to me like this? How dare you talk to me like this?” I tell you I was pretty scared, but I looked at Mark and his eyes were twinkling. “I know I was right about you, sir,” says he. “Right? What do you mean?” “That you was f-fair and square, sir.” “H’m!” says the president. “Sit down and be quick. I haven’t any time to waste. Tell me what you want and tell it briefly. No beating around the bush.” Anybody would have thought he was going to bite our heads off. So Mark told him the whole thing from beginning to end, and he told it quick. I hadn’t any idea so much could be told to anybody in such a short time; but then I might have known Mark could do it if he wanted to. When he got right down to business he could be mighty brief, I’ll tell you.
  • 72. “And that’s what you’ve dared to break into my office to bother me with, is it? For a cent I’d have you thrown out. I don’t know but I ought to do worse.” Mark he never said a word, but just looked at the president respectful and confident. The president turned around to his desk and wrote, and then he fairly threw a paper at Mark. “There,” says he. “Now git out.” Mark looked at the paper and I looked over his shoulder. It said: To all officials and employees of the P. G. R. R.: See to it that the bearer, Mark Tidd, is provided with freight-cars at any point to be transported to any other point in the United States within twelve hours of a request. This order is superior to all other rules or embargoes that may be at this time in force. And his name was signed. “Thank you, sir,” says Mark, “and good-by.” He never looked up, and I thought he wasn’t even going to nod his head when we went out, but he called us back again. “D’you know why I gave you that order?” says he. “I think so, sir,” says Mark. “Well, you don’t,” says the president, “but I’ll tell you. It’s because you’ve got the most tremendous crust in the world. It’s because you weren’t afraid, and it was because you had the backbone to force your way in here and compel me to talk to you. That’s why. Now git.” We got.
  • 73. CHAPTER XIV “Now,” says Mark Tidd when we were on the train again, “I guess we kin go to work l-l-lookin’ for George Piggins.” “Somethin’ else is apt to happen,” says I. “You can’t never tell.” “I guess ’most everything has h-happened,” says he. “There hain’t much more left.” Then all of a sudden he give me a poke in the ribs and says, “Tod Nodder.” “Eh?” says I. “Tod Nodder,” says he. “What about him? Tod Nodder hain’t no reason for pokin’ me black and blue.” “Who was he always loafin’ around with?” “Why, George Piggins!” says I. “Never seen one without the other, did you?” “Not that I know of.” “Well?” says he. “Well yourself,” says I, “and see how you like it.” “I mean,” says he, “that if anybody in the world knows where George is, the feller is Tod Nodder.” “Maybe so, but what does that git us?” “If he knows where George is,” says Mark, “maybe we kin git s-s- somethin’ out of him some way.” “It’s worth t-tryin’,” says I. “Anythin’s worth t-tryin’,” says he, “and everythin’s worth tryin’ when you’re in the fix we’re in. For a spell we’ll leave Silas Doolittle Bugg to run the mill. I guess he kin l-look after the manufacturin’ end with what help we kin give, and put all our time on f-findin’ George. We know Wiggamore’s l-lookin’ for him, and Wiggamore’s
  • 74. got money to look with. He kin hire men to do his lookin’. All we got is us and what b-brains we got.” “Admittin’ we got any,” says I. It was evening when we got home, but we got hold of Binney and Tallow and told them what had happened and how we was going to get all the freight-cars we needed; and we planned how we would meet next morning early, and two of us would keep watch on Miss Piggins’s house and the other two would lay for Tod Nodder. Mark and I were going after Nodder. That left it so that if anything happened one of each couple could stay to watch while the other went for help or to do any following that was necessary. Mark said it would be a pretty good idea to keep an eye on Wiggamore or any men that he had hanging around town. That’s the way it turned out. Binney stayed to watch Miss Piggins. Tallow went mogging after a strange man with fancy clothes that let on he was a detective and was working for Wiggamore, and Mark and I went to hunt up Tod Nodder. You could ’most always tell where to find Tod. It was the place where nobody would be like to come along and offer him a job. Tod was the kind that always complained about not having work, and then took mighty good care to hide somewheres where work couldn’t find him. Lazy! Whoo! Why, he was so lazy when he fished he did it with a night line, and then he hated to pull it in to take off the fish! We stopped at the mill a minute, and Silas Doolittle come up to us, all excited. “Say,” says he, “somebody was monkeyin’ around this mill last night. I was passin’ about nine o’clock and I seen a light. I come rushin’ right down. It looked like the light was ’way up toward the roof. Well, I busted right in and went rampagin’ up-stairs, and before I knowed I rammed right into a feller on the stairs. He was comin’ down as fast as I was goin’ up, and the way we come together would ’a’ made a railroad accident jealous. He got the best of it, though, for he was a-comin’ down-stairs. Yes, sir. He lammed right
  • 75. into me and clean upset me so’s I rolled all the way down, and doggone it if I didn’t leave about a peck of skin on them steps. Then he trompled right over the top of me and skedaddled. I couldn’t ketch him and I couldn’t find no harm he’d done. But after this I calc’late I’ll sleep right here into this mill. That’s what I’ll do, and if anybody comes fussin’ around I guess they’ll find out they got Silas Doolittle Bugg to reckon with.” “Mighty good idee,” says Mark. “Say, we got two freight-cars comin’ in this m-mornin’. Git ’em loaded so’s they’ll ketch the noon freight.” “Have to have help,” says Silas. “Hire some of them grocery-store loafers to help,” says Mark. “Us f-fellers has got somethin’ mighty important to look after.” Well, Mark and I started out then to get our eyes on Tod Nodder and to keep them on him. He wasn’t so easy to find as we thought he would be. Maybe that was because there was a man in town trying to hire folks to do some work on the railroad. Tod would hide away from such a man harder than he would hide from a tribe of scalping Indians. He wasn’t at any of the usual loafing-places, and at the livery-stable where he ’most generally slept they said they hadn’t seen him since daylight. They said he started off somewheres about four o’clock in the morning. Now when a man like Tod Nodder goes somewheres at four o’clock in the morning there are lots of things he might go to do, but there hain’t but one thing he’s very likely to go for, and that’s fish. After we had rummaged all around and couldn’t come across him Mark says, “Well, the s-s-skeezicks must’a’ gone f-f-fishin’.” “Where?” says I. “Tod’s one of these p-pickerel fishermen,” says Mark. “Seems like pickerel and him is mighty fond of each other. So,” says he, “I calc’late we better make for the bayou.” The bayou was a kind of elbow of the Looking-glass River that flows into the main river just below town. When the railroad came along they built right across that elbow, shutting it off into a kind of
  • 76. a lake shaped like a letter U, and the banks was mostly swampy and all overgrown with underbrush. Seems like the pickerel was fond of hanging around in there, and folks who knew how to fish was always hauling regular whoppers out of there. There was places where the banks were high and where you could take a long pole and fish right from the shore. We sort of figured Tod would pick out one of those places if he was there, on account of its being less work than to row out a boat. Mark was always thinking ahead a little, so what does he do but go past his house and stop for a lunch. He wasn’t going to be caught out in the country somewheres without anything to eat, not if he knew himself. Then we started off for the bayou, which wasn’t far. We started in at the railroad on one end and just skirted the shore, keeping our eyes open every inch of the way, and, sure enough, along about half-way around we saw a bamboo fish-pole sticking out. “Injuns,” says Mark Tidd. “Where?” says I. “Everywhere. All around us. They’re a r-r-raidin’ party gittin’ ready to bust out on the town and scalp everybody and carry off the wimmin and children. We got to creep up on ’em and f-f-find out their plans and warn Wicksville.” “I don’t understand no Injun language,” says I. “I do,” says he. “I learned ’most all the Injun languages when I was a captive among them some time back.” “Um!” says I. “I forgot about that. Come to think of it, I was one of them captives, too. I kin speak Choctaw and Hog Latin and a lot of them languages myself.” “Good!” says he. “Now cautious if you want to keep any hair g-g- growin’ on your head.” We did pretty good. In ten minutes we was lying not a hundred foot from Tod Nodder, and he hadn’t the least idea in the world that anybody was within a mile of him. At that distance we could whisper without any danger, so Mark leans over and says to me:
  • 77. Welcome to our website – the perfect destination for book lovers and knowledge seekers. We believe that every book holds a new world, offering opportunities for learning, discovery, and personal growth. That’s why we are dedicated to bringing you a diverse collection of books, ranging from classic literature and specialized publications to self-development guides and children's books. More than just a book-buying platform, we strive to be a bridge connecting you with timeless cultural and intellectual values. With an elegant, user-friendly interface and a smart search system, you can quickly find the books that best suit your interests. Additionally, our special promotions and home delivery services help you save time and fully enjoy the joy of reading. Join us on a journey of knowledge exploration, passion nurturing, and personal growth every day! ebookbell.com