spatio-temporal modelling, in samall area

Briggs Henan University 2010
1
The Standard Regression Model
and its Spatial Alternatives.
Relationships Between Variables
and
Building Predictive Models

2
Spatial Statistics
Descriptive Spatial Statistics: Centrographic Statistics
– single, summary measures of a spatial distribution
–- Spatial equivalents of mean, standard deviation, etc..
Inferential Spatial Statistics: Point Pattern Analysis
Analysis of point location only--no quantity or magnitude (no attribute variable)
--Quadrat Analysis
--Nearest Neighbor Analysis, Ripley’s K function
Spatial Autocorrelation
– One attribute variable with different magnitudes at each location
The Weights Matrix
Global Measures of Spatial Autocorrelation
(Moran’s I, Geary’s C, Getis/Ord Global G)
Local Measures of Spatial Autocorrelation (LISA and others)
Prediction with Correlation and Regression
–Two or more attribute variables
Standard statistical models
Spatial statistical models




Bivariate and Multivariate
• All measures so far have focused
on one variable at a time
– univariate
• Often, we are interested in the
association or relationship
between two variables
– bivariate.
• Or more than two variables
– multivariate
Briggs Henan University 2010 3
Y
X1 X2
education
income
gender*
income
*Gender = male or female
education

Correlation and Regression
The most commonly used techniques in science.
Review standard (non-spatial) approaches
Correlation
Regression
Spatial Regression
Why it is necessary.
How to do it.
4

Correlation and Regression
What is the difference?
• Mathematically, they are identical.
• Conceptually, very different.
Correlation
• Co-variation
• Relationship or association
• No direction or causation is implied
• Y X X1 X2
Regression
– Prediction of Y from X
– Implies, but does not prove, causation
– X (independent variable) Y (dependent variable)
Regression line
predicts

• The most common statistic in all of science
• measures the strength of the relationship (or “association”) between two
variables e.g. income and education
• Varies on a scale from –1 thru 0 to +1
+1 implies a perfect positive association
• As values go up () on one, they also go up () on the other
• income and education
0 implies no association
-1 implies perfect negative association
• As values go up on one () , they go down () on the other
• price and quantity purchased
• Full name is the Pearson Product Moment correlation coefficient,
6
Correlation Coefficient (r)
-1 0 +1
() ()
() ()

Examples of Scatter Diagrams and the Correlation Coefficient
r = 1
perfect
positive
r = -1
r = 0.26
weak
positive
perfect
negative
Education
Income
r = -0.71
strong
negative
Price
Quantity
Positive
Negative
r = 0.72
strong
positive
7

Correlation Coefficient: example
China Provinces 29
excludes Xizang/Tibet, Macao, Hong Kong, Hainan, Taiwan, P'eng-hu
Correlation coefficient
= 0.9458
(see later for calculation)

Pearson Product Moment Correlation Coefficient (r)
Where Sx and Sy are the standard
deviations of X and Y, and X and Y
are the means.
y
x
n
i
i
i
S
S
n
Y
y
X
x
r
)
)(
(
1



 
Moments about the mean
)
(
1
2


N
Y
Y
n
i
i
Sy=
)
(
1
2


N
X
X
n
i
i
SX=
“product” is the result of
a multiplication
X * Y = P

Briggs UT-Dallas GISC 6382 Spring 2007 10
Calculation Formulae for
Correlation Coefficient (r)
Before the days of
computers, these
formulae where easier
to do “by hand.”
See next slide for example

Calculating r for urban v. rural income
11
Province UrbanIncome x-xMean x-xMean2 RuralIncome y-ymean y-ymean2 (x-xM)(y-yM)
Anhui 14086 -2376 5646195 4504 -1202 1444970 2856323
Zhejiang 24611 8149 66403391 10008 4302 18506611 35055694
Jiangxi 14022 -2440 5954441 5075 -631 398248 1539917
Jiangsu 20552 4090 16726690 8004 2298 5280487 9398142
Jilin 14006 -2456 6032783 5450 -256 65571 628950
Qinghai 12692 -3770 14214200 3346 -2360 5569926 8897867
Fujian 19557 3095 9577958 6880 1174 1378114 3633114
Heilongjiang 12566 -3896 15180159 5207 -499 249070 1944459
Henan 14372 -2090 4368821 4807 -899 808325 1879209
Hebei 14718 -1744 3042137 5150 -556 309213 969880
Hunan 15084 -1378 1899359 4910 -796 633726 1097120
Hubei 14367 -2095 4389747 5035 -671 450334 1406005
Xinjiang 12258 -4204 17675066 4005 -1701 2893636 7151587
Gansu 11930 -4532 20540587 2980 -2726 7431452 12355015
Guangxi 15451 -1011 1022470 3980 -1726 2979314 1745353
Guizhou 12863 -3599 12954042 3005 -2701 7295774 9721613
Liaoning 15800 -662 438472 6000 294 86395 -194633
Nei Mongol 15849 -613 375980 4938 -768 589930 470959
Ningxia 14025 -2437 5939809 4048 -1658 2749193 4041000
Beijing 26738 10276 105592633 11986 6280 39437534 64531489
Shanghai 28838 12376 153161108 12324 6618 43797011 81902373
Shanxi 13997 -2465 6077075 4244 -1462 2137646 3604252
Shandong 17811 1349 1819336 6119 413 170512 556973
Shaanxi 14129 -2333 5443694 3438 -2268 5144137 5291796
Sichuan 13904 -2558 6544246 4462 -1244 1547708 3182543
Tianjin 21430 4968 24679311 10675 4969 24690276 24684793
Yunnan 14424 -2038 4154147 3369 -2337 5461891 4763349
Guangdong 21574 5112 26130781 6906 1200 1439834 6133841
Chongqing 15749 -713 508615 4621 -1085 1177375 773841
SUM 477403 0.00 546493254 165476 0.00 184124210 300022824
AVERAGE 16462 0.00 18844595 5706 0.00 6349111 10345615
Sx= SQRT( 18844595) Sy= SQRT( 6349111) r = 10345615/4341/2520
= 4341 = 2520 = 0.9458

Correlation Coefficient
example using
“calculation formulae”
Scatter Diagram
Source: Lee and Wong

Regression
• Simple regression
– Between two variables
• One dependent variable (Y)
• One independent variable (X)
• Multiple Regression
– Between three or more variables
• One dependent variable (Y)
• Two or independent variable (X1 ,X2…)
Y
X
Y
X1 X2
education
income
gender*

Simple Linear Regression
• Concerned with “predicting” one variable (Y - the dependent
variable) from another variable (X - the independent variable)
Y = a +bX +
 = residual= error = Yi-Ŷi =Actual (Yi ) – Predicted (Ŷi )
a
b
1
Y
X
Yi
X
a is the intercept —the value of Y when X =0
b is the regression coefficient or slope of the line
—the change in Y for a one unit change in X
Ŷi
0
Regression line

Ordinary Least Squares (OLS)
--the standard criteria for obtaining the
regression line
Ŷi
Yi
The regression line
minimizes the sum of the
squared deviations
between actual Yi and
predicted Ŷi
Min (Yi-Ŷi)2
Yi
Ŷi

Coefficient of Determination (r2
)
• The coefficient of determination (r2
) measures the proportion of the variance in Y (the
dependent variable) which can be predicted or “explained by” X (the independent
variable). Varies from 1 to 0.
• It equals the correlation coefficient (r) squared.
16
2
i
2
i
2
)
Y
–
Y
(
)
Y
–
Ŷ
(



r
SS Regression or Explained Sum of Squares
SS Total or Total Sum of Squares
SS Residual or
Error Sum of
Squares
SS Total or
Total Sum of
Squares
2
i
i
2
i
2
i )
Ŷ
–
Y
(
)
Y
–
Ŷ
(
)
Y
–
Y
( 

 

SS Regression or
Explained Sum of
Squares
Note:
= +

Partitioning the Variance on Y
SS Residual
or Error Sum of Squares
SS Total
or Total Sum of
Squares
2
i
i
2
i
2
i )
Ŷ
–
Y
(
)
Y
–
Ŷ
(
)
Y
–
Y
( 

 

SS Regression
or Explained Sum of
Squares
2
i
i
2
i
i
2
)
Y
–
Y
(
)
Y
–
Ŷ
(



r
Y Y Y
Y Y
Y

Standard Error of the Estimate (se)
Measures predictive accuracy: the bigger the standard error, the
greater the spread of the observations about the regression line,
thus the predictions are less accurate
Se
2
= error mean square, or average squared residual
= variance of the estimate, variance about regression
(called sigma-square in GeoDA)
k
n
se


 2
i
i )
Ŷ
–
Y
( Sum of squared residuals
Number of observations minus degrees of freedom
(for simple regression, degrees of freedom = 2)

r2
= r = 1
Se= 0.0
Sy = 2
b = 2 perfect
positive
r2
= 0.94 r = .97
Se= 0.3
r2
= 0.51 r = .71
Se= 1.1
b = 1.1
Very strong strong
r2
= 0.26 r = .51
Se= 1.3
b = 0.8
r2
= 0.07
Se= 1.8
b = 0.1
r2
= r= 0.00
Se= Sy = 2
moderate weak none
Regression
line in
blue
As the coefficient of determination gets smaller, the slope of
the regression line (b) gets closer to zero.
As the coefficient of determination gets smaller, the standard
error gets larger, and closer to the standard deviation of the
dependent variable (Y) (Sy = 2)
Coefficient of determination (r2
), correlation coefficinet (r),
regression coefficient (b), and standard error (Se)
b = 0
(Values are hypothetical and for illustration of relative change only)

Sample Statistics, Population Parameters and
Statistical Significance tests
Yi = a +bXi +i a and b are sample statistics
which are estimates of
population parameters α and β
β (and b) measure the change in Y for a one unit change in X. If β = 0 then X has no effect on Y, therefore
Null Hypothesis (H0): in the population β = 0
Alternative Hypothesis (H1): in the population β ≠ 0
Thus, we test if our sample regression coefficient, b, is sufficiently different from zero to reject the Null
Hypothesis and conclude that X has a statistically significant affect on Y
i
i
i ε
βX
α
Y 



Test Statistics in Simple Regression
Test statistic for b is distributed according to the Student’s t Distribution
(similar to normal):
where is the variance of the estimate,
with degrees of freedom = n – 2
A test can also be conducted on the coefficient of determination (r2
) to test if it is
significantly greater than zero, using the F frequency distribution.
It is mathematically identical to the t test.
2
/
)
Ŷ
–
Y
(
1
/
)
Y
–
Ŷ
(
S.S./d.f.
Residual
S.S./d.f.
Regression
2
i
i
2
i





n
F
)
(
SE(b)
2
2
 


i
e
X
X
s
b
b
t
2
e
s

Multiple regression
ε
X
β
β
Y
ε
βX
α
Y






1
0
We can rewrite simple regression as:
ε
X
β
X
β
X
β
β
Y m
m 




 ...
2
2
1
1
0
Multiple regression: Y is predicted from 2 or more independent variables
β0 is the intercept —the value of Y when values of all Xj = 0
β1… β m are partial regression coefficients which give the
change in Y for a one unit change in Xj, all other X variables held
constant
m is the number of independent variables
Y
X1 X2
education
income
gender*

Multiple regression: least squares criteria
23
residuals
)
ˆ
Predicted
-
Actual
(
)
ˆ
(
)
hyperplane
n
(regressio
Y
for
values
predicted
ˆ
0
0











i
i
i
i
m
j
j
ij
i
i
m
j
j
ij
i
Y
Y
Y
Y
b
X
Y
e
b
X
Y
ε
X
β
X
β
X
β
β
Y m
m 




 ...
2
2
1
1
0
).
Y
(actual i
0
i
m
j
j
ij
i ε
β
X
Y 


or
2
n
1
i
)
ˆ
(
Min i
i Y
Y 
 
As in simple regression, the “least squares”
criteria is used. Regression coefficients bj are
chosen to minimize the sum of the squared
residuals (the deviations between actual Yi and
predicted Ŷi)
The difference is that Ŷi is predicted from 2 or
more independent variables, not one.
Regression
hyperplane

Coefficient of Multiple Determination (R2
)
• Similar to simple regression, the coefficient of multiple determination (R2
) measures the
proportion of the variance in Y (the dependent variable) which can be predicted or
“explained by” all of X variables in combination.
Varies from 0 to 1.
24
2
i
2
i
2
)
Y
–
Y
(
)
Y
–
Ŷ
(



R
SS Regression or Explained Sum of Squares
SS Total or Total Sum of Squares
SS Residual or
Error Sum of
Squares
SS Total or
Total Sum of
Squares
2
i
i
2
i
2
i )
Ŷ
–
Y
(
)
Y
–
Ŷ
(
)
Y
–
Y
( 

 

SS Regression or
Explained Sum of
Squares
As with
simple
regression
= +
Formulae identical to simple regression

Reduced or Adjusted
• R2
will always increase each time another
independent variable is included
– an additional dimension is available
for fitting the regression hyperplane
(the multiple regression equivalent
of the regression line)
• Adjusted is normally used instead of R2
in
multiple regression
2
R
2
R
)
1
)(
1
(
1 2
2
k
n
n
R
R





k is the number of coefficients
in the regression equation,
normally equal to the number of
independent variables plus 1
for the intercept.
Y
X1 X2
perfect
Not perfect

Interpreting partial regression coefficients
• The regression coefficients (bj) tell us the change in Y for a 1 unit
change in Xj, all other X variables “held constant”
• Can we compare these bj values to tell us the relative importance of
the independent variables in affecting the dependent variable?
– If b1 = 2 and b2 = 4, is the affect of X2 twice as big as the affect of X1 ?
• No, no, no in general!!!!
• The size of bj depends on the measurement scale used for each
independent variable
– if X1 is income, then a 1 unit change is $1
– but if X2 is rmb or Euro(€) or even cents ( )
₵
1 unit is not the same!
– And if X2 is % population urban, 1 unit is very different
• Regression coefficients are only directly comparable if the units are all
the same: all $ for example
26
a
b
1
Y
X

Standardized partial regression coefficients
Comparing the Importance of Independent Variables
• How do we compare the relative importance of independent variables?
• We know we cannot use partial regression coefficients to directly compare
independent variables unless they are all measured on the same scale
• However, we can use standardized partial regression coefficients (also
called beta weights, beta coefficients, or path coefficients).
• They tell us the number of standard deviation (SD) unit changes in Y for
a one SD change in X)
• They are the partial regression coefficients if we had measured every
variable in standardized form
27
)
s
(
Y
j
j
X
j
XY
s
b


Note the confusing use of β for both standardized partial regression
coefficients and for the population parameter they estimate.
s
)
(
X
x
x
z i
i



Test Statistics in Multiple Regression:
testing each independent variable
A test can be conducted for each partial regression coefficient bj
to test if the associated independent variable influences the
dependent variable. It is distributed according to the Student’s t
Distribution (similar to the normal frequency distribution):
Null Hypothesis Ho : bj = 0
.
)
SE(bj
j
b
t 
2
e
s
with degrees of freedom = n – k, where k is the
number of coefficients in the regression equation,
normally equal to the number of independent
variables plus 1 for the intercept (m+1).
The formula for calculating the standard error (SE) of bj is more
complex than for simple regression , so it is not shown here.

Test Statistics in Mutiple Regression
testing the overall model
• We test the coefficient of multiple determination (R2
) to see if it is significantly greater
than zero, using the F frequency distribution.
• It is an overall test to see if at least one independent variable, or two or more in
combination, affect the dependent variable.
• Does not test if each and every independent variable has an effect
• Similar to the F test in simple regression.
– But unlike simple regression, it is not identical to the t tests.
• It is possible (but unusual) for the F test to be significant but all t tests not significant.
29
k
n
l
k
F






/
)
Ŷ
–
Y
(
/
)
Y
–
Ŷ
(
S.S./d.f.
Residual
S.S./d.f.
Regression
2
i
i
2
i
Again, k is the number of coefficients in the regression equation,
normally equal to the number of variables (m) plus 1.

30
Anscombe, Francis J. (1973). "Graphs in statistical analysis". The American
Statistician 27: 17–21.
Always look at your data
Don’t just rely on the statistics!
Anscombe's quartet
Summary statistics are the
same for all four data sets:
mean (7.5),
standard deviation (4.12),
correlation (0.816)
regression line
(y = 3 + 0.5x).

31
Waiting time between eruptions and the duration of the eruption for the Old
Faithful Geyser in Yellowstone National Park, Wyoming, USA. This chart
suggests there are generally two "types" of eruptions: short-wait-short-
duration, and long-wait-long-duration.
Source: Wikipedia
Real data is almost
always more complex
than the simple,
straight line
relationship assumed in
regression.

Spurious relationships
Eating ice cream
inhibits swimming
ability.
--eat too much, you
cannot swim
Omitted variable
problem
--both are related to a
third variable not
included in the analysis
Summer temperatures:
--more people swim
(and some drown)
--more ice cream is sold
Help!

Regression does not prove direction or cause!
Income and Illiteracy
• Provinces with higher incomes can afford
to spend more on education, so illiteracy is
lower
– Higher Income>>>>Less Illiteracy
• The higher the level of literacy (and thus
the lower the level of illiteracy) the more
high income jobs.
– Less Illiteracy>>>>Higher Income
• Regression will not decide!
Income
Illiteracy
Income
Illiteracy

Spatial Regression
It doesn’t solve any of the problems just discussed!
You always must examine your data!
34

Spatial Autocorrelation:
shows the association or
relationship between the
same variable in “near-
by” areas.
Spatial Autocorrelation & Correlation
Standard Correlation
shows the association or
relationship between two
different variables
35
income
education
education
Education
“next door”
In a neighboring
or near-by area
Each point is a
geographic location

36
If Spatial Autocorrelation exists:
• correlation coefficients and coefficients of
determination appear bigger than they really are
• biased upward
• You think the relationship is
stronger than it really is
• the variables in nearby areas
affect each other
• Standard errors appear smaller than they really are
• exaggerated precision
• You think your predictions are better than they really are
since standard errors measure predictive accuracy
• More likely to conclude
relationship is statistically significant.
(We discussed this in detail in the lecture on Spatial Autocorrelation concepts.)
SE(b)
b
t 

How do I know if I have a problem?
For correlation, calculate Moran’s I for each variable and test its statistical
significance
– If Moran’s I is significant, you may have a problem!
For regression, calculate the residuals
Yi-Ŷi =Actual (Yi ) – Predicted (Ŷi )
Then:
(1)map the residuals: do you see any spatial patterns?
--if yes, you may have a problem
(2)Calculate Moran’s I for the residuals: is it statistically significant?
--if yes, you have a problem

What do I do if SA exists?
• Acknowledge in your paper that SA exists
and that the calculated correlation
coefficients may be larger than their true
value, and may not be statistically
significant
• Try to fix the problem!

How do I fix SA?
Step 1:
Try to identify omitted variables and include them in a
multiple regression.
• Missing (omitted) variables may cause spatial autocorrelation
• Regression assumes all relevant variables influencing the
dependent variable are included
– If relevant variables are missing, model is misspecified
Step 2:
If additional variables cannot be identified, or SA still exists,
use a spatial regression model

Spatial Regression: 4 Options
1. Spatial Autoregressive Models
1. Lag model
2. Error model
2. Spatial Filtering
--based on eigenfunctions (Griffith)
3. Spatial Filtering
--based on Ripley’s K and Getis-Ord G (Getis)
3. Others
We will consider the first option only.
– simpler and the more commonly used
– Getis and Griffith 2002 compare the first three
40
Getis, A. and Daniel Griffith (2002) Comparative Spatial Filtering in
Regression Analysis Geographical Analysis 34 (2) 130-140

41
• Spatial lag model
values of the dependent variable in neighboring locations
(WY) are included as an extra explanatory variable
• these are the “spatial lag” of Y
• Spatial error model
values of the residuals in neighboring locations (Wε) are
included as an extra term in the equation;
• these are “spatial error”
W is the spatial weights matrix
Y = β0 + λ WY + Xβ + ε
Y = β0 + Xβ + ρWε + ξ
Spatial Lag and Spatial Error Models:
mathematical comparison
ξ is “white noise”

Spatial Lag and Spatial Error Models:
conceptual comparison
OLS SPATIAL LAG SPATIAL ERROR
Baller, R., L. Anselin, S. Messner, G. Deane and D. Hawkins. 2001.
Structural covariates of US County homicide rates: incorporating spatial
effects,. Criminology , 39, 561-590
Ordinary Least Squares
No influence from
neighbors
Dependent variable
influenced by
neighbors
Residuals influenced
by neighbors

Lag or Error Model: Which to use?
• Lag model primarily controls spatial
autocorrelation in the dependent variable
• Error model controls spatial autocorrelation
in the residuals, thus it controls
autocorrelation in both the dependent and
the independent variables
• Conclusion: the error model is more robust
and generally the better choice.
• Statistical tests called the LM Robust test
can also be used to select
– Will not discuss these

Comparing our models
• Which model best predicts the dependent variable?
• Neither R2
nor Adjusted can be used to compare different spatial
regression models
• Instead, we use Akaike Information Criteria (AIC)
– the smaller the AIC value the better the model
Note: can only be used to compare models with the same dependent variable
2
R
)]
Squares
of
Sum
Residual
[ln(
2 n
k
AIC 

k is the number of coefficients in the regression equation, normally equal
to the number of independent variables plus 1 for the intercept term.
Akaike, Hirotuga (1974) A new look at statistical model identification
IEEE Transactions on Automatic Control 19 (6) 716-723

Geographically Weighted Regression
• The idea of Local Indicators can also be applied to regression
• Its called geographically weighted regression
• It calculates a separate regression
for each polygon and its neighbors,
– then maps the parameters from the model, such as the regression
coefficient (b) and/or its significance value
• Mathematically, this is done by applying the spatial weights
matrix (Wij) to the standard formulae for regression
See Fotheringham, Brunsdon and Charlton Geographically Weighted
Regression Wiley, 2002
Xi

Problems with Geographically Weighted Regression
• Each regression is based on few observations
– the estimates of the regression
parameters (b) are unreliable
• Need to use more observations than just those with
shared border, but
– how far out do we go?
– How far out is the “local effect”?
• Need strong theory to explain why the regression
parameters are different at different places
• Serious questions about validity of statistical
inference tests since observations not independent
Xi

What have we learned today?
• Correlation and regression are very good tools
for science.
• Spatial data can cause problems with standard
correlation and regression
• The problems are caused by spatial
autocorrelation
• We need to use Spatial Regression Models
• Geographers and GIS specialists are experts on
spatial data
– They need to understand these issues!

48

spatio-temporal modelling, in samall area

More Related Content

Similar to spatio-temporal modelling, in samall area (20)

More from yonas381043 (6)

Recently uploaded (20)

spatio-temporal modelling, in samall area