GWR Presentation

Geographically Weighted Regression
CSDE Statistics Workshop Christopher S. Fowler PhD. February 1st 2011

Significant portions of this workshop were culled from presentations prepared by Fotheringham, Charleton and Brunsdon and presented at the 2010 Advanced Workshop on Spatial Analysis at the University of Santa Barbara.
University of Washington
Center for Studies in Demography and Ecology
Outline for the Session
The motivation for GWR

Examples from YOUR discipline
Mapping OLS Residuals

A good baseline for why we need GWR
GWR
Definitions, basic concepts
Running GWR
A straightforward implementation in ArcGIS
GWR and some extensions
Basics of OLS
y X
Assumes a stationary process Same stimulus provokes the same response anywhere in the study area
Why might relationships vary spatially?

Sampling variation Relationships intrinsically different across space (attitudes, preferences, contextual effects) Model misspecification
Applications: Ecology
GWR works on trees
Could have been differentiated sampling pattern creates predictable and changing levels of interaction among observations
Applications: Public Health

Relationships vary systematically
The relationship between mortality and occupational segregation and between mortality and unemployment varies across Tokyo
Applications: Sociology/Public Policy

Missing variables (and they may very well be unknowable)
The link between multifamily housing and residential burglaries varies widely even when controlling for numerous socioeconomic and neighborhood factors
Back upHow do we know if we have nonstationarity in our model?
Map residuals and test them for spatial autocorrelation if our model errs systematically with a spatial pattern then we may be on to something.
Poverty in the Southern U.S.
Our example Model

Poverty Fem aleH eadedH ousehold U nem ployed Black 65 andolder M etro AtLeastH ighSchoolEducation
Based on the work of Paul Voss and Katherine Curtis These are all understood to be good predictors of poverty What kinds of spatial structures influence this data set?
Lab Part 1

Run our OLS model in ArcGIS

Examine model output Map residuals
Calculate Morans I and Local Morans I
Our best aspatial model
So what now?
Add more missing variables and try again
Repeat the steps from the lab
Accept that there is something about certain places that makes them different (spatial heterogeneity)
Try GWR
Test variables meant to explore interactions taking place at short distances (spatial dependence)
Try Spatial Regression (Likely a spatial lag model)
Assume that the correlation is a nuisance and control for it in the error term
Try Spatial Regression (Likely a spatial error model)
Outline for Part II

What is GWR Weighting in GWR
Geographically Weighted Regression
Local statistical technique to analyze spatial variations in relationships We are not content with global averages of spatial data (climate for example) Why should we be satisfied with global averages in a statistical analysis?
Put another way.Simpsons Paradox
If we think of these points as our data grouped into colors by region we can see that the global and local models differ significantly
Source: Rcker and Schumacher BMC Medical Research Methodology 2008 8:34 doi:10.1186/1471-2288-8-34
Basic definitions
Spatial nonstationarity exists when the same stimulus provokes a different response in different parts of the study region Global models are statements about processes that are assumed to be stationary and, as such, are location GWR independent in greater detail Local models are spatial disaggregations of global models, the results of which are location specific Spatial heterogeneity refers to spatial patterns resulting from broad similarities usually over time Spatial dependence refers to spatial patterns that result from interactions among observations
Spatial Heterogeneity and Spatial Dependence
GWR and Spatial Processes
GWR is excellent at picking up broad scale regional differences

spatial heterogeneity
Not as effective at dealing with small scale interaction processes

Too much bias in each local model That doesnt mean it wont try (and give you misleading results)
GWR in a nutshell
Global model
y X
yi i i X i
Where i indicates that there is a set of coefficients estimated for every observation in our data set
becomes
The Key Difference
We estimate a set of regression coefficients for each observation
To do so we weight near observations more heavily than more distant ones. We may also estimate coefficients based on some local subset of observations
Some advantages of GWR
Excellent tool for testing model specification

Where does model fit look good, where are you missing something?
Residuals generally lower and not spatially autocorrelated
Real values for

.9 .8 .7 .6 .5 .8 .7 .6 .5 .4 .8 .6 .5 .4 .3 .7 .5 .4 .3 .2 .5 .4 .4 .2 .1
Estimated Values of in global model
.5 .5 .5 .5 .5
.5 .5 .5 .5 .5
.5 .5 .5 .5 .5
.5 .5 .5 .5 .5
.5 .5 .5 .5 .5
Residuals from global model

+ + + + 0 + + + 0 + + 0 + 0 0 -
Reasons to use GWR

Identify model misspecification Identify nonstationarity in relationships
Improved model fit (R2, AIC, etc) Reduced spatial autocorrelation Represent context
Address spatial heterogeneity when precise variables may not exist
Youve convinced me, what next?
Run your aspatial model (as we did in 1st lab)

We will want the results and diagnostics to compare with what comes next.
Decide how you are going to weight your nearby locations

Fixed bandwidth Variable bandwidth User-defined bandwidth
It all comes down to how you weight the observations
We can use a fixed bandwidth h
Wij = exp[-((dij/h)2)/2]
Number of observations will vary, but area they represent will remain constant
Weighting option 2
Or we can employ an adaptive bandwidth
Wij = [1-(dij2/ h2)] 2 if j is one of is N nearest neighbors
Number of observations will remain fixed, but area will not be the same
Kernels and Weights

Bandwidth specifies shape of weights curve Kernel type tells us whether we will define our bandwidth based on distance (fixed) or number of neighbors (adaptive)
So how do we know what bandwidth to use?
Judging the appropriate bandwidth
A tradeoff between Bias: we include observations that are not part of the same spatial group and Variance: we dont have enough points in our model to say anything with conviction
AIC Variance
Optimum Bias
AICc or CV measure model fit Optimize fit to obtain best bandwidth.
Bandwidth
To sum
Weighting assumptions are very important to outcomes in GWR Fixed distance kernel is more appropriate when the distribution of your observations is relatively stable across space (e.g. size, number of neighbors). Adaptive kernel is appropriate when distribution varies across space (e.g. events are clustered or polygons are heterogeneous) Once a kernel type is selected optimization takes some of the guesswork out of it, but robustness checks are still needed
Residuals from the OLS model from last lesson
Looks reasonably good
Morans I is still .22 and highly significant
Lab
Run GWR model Check Residuals Check variation in coefficients
Further topics/issues in GWR

Where to go for next steps General troubleshooting Significance testing Outlier problems Poisson and Logistic model implementations Mixed form models
Other software implementations of GWR

GWR 3.x (4.0 should be out soon) R (spgwr package) Stata Matlab Perhaps others I havent heard of
General Troubleshooting
Regional dummies BAD

Eliminate them from modelwe are trying to show regional variation, not control for it
Binary and low probability count variables

Use caution, lack of variation may cause model to crash or have trouble finding a workable bandwidth
Significance Testing
How do I know if the variation I see in my coefficients is meaningful? Could do t-test, but you will run into problems with multiple (1,387) tests
Results in lots of false positives Standard correction (Bonferroni) will make any significance finding nearly impossible
Best Method: Monte Carlo simulation
Randomly reassign all observation values (dependent and independent variables travel together) to different observation locations
Each countys data gets assigned randomly to a different county
Re-run GWR and record coefficients Repeat lots of times (at least 100) Define a distribution for coefficient values and compare your coefficients to this distribution
Other method: Fotheringham Significance Test

F otheringham
1 pe pe np
pe is effective number of parameters p is the number of parameters
Fotheringham Significance Test

F otheringham
1 pe pe np
.05 1 (37.97 ) 37.97 1387 8 .001283
Type equation here.
F otheringham
In Excel we can find the significant T-statistic using: TINV(.001283,1379) In R we use: qt(1-(.001283/2),1379) Either way we get a value of ~3.23
Results: Significant Nonstationarity for Percent Hispanic
Outlier problems
Outliers cause problems for everybody, but their impact is greater for local regressions, particularly when bandwidth keeps number of observations low. In standard OLS
Run model and identify observations with high or low residuals (~ +/- 4) Weight these observations less than 1 Re-run until none of the observations have extreme residuals Now do your GWR with weights assigned
Poisson and Logistic model forms

Implementations exist in both R and GWR 3.x software Both require much greater care with respect to colinearity and lack of variation
Mixed-form models
What if some of your variables are stationary and others have variation?
Mixed-form models allow you to hold some coefficients constant while allowing others to vary
Not yet implemented in any statistical package, but not that difficult from a technical standpoint
Concluding comments
What comes next?

Spatial regression Multilevel models

GWR Presentation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GWR Presentation

Uploaded by

Copyright:

Available Formats

Geographically Weighted Regression

CSDE Statistics Workshop Christopher S. Fowler PhD. February 1st 2011

Center for Studies in Demography and Ecology

Outline for the Session

The motivation for GWR

Mapping OLS Residuals

GWR and some extensions

Why might relationships vary spatially?

GWR works on trees

Applications: Public Health

Applications: Sociology/Public Policy

Back upHow do we know if we have nonstationarity in our model?

Poverty in the Southern U.S.

Our example Model

Run our OLS model in ArcGIS

Calculate Morans I and Local Morans I

Our best aspatial model

Outline for Part II

Geographically Weighted Regression

Put another way.Simpsons Paradox

Spatial Heterogeneity and Spatial Dependence

GWR and Spatial Processes

GWR is excellent at picking up broad scale regional differences

Not as effective at dealing with small scale interaction processes

The Key Difference

We estimate a set of regression coefficients for each observation

Some advantages of GWR

Excellent tool for testing model specification

Residuals generally lower and not spatially autocorrelated

Real values for

Estimated Values of in global model

Residuals from global model

Reasons to use GWR

Address spatial heterogeneity when precise variables may not exist

Youve convinced me, what next?

Run your aspatial model (as we did in 1st lab)

Decide how you are going to weight your nearby locations

It all comes down to how you weight the observations

We can use a fixed bandwidth h

Or we can employ an adaptive bandwidth

Wij = [1-(dij2/ h2)] 2 if j is one of is N nearest neighbors

Kernels and Weights

So how do we know what bandwidth to use?

Judging the appropriate bandwidth

AICc or CV measure model fit Optimize fit to obtain best bandwidth.

Residuals from the OLS model from last lesson

Looks reasonably good

Morans I is still .22 and highly significant

Further topics/issues in GWR

Other software implementations of GWR

Regional dummies BAD

Binary and low probability count variables

Best Method: Monte Carlo simulation

Other method: Fotheringham Significance Test

pe is effective number of parameters p is the number of parameters

Fotheringham Significance Test

Type equation here.

Results: Significant Nonstationarity for Percent Hispanic

Poisson and Logistic model forms

What comes next?

You might also like