You are on page 1of 16

Regression aanalysis

The spatial statistics toolbox provides effective tools for quantifying spatial patterns. Using the Hot Spot
Analysis tool, for example, you can ask questions like:
1.

Are there places in the United States where people are persistently dying young?

2.

Where are the hot spots for crime, 911 emergency calls (see graphic below), or fires?

3.

Where do we find a higher than expected proportion of traffic accidents in a city?

Each of the questions above asks "where?". The next logical question for the types of analyses above
involves "why?":
1.

Why are there places in the United States where people persistently die young? What might be
causing this?

2.

Can we model the characteristics of places that experience lots of crime, 911 calls, or fire events in
order to help reduce these incidents?

3.

What are the factors contributing to higher than expected traffic accidents? Are there policy
implications or mitigating actions that might reduce traffic accidents across the city and/or in
particular high accident areas?

Tools included in the Modeling Spatial Relationships toolset help users answer this second set of "why"
questions. These tools include Ordinary Least Squares (OLS) Regression and Geographically Weighted
Regression (GWR).

Regression Analysis
Regression analysis allows you to model, examine, and explore spatial relationships, and can help explain
the factors behind observed spatial patterns. Regression analysis is also used for prediction. You may want
to understand why people are persistently dying young in certain regions, for example, or may want to
predict rainfall where there are no rain gauges.
OLS is the best known of all regression techniques. It is also the proper starting point for all spatial
regression analyses. It provides a global model of the variable or process you are trying to understand or
predict (early death/rainfall); it creates a single regression equation to represent that process.
Geographically Weighted Regression (GWR) is one of several spatial regression techniques, increasingly used
in geography and other disciplines. GWR provides a local model of the variable or process you are trying to
understand/predict by fitting a regression equation to every feature in the dataset. When used properly,
these methods are powerful and reliable statistics for examining/estimating linear relationships.
Linear relationships are either positive or negative. If you find that the number of search and rescue events
increases when daytime temperatures rise, the relationship is said to be positive; there is a positive
correlation. Another way to express this positive relationship is to say that search and rescue events
decrease as daytime temperatures decrease. Conversely, if you find that the number of crimes goes down as
the number of police officers patrolling an area goes up, the relationship is said to be negative. You can also
express this negative relationship by stating that the number of crimes increases as the number of patrolling
officers decreases. The graphic below depicts both positive and negative relationships, as well as the case
where there is no relationship between two variables:

Correlation analyses and their associated graphics depicted above, test the strength of the relationship
between two variables. Regression analyses, on the other hand, make a stronger claim; they attempt to
demonstrate the degree to which one or more variables potentially promote positive or negative change in
another variable.

Using Regression Analysis


Regression analysis can be used for a large variety of applications:

Modeling fire frequency to determine high risk areas and to understand the factors that contribute
to high risk areas.

Modeling property loss from fire as a function of variables such as degree of fire department
involvement, response time, property value, etc. If you find that response time is the key factor,
you may need to build more fire stations. If you find that involvement is the key factor, you may
need to increase equipment/officers dispatched.

Modeling traffic accidents as a function of speed, road conditions, weather, etc. in order to inform
policy aimed at decreasing accidents.

There are three primary reasons you might want to use regression analysis:
1.

To model some phenomena in order to better understand it and possibly use that understanding to
affect policy or to make decisions about appropriate actions to take. Basic objective: to measure the
extent that changes in one or more variables jointly affect changes in another. Example: Understand
the key characteristics of the habitat for some particular endangered species of bird (perhaps
precipitation, food sources, vegetation, predators ) to assist in designing legislation aimed at
protecting that species.

2.

To model some phenomena in order to predict values for that phenomenon at other places or other
times. Basic objective: to build a prediction model that is consistent and accurate. Example: where
are real estate values likely to go up next year? Or: there are rain gauges at particular places and a
set of variables that explain the observed precipitation values how much rain falls in places where
there are no gauges? (Regression may be used in cases where interpolation is not effective because
of insufficient sampling: there are no gauges on peaks or in valleys, for example).

3.

You can also use regression analysis to test hypotheses. Suppose you are modeling residential crime
in order to better understand it, and hopefully implement policy to prevent it. As you begin your
analysis you probably have questions or hypotheses you want to test:
o

"Broken Window Theory" indicates that defacement of public property (graffiti, damaged
structures, etc.) invite other crimes. Will there be a positive relationship between vandalism
incidents and residential burglary?

Is there a relationship between illegal drug use and burglary (might drug addicts steal to
support their habits)?

Are burglars predatory? Might there be more incidents in residential neighborhoods with
higher proportions of elderly or female headed households?

Is a person at greater risk for burglary if they live in a rich or a poor neighborhood?

You can use regression analysis to test these relationships and answer your questions.

Regression Analysis components


It is impossible to discuss regression analysis without first becoming familiar with a few terms and basic
concepts specific to regression statistics:
Regression equation: this is the mathematical formula applied to the explanatory variables in order to
best predict the dependent variable you are trying to model. Unfortunately for those in the Geosciences who
think of X and Y as coordinates, the notation in regression equations for the dependent variable is always "y"
and for independent or explanatory variables is always "X". Each independent variable is associated with a
regression coefficient describing the strength and the sign of that variable's relationship to the dependent
variable. A regression equation might look like this (y is the dependent variable, the X's are the explanatory

variables, and the 's are regression coefficients; each of these components of the regression equation are
explained further below):

Dependent variable (y): this is the variable representing the process you are trying to predict or
understand (e.g., residential burglary, foreclosure, rainfall). In the regression equation, it appears
on the left side of the equal sign. While you can use regression to predict the dependent variable,
you always start with a set of known y values and use these to build (or to calibrate) the regression
model. The known yvalues are often referred to as observed values.

Independent/Explanatory variables (X): these are the variables used to model or to predict the
dependent variable values. In the regression equation, they appear on the right side of the equal
sign and are often referred to as explanatory variables. We say that the dependent variable is a
function of the explanatory variables. If you are interested in predicting annual purchases for a
proposed store, you might include in your model explanatory variables representing the number of
potential customers, distance to competition, store visibility, and local spending patterns, for
example.

Regression coefficients (): coefficients are computed by the regression tool. They are values,
one for each explanatory variable, that represent the strength and type of relationship the
explanatory variable has to the dependent variable. Suppose you are modeling fire frequency as a
function of solar radiation, vegetation, precipitation and aspect. You might expect a positive
relationship between fire frequency and solar radiation (the more sun, the more frequent the fire
incidents). When the relationship is positive, the sign for the associated coefficient is also positive.
You might expect a negative relationship between fire frequency and precipitation (places with more
rain have fewer fires). Coefficients for negative relationships have negative signs. When the
relationship is a strong one, the coefficient is large. Weak relationships are associated with
coefficients near zero.
0 is the regression intercept. It represents the expected value for the dependent variable if all of
the independent variables are zero.

P-Values: most regression methods perform a statistical test to compute a probability, called a p-value, for
the coefficients associated with each independent variable. The null hypothesis for this statistical test states
that a coefficient is not significantly different from zero (in other words, for all intents and purposes, the
coefficient is zero and the associated explanatory variable is not helping your model). Small p-values reflect
small probabilities, and suggest that the coefficient is, indeed, important to your model with a value that is
significantly different from zero (the coefficient is NOT zero). You would say that a coefficient with a p value
of 0.01, for example, is statistically significant at the 99% confidence level; the associated variable is an
effective predictor. Variables with coefficients near zero do not help predict or model the dependent variable;
they are almost always removed from the regression equation, unless there are strong theoretical reasons
to keep them.
R2/R-Squared: Multiple R-Squared and Adjusted R-Squared are both statistics derived from the regression
equation to quantify model performance. The value of R-squared ranges from 0 to 100 percent. If your
model fits the observed dependent variable values perfectly, R-squared is 1.0 (and you, no doubt, have
made an error perhaps you've used a form of y to predict y). More likely, you will see R-squared values like
0.49, for example, which you can interpret by saying: this model explains 49% of the variation in the
dependent variable. To understand what the R-squared value is getting at, create a bar graph showing both
the estimated and observed Y values sorted by the estimated values. Notice how much overlap there is. This
graphic provides a visual representation of how well the model's predicted values explain the variation in the
observed dependent variable values.View an illustration. The Adjusted R-Squared value is always a bit lower
than the Multiple R-Squared value because it reflects model complexity (the number of variables) as it
relates to the data.
Residuals: these are the unexplained portion of the dependent variable, represented in the regression
equation as the random error term, . View an illustration. Known values for the dependent variable are
used to build and to calibrate the regression model. Using known values for the dependent variable ( y) and
known values for all of the explanatory variables (the Xs), the regression tool constructs an equation that
will predict those knowny values, as well as possible. The predicted values will rarely match the observed
values exactly. The difference between the observed y values and the predicted y values are called the
residuals. The magnitude of the residuals from a regression equation is one measure of model fit. Large
residuals indicate poor model fit.
Building a regression model is an iterative process that involves finding effective independent variables to
explain the process you are trying to model/understand, then running the regression tool to determine
which variables are effective predictors then removing/adding variables until you find the best model
possible.

Regression Analysis Issues


OLS regression is a straightforward method, has well-developed theory behind it, and has a number of
effective diagnostics to assist with interpretation and troubleshooting. OLS is only effective and reliable,
however, if your data and regression model meet/satisfy all of the assumptions inherently required by this
method (see the table below). Spatial data often violate the assumptions/requirements of OLS regression,
and so it is important to use regression tools in conjunction with appropriate diagnostic tools that can assess
whether or not regression is an appropriate method for your analysis, given the structure of the data and
the model being implemented.
How Regression Models Go Bad. A serious violation for many regression models is misspecification. A
misspecified model is one that is not complete - it is missing key/important explanatory variables and so it
does not adequately represent what you are trying to model or trying to predict (the dependent variable, y);
in other words, the regression model is not telling the whole story. Misspecification is evident whenever you
see statistically significant spatial autocorrelation in regression residuals, or said another way: whenever you
notice that the over and underpredictions (residuals) from your model tend to cluster spatially so that the
over predictions cluster together in some portions of the study area and the underpredictions cluster

together in others. Mapping regression residuals or the coefficients associated with Geographically Weighted
Regression (GWR) analysis, will often provide clues about what you've missed. Running a Hot Spot
Analysis on regression residuals may help reveal different spatial regimes that can be modeled in OLS with
regional variables or can be remedied using the Geographically Weighted Regression (GWR) method.
Suppose when you map your regression residuals you see that the model is always over predicting in the
mountain areas and under predicting in the valleys - you will likely conclude that your model is missing an
Elevation variable. There will be times, however, when the missing variable(s) are too complex to model, or
impossible to quantify, or too difficult to measure. In these cases, you may be able to move to GWR or to
another spatial regression method to get a well specified model.
The following table lists common problems with regression models, and the tools available in ArcGIS to help
address them:

Common Regression Problems, Consequences, and Solutions

When key explanatory


variables are missing
Omitted
explanatory
from a regression model,
variables
coefficients and their
(misspecification).
associated
p-values
cannot be trusted.

Map and examine OLS residuals and GWR


coefficients, or run Hot Spot Analysis on
OLS regression residuals to see if this
provides clues about possible missing
variables.

OLS and GWR are both


linear models. If the
relationship between any
of
the
explanatory
an variables
and
the
dependent variable is
non-linear, the resultant
model
will
perform
poorly.

Use the scatterplot matrixgraphic to


elucidate the relationships among all
variables in the model. Pay careful
attention to relationships involving the
dependent variable. Curvilinearity can
often be remedied by transforming the
variables. View
an
illustration.
Alternatively, use a non-linear regression
method.

Influential outliers can


pull modeled regression
an
relationshsips away from
their true best fit, biasing
regression coefficients.

Use the scatterplot matrixand other


graphing tools to examine extreme data
values. Correct or remove outliers if they
represent errors. When outliers are
correct/valid values they cannot/should
not be removed. Run the regression with
and without the outliers to see how much
they are effecting your results.

Non-linear
relationships. View
illustration.

Data Outliers. View


illustration.

Non-stationarity.
You
might
find
that
an
INCOME
variable,
for
example,
has
strong
explanatory power in
region
A,
but
is
insignificant
or
even
switches signs in region
B. View an illustration.

If relationships between
your
dependent
and
explanatory variables are
inconsistent across your
study area, computed
standard errors will be
artifically inflated.

The OLS tool in ArcGIS automatically


tests for problems associated with nonstationarity (regional variation) and
computes
robust
standard
error
values. View an illustration. When the
probability associated with the Koenker
test is small (< 0.05, for example), you
have statistically significant regional
variation and should consult the robust
probabilities
to
determine
if
an

explanatory
variable
is
statistically
significant or not. You will improve model
results by using Geographically Weighted
Regression.

Multicollinearity: one or a
combination
of
explanatory variables is
redundant. View
an
illustration.

The OLS tool in ArcGIS automatically


checks for redundancy. Each explanatory
variable is given a computed VIF value.
Multicollinearity leads to
When this value is large (> 7.5, for
an over-counting type of
example), redundancy is a problem and
bias
and
an
the offending variables should be
unstable/unreliable
removed from the model or modified by
model.
creating an interaction variable or
increasing the sample size. View an
illustration.

Inconsistent variance in
residuals. It may be that
the model predicts well
for small values of the
dependent variable, but
becomes unreliable for
large values. View an
illustration.

The OLS tool in ArcGIS automatically


tests for inconsistent residual variance
(called heteroskedasticity) and computes
When the model predicts standard errors that are robust to this
poorly for some range of problem. When the probability associated
values, results will be with the Koenker test is small (< 0.05, for
biased.
example), you should consult the robust
probabilities
to
determine
if
an
explanatory
variable
is
statistically
significant or not. View an illustration.

When there is spatial


clustering
of
the
under/over
predictions
Spatially autocorrelated
coming out of the model,
residuals. View
an
it introduces an overillustration.
counting type of bias and
renders
the
model
unreliable.

Run the Spatial Autocorrelation tool on


the residuals to ensure they do not
exhibit statistically significant spatial
clustering. Statistically significant spatial
autocorrelation is often a symptom of
misspecification (a key variable is
missing from the model). View an
illustration.

When
the
regression
model residuals are not
normally distributed with
Normal
distribution
a mean of zero, the pbias. View an illustration.
values associated with
the
coefficients
are
unreliable.

The OLS tool in ArcGIS automatically


tests whether the residuals are normally
distributed.
When
the
Jarque-Bera
statistic is significant (< 0.05, for
example),
your
model
is
likely
misspecified (a key variable is missing
from
the
model).
Examine
the
output residual map and perhaps GWR
coefficient maps to see if this exercise
reveals the key variables missing from
the analysis.

It is important to test for each of the problems listed above. Results can be 100% wrong (180 degrees
different) if any of the problems above are ignored.

Spatial regression

Spatial data exhibit two properties that make it difficult (but not impossible) to meet the assumptions and
requirements of traditional (non-spatial) statistical methods, like OLS regression:
1.

Geographic features are more often than not spatially autocorrelated; this means that features
near each other tend to be more similar than features that are farther away. This creates an overcount type of bias for traditional (non-spatial) regression methods.

2.

Geography is important, and often the processes most important to the model are non-stationary;
these processes behave differently in different parts of the study area. This characteristic of spatial
data can be referred to as regional variation or spatial drift.

True spatial regression methods were developed to be robust to these two characteristics of spatial data,
and even to incorporate these special qualities of spatial data in order to improve their ability to model data
relationships. Some spatial regression methods deal effectively with the first characteristic (spatial
autocorrelation), others deal effectively with the second (non-stationarity). At present, no spatial regression
methods are effective for both characteristics. For a properly specified GWR model, however, spatial
autocorrelation is typically not a problem.
Spatial Autocorrelation. There seems to be a big difference between how a traditional statistician views
spatial autocorrelation and how a spatial statistician views spatial autocorrelation. The traditional statistician
sees it as a bad thing that needs to be removed from the data (through resampling, for example) because
spatial autocorrelation violates underlying assumptions of many traditional (non-spatial) statistical methods.
For the geographer or GIS analyst, however, spatial autocorrelation is evidence of important underlying
spatial processes at work; it is an integral component of our data! Removing space removes data from their
spatial context it is like getting only half the story. The spatial processes and spatial relationships evident
in our data, are a primary interest, and one of the reasons we get so excited about spatial data analysis. To
avoid an over-counting type of bias in your model, however, you must identify the full set of explanatory
variables that will effectively capture the inherent spatial structure in your dependent variable. If you cannot
identify all of these variables, you will very likely see statistically significant spatial autocorrelation in the
model residuals. Unfortunately, you cannot trust your regression results until this is remedied. Use
the Spatial Autocorrelation tool to test for statistically significant spatial autocorrelation in your regression
residuals.
There are at least three strategies for dealing with spatial autocorrelation in regression model residuals:
1.

Resample until the input variables no longer exhibit statistically significant spatial autocorrelation.
While this does not insure the analysis is free of spatial autocorrelation problems, they are far less
likely when spatial autocorrelation is removed from the dependent and explanatory variables. This is
the traditional statistician's approach to dealing with spatial autocorrelation and is only appropriate
if spatial autocorrelation is the result of data redundancy (the sampling scheme is too fine).

2.

Isolate the spatial and non-spatial components of each input variable using a spatial filtering
regression method. Space is removed from each variable, but then it is put back into the regression
model as a new variable to account for spatial effects/spatial structure. Spatial Filtering regression
methods will be added to ArcGIS in a future release.

3.

Incorporate spatial autocorrelation into the regression model using spatial econometric regression
methods. Econometric spatial regression methods will be added to ArcGIS in a future release.

Regional Variation. Global models, like OLS regression, create equations that best describe the overall data
relationships in a study area. When those relationships are consistent across the study area, the OLS
regression equation models those relationships well. When those relationships behave differently in different
parts of the study area, however, the regression equation is more of an average of the mix of relationships
present, and in the case where those relationships represent two extremes, the global average will not
model either extreme well. When your explanatory variables exhibit non-stationary relationships (regional
variation), global models tend to fall apart unless robust methods are used to compute regression results.
Ideally, you will be able to identify a full set of explanatory variables to capture the regional variation

inherent in your dependent variable. If you cannot identify all of these spatial variables, however, you will
again notice statistically significant spatial autocorrelation in your model residuals and/or lower than
expected R-squared values. Unfortunately, you cannot trust your regression results until this is remedied.
There are at least 4 ways to deal with regional variation in OLS regression models:
1.

Include a variable in the model that explains the regional variation. If you see that your model is
always over-predicting in the north and under-predicting in the south, for example, add a regional
variable set to 1 for northern features and set to 0 for southern features.

2.

Use methods that incorporate regional variation into the regression model such as Geographically
Weighted Regression (GWR).

3.

Consult robust regression standard errors and probabilities to determine if variable coefficients are
statistically significant. See Interpreting OLS Regression Results. Geographically Weighted
Regression is still recommended.

4.

Redefine/reduce the size of the study area so that the processes within it are all stationary - so they
no longer exhibit regional variation.

Interpreting OLS results


Output generated from the OLS Regression tool includes:
Output feature class.

Message window report of statistical results.

Optional table of explanatory variable coefficients.

Optional table of regression diagnostics.

Each of these outputs is shown and described below as a series of steps for running OLS regression and
interpretting OLS results.
(A) Run the OLS tool:

You will need to provide an input feature class with a unique ID field, the dependent variable you want to
model/explain, and all of the explanatory variables. You will also need to provide a pathname for the output
feature class, and optionally, pathnames for the coefficient and diagnostic output tables. As the OLS tool
runs, statistical results are printed to the screen.
(B) Examine the statistical report using the numbered steps described below:

Dissecting the Statistical Report


1.

Assess model performance. Both the Multiple R-Squared and Adjusted R-Squared values are
measures of model performance. Possible values range from 0.0 to 1.0. The Adjusted R-Squared
value is always a bit lower than the Multiple R-Squared value because it reflects model
complexity (the number of variables) as it relates to the data, and consequently is a more
accurate measure of model performance. Adding an additional explanatory variable to the model
will likely increase the Multiple R-Squared value, but decrease the Adjusted R-Squared value.
Suppose you are creating a regression model of residential burglary (the number of residential
burglaries associated with each census block is your dependent variable, y). An Adjusted RSquared value of 0.84 would indicate that your model (your explanatory variables modeled using
linear regression) explains approximately 84% of the variation in the dependent variable, or said
another way: your model tells approximately 84% of the residential burglary "story".

2.

Assess each explanatory variable in the model: Coefficient, Probability or Robust Probability, and
Variance Inflation Factor (VIF). The coefficient for each explanatory variable reflects both the
strength and type of relationship the explanatory variable has to the dependent variable. When
the sign associated with the coefficient is negative, the relationship is negative (e.g., the larger
the distance from the urban core, the smaller the number of residential burglaries). When the
sign is positive, the relationship is positive (e.g., the larger the population, the larger the number
of residential burglaries). Coefficients are given in the same units as their associated explanatory

variables (a coefficient of 0.005 associated with a variable representing population counts may be
interpretted as 0.005 people). The coefficient reflects the expected change in the dependent
variable for every 1 unit change in the associated explanatory variable, holding all other variables
constant (e.g., a 0.005 increase in residential burglary is expected for each additional person in
the census block, holding all other explanatory variables constant). The T test is used to assess
whether or not an explanatory variable is statistically significant. The null hypothesis is that the
coefficient is, for all intents and purposes, equal to zero (and consequently is NOT helping the
model). When the probability or robust probability is very small, the chance of the coefficient
being essentially zero is also small. If the Koenker test (see below) is statistically significant, use
the robust probabilities to assess explanatory variable statistical significance. Statistically
significant probabilities have an asterisk "*" next to them. An explanatory variable associated
with a statistically significant coefficient is important to the regression model if theory/common
sense supports a valid relationship with the dependent variable, if the relationship being modeled
is primarily linear, and if the variable is not redundant to any other explanatory variables in the
model. The variance inflation factor (VIF) measures redundancy among explanatory variables. As
a rule of thumb, explanatory variables associated with VIF values larger than about 7.5 should be
removed (one by one) from the regression model. If, for example, you have a population variable
(the number of people) and an employment variable (the number of employed persons) in your
regression model, you will likely find them to be associated with large VIF values indicating that
both of these variables are telling the same "story"; one of them should be removed from your
model.

3.

Assess model significance. Both the Joint F-Statistic and Joint Wald Statistic are measures of
overall model statistical significance. The Joint F-Statistic is trustworthy only when the Koenker
(BP) statistic (see below) is not statistically significant. If the Koenker (BP) statistic is significant
you should consult the Joint Wald Statistic to determine overall model significance. The null
hypothesis for both of these tests is that the explanatory variables in the model are not effective.
For a 95% confidence level, a p-value (probability) smaller than 0.05 indicates a statistically
significant model.

4.

Assess Stationarity. The Koenker (BP) Statistic (Koenker's studentized Bruesch-Pagan statistic) is
a test to determine if the explanatory variables in the model have a consistent relationship to the
dependent variable (what you are trying to predict/understand) both in geographic space and in
data space. When the model is consistent in geographic space, the spatial processes represented
by the explanatory variables behave the same everywhere in the study area (the processes are
stationary). When the model is consistent in data space, the variation in the relationship between
predicted values and each explanatory variable does not change with changes in explanatory
variable magnitudes (there is no heteroscedasticity in the model). Suppose you want to predict
crime and one of your explanatory variables in income. The model would have problematic
heteroscedasticity if the predictions were more accurate for locations with small median incomes,
than they were for locations with large median incomes. The null hypothesis for this test is that
the model is stationary. For a 95% confidence level, a p-value (probability) smaller than 0.05
indicates statistically significant heteroscedasticity and/or non-stationarity. When results from
this test are statistically significant, consult the robust coefficient standard errors and
probabilities to assess the effectiveness of each explanatory variable. Regression models with
statistically significant non-stationarity are especially good candidates for GWR analysis.

5.

Assess model bias. The Jarque-Bera statistic indicates whether or not the residuals (the
observed/known dependent variable values minus the predicted/estimated values) are normally
distributed. The null hypothesis for this test is that the residuals are normally distributed and so if
you were to construct a histogram of those residuals, they would resemble the classic bell curve,
or Gaussian distribution. When the p-value (probability) for this test is small (is smaller than 0.05
for a 95% confidence level, for example), the residuals are not normally distributed, indicating
model misspecification (a key variable is missing from the model). Results from a misspecified
OLS model are not trustworthy.

6.

Assess residual spatial autocorrelation. Always run the Spatial Autocorrelation (Moran's I) tool on
the regression residuals to ensure they are spatially random. Statistically significant clustering of
high and/or low residuals (model under and over predictions) indicates a key variable is missing
from the model (misspecification). OLS results cannot be trusted when the model is misspecified.

7.

Finally, review the section titled "How Regression Models Go Bad" in the Regression Analysis
Basics document as a check that your OLS regression model is properly specified. Notice, too,
that there is a section titled "Notes on Interpretation" at the end of the OLS statistical report to
help you remember the purpose of each statistical test.

(C) Examine output feature class residuals. Over and under predictions for a properly specified regression
model will be randomly distributed. Clustering of over and/or under predictions is evidence that you are
missing at least one key explanatory variable. Examine the patterns in your model residuals to see if they
provide clues about what those missing variables are. Sometimes running Hot Spot Analysis on regresion
residuals will help you see the broader patterns in over and under predictions.

(D) View the coefficient and diagnostic tables. Creating the coefficient and diagnostic tables is optional.
While you are in the process of finding an effective model, you may elect not to create these tables. The
model building process is iterative and you will likely try a large number of different models (different

explanatory variables) until you settle on a few good ones. You can use the Aiaike Information Criterion
(AIC) on the report to compare different models. The model with the smaller AIC value is the better model
(that is, taking into account model complexity, the model with the smaller AIC provides a better fit to the
observed data). You should always create the coefficient and diagnostic tables for your final OLS models in
order to capture the most important elements of the OLS report including the list of explanatory variables
used in the model with their coefficients, standard errors, and probabilities, and results for each diagnostic
test. The diagnostic table includes a description of each test along with some guidelines for how to interpret
test results.

You might also like