You are on page 1of 24

LINEAR REGRESSION IN GEOGRAPHY

R.Ferguson

ISSN 0306 - 614?


ISBN 0 902246 87 9

Rob Ferguson
CONCEPTS AND TECHNIQUES IN MODERN GEOGRAPHY No 15
CATMOG

(Concepts and Techniques in Modern Geography)


LINEAR REGRESSION IN GEOGRAPHY
CATMOG has been created to fill a teaching need in the field of quantitative
by
methods in undergraduate geography courses. These texts are admirable guides
for the teachers, yet cheap enough for student purchase as the basis of class-
Rob Ferguson
work. Each book is written by an author currently working with the technique
or concept he describes. ',University of Stirling)

1. An introduction to Markov chain analysis - L. Collins


2. Distance decay in spatial interactions - P.J. Taylor
CONTENTS
3. Understanding canonical correlation analysis - D. Clark Page
I INTRODUCTION
4. Some theoretical and applied aspects of spatial interaction
shopping models - S. Openshaw (i) Relationships between variables 3
5. An introduction to trend surface analysis - D. Unwin (ii) Aims and prerequisites 3
6. Classification in geography - R.J. Johnston (iii) Averages, variances, and correlations 4
7. An introduction to factor analytical techniques - J.B. Goddard & A. Kirby II SIMPLE REGRESSION
8. Principal components analysis - S. Daultrey (i) The least squares trend 7
9. Causal inferences from dichotomous variables - N. Davidson (ii) The linear model 10
10. Introduction to the use of logit models in geography - N. Wrigley
III MULTIPLE REGRESSION
11. Linear programming: elementary geographical applications of the
transportation problem - A. Hay (i) The need 14
12. An introduction to quadrat analysis - R.W. Thomas (ii) Two explanatory variables 16
13. An introduction to time-geography - N.J. Thrift (iii) Multiple and partial correlation 20
14. An introduction to graph theoretical methods in geography - K.J. Tinkler (iv) More than two predictors 23
15. Linear regression in geography - R. Ferguson (v) Statistical assumptions , residual checking, 24
and transformation of variables
16. Probability surface mapping. An introduction with examples and
Fortran programs - N. Wrigley (vi) Confidence limits and significance tests 30
17. Sampling methods for geographical research - C. Dixon & B. Leach IV SPECIAL APPLICATIONS
18. Questionnaires and interviews in geographical research - (i) Causal models 36
C. Dixon & B. Leach
(ii) Special kinds of variables 41
Other titles in preparation
BIBLIOGRAPHY 43

This series, Concepts and Techniques in Modern Geography


is produced by the Study Group in Quantitative Methods, of
the Institute of British Geographers.
For details of membership of the Study Group, write to Acknowledgements
the Institute of British Geographers, 1 Kensington Gore,
London, S.W.7. To Joyce Bell and Keith Scurr of the Geography Department at Hull University,
The series is published by Geo Abstracts, University of for typing the text and drawing the diagrams; John Pethick for many dis-
East Anglia, Norwich, NR4 7TJ, to whom all other enquiries cussions; and Jim Thompson (Hull) and Pete Taylor (Newcastle) for technical
should be addressed. comments.

1
LINEAR REGRESSION IN GEOGRAPHY
I INTRODUCTION
(i) Relationships between variables
Most if not all systems studied by geographers involve variable quant-
ities that can be measured numerically, from the distances travelled by
Sunday trippers to the hydraulic characteristics of proglacial streams. Com-
parison of maps or time series graphs often shows that differences in one
variable, in space or over time, appear to be associated with differences in
other variables. For example both Sunday trips and proglacial streamflow may
vary according to temperature. Sometimes the apparent association is entirely
a matter of chance, and in other cases two variables may amount to altern-
ative definitions of the same thing, but frequently we suspect cause and
effect are at work. A relationship of this kind can be important in three ways.
It may confirm or refute theoretical notions about cause-effect processes in
the system under study. Alternatively it may highlight something not con-
sidered by existing theory and thus stimulate new ideas. Thirdly, it may be
useful for prediction of the response to future changes in conditions, or of
the present state of affairs in places where direct measurement is inconven-
ient or impossible.

Whether we are testing, generalising, or predicting, an objective method


for summarising the form and strength of apparent associations is useful.
Occasionally two or more variables are linked by a simple mathematical
equation, as in the 'laws' of physics. But most geographical relationships
are only broad trends, with individual cases departing from the norm because
of unique local circumstances and differences in other relevant factors. The
statistical techniques used to pick out general tendencies of this kind are
known as regression methods. It is not surprising that they are very widely
used in geography.

(ii) Aims and prerequisites


Regression analysis is a major branch of mathematical statistics and is
used throughout the social and environmental sciences as well as in many
branches of industry, business, and government. As a result computer programs
are widely available, there is a vast literature on the subject, and many
specialised variants of the basic techniques have been devised. Attention is
focused here on the simplest and commonest version, linear regression. It is
unrealistic to expect all users of regression methods, including those who
make judgments on the basis of other peoples' analyses, to understand in de-
tail how these tools work. But pitfalls surround the unwary user who has no
idea what is involved. I have therefore tried to explain how linear regression
works, what kind of underlying model it imposes on reality, and what can be
done to check the assumptions made, as well as explaining how to interpret
the results. The treatment is introductory with worked examples, graphical
illustrations, and no mathematics beyond simple algebra. More advanced treat-
ments, usually involving calculus and matrix algebra, can be found in a wide
range of textbooks of which a few are listed in the bibliography. The statis-
tical level is also kept low and familiarity with simple regression is not

2 3
essential. The reader is however assumed to have taken an introductory course Regression analysis is about relationships between variables and the
covering elementary descriptive statistics, including the correlation co- contributions of different factors to overall variability. It is therefore
effecient, and significance testing. Finally, the notation adopted here is useful to be able to split into components the variance of quantities like
widely used and fairly obvious but the reader is warned that other authors X + Y. Rules (1) to (3) show that
may use alternative symbols for the same things and the same symbols for
different things.

(iii) Averages, variances, and correlations

Most explanations of regression analysis obtain and present the main


results in terms of sums of squares of values of each variable and sums of
products of different variables. Here we work entirely in terms of the three X and Y deviations. This is called the covariance of, or between, X and Y.
basic descriptive statistics: means, standard deviations, and correlations. It can be shown that its maximum possible value is the product of the standard
There are several advantages. These statistics should be familiar to all geo- deviations of the two variables, when corresponding X and Y deviations are
graphers. In most applications they will have been calculated and inspected all in exactly the same proportion. The familiar correlation coefficient,
for their own interest before any further analysis is comtemplated. Their r, is simply the ratio of actual to maximum possible covariance:
numerical values are immediately interpretable, and also generally small which
reduces the danger of roundoff errors in machine computation or shifted deci-
mal places and the like in hand calculations. And their use avoids a pro-
fusion of summation signs hereafter.
More familiar formulae for r can be obtained using the averaging rules, but
Means, standard deviations, and correlations are all defined in terms the important point for our purposes is that the covariance of two variables
of averages, and the operation of averaging plays a vital role later on. can be rewritten as their correlation times the product of their standard
deviations. So the variance of X + Y above reduces to

of a numerical variable X. More complicated expressions can be averaged using


three simple rules: (1) multiply out any brackets; (2) the average of a sum
is a sum of separate averages; (3) constants can be moved outside averages. i.e. a sum of separate contributions plus a joint one that only disappears
For example, if k is a constant the mean of k(X + Y) is if X and Y are uncorrelated.

The correlation coefficient is of course a measure of association. If


high X tends to go with high Y and low with low then r is positive, approach-
beyond which it cannot be simplified. ing its limit of +1 the stronger the tendency. If high X goes with low Y and
vice versa r is negative, while if there is no pattern one way or the other
For a variable to merit the name its individual values must deviate r is close to zero. (see Fig. 1). A correlation can be calculated between
from their mean. Deviations are represented here by lower case letters, for

possess form, apart from being positive or negative, so regression analysis


is normally restricted to interval or ratio scale data. Some exceptions are
mentioned in the final chapter, but otherwise we will be concerned with re-
lationships between variables with a more or less continuous range of possible
values.
Application of the averaging rules converts this to

which is one version of the familiar short-cut formula used to calculate


variances and standard deviations. But for our purposes it is more important

the variance or squared standard deviation of the variable concerned.

4 5
II SIMPLE REGRESSION
(i) The least squares trend
The observed relationship between two numerical variables X and Y can
be represented exactly by a scatter diagram or graph of Y values against X
values, with one point for each pair of values (each site or area in a
spatial study, person or household in a social study, time or time interval
in a historical study, and so on). As an illustration Figure 2 shows the
relationship between long term average precipitation (Y) and height above
sea level (X) at twenty rain gauges in a west-east transect across southern
Scotland (the data is in Table 1).

If any association can be discerned in such a scatter diagram it can be


summarised with greater or lesser accuracy by the equation of a trend curve
passing through the cloud of points. Any type of curve can be chosen but the
simplest possibility, perfectly adequate in this example, is the straight
line

creases by one. In this case it is clear that rainfall increases rather than
decreases with elevation so b is positive.

Simple regression is the process of choosing appropriate values of the


regression coefficients a and b, or in effect choosing the particular straight
line that best describes the trend of the data. This is generally done by the
method of least squares, in which the coefficients are chosen to minimise
the sum of squares (or equivalently the mean square) of the residuals or
differences between observed and predicted Y values for each observed X value.

Minimisation of the residual scatter involves elementary calculus and


only the results are given here (for a proof see any advanced text). For a
given slope, b, changes in the intercept, a, shift the trend line up or down.
The residual scatter is found to be least when

(2a)

the line cancel out. This fixes the general level of the line, but its tilt
depends on b. It turns out that the residual scatter is least for a slope of

(2b)

Fig. 1 The correlation coefficient, r, as a measure of association.


Top to bottom: no association, strong positive association,
weak negative association
7
6
Table 1: Average precipitation and elevation across southern Scotland

Site elevation rainfall Site elevation rainfall


No. (m above OD) (mm/yr) No. (m above OD) (mm/yr)
1 240 1720 11 140 1460
2 430 2320 12 540 1860
3 420 2050 13 280 1670
4 470 1870 14 240 1580
5 300 1690 15 200 1490
6 150 1250 16 210 1420
7 520 2130 17 160 900
8 460 2090 18 270 1250
9 300 1730 19 320 1170
10 410 2040 20 230 1170

Source: British Rainfall (HMSO), selected raingauges between national


grid lines 600 and 601 km N. Sites are in west-east order

Table 2: Residuals from rainfall-elevation trend

Fig. 2 Scatter diagram of rainfall against elevation for southern


distance E distance E Scotland. The trend line shown minimises the residual variance
Site residual (km from Site residual (km from
No. (mm/yr) W. coast) No. (mm/yr) W. coast) The amount of residual scatter, measured by its variance, is now
1 254 37 11 232 86
2 402 (2c)
43 12 -319 97
3 156 48 13 109 1 00 and this must be less than for any other trend line through the same data.
4 -143 The stronger the correlation between the variables the less the scatter about
49 14 114 1 03 the best fit trend line, until with r = 1 or -1 there is no scatter at all
5 81 52 15 119 1 04 and every data point lies exactly on the trend line. The correlation co-
6 efficient, or more precisely its square, is therefore a measure of the good-
-2 59 16 25 114 ness of fit of a least squares simple regression.
7 -2 73 17 -376 138
Equations (2a) and (2b) are general formulae, called estimators, for
8 101 75 18 -287 151 deciding the intercept and slope of the best straight-line description of
9 121 76 19 -486 153 a simple trend. There is no need to go through the mathematics of minimising
10 the scatter, one simply inserts the values of the appropriate means, standard
170 77 20 -272 154 deviations, and correlations in the formulae.

8
9
For example, the relevant statistics of rainfall and elevation for the correlation coefficient which in this model is a parameter of the prob-
the Scottish data of Table 1 are easily found to be as follows. ability distribution. The least squares regressions of Y on X and X on Y can
now be shown to give the best possible estimates of, respectively, the value
of Y given X and that of X given Y (see for example Sprent, 1969, ch. 2).
variable units mean st. dev. correlation So the regression of Y on X is appropriate for prediction of Y from X, and
X (rainfall) mm/yr 1640 371 ) vice versa. But this still does not tell us which to use to describe the re-
) 0.784 lationship, and for this purpose it seems more sensible to find the equation
Y (elevation) m 314 122 of the major axis of the scatter ellipse, estimated by the bisector of the
)
two regression lines (see Till, 1973).

With these figures b = 0.784 x 371/122 = 2.38 and a = 1640 - 2.38 x 314 = 895. The joint model may be applicable when X and Y are different measures
of essentially the same thing, but it is technically invalid when X or Y or
The best-fit trend line is therefore both depart appreciably from a normal distribution, and logically inappro-
priate when there is reason to think that one variable (conventionally Y)
(3) depends on or is in some way a function of the other (X). The linear regress-
which shows that rainfall in this part of Scotland increases by nearly ion model provides an alternative justification for fitting a trend by least
240 mm/yr per 100m of extra height, from a sea level value of just under squares in these circumstances. To the extent that we rationalise the world
900 mm/yr (it is seldom worth looking beyond the first two significant in cause-effect terms this is a more versatile approach to regression. It is
figures of any statistical coefficient). By (2c) the residual variance about also technically more flexible in that X need not be either random or normally
the trend is 100(1 - (0.784) 2 )% or only 39% of the total variability of rain- distributed. Instead we could for example measure rainfall (Y) at preselected
fall, corresponding to a residual standard deviation s e of about 230 mm/yr. and evenly-spaced heights (X).
Least squares regression provides an objective description of the form of
what is evidently a strong orographic tendency in our data. The linear model assumes that the explanatory variable or predictor
X affects the dependent variable Y in a systematic fashion that is distorted
(ii) The linear model by more or less random scatter. Three possible reasons for this scatter are
errors of measurement, idiosyncracies of the individuals to which the data
Minimum residual variance is only one of several possible criteria for refer, and neglect of other relevant factors - i.e. the relationship would
choosing a best-fit trend line, and goodness of fit may not be the only con- hold exactly if other things were equal, but they are not. If one or more of
sideration anyway. It can be argued that many other lines through the mean these applies, the underlying relationship must be obscured but to an extent
point of a scatter diagram have only slightly greater scatter than the best- that we can only guess. We may suspect that our data are imperfect, indivi-
fit regression, and that a round-number slope that generalises to other data duality exists, and relevant variables have been overlooked - indeed the geo-
is the most useful description (Ehrenberg 1975, chs. 7,14). A further diff- grapher can seldom be confident on any of these counts - but the size and
iculty is that unless the correlation is perfect (r = 1 or -1), when statis- direction of the disturbances so introduced are unknown.
tical methods are unnecessary since the points lie on a straight line, the
least squares regressions of Y on X and X on Y give different trend lines. Without further information the only way ahead is to lump the three
One minimises the variance of residuals in the Y direction, the other that kinds of complications into a single unobservable variable c that stands for
of residuals in the X direction. Which are we to take? If the aim is simply all sources of variability in Y other than X. In the linear model c is taken
to describe the relationship the ambiguity is embarrassing, and as Ehrenberg to be added on to an unknown straight-line relationship between Y and X:
notes it is impossible for both regression lines to generalise to different that is, the ith observed value of Y is
data.
(4a)
The least squares method can however be justified if we are prepared to
accept one of two alternative statistical models for our data. In this con-
text a 'model' is a set of assumptions about the underlying nature of the systematic relationship and the more fundamentally probabilistic joint model
relationship that is supposed to have generated our sample data, and the is sketched in Fig. 3. To proceed further we must assume that the mean dis-
models are statistical because they involve chance. The two models make dif- turbance is zero, i.e.
ferent assumptions and are applicable in different circumstances but the dis-
tinction is often blurred or ignored completely in the geographical liter- (4b)
ature. The first and less useful one assumes that paired values of X and Y and that the disturbances are uncorrelated with X values, i.e.
are random variates which jointly follow a bivariate normal probability dis-
tribution. For this reason it is generally called the joint or bivariate (4c)
normal model, though Poole and O'Farrell (1971) refer Wit as the random-X
model. A set of X,Y pairs from this model should show an ellipse-shaped
scatter in an X-Y plot, densest in the middle and elongated in proportion to (4b) implies that the disturbances do not have the general effect of raising
or lowering the Y values overall, whilst (4c) implies the absence of any

10 11
systematic association between positive or negative disturbances and high The unknown coefficient B is thus determined by the standard deviations and
or low X values. correlation of the two observed variables, and a now follows from this and
the observed means by (5a). Finally we can equate observed and model vari-
ances of Y:

or using (5b) and rearranging, (5c)


The alert reader will have recognised equations (5a-c). Apart from the
use of Greek letters for unobservable quantities, which is conventional,
they are the same as the least squares formulae (2a-c). It follows also that

rig. 3 Models for simple regression: (left) joint (right) linear. approaches is very important. It shows that the least squares method amounts
Solid lines are least squares trends to the assumption of a linear model with added zero mean disturbances un-
correlated with the systematic effect. Conversely the least squares trend
identifies the true coefficients and disturbances of the underlying systematic
relationship if these assumptions are correct. No probabilistic considerations
he situation can be represented dia- are involved apart from the plausibility of the model.
rammatically by drawing arrows from causes
The position is different if the observed data represent only a sample
from some wider population of interest. This is the case in the Scottish
example, where we have measurements for only twenty out of an infinite number
of possible sites but would like to use the results to generalise about the
regional climate or to predict rainfall at ungauged locations. The linear
regression model (4) is now one possible view of the underlying relationship
in the population as a whole. Let us assume it is appropriate. If the popu-
lation means, standard deviations, and correlation were the same as the sample
ones, then equations (5a-b) would give an accurate description of the popu-
The assumptions embodied in this model are sufficient to distinguish lation as well as the sample. But we have no means of telling whether this is
between the systematic and disturbance effects and thereby to identify the so. The odds are overwhelmingly against, since the characteristics of a small

(5a)
will be doubly uncertain, with doubt about the correctness of the trend line
It can also be subtracted from (4a) to give the model in deviation form, added to the scatter of points around it.

Despite the inevitable uncertainty, detailed assessment of the probabil-


Next, the covariance of X and Y can be looked at in two ways which must be ities involved shows that substitution of sample statistics in the assumed
population model, or to put it a different way extrapolation of the sample
least squares trend to the population, is a consistent method of estimation.

simple random way to be discussed later. The routine mechanics of regression


are therefore the same whether the aim is sample description, population
inference, or prediction.

12 13
III MULTIPLE REGRESSION

(i) The need

The simple regression model just discussed assumes that differences in


a dependent variable are accounted for partly by the linear effect of a single
explanatory variable and partly by a disturbance term that lumps together
individuality, measurement error, and the effects of relevant but unconsidered
variables. It is not uncommon for the disturbance term a to be numerically
more important than the explanatory variable X, in the sense that the residual
variance s e e is more than half the size of the variance of Y. Naturally we
would like to do better than this and reduce the residual scatter as far as
possible towards zero.

Nothing can be done about genuinely unique individual circumstances,


and it may be impracticable to obtain more accurate measurements, so the main
scope for improvement usually lies in separating out from the disturbance
term other relevant explanatory variables and incorporating them explicitly
in a multi-variable or multiple regression model. Most environmental systems
involve a large number of interrelated variables so it is rare to find a
satisfactory explanation for a spatial or historical pattern in terms of a
single influence. Generally several conflicting or reinforcing effects are at
work and the variables responsible are themselves interrelated. In such cases
simple regression may be a very inadequate guide to the pattern of linkages
and multiple regression is needed to clarify the situation.
Fig. 4 West-east trend in residuals (e) from least squares regression
The identification of potentially relevant variables is largely a matter of rainfall on elevation
of imagination and commonsense, guided by any relevant geographical theory.
Several additional variables may be candidates for inclusion and in a later imposed as would be expected if location made no difference once elevation
section we will see how to deal with them simultaneously, but for simplicity had been taken into account. All this suggests an oceanic effect as well as
we start with just one. The first step is to obtain measurements of it for the orographic one already considered.
those individuals to which the original X and Y values referred. This is not
always possible, so we may have to settle for some approximate indicator of
the desired variable - a substitute or surrogate - and hope it behaves in
essentially the same way.

Whether the new variable is indeed relevant can be tested by comparing


rainfall at a standard series of elevations at each of several different dis-
the residuals from the original simple regression with corresponding values
tances across the country. But it is rare for predictors to be completely
uncorrelated when existing data are taken as they stand, and a sampling de-
sign is no help if some combinations of conditions are absent. In our Scottish
example the correlation between elevation and distance east is not zero but
illustration, residuals from the regression of rainfall on elevation in -0.35, i.e. the high ground tends to be in the west, and this must be taken
into account when attempting to separate the two variables' effects on rain-
southern Scotland are listed in Table 2 in the same west-east order as the
fall. One way is to regress the residuals from the west-east trend of rain-
original data of Table 1. It is clear that the residuals tend to be positive
fall on those from the west-east trend of elevation, and those from the
in the west, negative in the east. In other words elevation on its own tends
altitudinal trend of rainfall on those from the altitudinal trend of easter-
to underestimate rainfall in the west but overestimate it in the east. If the
liness. This sounds, and is, complicated. Fortunately there is an easier
residuals are plotted against distance east (Fig. 4) a pronounced and approx-
imately linear trend is apparent. The effect of location can also be detected method.
in the original scatter diagram of rainfall against elevation if points re-
presenting sites west and east of the overall mean location are distinguished
(Fig. 5). The two sets of points are clearly staggered, rather than super-

14 15
(6)
and are assumed to have been generated by the linear model
(7)

The assumed cause-effect structure is


pictured alongside. The dependent variable

shown by the double-headed arrow. At this


stage it does not matter whether their
correlation reflects a causal link or if
so which way round. As in simple regression
Fig. 5 Rainfall-elevation scatter diagram of Fig. 1 with western and the underlying systematic relationship can
eastern sites distinguished to show oceanic effect be identified by assuming the disturbances

(ii) Two explanatory variables

Multiple regression disentangles the effects of correlated explanatory


variables after the event, statistically rather than experimentally. This is
of course achieved at a price: we have to assume the type of relationship (8a)
present, i.e. a statistical model for the data, and hope it is not so unreal-
and consequently can be rewritten in deviation form as
istic that the results are misleading or meaningless.

The simplest and most commonly used model is linear and an extension of
The second assumption can now be used to simplify the results of equating
that for simple regression. The dependent variable is taken to be made up of
observed covariances with those implied by the model:
a systematic effect plus a more or less random disturbance that averages out
to zero and is uncorrelated with the systematic effect. The latter is assumed
to be some linear function or weighted average of the two explanatory vari-
ables X 1 and X2 (the term 'independent variables' should be avoided since
they may well be correlated with each other and with other variables). Values
of the dependent variable Y are therefore predicted by and

17
16
and standard deviations of the variables: This can be interpreted in three ways. It is an objective description of the
observed relationship between the three variables. It is moreover an accurate
(8b) description of the underlying systematic relationship if the linear model (7)
and associated assumptions are true for the 20 sites. Alternatively it is
the best available estimate of the underlying relationship if the linear
model is true of the region as a whole and the assumptions to be described
(8c) later are true of the disturbances.

The estimated values of the coefficients show that average precipitation


The systematic part of the regression model can therefore be identified from tends to increase with altitude by about 180 mm/yr per 100m at a given dis-
simple descriptive statistics of the observed variables. tance from the coast, and to decrease by over 5 mm/yr for every kilometre
eastwards at any given height, from a base figure of 1540 mm/yr at sea level
on the Ayrshire coast. The suggested rate of increase of rainfall with ele-
noted. First, they are symmetric: one can be obtained from the other by vation is appreciably lower than in the simple regression (3) considered
earlier, so taking location into account makes a difference to the apparent
orographic effect as well as adding an oceanic term to the equation. We will
come back to this point later on in discussing the various ways three var-
to separate instances of formula (5b) for a simple regression coefficient. iables can interrelate.

and are identified correctly by simple regression. Third and most important, Just as the fitted simple regression could be represented by a sloping
trend line in a two-dimensional scatter diagram, so the fitted multiple re-
gression (8) can be seen as a tilted trend plane in a three-dimensional
calculus in most advanced texts). Thus if the linear model (7) is true of
the observed data the least squares trend is identical to the underlying
systematic relationship. If however the model is only true of a population
from which the data are a sample then the observed means, standard deviations,
and correlations are unlikely to be the same as those for the population as the maximum contrast in rainfall is between high ground in the west and low
ground in the east, not simply west and east or high and low ground. These
two situations would be represented instead by planes with no slope in one

simple random manner discussed later. on either height or location the regression plane would be horizontal with
both b's equal to zero.
As an example of the routine application of these estimation formulae,
consider again the Scottish rainfall data. We saw earlier that rainfall in- Multiple regression in this case gives a more accurate description than
creases with elevation but that distance from the west coast also seems to simple regression of the regional distribution of precipitation, but there
be relevant. The two effects can be separated by multiple regression. All the is of course still some residual scatter. In Fig. 6 the individual data points
necessary information is included in the following table of descriptive
statistics calculated from the data of Tables 1 and 2.

correlation with
variable mean st. dev. Y XI X2
Y (rainfall) 1640 371 1
X 1 (elevation) 314 122 .784 1 height trend is thus shifted up or down according to distance east. Clearly
X2 (distance E) 89.4 36.7 -.730 -.353 1 this ought to improve the overall goodness of fit, and we consider this next.

(9)

18
19
It follows that the total scatter of the dependent variable Y is made
up of two parts, that due to the systematic effect of the measured explanatory
variables and that due to unsystematic disturbances:

or
total variance = explained variance + unexplained variance.
'Explained' means of course 'numerically accounted for and need not imply
any understanding of why the relationship exists.

The relative sizes of the two variance components are obviously of great
interest to anyone carrying out a regression analysis, whether the aim is
prediction, hypothesis testing, or generalisation. The proportion of observed

of goodness of fit, standardised to a 0-1 range. Its positive square root R,


the multiple correlation coefficient, is the same as the simple correlation
between the observed Y's and the Y's predicted by substituting observed X
values in the fitted regression equation. In simple regression this is the

It is not necessary to calculate individual residuals and find their

of V by averaging over the observed data:


Fig. 6 Block diagram of multiple regression plane for Scottish rainfall (Y)
as function of elevation (X 1 ) and distance east (X 2 ). Slope in each
direction is proportional to partial regression coefficient

(iii) Multiple and partial correlation

The fundamental assumption of the linear regression model is that an


observed imperfect trend would be a perfect systematic relationship but for 85% of the observed variability in rainfall, leaving only 15% unexplained or
more or less random disturbances. We have seen that the two components can residual variance attributable to unmeasured complications.

20
21
Figures like this are difficult to assess in isolation. They provide (iv) More than two predictors
an objective measure of goodness of fit, but one's satisfaction with the re-
sult depends on temperament, past experience, and prior expectations. Since There is no logical limit to the number of variables that may be linked
R 2 is defined as a proportion of the total variability of Y the nature of in a cause-effect web. There is however a statistical limit when regression
this base figure should also be kept in mind. A high coefficient of deter- analysis is used to establish the causal structure: if as many predictors are
mination for a large set of data spanning a wide range of conditions - rain- considered as there are individuals, we have one for each disturbance 61,62,
fall throughout Britain, perhaps - is more impressive than the same value ... and might as well admit individuals are unique. In practice few geo-
of R 2 in a more restricted analysis. graphers (or statisticians) would confidently interpret a regression on more
than a few explanatory variables, so this indeterminate situation is unlikely
It is also possible to make internal comparisons to see how far the to arise.

Regression on three predictors, then four, and so on could be explained


has lifted the level of explanation from 61% to 85%. This too must be assessed by successively more complicated arguments of the same type as before, but
in comparative terms. Goodness of fit clearly cannot drop when an extra pre - it is neater and quicker to discuss the general case of p predictors. The
prediction equation is
will stay the same. At the other extreme the additional X may account com-
(11)
right up to 1 or 100%. If the actual improvement is expressed as a fraction
which can be fitted by least squares or by considering the linear model
of this maximum possible gain we have what might be called a coefficient of
extra determination,
(12)

taking location into account gives a 61% improvement in explanation compared


to the simple regression of rainfall on elevation.
Values of the dependent variable are
The square root of this quantity is called the partial correlation thus weighted averages of the corres-
between the dependent variable and the new predictor with the old one (ele- ponding values of the explanatory
vation) controlled or held constant. The partial correlation is always given variables, plus a disturbance. The
the same sign as the corresponding partial regression coefficient, in this arrow diagram alongside shows the
situation for p = 3 intercorrelated
predictors.

(the dot notation explained previously is used for partial correlations as


well as regressions).

Multiple regression in this case provides a statistical substitute for


the impractical experiment of flattening out southern Scotland to see more
clearly the eastwards trend in rainfall at constant height. The variables
can also be taken the other way round, starting with the simple correlation
of -0.73 between rainfall and distance east. A similar calculation shows

and

fall variability not already accounted for by the simple west-east trend.
In this way partial correlations supplement the partial b's as indications
of the direct importance of individual explanatory variables.

22 23
We can now equate observed covariances with those implied by the model:
for each predictor i 1,2,...,p

variables the longer the calculations, so a computer or desk calculator is


normally used if more than two predictors are involved. Even so roundoff
errors can lead to inaccurate results unless reliable equation-solving methods
are used (see Mather and Openshaw, 1974).

The proportion of Y variance accounted for by the regression equation


can also be found by an extension of the argument for two predictors.
From equation (11)

The one- and two-predictor formulae given earlier are special cases of these
general results. In all cases only the means, standard deviations, and inter-
correlations of the observed variables are required.

Once again all this applies in the first place only to sample descrip-
tion. If the linear model is assumed to apply to some population from which
our data are asample, the descriptive statistics of the latter are likely
to differ from those for the population and so therefore must the regression

assumptions involved.

(v) Statistical assumptions, residual checking, and transformation of


variables
Fig. 7 Scatter diagrams and residual plots to illustrate ideal
To ensure that regressions can be fitted to data we have already had regression (linear with constant scatter) and departures
to assume a certain type of relationship in which the dependent variable is from it
made up of a weighted average of explanatory variables plus a disturbance.
The weights or regression coefficients are constants for any one set of data,
so according to the model a unit increase in one predictor changes Y by a
fixed amount whatever the actual value of the predictor and irrespective of

24 25
additivity (since if several X's are changed their effects on Y are simply
added together).

This is a fair approximation to many situations but, as Gould (1970)


points out, geographers often fail to consider the alternatives. A nonlinear
effect in simple regression is revealed by a curved rather than straight
trend in a crafter diagram. or equivalently in a plot of residuals against

dicted by the fitted regression, as in Fig. 8 which shows no evidence of


nonlinearity in the Scottish rainfall regression. If an effect does turn out
to be nonlinear the usual remedy is to linearise it as far as possible by
using as X the square, square root, logarithm, or other appropriate trans-
formation of the measured variable. The model should also be linear in the

way even though the corresponding X-Y plots can take on a variety of concave
and convex shapes (Fig. 9).

This converts a multiplicative relationship such as

to the linear additive form

ordinary way. Note that the disturbance term in this is also interactive:
its effect is to alter by a certain proportion or percentage, rather than
absolute amount, the Y value determined by the systematic effects of the X's.
We shall return to this topic in a moment.

Apart from linearity and additivity we have assumed so far that the dis-

assumptions are not immediately testable since the least squares method
ensures they are both satisfied, but the second one is related to the question
of how many predictors to include in a regression. If we have omitted some
relevant explanatory variable, say Z, that is correlated with one or more of
the included X's, the true disturbance contains the 'lurking' variable Z and
is not uncorrelated with each predictor. The assumption that it is leads to
misleading estimates of the regression coefficients of those X's correlated
Nonadditivity occurs when the effect on Y of one variable depends on
with Z (Box, 1966; and see 'specification error' in econometrics texts).
the value of some other variable in an interactive, usually multiplicative,
This problem does not arise if the omitted variable Z is uncorrelated with
way. If, for example, a 100m rise in elevation increased rainfall by a certain
the included X's, or if it has no direct effect on Y, but the only way to be
percentage of its local value rather than a fixed amount then our additive
sure is to include potential lurking variables. They can always be abandoned
regression would be inappropriate, particularly at sites where both predictors
if found to be irrelevant.
have extreme values. Ln the physical sciences multiplicative relationships
Another problem to do with correlations between variables is that if
simple predictor (see Haynes, 1973 for a rare example of this kind of fore-
the variables concerned are said to be multicollinear and their effects can-
thought in human geography). In less clear cut situations the most flexible
not be separated. The least squares method breaks down in these circumstances
approach is to take logarithms of all variables. and a regression cannot be fitted until the offending variable is omitted.

27
26
Some writers, including Poole and O'Farrell (1971) in an otherwise helpful
discussion, give the impression that any non-zero correlation between pre- amount of scatter, as inthe bottom graphs of Fig.7,suggests heteroscedast-
dictors is unacceptable. This is not so, and indeed multiple regression is icity. This time the multiple regression for Scottish rainfall does not pass
unnecessary when predictors are mutually uncorrelated. Strong intercorrelations the test so clearly, for there is some tendency towards greater scatter in
do however lead to greater uncertainty in regression estimates from samples, the west where observed and predicted rainfalls are higher (Fig. 8). If this
as will be seen later. were more pronounced the least squares estimates would be less reliable.
Proportional disturbances about a simple linear model can be accommodated by
The general inferential problem in regression has already been touched regressing Y/X on 1/X. Taking logarithms of all variables is another altern-
upon several times. If sample data are used to estimate the coefficients of ative, since as previously noted it converts multiplicative proportional dis-
a population model, then whatever the precise methods used the results are turbances to additive homoscedastic ones. More complicated types of hetero-
bound to be uncertain to the extent that another sample would give different scedasticity can be dealt with only by weighting each observation: see
estimates. Does any particular method of estimation minimise this uncertainty, 'generalised least squares' in advanced texts.
and if so which? Two kinds of error are involved. An estimation method may
be biased, i.e. systematically over- or underestimate the population co- The third condition, that disturbances are mutually uncorrelated and
efficients; and it may have greater variance than another method, i.e. give therefore convey no information about each other, has been singled out as
a wider scatter of estimates about the true value. An analogy may help make especially dubious by Gould (1970) and other geographers on the grounds that
this clear. The numbers at the top of a darts board are 5, 20, 1, 18. If 20 almost all geographical phenomena show positive spatial autocorrelation, with
is the target, a player who throws three l's shows bias; 5, 20, 1 is erratic, nearby places more alike. However, it is not necessary to assume that the
and 20, 1, 18 reveals both bias and variance. The best grouping is of course values of any observed variable are mutually uncorrelated, only that the
three 20's, with no bias and minimum variance. measurement error or extraneous complications affecting one Y value are un-
related to those affecting any other individual. Pronounced spatial patterns
If a linear additive model is an accurate description of a population or trends in the X's or Y need not violate this assumption; what matters is
relationship and the disturbance term c behaves in a simple random fashion,
least squares estimates of the regression coefficients using sample data are
best linear unbiased estimates (BLUE for short). Proofs can be found in the this formally (see Cliff and Ord, 1972). Much the simplest is the runs test
advanced textbooks listed in the bibliography. The necessary assumptions (described in most elementary statistics texts) in which the number of runs
about c are that individual disturbances behave as uncorrelated random var- of successive residuals with the same sign in some appropriate plot is com-
iables from probability distributions which all have a mean of zero and the pared with the expected number for a random sequence. This is just over n/2,
same variance. In terms of expectations (averages over probability distrib- so the 20 residuals from the combined eastwards and altitudinal trend in
utions), Scottish rainfall pass the test comfortably with 11 runs in Fig. 8. Fewer and
longer runs might have been found had raingauges within 2 km of others not
been eliminated from the original sample, for local similarities in omitted
variables such as aspect could lead to similar departures from the regional
trend. Autocorrelation is therefore commoner in closely-spaced data. The
same applies in time series, where temporary disturbances may carry over from
one observation to the next if the interval is short. If substantial auto-
correlation is present regression estimates may remain unbiased but no longer
or has been made so by variable transformations, and if no systematic 'lurking have minimum variance and are thus less reliable than usual. An intuitive
variable' has been left out. It is untestable unless there are several obser- explanation for this is that some of the data points are more or less dup-
licating each other so that the effective sample size is reduced. Unwin and
with any X, whether the latter has random or nonrandom (fixed) values, so Hepple (1974) discuss the problem further.
long as these are measured without error. Inaccuracy in Y is permissible
(indeed it was the original motive for developing regression methods), but Visual inspection of residual plots generally provides the simplest and
errors in X's reduce their correlations with Y and necessarily lead to bias in many ways the best check of each of the assumptions that justify the use
in the form of underestimation of the effect of each X. of linear least squares estimation. Residuals are, or can be, printed by most
computer programs for regression analysis, and in some cases the plots too
The second condition, that of homoscedasticity, says that there is a can be produced by machine, so residual checking is no great chore. It can
constant degree of scatter about the population relationship rather than however be less straightforward than suggested above if predictors have very
local regions of high or low scatter, when data points from high-scatter skewed distributions. Scatter diagrams and residual plots then contain one
regions would exert undue influence on the least squares estimates. Non- or a few isolated points well clear of the rest, making it difficult to dis-
constant scatter or heteroscedasticity is commonly associated with disturb- tinguish between trend and scatter. Log transformation helps here too by re-
ances that are proportional rather than additive. This is a special case of ducing positive skewness.
nonadditivity, as already discussed. It can be detected by inspection of the
scatter diagram of a simple regression, or by plotting the residuals from a

28 29
We have considered the statistical assumptions behind regression esti-
mation at some length in order to see some of the problems that can arise in
practice. Opinions differ as to how seriously violations of these assumptions
should be taken. It is unfortunate that many geographical statistics books
still in use stress only some of the assumptions, and then not necessarily
the most important ones. Emphasis is often laid on the supposed need for nor-
mal distributions, but the distribution of each measured variable is irrele-
vant in linear regression and that of the residuals is relevant only to the
significance tests discussed in the next section. Yet the important question
of whether a multiplicative model with proportional disturbances is more
appropriate than an additive one is seldom mentioned. It can also be argued
that most geographical applications of regression analysis are exploratory,
so that violations of the strict statistical assumptions are far less serious
than in, say, the Treasury's predictions of economic variables such as un-
employment and spending. As so often, the care needed with the tools depends
on what is to be done with the finished product.

(vi) Confidence limits and significance tests

Regression coefficients obtained by the linear least squares method may


be the best possible estimates of underlying relationships, but how uncertain
are they? In particular, is the calculated b for a set of data sufficiently
close to some hoped-for population G for the discrepancy to be only a matter
of unlucky sampling? And how accurate are predictions made using the re-
gression equation likely to be?

These are questions about confidence limits and significance tests, and
they are relevant whenever we want to infer something about a population
relationship from sample data. Even when we have data for every administrative
area in a region it is sometimes argued that the boundaries could have been
drawn in an infinite number of alternative ways, so that inferential questions
are still relevant (see Gudgin and Thornes, 1974).

Fig. 10 Assumed behaviour of Y in linear regression model. Shaded areas


indicate probability distributions for observed Y, offset according
to value of X but with same shape because of assumptions about c

It might also seem to be a measure of the possible error in predicting


Y from X. But this is only so if the true regression line is known, for
otherwise predictions of Y are additionally uncertain to the extent that the
estimated regression coefficients may differ from the true ones. The various
sources of prediction error are shown in Fig. 11 for simple regression. The
shaded band in the lower left diagram indicates the range of Y values that
can be expected simply because of variations in c. But as the upper diagrams
show the true slope may be steeper or gentler than the estimated value, and
the sample and population centres of gravity through which the trend passes
may differ. The overall uncertainty in predicting Y is the sum of all these
components, and as shown in the lower right diagram it increases away from
the mean of the data. Extrapolation beyond the range of the data is thus
particularly uncertain, quite apart from the possibility that the relation-
ship is not linear outside the observed range.

30
31
Without further information these standard errors can only be interpreted
in relative terms. They all depend on the square root of n or something close
to n, so a fourfold increase in sample size halves the uncertainty of re-
gression estimates if other things are equal. But unless the probability dis-

of samples are likely to give estimates within, say, two standard errors of
the true value. The most convenient, and therefore commonest, assumption is

distribution. This makes the least squares estimates of regression coeffic-


ients not just BLUE but also maximum likelihood estimates: they are the values
that maximise the overall probability of the sample data given the population
model. The assumption of normal disturbances implies nothing about the fre-
quency distribution of Y or of any X, so histograms of the measured variables
are completely irrelevant. The probability distribution of Y conditional on
the value of X is normal, but this is a different matter (Fig. 10). The only
relevant test is to compare the histogram of calculated residuals against the

Kolmogorov-Smirnov test described in most elementary statistics texts. Fig. 12


illustrates this for the Scottish rainfall regression.

Fig. 11 Uncertainty in simple regression. Y may depart from fitted trend


because of uncertainty in slope (b) and intercept (a) as well as
inherent scatter about trend (e)

Fig. 12 Histogram of residuals from rainfall multiple regression compared


to unit normal curve. Difference is not significant at 95% level
by one-sample Kolmogorov-Smirnov test

attached to estimates by multiplying the appropriate standard error by the


tabulated t value for the desired confidence level, say 95% for which t is
close to 2 except for very small samples. The true value of a quantity est-
imated from a sample regression is therefore about 95% certain to lie within
2 standard errors either side of the estimated value. Heteroscedasticity or
autocorrelation in the disturbances increases the uncertainty and can make
calculated confidence limits dangerously misleading (see Unwin and Hepple,
1974), but the assumption of normality appears to be less critical (Gudgin
and Thornes, 1974).

33
32
overwhelmingly significant even at the 0.1% level. In other words the chances
are far less than 1 in 1000 that we have simply an unrepresentative sample
from a regional rainfall distribution that shows no systematic dependence on

only useful when there are strong grounds for expecting a particular effect
to be absent or negligible, which of course takes us back to type (1) or (2)
null hypotheses. And if the sample size is big enough even a tiny and geo-
graphically unimportant difference between sample b and expected 8 will be
statistically significant.

If the 95% (or 99%, or other) confidence limits for an estimated re-
gression coefficient b lie on opposite sides of some prespecified value then can be used to compare any two linear regressions fitted to the same data and
we cannot safely say the true coefficient 8 differs from this value, since dependent variable. The numerator is proportional to the extra Y variance
at least 5% (or 1%, etc.) of all possible samples from a population with the explained per new predictor, the denominator to that still unexplained by p
hypothetical 8 would yield at least as big a discrepancy as has been observed.
This is the idea behind significance tests of regression coefficients (see variance the larger the ratio becomes and the less likely it is that the im-
Fig. 13). Any value of 8 may be specified beforehand as a null hypothesis. proved fit is a sampling fluke. If the ratio exceeds the tabulated value of
The obvious possibilities are (1) a value expected on theoretical grounds;
(2) the value found in a previous study; (3) zero.

An example of the first type is Ferguson's (1975) investigation of the


relationship between meander wavelength and streamflow for 19 British rivers.
It is generally thought that wavelength is proportional to the square root
of discharge. The best-fit regression between the logarithms of the two vari-
ables in this study had a slope of b = 0.58 with standard error 0.08. The
95% confidence interval is therefore 0.41 to 0.75, which includes the theoret-
ical value of 0.5, so the departure from a square-root relationship is not where r. is the partial correlation between Y and the new X with previous X's
statistically significant at the 5% level. held constant. The stronger the partial correlation the more significant the
effect of the new predictor. This test gives identical results to the t test
Testing against a previous empirical result can be illustrated for our
best-fit relationship Y = 895 + 2.38 X (equation 3 above) between rainfall (Y)
and height (X) at 20 sites in Scotland. This compares well with There is nothing to stop us applying this F test to the improvement when
Y = 714 + 2.42 X found by Bleasdale and Chan (1972) for over 6500 rain gauges several predictors are added to the regression, or to the improvement over no
throughout the U.K. The standard error of b in the Scottish study turns out
to be 0.44, giving a 95% confidence interval of 1.46 to 3.30 which easily in-
cludes Bleasdale and Chan's value of 2.42, even allowing for the latter's
own standard error (which is very small because of the huge sample size). The
orographic tendency in southern Scotland is therefore not significantly differ-
ent from that for the whole U.K.

The third type of, test, against the null hypothesis that 8 = 0, is even
easier. It boils down to asking how many standard errors away from zero the
observed slope, b, lies. Most computer programs for regression print the small not to be significant (less than 0.2 for large samples at the 57. level).
ratio of b to its standard error, generally labelled as 'T', and this can be If so the t or F tests of individual predictors are usually non-significant
compared with tables of Student's t. Our multiple regression equation (9) for too. But this does not always apply in reverse: the regression as a whole may
be significant, but not any of the individual effects. This paradoxical result
and 1.02, giving t values of 6.0 and 5.1 (the sign is immaterial). These are occurs when a pair or set of predictors are so highly intercorrelated that

34 35
controlling any one of them reduces their combined explanatory power to an the existence and relative strength of relationships, not their precise form
insignificant level. This is yet another manifestation of the multicollinear- for which there are no empirical or theoretical yardsticks. This is a broader
ity problem.
application of regression analysis, and involves assessment of the realism
of alternative cause-effect models rather than calibration of one whose
Partial t or F tests are often used as a means of screening variables applicability is not in question. As an introduction to this kind of explor-
and selecting the 'best' regression with a given number of predictors out atory work we consider here the types of interrelationship that may exist
of a wide choice (Draper and Smith, 1966, ch. 6). One way to do this is to between three variables, and only outline the extension to more complicated
situations.

It will generally be clear which of three variables could in principle


b or r) is now taken as a base on top of which all possible third predictors
are tried. The process stops when the improvement on adding even the strongest situations can be recognised according to whether neither, one, or both of
extra X fails to reach some preset significance level. Alternatively one can the X's does directly affect Y and whether they are themselves related. The
work backwards and eliminate at each stage the X with the weakest partial cause-effect linkages can be represented by an arrow diagram. If either X
b or r. A variant known as stepwise regression combines both approaches, has no direct systematic effect on Y the corresponding arrow can be omitted
working forwards but checking after each step to see whether any X has lost and the partial regression and correlation coefficients b and r. between
significance and should be dropped. Y and this X with the other X fixed must be equal to zero. If the X's are
unrelated they do not need to be linked by an arrow and their simple cor-
These selection methods are readily available in package programs and
can save much time and effort. But for this very reason they are all too
often a mechanical substitute for critical judgment about which variables both depend on some other variable not included in the analysis. These cases
ought to be relevant or irrelevant in the light of theory or experience. may be represented diagramatically by one-way and two-way arrows. If the
Automatic selection is also statistically suspect. It sometimes fails to find three direct links are known the simple regressions between Y and each X can
be found from the relationships
are technically invalid because they are not independent (see Mather and

checked when large numbers of regressions are tried and discarded. And, most
serious of all, the more significance tests are carried out the greater the
chance of capitalising on a statistically significant but inexplicable result
that really is a 1 in 20 or 1 in 100 sampling fluke. It must also be remembered derived from the covariance equations which led to formulae (8a-b) for the
that the tighter we set our significance levels to avoid accepting fortuit- partial regression coefficients. The following situations can be distinguished.
ously strong effects, the more we are liable to reject real population re-
lationships that happen to be weak in our sample data. This is by far the
greater danger when samples are small. For all these reasons significance
tests are a poor substitute for prior knowledge and critical judgment.
simple and partial correlations and re-
gressions between Y and either X are zero
IV SPECIAL APPLICATIONS
between the X's does not alter the situ-
(i) Causal models ation.

Multiple regression analysis can be carried out for various reasons.


One rather narrow application is the search for the best possible empirical
equation for predicting one variable from a reasonably small set of others
that are relatively easy or cheap to measure. Another is the estimation of a (2) The other situation with only one arrow
particular partial regression coefficient, i.e. the effect of one variable is when one X has no effect on Y and is
on another when extraneous complications are held constant, as a substitute uncorrelated with the other X. It is then
for a controlled experiment or to substantiate earlier findings. completely irrelevant and its partial and
simple b's are both zero. The partial b
In much if not most social science, however, and much environmental
science too despite its links with the 'exact' sciences, research is still
exploratory. There is no consensus of opinion on which variables are relevant
to which others, or how they compare in importance. Attention is focused on
approaching this no change' situation might be found if we controlled the
36 37
regression of rainfall on elevation for, say, the ages of the men who read
the raingauges.

Each predictor then amplifies the direct effect of the other. The Scottish
rainfall example is a good illustration. If only simple regressions are con-
sidered the orographic effect is exaggerated by the oceanic influence also
affecting most of the high ground, and vice versa. The modest correlation of
-.35 between the predictors is sufficient to inflate partial b's of 1.8 and
-5.2 to simple b's of 2.4 and -7.4 for height and distance east respectively.
Whichever predictor is taken first, ignoring the other leads to a considerable
bias in the regression estimate. Multiple regression is essential for a truer
picture.

(7) The opposite case is suppression, where direct and indirect effects are
in competition and tend to cancel out. This occurs whenever either just one
or all three direct links are negative. What matters here is the sign of the
partial (not simple) b describing the effect on Y of each X. A simple b can
conceivably have the opposite sign to its partial if the indirect effect
through the third variable outweighs the direct effect. It could even be zero
if the direct and indirect effects cancelled out exactly (though this cannot
happen to both simple regressions at once). Suppression would occur in the
Scottish rainfall example if the topography of the region were reversed

would now be only one negative direct link, that between rainfall and dis-
tance east. This would not affect the observed partial b's if the orographic
and oceanic tendencies at work really are additive and linear. But the simple
b's would drop to 1.3 and -3.1 and the corresponding simple correlations to
0.56 and -.41 instead of 0.78 and -.73 as actually observed. This is because
orographic rainfall in the east would go some way to offsetting the oceanic
influence in the west, giving a more uniform overall distribution of rainfall.

The different patterns of simple and partial b's, and to a lesser extent
partial r's, that characterise these seven situations can be used to dis-
tinguish between alternative causal models for the observed relationship of
three variables. This can be done by trial and error or systematically using
the flow chart of Fig. 14. The diagnostic questions are whether neither, one,
or both partials are so close to zero as to be negligible, and how the pre-
dictors are related. The first is a subiective matter when we only have

are significant at better than 1% cannot lightly be ignored if the necessary


assumptions are satisfied. On the other hand many investigators would not
dismiss as negligible a partial that has the expected sign but fails to reach
the chosen level of significance, since this could well reflect a sampling
error of the second kind.

If both partials are adjudged negligible, neither X has an appreciable


effect on Y; this is case (1) above. If one partial is negligible but the
other not, the first X is either irrelevant (case 2, if it is also more or
less uncorrelated with the other X) or only indirectly relevant (cases 3 or
4, according to the likely direction of the link between the X's; this too
is a matter for the analyst's judgment). Finally if both X's are directly

38
39
relevant the existence and sign of the correlation between them determines The Simon-Blalock approach has much in common with the technique of
whether their effects are separate (case 5), reinforcing (case 6), or path analysis, which originated in biology and is described in social statis-
suppressing (case 7). tics texts such as Heise (1975) and Kerlinger and Pedhazur (1973). The chief
difference is that path analyses generally use standardised partial regression
coefficients, which are b's measured in standard deviations of Y per standard
deviation of X (they occurred unannounced in our discussion of regression
on p predictors). The standardised form of a simple b is r, so the total
effect of one variable on another is their correlation. It can be found from
the arrow diagram as the sum, over all paths linking the variables, of the
product of standardised b's along each path. In this way the importance of
direct, indirect, and other paths between variables can be compared within
one study. Standardised b's should not be used for comparisons between studies
since they depend on sample variability, but the path analysis principle can
be applied to unstandardised regressions as in our discussion of indirect,
reinforcing, and suppressing effects.

(ii) Special kinds of variables

Regression analysis is mostly applied to variables that can potentially


take any value within some continuous range. Rainfall, altitude, and distance
from the coast must all be positive and have upper limits within any one
region, but otherwise the number of possible values of each is restricted
only by the imprecision of our measurements. This is not however true of all
variables of interest to geographers. Some phenomena are simply present or
absent, and others exist only in a few separate categories - types of house
tenure, for example. These are examples of two-level and multi-level classif-
ications. Classification is also commonly used for things that vary widely
but are difficult to quantify on any continuous numerical scale: rock type,
social class, and so on. At first sight relationships involving qualitative
variables of this kind cannot be investigated by regression methods, but this
is not necessarily so. We saw earlier on that the correlation coefficient r
can be calculated for binary variables, i.e. those taking values 0 or 1 only.
Since regression coefficients can be found from the correlations of predictors
with each other and with a dependent variable, any qualitative phenomenon
that can be expressed in binary form can be used in a regression analysis.
Binary variables created for this purpose are called dummy variables. Their
use enables several apparently different statistical techniques to be brought
into the framework of regression theory.

One such application is the prediction of whether some qualitative


phenomenon is present or absent, given values of one or more quantitative
factors thought to be relevant. In this case we need a dummy dependent var-

causal models for relationships among more than three variables can be
tested in a similar way if they can be represented by arrow diagrams without linear discriminant analysis (see King, 1969, 205-7).
feedback loops. The general principle, first noted by the sociologist
H.A. Simon, is that the absence of an arrow must be reflected in a near-zero The main drawback is that the scatter about the regression cannot be
partial b or r when any intervening variables and/or common causes are held homoscedastic, so the least squares regression estimates are not as reliable
constant. Repeated application of multiple regression or partial correlation as they could be. Wrigley (1976) in another monograph in this series describes
to each partly or wholly dependent variable will either confirm the model or a more sophisticated technique, logit analysis, that overcomes this problem
suggest necessary modifications. The classic text is that by Blalock (1961), and can be extended to qualitative dependent variables with more than two
and Mercer (1975) gives a clear account of an application in urban social
geography.
formed back to get the best possible prediction of the probability of presence

40 41
of whatever is represented by Y = 1. Wrigley gives the example of predicting BIBLIOGRAPHY
how likely people are to suffer from acute bronchitis given their cigarette
A, Applications
Dummy variables can also be used as predictors in a multiple regression Bleasdale, A. and Chan, Y.K., (1972), Orographic influences on the distrib-
ution of precipitation. 322-333 in: Distribution of precipit-
ation in mountainous areas, 2, World Meteorological Office
(Geneva).
Champion, A.G., (1972), Urban densities in England and Wales: the significance
of three factors. Area, 4, 187-192.
For example, the magnitude of river floods is likely to increase with rain- Ferguson, R.I., (1975), Meander irregularity and wavelength estimation.
fall but may also depend on geology since the less permeable the ground the Journal of Hydrology, 26, 315-333.
quicker storm rainfall gets into the river. Rock type can be taken into Haynes, R.M., (1973), Crime rates and city size in America. Area, 5, 162-165.
Krumbein, W.C., (1959), The sorting out of geological variables illustrated
by regression analysis of factors controlling beach firmness.
Journal of Sedimentary Petrology, 29, 575-587.
Mercer, J., (1975), Metropolitan housing quality and an application of causal
modelling. Geographical Analysis, 7, 295-302.
Parker, A.J., (1974), An analysis of retail grocery price variations. Area,
This method can be extended to several parallel trend lines, which 6, 117-120.
amounts to analysis of covariance; to horizontal lines, which is equivalent Smith, G.C., (1976), The spatial inforamtion fields of urban consumers.
to analysis of variance; to lines with different slopes but the same inter- Transactions, Institute of British Geographers, new
cept; and to mixtures of all these cases. The multiple regression approach series 1, 175-189.
makes clear the links between the different possibilities, and allows easy
comparison of their goodness of fit. Silk (1976) gives a detailed but read- Taafe, E.J., Morrill, R.L., and Gould, P.R., (1963), Transport expansion in
able account. Further applications of dummy predictors are described by underdeveloped countries: a comparative analysis. Geographical
Draper and Smith (1966, ch. 5) and Mather and Openshaw (1974). Review, 53, 503-529.

Two further applications of multiple linear regression to special kinds B. Assumptions and inference
of variables should also be noted. One is trend surface analysis in which Y Gould, P., (1970), Is statistix inferens the geographical name for a wild
is some spatially distributed variable and the X's are locational coordinates goose? Economic Geography, 46, 439-448.
(e.g. eastings and northings) and their powers and products up to some maximum
order. The aim is generally to see what order of surface adequately describes Gudgin, G., and Thornes, J.B., (1974), Probability in geographic research:
applications and problems. The Statistician, 23, 157-177.
provement from one order to the next. Details and applications are described Mather, P., and Openshaw, S., (1974), Multivariate methods and geographical
by Unwin (1975) in another monograph in this series. data. The Statistician, 23, 283-308.
The final special case is the autoregressive modelling of time series Poole, M.A., and O'Farrell, P.N., (1971), The assumptions of the linear re-
(Box and Jenkins, 1970). Here the X's are Y values one, two, etc. time inter- gression model. Transactions, Institute of British Geo-
vals ago. The carryover effects in the series at these different time lags graphers, 52, 145-158.
amount to partial regression coefficients of the series on its own past and Unwin, D.J., and Hepple, L.W., (1974), The statistical analysis of spatial
can be found from the correlations between Y and the X's, i.e. the auto- Series. The Statistician, 23, 211-227.
correlations of the series with itself at different lags. Applications include
the study of fluctuations in economic, climatic, and hydrologic time series. C. Advanced texts
Spatial series in geomorphology have also been investigated in the same way,
with distance (downslope or downriver) replacing time. Draper, N.R., and Smith, H., (1966), Applied regression analysis.
(Wiley, New York).
Multiple linear regression in its basic form is very widely used. The Huang, D.S., (1970), Regression and econometric methods. (Wiley,
special applications to spatial patterns, time series, qualitative variables, New York).
and causal models make it even more versatile. The geographer who understands
the fundamentals of linear regression is well placed to analyse most kinds of Johnston, J., (1972), Econometric methods (2nd edition). (McGraw-Hill,
geographical data and to appreciate published quantitative research. New York).
43
42
Kerlinger, F.N., and Pedhazur, E.J., (1973), Multiple regression in be-
havioral research. (Holt, Rinehart and Winston, New York).
Surrey, M.J.C., (1974), An introduction to econometrics. (Clarendon
Press, Oxford).

D. Other references
Blalock, H.M., (1961), Causal inferences in nonexperimental research.
(University of North Carolina Press, Chapel Hill, North Carolina).
Box, G.E.P., (1966), Use and abuse of regression. Technometrics, 8, 625-9.
Box, G.E.P., and Jenkins, G.M., (1970), Time series analysis, fore-
casting and control. (Holden-Day, San Francisco).
Cliff, A.D., and Ord, J.K., (1972), Testing for spatial autocorrelation among
regression residuals. Geographical Analysis, 4, 267-284.
Ehrenberg, A.S.C., (1975), Data reduction. (Wiley, London).
Heise, D.R., (1975), Causal analysis. (Wiley, London).
King, L.J., (1969), Statistical analysis in geography. (Prentice Hall,
Englewood Cliffs, New Jersey).
Silk, J., (1976), A comparison of regression lines using dummy variable
analysis. Geographical papers, Department of Geography,
University of Reading, 44.
Sprent, P., (1969), Models in regression. (Methuen, London).
Till, R., (1973), The use of linear regression in geomorphology. Area, 5,
303-8.
Unwin, D.J., (1975), An introduction to trend surface analysis.
Concepts and techniques in modern geography, 5. (Geo Abstracts Ltd,
Norwich).
Wrigley, N., (1976), An introduction to the use of logit models in
geography. Concepts and techniques in modern geography, 10.
(Geo Abstracts Ltd, Norwich).

44

You might also like