Professional Documents
Culture Documents
R.Ferguson
Rob Ferguson
CONCEPTS AND TECHNIQUES IN MODERN GEOGRAPHY No 15
CATMOG
1
LINEAR REGRESSION IN GEOGRAPHY
I INTRODUCTION
(i) Relationships between variables
Most if not all systems studied by geographers involve variable quant-
ities that can be measured numerically, from the distances travelled by
Sunday trippers to the hydraulic characteristics of proglacial streams. Com-
parison of maps or time series graphs often shows that differences in one
variable, in space or over time, appear to be associated with differences in
other variables. For example both Sunday trips and proglacial streamflow may
vary according to temperature. Sometimes the apparent association is entirely
a matter of chance, and in other cases two variables may amount to altern-
ative definitions of the same thing, but frequently we suspect cause and
effect are at work. A relationship of this kind can be important in three ways.
It may confirm or refute theoretical notions about cause-effect processes in
the system under study. Alternatively it may highlight something not con-
sidered by existing theory and thus stimulate new ideas. Thirdly, it may be
useful for prediction of the response to future changes in conditions, or of
the present state of affairs in places where direct measurement is inconven-
ient or impossible.
2 3
essential. The reader is however assumed to have taken an introductory course Regression analysis is about relationships between variables and the
covering elementary descriptive statistics, including the correlation co- contributions of different factors to overall variability. It is therefore
effecient, and significance testing. Finally, the notation adopted here is useful to be able to split into components the variance of quantities like
widely used and fairly obvious but the reader is warned that other authors X + Y. Rules (1) to (3) show that
may use alternative symbols for the same things and the same symbols for
different things.
4 5
II SIMPLE REGRESSION
(i) The least squares trend
The observed relationship between two numerical variables X and Y can
be represented exactly by a scatter diagram or graph of Y values against X
values, with one point for each pair of values (each site or area in a
spatial study, person or household in a social study, time or time interval
in a historical study, and so on). As an illustration Figure 2 shows the
relationship between long term average precipitation (Y) and height above
sea level (X) at twenty rain gauges in a west-east transect across southern
Scotland (the data is in Table 1).
creases by one. In this case it is clear that rainfall increases rather than
decreases with elevation so b is positive.
(2a)
the line cancel out. This fixes the general level of the line, but its tilt
depends on b. It turns out that the residual scatter is least for a slope of
(2b)
8
9
For example, the relevant statistics of rainfall and elevation for the correlation coefficient which in this model is a parameter of the prob-
the Scottish data of Table 1 are easily found to be as follows. ability distribution. The least squares regressions of Y on X and X on Y can
now be shown to give the best possible estimates of, respectively, the value
of Y given X and that of X given Y (see for example Sprent, 1969, ch. 2).
variable units mean st. dev. correlation So the regression of Y on X is appropriate for prediction of Y from X, and
X (rainfall) mm/yr 1640 371 ) vice versa. But this still does not tell us which to use to describe the re-
) 0.784 lationship, and for this purpose it seems more sensible to find the equation
Y (elevation) m 314 122 of the major axis of the scatter ellipse, estimated by the bisector of the
)
two regression lines (see Till, 1973).
With these figures b = 0.784 x 371/122 = 2.38 and a = 1640 - 2.38 x 314 = 895. The joint model may be applicable when X and Y are different measures
of essentially the same thing, but it is technically invalid when X or Y or
The best-fit trend line is therefore both depart appreciably from a normal distribution, and logically inappro-
priate when there is reason to think that one variable (conventionally Y)
(3) depends on or is in some way a function of the other (X). The linear regress-
which shows that rainfall in this part of Scotland increases by nearly ion model provides an alternative justification for fitting a trend by least
240 mm/yr per 100m of extra height, from a sea level value of just under squares in these circumstances. To the extent that we rationalise the world
900 mm/yr (it is seldom worth looking beyond the first two significant in cause-effect terms this is a more versatile approach to regression. It is
figures of any statistical coefficient). By (2c) the residual variance about also technically more flexible in that X need not be either random or normally
the trend is 100(1 - (0.784) 2 )% or only 39% of the total variability of rain- distributed. Instead we could for example measure rainfall (Y) at preselected
fall, corresponding to a residual standard deviation s e of about 230 mm/yr. and evenly-spaced heights (X).
Least squares regression provides an objective description of the form of
what is evidently a strong orographic tendency in our data. The linear model assumes that the explanatory variable or predictor
X affects the dependent variable Y in a systematic fashion that is distorted
(ii) The linear model by more or less random scatter. Three possible reasons for this scatter are
errors of measurement, idiosyncracies of the individuals to which the data
Minimum residual variance is only one of several possible criteria for refer, and neglect of other relevant factors - i.e. the relationship would
choosing a best-fit trend line, and goodness of fit may not be the only con- hold exactly if other things were equal, but they are not. If one or more of
sideration anyway. It can be argued that many other lines through the mean these applies, the underlying relationship must be obscured but to an extent
point of a scatter diagram have only slightly greater scatter than the best- that we can only guess. We may suspect that our data are imperfect, indivi-
fit regression, and that a round-number slope that generalises to other data duality exists, and relevant variables have been overlooked - indeed the geo-
is the most useful description (Ehrenberg 1975, chs. 7,14). A further diff- grapher can seldom be confident on any of these counts - but the size and
iculty is that unless the correlation is perfect (r = 1 or -1), when statis- direction of the disturbances so introduced are unknown.
tical methods are unnecessary since the points lie on a straight line, the
least squares regressions of Y on X and X on Y give different trend lines. Without further information the only way ahead is to lump the three
One minimises the variance of residuals in the Y direction, the other that kinds of complications into a single unobservable variable c that stands for
of residuals in the X direction. Which are we to take? If the aim is simply all sources of variability in Y other than X. In the linear model c is taken
to describe the relationship the ambiguity is embarrassing, and as Ehrenberg to be added on to an unknown straight-line relationship between Y and X:
notes it is impossible for both regression lines to generalise to different that is, the ith observed value of Y is
data.
(4a)
The least squares method can however be justified if we are prepared to
accept one of two alternative statistical models for our data. In this con-
text a 'model' is a set of assumptions about the underlying nature of the systematic relationship and the more fundamentally probabilistic joint model
relationship that is supposed to have generated our sample data, and the is sketched in Fig. 3. To proceed further we must assume that the mean dis-
models are statistical because they involve chance. The two models make dif- turbance is zero, i.e.
ferent assumptions and are applicable in different circumstances but the dis-
tinction is often blurred or ignored completely in the geographical liter- (4b)
ature. The first and less useful one assumes that paired values of X and Y and that the disturbances are uncorrelated with X values, i.e.
are random variates which jointly follow a bivariate normal probability dis-
tribution. For this reason it is generally called the joint or bivariate (4c)
normal model, though Poole and O'Farrell (1971) refer Wit as the random-X
model. A set of X,Y pairs from this model should show an ellipse-shaped
scatter in an X-Y plot, densest in the middle and elongated in proportion to (4b) implies that the disturbances do not have the general effect of raising
or lowering the Y values overall, whilst (4c) implies the absence of any
10 11
systematic association between positive or negative disturbances and high The unknown coefficient B is thus determined by the standard deviations and
or low X values. correlation of the two observed variables, and a now follows from this and
the observed means by (5a). Finally we can equate observed and model vari-
ances of Y:
rig. 3 Models for simple regression: (left) joint (right) linear. approaches is very important. It shows that the least squares method amounts
Solid lines are least squares trends to the assumption of a linear model with added zero mean disturbances un-
correlated with the systematic effect. Conversely the least squares trend
identifies the true coefficients and disturbances of the underlying systematic
relationship if these assumptions are correct. No probabilistic considerations
he situation can be represented dia- are involved apart from the plausibility of the model.
rammatically by drawing arrows from causes
The position is different if the observed data represent only a sample
from some wider population of interest. This is the case in the Scottish
example, where we have measurements for only twenty out of an infinite number
of possible sites but would like to use the results to generalise about the
regional climate or to predict rainfall at ungauged locations. The linear
regression model (4) is now one possible view of the underlying relationship
in the population as a whole. Let us assume it is appropriate. If the popu-
lation means, standard deviations, and correlation were the same as the sample
ones, then equations (5a-b) would give an accurate description of the popu-
The assumptions embodied in this model are sufficient to distinguish lation as well as the sample. But we have no means of telling whether this is
between the systematic and disturbance effects and thereby to identify the so. The odds are overwhelmingly against, since the characteristics of a small
(5a)
will be doubly uncertain, with doubt about the correctness of the trend line
It can also be subtracted from (4a) to give the model in deviation form, added to the scatter of points around it.
12 13
III MULTIPLE REGRESSION
14 15
(6)
and are assumed to have been generated by the linear model
(7)
The simplest and most commonly used model is linear and an extension of
The second assumption can now be used to simplify the results of equating
that for simple regression. The dependent variable is taken to be made up of
observed covariances with those implied by the model:
a systematic effect plus a more or less random disturbance that averages out
to zero and is uncorrelated with the systematic effect. The latter is assumed
to be some linear function or weighted average of the two explanatory vari-
ables X 1 and X2 (the term 'independent variables' should be avoided since
they may well be correlated with each other and with other variables). Values
of the dependent variable Y are therefore predicted by and
17
16
and standard deviations of the variables: This can be interpreted in three ways. It is an objective description of the
observed relationship between the three variables. It is moreover an accurate
(8b) description of the underlying systematic relationship if the linear model (7)
and associated assumptions are true for the 20 sites. Alternatively it is
the best available estimate of the underlying relationship if the linear
model is true of the region as a whole and the assumptions to be described
(8c) later are true of the disturbances.
and are identified correctly by simple regression. Third and most important, Just as the fitted simple regression could be represented by a sloping
trend line in a two-dimensional scatter diagram, so the fitted multiple re-
gression (8) can be seen as a tilted trend plane in a three-dimensional
calculus in most advanced texts). Thus if the linear model (7) is true of
the observed data the least squares trend is identical to the underlying
systematic relationship. If however the model is only true of a population
from which the data are a sample then the observed means, standard deviations,
and correlations are unlikely to be the same as those for the population as the maximum contrast in rainfall is between high ground in the west and low
ground in the east, not simply west and east or high and low ground. These
two situations would be represented instead by planes with no slope in one
simple random manner discussed later. on either height or location the regression plane would be horizontal with
both b's equal to zero.
As an example of the routine application of these estimation formulae,
consider again the Scottish rainfall data. We saw earlier that rainfall in- Multiple regression in this case gives a more accurate description than
creases with elevation but that distance from the west coast also seems to simple regression of the regional distribution of precipitation, but there
be relevant. The two effects can be separated by multiple regression. All the is of course still some residual scatter. In Fig. 6 the individual data points
necessary information is included in the following table of descriptive
statistics calculated from the data of Tables 1 and 2.
correlation with
variable mean st. dev. Y XI X2
Y (rainfall) 1640 371 1
X 1 (elevation) 314 122 .784 1 height trend is thus shifted up or down according to distance east. Clearly
X2 (distance E) 89.4 36.7 -.730 -.353 1 this ought to improve the overall goodness of fit, and we consider this next.
(9)
18
19
It follows that the total scatter of the dependent variable Y is made
up of two parts, that due to the systematic effect of the measured explanatory
variables and that due to unsystematic disturbances:
or
total variance = explained variance + unexplained variance.
'Explained' means of course 'numerically accounted for and need not imply
any understanding of why the relationship exists.
The relative sizes of the two variance components are obviously of great
interest to anyone carrying out a regression analysis, whether the aim is
prediction, hypothesis testing, or generalisation. The proportion of observed
20
21
Figures like this are difficult to assess in isolation. They provide (iv) More than two predictors
an objective measure of goodness of fit, but one's satisfaction with the re-
sult depends on temperament, past experience, and prior expectations. Since There is no logical limit to the number of variables that may be linked
R 2 is defined as a proportion of the total variability of Y the nature of in a cause-effect web. There is however a statistical limit when regression
this base figure should also be kept in mind. A high coefficient of deter- analysis is used to establish the causal structure: if as many predictors are
mination for a large set of data spanning a wide range of conditions - rain- considered as there are individuals, we have one for each disturbance 61,62,
fall throughout Britain, perhaps - is more impressive than the same value ... and might as well admit individuals are unique. In practice few geo-
of R 2 in a more restricted analysis. graphers (or statisticians) would confidently interpret a regression on more
than a few explanatory variables, so this indeterminate situation is unlikely
It is also possible to make internal comparisons to see how far the to arise.
and
fall variability not already accounted for by the simple west-east trend.
In this way partial correlations supplement the partial b's as indications
of the direct importance of individual explanatory variables.
22 23
We can now equate observed covariances with those implied by the model:
for each predictor i 1,2,...,p
The one- and two-predictor formulae given earlier are special cases of these
general results. In all cases only the means, standard deviations, and inter-
correlations of the observed variables are required.
Once again all this applies in the first place only to sample descrip-
tion. If the linear model is assumed to apply to some population from which
our data are asample, the descriptive statistics of the latter are likely
to differ from those for the population and so therefore must the regression
assumptions involved.
24 25
additivity (since if several X's are changed their effects on Y are simply
added together).
way even though the corresponding X-Y plots can take on a variety of concave
and convex shapes (Fig. 9).
ordinary way. Note that the disturbance term in this is also interactive:
its effect is to alter by a certain proportion or percentage, rather than
absolute amount, the Y value determined by the systematic effects of the X's.
We shall return to this topic in a moment.
Apart from linearity and additivity we have assumed so far that the dis-
assumptions are not immediately testable since the least squares method
ensures they are both satisfied, but the second one is related to the question
of how many predictors to include in a regression. If we have omitted some
relevant explanatory variable, say Z, that is correlated with one or more of
the included X's, the true disturbance contains the 'lurking' variable Z and
is not uncorrelated with each predictor. The assumption that it is leads to
misleading estimates of the regression coefficients of those X's correlated
Nonadditivity occurs when the effect on Y of one variable depends on
with Z (Box, 1966; and see 'specification error' in econometrics texts).
the value of some other variable in an interactive, usually multiplicative,
This problem does not arise if the omitted variable Z is uncorrelated with
way. If, for example, a 100m rise in elevation increased rainfall by a certain
the included X's, or if it has no direct effect on Y, but the only way to be
percentage of its local value rather than a fixed amount then our additive
sure is to include potential lurking variables. They can always be abandoned
regression would be inappropriate, particularly at sites where both predictors
if found to be irrelevant.
have extreme values. Ln the physical sciences multiplicative relationships
Another problem to do with correlations between variables is that if
simple predictor (see Haynes, 1973 for a rare example of this kind of fore-
the variables concerned are said to be multicollinear and their effects can-
thought in human geography). In less clear cut situations the most flexible
not be separated. The least squares method breaks down in these circumstances
approach is to take logarithms of all variables. and a regression cannot be fitted until the offending variable is omitted.
27
26
Some writers, including Poole and O'Farrell (1971) in an otherwise helpful
discussion, give the impression that any non-zero correlation between pre- amount of scatter, as inthe bottom graphs of Fig.7,suggests heteroscedast-
dictors is unacceptable. This is not so, and indeed multiple regression is icity. This time the multiple regression for Scottish rainfall does not pass
unnecessary when predictors are mutually uncorrelated. Strong intercorrelations the test so clearly, for there is some tendency towards greater scatter in
do however lead to greater uncertainty in regression estimates from samples, the west where observed and predicted rainfalls are higher (Fig. 8). If this
as will be seen later. were more pronounced the least squares estimates would be less reliable.
Proportional disturbances about a simple linear model can be accommodated by
The general inferential problem in regression has already been touched regressing Y/X on 1/X. Taking logarithms of all variables is another altern-
upon several times. If sample data are used to estimate the coefficients of ative, since as previously noted it converts multiplicative proportional dis-
a population model, then whatever the precise methods used the results are turbances to additive homoscedastic ones. More complicated types of hetero-
bound to be uncertain to the extent that another sample would give different scedasticity can be dealt with only by weighting each observation: see
estimates. Does any particular method of estimation minimise this uncertainty, 'generalised least squares' in advanced texts.
and if so which? Two kinds of error are involved. An estimation method may
be biased, i.e. systematically over- or underestimate the population co- The third condition, that disturbances are mutually uncorrelated and
efficients; and it may have greater variance than another method, i.e. give therefore convey no information about each other, has been singled out as
a wider scatter of estimates about the true value. An analogy may help make especially dubious by Gould (1970) and other geographers on the grounds that
this clear. The numbers at the top of a darts board are 5, 20, 1, 18. If 20 almost all geographical phenomena show positive spatial autocorrelation, with
is the target, a player who throws three l's shows bias; 5, 20, 1 is erratic, nearby places more alike. However, it is not necessary to assume that the
and 20, 1, 18 reveals both bias and variance. The best grouping is of course values of any observed variable are mutually uncorrelated, only that the
three 20's, with no bias and minimum variance. measurement error or extraneous complications affecting one Y value are un-
related to those affecting any other individual. Pronounced spatial patterns
If a linear additive model is an accurate description of a population or trends in the X's or Y need not violate this assumption; what matters is
relationship and the disturbance term c behaves in a simple random fashion,
least squares estimates of the regression coefficients using sample data are
best linear unbiased estimates (BLUE for short). Proofs can be found in the this formally (see Cliff and Ord, 1972). Much the simplest is the runs test
advanced textbooks listed in the bibliography. The necessary assumptions (described in most elementary statistics texts) in which the number of runs
about c are that individual disturbances behave as uncorrelated random var- of successive residuals with the same sign in some appropriate plot is com-
iables from probability distributions which all have a mean of zero and the pared with the expected number for a random sequence. This is just over n/2,
same variance. In terms of expectations (averages over probability distrib- so the 20 residuals from the combined eastwards and altitudinal trend in
utions), Scottish rainfall pass the test comfortably with 11 runs in Fig. 8. Fewer and
longer runs might have been found had raingauges within 2 km of others not
been eliminated from the original sample, for local similarities in omitted
variables such as aspect could lead to similar departures from the regional
trend. Autocorrelation is therefore commoner in closely-spaced data. The
same applies in time series, where temporary disturbances may carry over from
one observation to the next if the interval is short. If substantial auto-
correlation is present regression estimates may remain unbiased but no longer
or has been made so by variable transformations, and if no systematic 'lurking have minimum variance and are thus less reliable than usual. An intuitive
variable' has been left out. It is untestable unless there are several obser- explanation for this is that some of the data points are more or less dup-
licating each other so that the effective sample size is reduced. Unwin and
with any X, whether the latter has random or nonrandom (fixed) values, so Hepple (1974) discuss the problem further.
long as these are measured without error. Inaccuracy in Y is permissible
(indeed it was the original motive for developing regression methods), but Visual inspection of residual plots generally provides the simplest and
errors in X's reduce their correlations with Y and necessarily lead to bias in many ways the best check of each of the assumptions that justify the use
in the form of underestimation of the effect of each X. of linear least squares estimation. Residuals are, or can be, printed by most
computer programs for regression analysis, and in some cases the plots too
The second condition, that of homoscedasticity, says that there is a can be produced by machine, so residual checking is no great chore. It can
constant degree of scatter about the population relationship rather than however be less straightforward than suggested above if predictors have very
local regions of high or low scatter, when data points from high-scatter skewed distributions. Scatter diagrams and residual plots then contain one
regions would exert undue influence on the least squares estimates. Non- or a few isolated points well clear of the rest, making it difficult to dis-
constant scatter or heteroscedasticity is commonly associated with disturb- tinguish between trend and scatter. Log transformation helps here too by re-
ances that are proportional rather than additive. This is a special case of ducing positive skewness.
nonadditivity, as already discussed. It can be detected by inspection of the
scatter diagram of a simple regression, or by plotting the residuals from a
28 29
We have considered the statistical assumptions behind regression esti-
mation at some length in order to see some of the problems that can arise in
practice. Opinions differ as to how seriously violations of these assumptions
should be taken. It is unfortunate that many geographical statistics books
still in use stress only some of the assumptions, and then not necessarily
the most important ones. Emphasis is often laid on the supposed need for nor-
mal distributions, but the distribution of each measured variable is irrele-
vant in linear regression and that of the residuals is relevant only to the
significance tests discussed in the next section. Yet the important question
of whether a multiplicative model with proportional disturbances is more
appropriate than an additive one is seldom mentioned. It can also be argued
that most geographical applications of regression analysis are exploratory,
so that violations of the strict statistical assumptions are far less serious
than in, say, the Treasury's predictions of economic variables such as un-
employment and spending. As so often, the care needed with the tools depends
on what is to be done with the finished product.
These are questions about confidence limits and significance tests, and
they are relevant whenever we want to infer something about a population
relationship from sample data. Even when we have data for every administrative
area in a region it is sometimes argued that the boundaries could have been
drawn in an infinite number of alternative ways, so that inferential questions
are still relevant (see Gudgin and Thornes, 1974).
30
31
Without further information these standard errors can only be interpreted
in relative terms. They all depend on the square root of n or something close
to n, so a fourfold increase in sample size halves the uncertainty of re-
gression estimates if other things are equal. But unless the probability dis-
of samples are likely to give estimates within, say, two standard errors of
the true value. The most convenient, and therefore commonest, assumption is
33
32
overwhelmingly significant even at the 0.1% level. In other words the chances
are far less than 1 in 1000 that we have simply an unrepresentative sample
from a regional rainfall distribution that shows no systematic dependence on
only useful when there are strong grounds for expecting a particular effect
to be absent or negligible, which of course takes us back to type (1) or (2)
null hypotheses. And if the sample size is big enough even a tiny and geo-
graphically unimportant difference between sample b and expected 8 will be
statistically significant.
If the 95% (or 99%, or other) confidence limits for an estimated re-
gression coefficient b lie on opposite sides of some prespecified value then can be used to compare any two linear regressions fitted to the same data and
we cannot safely say the true coefficient 8 differs from this value, since dependent variable. The numerator is proportional to the extra Y variance
at least 5% (or 1%, etc.) of all possible samples from a population with the explained per new predictor, the denominator to that still unexplained by p
hypothetical 8 would yield at least as big a discrepancy as has been observed.
This is the idea behind significance tests of regression coefficients (see variance the larger the ratio becomes and the less likely it is that the im-
Fig. 13). Any value of 8 may be specified beforehand as a null hypothesis. proved fit is a sampling fluke. If the ratio exceeds the tabulated value of
The obvious possibilities are (1) a value expected on theoretical grounds;
(2) the value found in a previous study; (3) zero.
The third type of, test, against the null hypothesis that 8 = 0, is even
easier. It boils down to asking how many standard errors away from zero the
observed slope, b, lies. Most computer programs for regression print the small not to be significant (less than 0.2 for large samples at the 57. level).
ratio of b to its standard error, generally labelled as 'T', and this can be If so the t or F tests of individual predictors are usually non-significant
compared with tables of Student's t. Our multiple regression equation (9) for too. But this does not always apply in reverse: the regression as a whole may
be significant, but not any of the individual effects. This paradoxical result
and 1.02, giving t values of 6.0 and 5.1 (the sign is immaterial). These are occurs when a pair or set of predictors are so highly intercorrelated that
34 35
controlling any one of them reduces their combined explanatory power to an the existence and relative strength of relationships, not their precise form
insignificant level. This is yet another manifestation of the multicollinear- for which there are no empirical or theoretical yardsticks. This is a broader
ity problem.
application of regression analysis, and involves assessment of the realism
of alternative cause-effect models rather than calibration of one whose
Partial t or F tests are often used as a means of screening variables applicability is not in question. As an introduction to this kind of explor-
and selecting the 'best' regression with a given number of predictors out atory work we consider here the types of interrelationship that may exist
of a wide choice (Draper and Smith, 1966, ch. 6). One way to do this is to between three variables, and only outline the extension to more complicated
situations.
checked when large numbers of regressions are tried and discarded. And, most
serious of all, the more significance tests are carried out the greater the
chance of capitalising on a statistically significant but inexplicable result
that really is a 1 in 20 or 1 in 100 sampling fluke. It must also be remembered derived from the covariance equations which led to formulae (8a-b) for the
that the tighter we set our significance levels to avoid accepting fortuit- partial regression coefficients. The following situations can be distinguished.
ously strong effects, the more we are liable to reject real population re-
lationships that happen to be weak in our sample data. This is by far the
greater danger when samples are small. For all these reasons significance
tests are a poor substitute for prior knowledge and critical judgment.
simple and partial correlations and re-
gressions between Y and either X are zero
IV SPECIAL APPLICATIONS
between the X's does not alter the situ-
(i) Causal models ation.
Each predictor then amplifies the direct effect of the other. The Scottish
rainfall example is a good illustration. If only simple regressions are con-
sidered the orographic effect is exaggerated by the oceanic influence also
affecting most of the high ground, and vice versa. The modest correlation of
-.35 between the predictors is sufficient to inflate partial b's of 1.8 and
-5.2 to simple b's of 2.4 and -7.4 for height and distance east respectively.
Whichever predictor is taken first, ignoring the other leads to a considerable
bias in the regression estimate. Multiple regression is essential for a truer
picture.
(7) The opposite case is suppression, where direct and indirect effects are
in competition and tend to cancel out. This occurs whenever either just one
or all three direct links are negative. What matters here is the sign of the
partial (not simple) b describing the effect on Y of each X. A simple b can
conceivably have the opposite sign to its partial if the indirect effect
through the third variable outweighs the direct effect. It could even be zero
if the direct and indirect effects cancelled out exactly (though this cannot
happen to both simple regressions at once). Suppression would occur in the
Scottish rainfall example if the topography of the region were reversed
would now be only one negative direct link, that between rainfall and dis-
tance east. This would not affect the observed partial b's if the orographic
and oceanic tendencies at work really are additive and linear. But the simple
b's would drop to 1.3 and -3.1 and the corresponding simple correlations to
0.56 and -.41 instead of 0.78 and -.73 as actually observed. This is because
orographic rainfall in the east would go some way to offsetting the oceanic
influence in the west, giving a more uniform overall distribution of rainfall.
The different patterns of simple and partial b's, and to a lesser extent
partial r's, that characterise these seven situations can be used to dis-
tinguish between alternative causal models for the observed relationship of
three variables. This can be done by trial and error or systematically using
the flow chart of Fig. 14. The diagnostic questions are whether neither, one,
or both partials are so close to zero as to be negligible, and how the pre-
dictors are related. The first is a subiective matter when we only have
38
39
relevant the existence and sign of the correlation between them determines The Simon-Blalock approach has much in common with the technique of
whether their effects are separate (case 5), reinforcing (case 6), or path analysis, which originated in biology and is described in social statis-
suppressing (case 7). tics texts such as Heise (1975) and Kerlinger and Pedhazur (1973). The chief
difference is that path analyses generally use standardised partial regression
coefficients, which are b's measured in standard deviations of Y per standard
deviation of X (they occurred unannounced in our discussion of regression
on p predictors). The standardised form of a simple b is r, so the total
effect of one variable on another is their correlation. It can be found from
the arrow diagram as the sum, over all paths linking the variables, of the
product of standardised b's along each path. In this way the importance of
direct, indirect, and other paths between variables can be compared within
one study. Standardised b's should not be used for comparisons between studies
since they depend on sample variability, but the path analysis principle can
be applied to unstandardised regressions as in our discussion of indirect,
reinforcing, and suppressing effects.
causal models for relationships among more than three variables can be
tested in a similar way if they can be represented by arrow diagrams without linear discriminant analysis (see King, 1969, 205-7).
feedback loops. The general principle, first noted by the sociologist
H.A. Simon, is that the absence of an arrow must be reflected in a near-zero The main drawback is that the scatter about the regression cannot be
partial b or r when any intervening variables and/or common causes are held homoscedastic, so the least squares regression estimates are not as reliable
constant. Repeated application of multiple regression or partial correlation as they could be. Wrigley (1976) in another monograph in this series describes
to each partly or wholly dependent variable will either confirm the model or a more sophisticated technique, logit analysis, that overcomes this problem
suggest necessary modifications. The classic text is that by Blalock (1961), and can be extended to qualitative dependent variables with more than two
and Mercer (1975) gives a clear account of an application in urban social
geography.
formed back to get the best possible prediction of the probability of presence
40 41
of whatever is represented by Y = 1. Wrigley gives the example of predicting BIBLIOGRAPHY
how likely people are to suffer from acute bronchitis given their cigarette
A, Applications
Dummy variables can also be used as predictors in a multiple regression Bleasdale, A. and Chan, Y.K., (1972), Orographic influences on the distrib-
ution of precipitation. 322-333 in: Distribution of precipit-
ation in mountainous areas, 2, World Meteorological Office
(Geneva).
Champion, A.G., (1972), Urban densities in England and Wales: the significance
of three factors. Area, 4, 187-192.
For example, the magnitude of river floods is likely to increase with rain- Ferguson, R.I., (1975), Meander irregularity and wavelength estimation.
fall but may also depend on geology since the less permeable the ground the Journal of Hydrology, 26, 315-333.
quicker storm rainfall gets into the river. Rock type can be taken into Haynes, R.M., (1973), Crime rates and city size in America. Area, 5, 162-165.
Krumbein, W.C., (1959), The sorting out of geological variables illustrated
by regression analysis of factors controlling beach firmness.
Journal of Sedimentary Petrology, 29, 575-587.
Mercer, J., (1975), Metropolitan housing quality and an application of causal
modelling. Geographical Analysis, 7, 295-302.
Parker, A.J., (1974), An analysis of retail grocery price variations. Area,
This method can be extended to several parallel trend lines, which 6, 117-120.
amounts to analysis of covariance; to horizontal lines, which is equivalent Smith, G.C., (1976), The spatial inforamtion fields of urban consumers.
to analysis of variance; to lines with different slopes but the same inter- Transactions, Institute of British Geographers, new
cept; and to mixtures of all these cases. The multiple regression approach series 1, 175-189.
makes clear the links between the different possibilities, and allows easy
comparison of their goodness of fit. Silk (1976) gives a detailed but read- Taafe, E.J., Morrill, R.L., and Gould, P.R., (1963), Transport expansion in
able account. Further applications of dummy predictors are described by underdeveloped countries: a comparative analysis. Geographical
Draper and Smith (1966, ch. 5) and Mather and Openshaw (1974). Review, 53, 503-529.
Two further applications of multiple linear regression to special kinds B. Assumptions and inference
of variables should also be noted. One is trend surface analysis in which Y Gould, P., (1970), Is statistix inferens the geographical name for a wild
is some spatially distributed variable and the X's are locational coordinates goose? Economic Geography, 46, 439-448.
(e.g. eastings and northings) and their powers and products up to some maximum
order. The aim is generally to see what order of surface adequately describes Gudgin, G., and Thornes, J.B., (1974), Probability in geographic research:
applications and problems. The Statistician, 23, 157-177.
provement from one order to the next. Details and applications are described Mather, P., and Openshaw, S., (1974), Multivariate methods and geographical
by Unwin (1975) in another monograph in this series. data. The Statistician, 23, 283-308.
The final special case is the autoregressive modelling of time series Poole, M.A., and O'Farrell, P.N., (1971), The assumptions of the linear re-
(Box and Jenkins, 1970). Here the X's are Y values one, two, etc. time inter- gression model. Transactions, Institute of British Geo-
vals ago. The carryover effects in the series at these different time lags graphers, 52, 145-158.
amount to partial regression coefficients of the series on its own past and Unwin, D.J., and Hepple, L.W., (1974), The statistical analysis of spatial
can be found from the correlations between Y and the X's, i.e. the auto- Series. The Statistician, 23, 211-227.
correlations of the series with itself at different lags. Applications include
the study of fluctuations in economic, climatic, and hydrologic time series. C. Advanced texts
Spatial series in geomorphology have also been investigated in the same way,
with distance (downslope or downriver) replacing time. Draper, N.R., and Smith, H., (1966), Applied regression analysis.
(Wiley, New York).
Multiple linear regression in its basic form is very widely used. The Huang, D.S., (1970), Regression and econometric methods. (Wiley,
special applications to spatial patterns, time series, qualitative variables, New York).
and causal models make it even more versatile. The geographer who understands
the fundamentals of linear regression is well placed to analyse most kinds of Johnston, J., (1972), Econometric methods (2nd edition). (McGraw-Hill,
geographical data and to appreciate published quantitative research. New York).
43
42
Kerlinger, F.N., and Pedhazur, E.J., (1973), Multiple regression in be-
havioral research. (Holt, Rinehart and Winston, New York).
Surrey, M.J.C., (1974), An introduction to econometrics. (Clarendon
Press, Oxford).
D. Other references
Blalock, H.M., (1961), Causal inferences in nonexperimental research.
(University of North Carolina Press, Chapel Hill, North Carolina).
Box, G.E.P., (1966), Use and abuse of regression. Technometrics, 8, 625-9.
Box, G.E.P., and Jenkins, G.M., (1970), Time series analysis, fore-
casting and control. (Holden-Day, San Francisco).
Cliff, A.D., and Ord, J.K., (1972), Testing for spatial autocorrelation among
regression residuals. Geographical Analysis, 4, 267-284.
Ehrenberg, A.S.C., (1975), Data reduction. (Wiley, London).
Heise, D.R., (1975), Causal analysis. (Wiley, London).
King, L.J., (1969), Statistical analysis in geography. (Prentice Hall,
Englewood Cliffs, New Jersey).
Silk, J., (1976), A comparison of regression lines using dummy variable
analysis. Geographical papers, Department of Geography,
University of Reading, 44.
Sprent, P., (1969), Models in regression. (Methuen, London).
Till, R., (1973), The use of linear regression in geomorphology. Area, 5,
303-8.
Unwin, D.J., (1975), An introduction to trend surface analysis.
Concepts and techniques in modern geography, 5. (Geo Abstracts Ltd,
Norwich).
Wrigley, N., (1976), An introduction to the use of logit models in
geography. Concepts and techniques in modern geography, 10.
(Geo Abstracts Ltd, Norwich).
44