Professional Documents
Culture Documents
, 2003
. 17, . 1, 6992
Review Article
MARK GAHEGAN
GeoVISTA Center, Department of Geography, The Pennsylvania State
University, 302 Walker Building, University Park, PA 16802, USA;
e-mail: mng1@psu.edu
Abstract. The research reported here contrasts the roles, methodologies and
capabilities of statistical methods with those of inductive machine learning
methods, as they are used inferentially in geographical analysis. To this end,
various established problems with statistical inference applied in geographical
settings are reviewed, based on Goulds (1970) critique. Possible solutions to the
problems outlined by Gould are suggested via reviews of: (i) improved statistical
methods, and (ii) recent inductive machine learning techniques. Following this,
some newer problems with inference are described, emerging from the increased
complexity of geographical datasets and from the analysis tasks to which we put
them. Again, some solutions are suggested by pointing to newer methods. By way
of results, questions are posed, and answered, relating to the changes brought
about by adopting inductive machine learning methods for geographical analysis.
Specifically, these questions relate to analysis capabilities, methodologies, the role
of the geographer and consequences for teaching and learning. Conclusions argue
that there is now a strong need, motivated from many perspectives, to give
geographical data a stronger voice, thus favouring techniques that minimize the
prior assumptions made of a dataset.
1. Introduction
In his famous article critiquing the use of inferential statisticsIs statistix inferens
the geographical name for a wild goose?Peter Gould (1970) lays bare the many
premises upon which inferential statistical analysis is founded, alternatively ques-
tioning their validity and the blind faith placed in them by geographers. These
questions are revisited here in the light of a digital revolution that is providing
torrents of data where once was only a trickle (Miller and Han 2001). Consequently,
we are confronted with the diculty of scaling up our analysis to embrace datasets
that are both voluminous in terms of numbers of records or samples represented (n),
and deep in terms of the number of separate attribute dimensions over which data
are gathered ( p). As well as making additional demands on existing analysis methods,
these datasets also generate the need for new types of analysis procedure, to support
exploration, mining and knowledge discovery (Buttenfield et al. 2001, Gahegan et al.
2001). It is not always clear that traditional statistical techniques can address these
new challenges, and where they can, there may be severe consequences in terms of
International Journal of Geographical Information Science
ISSN 1365-8816 print/ISSN 1362-3087 online 2003 Taylor & Francis Ltd
http://www.tandf.co.uk/journals
DOI: 10.1080/13658810210157778
70 M. Gahegan
computational burden, significance testing, demands for sample data and so forth.
Openshaw and Openshaw (1997, p. 3) describe the current situation thus: Sadly,
nearly all of the available methods for analysis, modelling and processing to extract
value date from an earlier period of history where data were scarce and the analyst
had to rely on his or her intuitive skills aided by an intimate knowledge of what
little information was available to formulate analysis tasks.
Within the domain of geographical analysis, the use and capabilities of traditional
inferential statistics are here contrasted with an alternative form of computational
inference based on inductive machine learning. The discussion is restricted to infer-
ence used for predicting some unknown characteristics or properties, as opposed to
the identification of underlying processes or models. The latter is possible also with
machine learning, for example by utilizing tools to automatically construct Bayesian
Belief Networks, but falls outside the scope of this paper. Philosophically, statistical
inference and machine learning (ML) are based, to diering extents, around a style
of inference known as induction; allowing the analyst to infer some generic outcomes
from specific examples, to whit: By induction, we conclude that facts, similar to
observed facts, are true in cases not examined (Peirce 1878). This contrasts with
deduction, in which facts are asserted as true by computation against some a priori
model. Section 2 below describes the process of inductive inference in detail.
Machine learning and inferential statistics typically dier in their use of prior
knowledge. Inferential statistics uses observations to condition (shape) the form of a
distribution model that is usually provided by the analyst. This prior assumption
represents a self-imposed limit in terms of model complexity and the ability to adapt
to the data. By contrast, many machine learning techniques construct a distribution
model using evidence gleaned from the data alone, i.e. they are data-driven. This
dierence leads to major methodological disparities aecting training, accuracy
analysis, goodness of fit and significance testing. Thus it can appear at first glance
that these two types of inference are for quite dierent purposes, yet we see a growing
trend to employ neural, genetic and rule-based induction methods in place of more
traditional forms of geographic analysis (Benediktsson et al. 1990, Byungyong and
Landgrebe 1991, Lees and Ritman 1991, Civco 1993, Openshaw 1993, Fisher 1994,
Yoshida and Omatu 1994, Paola and Schowengerdt 1995, Foody et al. 1995, German
and Gahegan 1996, Friedl and Brodley 1997, Fischer and Leung 1998, Bennett et al.
1999, Openshaw and Abrahart 2000). The reasons for this are largely concerned
with practicality.
Firstly, we can substitute a model that must be provided beforehand for a learned
model that is derived when needed from sample data. This can lead to greater
flexibility, and less reliance on expert knowledge for configuration. Such flex-
ibility may well prove crucial; as geographers integrate ever more data to study
complex phenomena such as human-environment interaction or population demo-
graphics and epidemiology, the diculties in specifying a reliable model in advance
rise accordingly. Discoveringor inducingsuch a model from a limited set of
observations may provide a practical alternative.
Secondly, in many complex systems with non-axiomatic components, models
may either be too elaborate to define or else too susceptible to variation in precondi-
tions; for example data gathered from a dierent place requires a dierent model.
Gould points out (p. 444) that a geographer should expect this latter problem since:
...all phenomena of interest to the geographer are never independent in the funda-
mental dimensions of his enquiry. We must then decide if this interdependence can
Is inductive machine learning just another wild goose chase? 71
for expanding our arsenal of inferential tools to include machine learning methods.
By doing so we are able to discard some problematic underlying assumptions. But
we must also modify and declare some in addition, all of which have a direct impact
on the questions we can investigate, the methodology we must use and our interpreta-
tion of the results produced (5). The conclusions present a summary of the findings
and outline the major research themes still to be addressed in this arena.
Figure 1. The inductive learning methodology. (a) The target function (V ) is learned from
examples, and (b) then applied to predict unknown values.
Is inductive machine learning just another wild goose chase? 73
governed by the number of these small functions used, and the mechanisms by which
they are combined.
In many ML methods there is no requirement for the same overall functional
form to be used throughout the entire range of the data, nor indeed to assume that
just one function form is adequate. Thus, irregular and multi-modal distributions
cause no additional complications, provided enough learning capacity is available
in the tool, since they can be constructed by the piecewise combination of more
primitive functions. The additional flexibility is very useful in situations where
relationships between variations are complex and/or unknown.
other simplistic relationship, should be assumed. Gould argues (in 1970!) that with
improvements in computational capacity, and in associated software, there is no
longer a reason to strive for simplicity where it is not warranted. In the meantime,
research in statistics has made significant progress in the support provided for more
complex functions (McGarigal and Marks 1995), hierarchies of functions that better
integrate scale-based analysis (Kreft and DeLeeuw 1998, Johnson et al. 1999) and
extreme value theory to address very rare events (Smith 1990). Geographically
weighted regression (Brunsdon et al. 1998) addresses this same issue by making local
subsets where the functional form is the same, but the parameterization diers.
However, more simplistic statistical models are still in widespread use, possibly
reflecting the ease with which they can be applied and understood, rather than the
need for computational simplicity.
Large families of ML methods have also been developed to address the modelling
of complex functional forms. As described above in 2.2, complex functions can be
simulated by ML methods by the assumption of many simpler, low-level functions,
such as decision rules or hyperplanes. Neural networks are perhaps the most widely
used method in this regard. For example, the General Regression Neural Network
(GRNN: Specht 1991) provides a more flexible form of regression, where distances
from the fitted line are applied piecewise, locally rather than globally, allowing more
complex functional relationships to be modelled with ease.
3.2. T he sample
Assumptions include the randomness of sample selection, problems of generaliz-
ing from a sample to a population and the chances that the sample contains unwanted
bias of some sort. These problems still pervade spatial statistics, for example a semi-
variogram (a graphical tool for exploring spatial dependence in data) will produce
misleading results when samples are preferentially clustered or data shows significant
heteroskedasticity (Isaacs and Srinivastava 1989, p. 527). Improvements in sampling
strategies help to alleviate some of these problems (Kalton and Anderson 1986,
Thompson 1992) and simulation techniques such as the Monte Carlo method can
help explore for randomness and bias problems (Bremaud 1999). Using relative
variograms, or other locally-calculated measures of variance can help oset the
eects of heteroskadisticity.
In part, ML methods overcome this problem by avoiding assumptions about the
sample, though its representativeness is tacitly assumed. The whole area of sampling
theory and bias associated with both the data and the generalization methods used
have formed central strands in the development of machine learning methods
(Benjamin 1990, Briscoe and Caelli 1996), and are well understood.
Figure 2. For this distribution of samples, using only three hyperplanes or oblique decision
rules, the feature space cannot be subdivided so that a perfect classification results.
The two diamond samples inside the dashed oval will likely be mis-classified, since
this represents a minimization of error. Any bias in the distribution of such dicult
to train on samples will propagate into the result.
Is inductive machine learning just another wild goose chase? 77
Solving bias problems requires careful initial calibration, to ensure enough learning
capacity is available, though only just enough, otherwise over-training may occur
(Gahegan 2000). Utgo (1986) describes how the bias exhibited during training can
itself be learned, so that it might be better understood.
Figure 3. Comparing simple geometric shapes and fractional intersection of their volume in
a p dimensional feature space, after Scott (1992) and Landgrebe (1999).
the model used does not generalize too far beyond the observed properties of the
data. However, if p is increased, this ratio does not stay constant, but decreases
rapidly to a state where the surrounding box is almost entirely empty and is a very
poor representation of the data. By p=4 the ratio of the area is well below 50%
and at p=7 the hypersphere only accounts for about 4% of the volume of the
hypercube. In other words, the hypercube is certainly no longer a useful approximator
of any spherical cluster of data points, since it is 96% empty.
Were this problem to be confined to only rectangular or orthonormal structures
then it would simply require that we choose statistical models with greater care as
p increases. But unfortunately, the same geometric problems occur with other distri-
bution functions too; in fact it can be generally shown that for an arbitrary shape,
as dimensionality is increased, more of the volume of the object becomes concentrated
in an outer shell, and less in the centre. So, when considering a Gaussian distribution,
the volume of the curve migrates quickly from the centre to the tails of the distribu-
tion, producing a rather counter-intuitive flat shape. Note that this eect is not a
result of a lack of training examples, high variance or poor model choice, but simply
a consequence of geometry. An insightful explanation of this phenomenon is given
by Landgrebe (1999), who also points out the following two important consequences:
that the space is largely empty and that the migration of volume to the outer shell
or corners causes great diculties for multi-variate density estimation (Scott 1992,
Wand and Jones 1995, Jimenez and Landgrebe 1998).
The point here is that familiar distributional forms do not perform well in high-
dimensional settings, they were never designed to. It becomes vital, instead, to take
a piecewise or hierarchical approach, tackling the problem by fragmenting the space
into lower dimensional partitions only where the feature space contains useful
information, and ignoring other empty portions. This is why neural networks and
decision trees often meet with success in these settings (2.2).
more necessary than in spatial or spatio-temporal data mining where the physical
dimensions add considerably to the number of tests to be applied (Ester et al. 1998,
Koperski et al. 1999).
To summarize, traditional statistical methodologies can experience diculties in
exploratory settings where they are put to use in a manner for which they were
never designed. Machine learning researchers have tackled this vexing issue by
providing techniques that can summarize and generalize from learning outcomes,
thus avoiding a case-by-case assessment of significance (Gains 1996, Bradsil and
Kronolige 1990). Significance testing may also prove unreliable if distributions
cannot be conditioned accurately because of a lack of training examples, as
discussed next.
terms of data requirements, allowing them to extend to very large feature spaces
without acquiring a voracious appetite for data.
styles of analysis, concerns the requirement for prior knowledge. It is not necessary
to have a procedural understanding of a problem before using ML to predict or
infer new results.
By adopting machine induction, we move from an explicit model constructed by
a human expert (perhaps indirectly from observations or theory) to an implicit model
constructed directly from examples by an algorithm. Methodology changes accord-
ingly (2). In all cases, reliance on the human expert is never fully relinquished since
machine learning algorithms require a variety of hands-on intervention to assure
their correct functioning. While one goal is to remove this reliance, because it
demands a level of computational knowledge, another is to build expertise from
the user into the method, as it relates to the domain of application (German and
Gahegan 1999). These goals are not in conflict, though they may appear to be so at
first glance.
5.3. Are we able to examine new kinds of questions and if so, how?
Again the answer is yes; the ability to operate in the absence of prior knowledge
is enabled by substituting data for expertise (Openshaw 2000), with examples used
as a surrogate for this understanding. So, questions can be generated from our
extended ability to extract patterns from data, to categorize and to generalize. These
questions can take the form of hypotheses that shape the start of a more trad-
itional investigation. To this end, inductive learning is being applied within data
mining tools, to uncover previously unknown relationships and patterns in complex
geographical datasets (Ester et al. 1998).
used. By contrast, the error term in inferential statistics is a measure of the goodness-
of-fit of the data to the pre-defined model and not how appropriate the model itself
might be. The simplest way to account for variance in results of ML methods is to
compute an average value over several consecutive training and validation cycles.
Many appropriate measures have been proposed (Schaer 1993).
5.6. Are there implications for teaching and learning about geographical analysis?
One ramification for education is that learned models may be dicult to recover
and to communicate, even if they do lead to improvements in predictive power. The
simple parametric form of many common statistical functions makes the nature of
relationships easy to comprehend and to explain, whereas most machine learning
methods have little or no facility to describe the models they learn in any way that
makes immediate sense to a human. This is not an insurmountable problem, even a
complex model can be progressively reduced to a simpler, more generalized form
86 M. Gahegan
for presentation and examination: learning outcomes can be visualised and internal
structures can be summarized (Gains 1996, Laan 1998, Ankerst et al. 1999).
However, one could also make the counter-argument, namely: is such simplifica-
tion ultimately helpful and/or does it act as a barrier to understanding, rather than
an aid? The complexity of learned models may well depict geography as inherently
complex, and thus challenge our tendency to simplify it. Clearly, there are pedagogic
consequences to face.
Table 1. Various analysis tasks with their statistical and machine learning counterparts.
substituted for their more established statistical counterparts as datasets and tasks
become more complex.
By increasing our reliance on induction we change the role of the expert, since
many initial assumptions need now not be made or tested, but we must instead rely
directly on the truth (representativeness) contained within the dataset. Although,
such a goal is perhaps not entirely laudable, since it is probably a good thing to be
intimately familiar with ones data, this is an increasingly impractical requirement
due to the escalating size and complexity of datasets (Openshaw and Openshaw
1997, p. 3).
Diculty of use is still a real issue with many forms of machine learning; it is
not always straightforward to make informed choices regarding parameter config-
uration. However, this situation is also common for more advanced spatial analysis
tools. Configuration of neural networks, for instance, is no more complex a task
than conducting a geostatistical interpolation: the appropriate use of kriging requires
quite a deep knowledge of available methods, as well as selection of suitable
transformations (spherical, etc).
To make the descriptions clearer I have contrasted the simpler techniques from
statistics and machine learning. There are many other techniques that merit descrip-
tion, but space considerations have precluded their mention. It is important to point
out that there is by now a good deal of convergence between statistics and machine
learning, especially with more advanced techniques where the need to search through
solution spaces ecienctly is a common thread in both disciplines (Moller 1993,
Stewart et al. 1994, Simoudis et al. 1996). For example, Kernel Discriminant Analysis
(Lissoir and Rasson 1998), a statistical classification techniques, constructs decision
boundaries by employing a non-linear mapping of the data into some feature space,
via a series of kernel transformation functions. This new space introduces distortions
to allow a cleaner delineation of the classes. Although the theoretical foundation
diers from that of a neural classifier, the functionality and many of the configuration
and training issues are similar. This trend towards convergence between machine
learning and statistical analysis is likely to continue, so their distinction will become
less clear as time passes.
Acknowledgments
This paper is dedicated to the memory of Peter Robin Gould (19292000), whose
many insights are a continuing source of inspiration.
References
A, R., A, J. S., and G, P., 1971, Spatial Organization: T he Geographers V iew
of the World (Prentice Hall: Englewood Clis, New Jersey).
A, M., E, C., E, M., and K, H. P., 1999, Visual classification: An
interactive approach to decision tree construction. In KDD99 Proc., Fifth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (New
York: ACM Press), pp. 392396.
A, L., 1988, Spatial Econometrics: Methods and Models (Kluwer: Dordrecht).
A, L., 1995, Local indicators of spatial associationLISA. Geographical Analysis, 27,
93115.
A, D., 1985, The grand tour: a tool for viewing multidimensional data. SIAM Journal
of Science and Statistical Computing, 6, 128143.
A, R. M., and R, E. A., 1999, A new proposal to adjust Morans I for population
density. Statistics and Computing, 18, 21472162.
B, T. C., 1994, A review of statistical spatial analysis in geographical information systems.
88 M. Gahegan
D, X., and K, S., 1999, Data fusion using artificial neural networks: a case study
on multitemporal change analysis. Computers, Environment and Urban Systems, 23,
1931.
D, T. G., 1997, Machine learning research: four current directions. AI magazine,
Winter, pp. 97136.
D, P. J., 1983, Statistical Analysis of Spatial Point Patterns (London: Academic Press).
E, A., H, D., K, M., and V, L., 1989, A general lower bound
on the number of examples needed for learning. Information and Computation, 82,
247261.
E, M., K, H.-P., and S, J., 1998, Algorithms for characterization and trend
detection in spatial databases. In Proceedings of 4th International Conference on
Knowledge Discovery and Data Mining (KDD98), New York, USA (Menlo Park, CA:
American Association for Artificial Intelligence), pp. 4450.
F, P. F., 1994, Probable and fuzzy models of the viewshed operation. In Innovations in
GIS 1, edited by M. Worboys (London: Taylor and Francis), pp. 161175.
F, M. M., and L, Y., 1998, A genetic-algorithms based evolutionary computational
neural network for modeling spatial interaction data. Annals of Regional Science,
32, 437458.
F, M. M., and S, P., 1999, Optimization in an error backpropagation neural
network environment with a performance test on a pattern classification problem.
Geographical Analysis, 31, 89108.
F, R. W., and L, B. G., 1994, Assessing the classification accuracy of multisource
remote sensing data. Remote Sensing of the Environment, 47, 362368.
F, G. M., MC, M. B., and Y, W. B., 1995, Classification of remotely sensed
data by an artificial neural network: issues relating to training data characteristics.
Photogrammetric Engineering and Remote Sensing, 61, 391401.
F, M. A., and B, C. E., 1997, Decision tree classification of landcover from
remotely sensed data. International Journal of Remote Sensing, 18, 711725.
F, K., 1990, Introduction to Statistical Pattern Recognition (San Diego, California:
Academic Press).
G, M., 2000, On the application of inductive machine learning tools to geographical
analysis. Geographical Analysis, 32, 113139.
G, M., G, G., and W, G., 1999, Some solutions to neural network configura-
tion problems for the classification of complex geographic datasets. Geographical
Systems, 6, 322.
G, M., H, M., R, T.-M., and W, M., 2001, The Integration
of Geographic Visualization with Databases, Data Mining, Knowledge Construction
and Geocomputation. Cartography and Geographic Information Science, 28, 2944.
G, B. R., 1996, Transforming Rules and Trees into Comprehensive Knowledge
Structures. In: Advances in Knowledge Discovery and Data Mining, edited by U. Fayyad,
G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (Cambridge, MA: AAAI/MIT
Press), pp. 205228.
G, A., and B, B., 1978, Models of Spatial Processes (Cambridge, UK: Cambridge
University Press).
G, J., G, V., R, R., and L, W.-Y., 1999, BOATOptimistic
decision tree construction. Proc. SIGMOD 1999 (New York: ACM Press), pp. 169180.
G, G., and G, M., 1996, Neural network architectures for the classification of
temporal image sequences. Computers and Geosciences, 22 (9), 969979.
G, C., M, D., P, D., and S, P., 1996, Statistical inference and
data mining. Communications of the ACM, 39, 3541.
G, A. F. H., and C, B., 1996, Hyperspectral imaging of the earth: remote analytical
chemistry in an uncorrelated environment. Field Analytical Chemistry and T echnology,
1, 6776.
G, P. R., 1970, Is Statistix Inferens the geographcial name for a wild goose? Economic
Geography, 46, 539548.
G, P. R., 1999, Becoming a Geographer (New York: Syracuse University Press).
H, R. P., 1990, Spatial Data Analysis in the Social and Environmental Sciences
(Cambridge: Cambridge University Press).
90 M. Gahegan