Professional Documents
Culture Documents
1 Introduction
1.1 Environmental niches modeling
Modeling species distributions of biodiversity conservation and obtaining a better
understanding of the relationship between environmental factors and species distribution
is an important task (Funk & Richardson 2002; Thuiller 2004). With the increasing
availability of digital ecological data (Graham et al. 2004; Wieczorek et al. 2004),
environmental niche modeling has gained much attention for various ecological
applications (Pearce & Boyce 2006). For example, they have been used to study potential
distributions of invasive species (Guo et al. 2005; Thuiller et al. 2005), study the response
of species distribution to climate change (Kueppers et al. 2005; Broenniman et al. 2006),
and perform biodiversity assessments (Feria & Peterson 2002; Chefaoui et al. 2005).
A range of environmental niches models have been proposed for studying species
distributions such as BioClim (Busby 1986), Domain (Carpenter et al. 1993), linear,
multivariate and logistic regressions (Mladenoff et al. 1995; Bian & West 1997; Kelly et
al. 2001; Felicisimo et al. 2002; Fonseca et al. 2002), generalized linear modeling and
generalized additive modeling (Frescino et al. 2001; Guisan et al. 2002a), discriminant
analysis (Livingston et al. 1990; Fielding & Haworth 1995; Manel et al. 1999a),
classification and regression tree analysis (De'ath & Fabricius 2000; Fabricius & De'ath
2001; Kelly 2002), genetic algorithms (Stockwell & Peters 1999), artificial neutral
networks (Manel et al. 1999a; Spitz & Lek 1999; Moisen & Frescino 2002), and support
vector machines (Guo et al. 2005).
3. Status bar
In ModEco, the status bar includes three panes: a progress bar to indicate the
progress of a time-consuming operation, a coordinate pane to show the geographical
coordinate of the current mouse position, and an attribute pane to show the attribute value
in a data layer associated with the current mouse position.
4. Output window
The output window is used to output the running information of a model,
especially when it is time-consuming.
5. Menu bar and tool bar
They are similar to those of common windows applications.
2 Project management
2.1 Project structure
In ModEco, the data are managed using an XML-based project file. Its file extension
name is SML. It includes the following six elements (Fig. 3):
1. Environmental factor group
A factor data group encapsulates a set of factor layers in raster format used in
predicting a species distribution. The raster layers may be temperature or precipitation
maps that influence the species distribution. However, the nominal data (such as soil
type) are not supported in this versions package.
More than one factor group can be managed in a project. Thus, the user may train
a model using one group (e.g. the historic data), but predict using another group (e.g. the
contemporary data).
The following metadata are necessary for a factor layer:
Name
Data unit
Key words
If the user wants to train a model using one group but predict based on another
group, their data unit and key words must be consistent. This tool also provides a
function to match the layers in different groups automatically.
When importing factor layers, the acceptable raster data may be BIP file or
ARCGIS ASSCI file. For BIP file, a header file (*.hdr) specifying the size and
geographic extent of a layer is needed.
2. Species data points
Species data points are managed using a simple text file (*.smp) similar to the
following:
-119.809924 37.485305 lobata
-119.804719 37.471919 lobata
-119.936354 37.407816 lobata
-119.951348 37.371039 lobata
-118.632031 36.529882 lobata
-116.927998 34.273594 N/A
-122.627998 40.673594 N/A
-121.677998 38.273594 N/A
-117.277998 34.523594 N/A
-115.977998 35.873594 N/A
where the first two columns specify the geographical coordinates of a speceis
point. If the third column reads N/A, it means no species is observed at this point;
otherwise, it is the name of a species. If a species layer includes more than one species, it
is a multi-species layer; otherwise, it is a single-species layer. A single-species layer
without N/A rows is a one-class layer; otherwise, it is a two-class layer. Different
models are suitable for different types of species point layers.
3. Model instance
In a project, the model instance manages the parameters and the trained result of a
model. After being trained, a model instance can be invoked at any time.
4. Result data
Based on a factor group and a trained model, the result map representing the
spatial distribution of a species can be obtained. The result map is a nominal raster data
layer. Generally, in a result layer, the value 0 stands for absence of a species, while
value 255 stands for a null value.
5. Base Map
In ModEco, the base maps are vector layers in ARCGIS shape format that can be
loaded and overlaid on a factor layer, species point layer, or prediction result layer.
6. Preferences
Some global settings are needed to implement certain projects. For example, the
extent, which defines the study area of interest, is important for input environmental data
layers that are greater than the study area of interest. If not defined properly, the extent
could also affect the performance of several analyses. For example, histogram
comparisons between the observed species distribution and background distribution
highly depend on the definition of extent, which may change the results dramatically.
Other analyses such as principle component analysis, scatter plot are also related to the
extent. In addition, the extent could also significantly affect the pseudo absence data
generation for the niche models. Consequently, we developed the extent as a global
preference for ModEco.
Fig. 8 Dialog box for importing raster files as species point layer
After a species layer has been imported, a window will be opened to show all the
points in this layer (Fig. 9).
10
11
When the user wants to close the application, a dialog box (Fig. 10) will appear to
prompt the user to save the unsaved components in the current project.
3 Interactive display
For the convenience of analyzing the data, this tool provides some features on viewing
the associated maps.
12
In the project, different species are symbolized with different colors. The user may
modify the default color set by double-clicking the species name in the dialog box shown
in Fig. 12.
3.3 Overlay species data points [View Overlay species data points...]
The species point layers can be overlaid on a factor layer or result map (Fig. 13). Thus,
we can view the relative distribution of species points and roughly estimate the prediction
accuracy. This function can also be accessed by the toolbar button
13
14
15
4 Factor analysis
Before we start to predict species distribution, it is important to examine the input
environmental and species data. In ModEco, we provide some basic functions that allow
users to visualize the relationship between observed species localities and environmental
features. Functions include the factor histogram, scatter plot, and factor importance
analysis.
16
17
Figure 17 Distribution of black oak in feature space (annual precipitation vs. annual
temperature)
18
important one for maximum a likelihood model. Note that factor importance analysis
could be an algorithm-sensitive analysis. Different models may lead to different results.
19
20
21
BioClim
Domain
Generalized linear model (GLM)
Maximum likelihood classification
Artificial neural network (ANN) trained using back-propagation algorithm
Artificial neural network (ANN) trained using particle swarm optimization
(PSO) algorithm
Rough set
Classification and Regression Tree (CART)
Maximum Entropy (MaxEnt)
Ensemble model
Different models are suitable for different training data. First of all, some models
(one-class SVM, BioClim, and Doman) are one-class models, thus the involved species
point layer can only contain one species. Two-class species layers are acceptable;
however, the absence points are filtered before training. The other models support two or
more classes. Therefore, one-class species layers are not acceptable for them. In order to
employ such models, the pseudo absence approach (Section 5.4) can be adopted to make
the species point layer two-classed.
concept of the BioClim model is very straightforward and often provides reasonably good
results (Rissler et al. 2006). Another advantage of the BioClim model is that it only
requires one free parameter (i.e. percentile), and can be easily implemented in a
geographic information system. Thus, the BioClim model is often used to provide a base
result for comparisons among other advanced niche models (Elith et al. 2006).
Domain
The Domain model is considered an improvement over BioClim model
(Carpenter et al. 1993). The DOMAIN procedure assigns a classification value to an
unknown site based on the distance of its closest similar site in environmental space. The
similarity metric is the only free parameter needed in the DOMAIN model. Essentially,
the DOMAIN model is analogous to nearest neighbor classification which is commonly
used in spatial interpolation or image classification. On a recent method comparison
(Rissler et al. 2006), the DOMAIN model has been demonstrated to be a very competitive
model based on its performance and relatively easy implementation.
Generalized Linear Model (GLM)
GLM is a generalization of the general linear models which can relax the
distribution and constancy of variances that are commonly required by traditional linear
models such as linear regression. The GLM is commonly used to model dependent
variables that are discrete distributions and are nonlinearly related to independent
variables through a link function (Guisan et al. 2002). Consequently, the GLM model is
particularly suitable for predicting species distributions, and has been proven to be
successful in various ecological applications (Guisan et al. 2002; Latimer et al. 2006).
Three link functions are currently implemented in ModEco: logit link, log-log link, and
complementary log-log link.
Maximum Likelihood Classification (MLC)
MLC is one of the most popular classification methods in remote sensing
(Richards & Jia 1999). The idea of the MLC is to label an unknown location to the class
(either presence or absence) of the maximum likelihood. The likelihood is defined as the
posterior probability of the unknown location belonging to either presence or absence.
The MLC method relies heavily on a normal distribution of each environmental factor,
and it takes into consideration the variance and covariance of environmental factors of
presence and absence data by using a covariance matrix. The MLC method is considered
to be one of the most accurate classifiers if the data meet the assumptions (Richards & Jia
1999; Duda et al. 2001).
Artificial Neural Networks (ANNs)
ANNs which were originally inspired by the central nervous system have been
commonly used to model complex relationship between dependent variables and
independent variables or used to mine patterns in data. The idea of ANNs is to extract
linear combinations of the input variable as derived features, and model the output as a
nonlinear function of these derived features (Hastie et al. 2001). ANNs have been
successfully used to predict species distribution (Manel et al. 1999a; Maravelias et al.
2003). In ModEco, a 4 layer feed-forward ANN (one input layer, one output layer, and
23
two hidden layers) that can be trained using backpropagation algorithm (Werbos 1994)
and Particle Swarm Optimization (PSO) algorithm (Eberhart & Kennedy 1995) is
implemented.
Rough Set
The idea of rough set was proposed by Pawlak (1991) as a new mathematical tool
to deal with vague concepts. Rough set based reduction is particularly useful for rule
generation and feature selection in data mining. At present, rough sets have been viewed
as a theoretical basis for some problems in machine learning. It has been widely used in
pattern recognition, classification, and the other related areas. In ModEco, we used a
heuristics algorithm for the rule reduction, since it is a NP-hard problem.
Classification and Regression Tree (CART)
CART seeks to recursively partition the response variable into increasingly pure
binary subsets with splits and stop criteria (Venables & Ripley 2002). Tree can overgrow
to fit exactly the training data. The method has several advantages: 1) it can handle any
combination of categorical data and continuous data in the classification and regression.
For example, in this study, we can use the aspect directly into the classification tree; 2)
the results from CART are presented by a set of if then logical splits that allows for
accurate prediction and classification of cases, consequently, it is easy to interpret the
CART results; and 3) it has the ability to capture hierarchical and nonlinear relationship
among predictor variables (De'ath & Fabricius 2000).
Maximum Entropy (MaxEnt)
MaxEnt, first proposed by Jaynes (1957), and has been widely used in natural
language processing, text segmentation, part-of-speech tagging, prepositional phrase
attachment, and niche modeling. Entropy is a fundamental concept in information theory,
and it measures how much choice is involved in the selection of an event (Shannon,
1948). The principle of maximum entropy indicates that the distribution model that
satisfies any given constraints should be as uniform as possible. This agrees with
everything that is known, but carefully avoids assuming anything that is not known.
MaxEnt has been shown to be a promising approach to modeling species distributions
(Philllips et al. 2004, Elith et al. 2006).
24
25
26
BP-ANN
Momentum. Default value: 0.3.
Learning rate. Default value: 0.1.
PSO-ANN
The number of particles. Default value: 10.
Rough set
The number of discretized grades. Default value: 10.
The option to choose lower approximation or upper approximation to training
the model.
Classification Tree
The number of trails. Default value: 10.
Window size. Default value: 20.
Pruning confidence level. Default value: 0.25.
Maximum Entropy
Validation Set. Default: 25%.
Empirical Threshold that is used to generate the binary output. Default: 0.05
omission rate.
Fig. 26 demonstrates the prediction result maps using the eight models for the
same species.
Fig. 26 Prediction result maps using the eight models for the same species
27
28
Fig. 28 Dialog box for running a two class model based on pseudo-absence points
6 Accuracy assessment
6.1 Accuracy of prediction result [AnalysisAccuracy of prediction]
Once we have obtained the predicted distribution map, we can compute the accuracy of
the prediction by an overlay operation. To perform this function, the user needs to select
the result map and the relevant species point layer. The computed accuracy is shown in
Fig. 29. It includes:
The error matrix;
The Kappa value;
The true positive rate;
The area of prediction species distribution.
If the species point layer is one-class, the error matrix and Kappa value cannot be
computed.
29
30
Figure 32: ROC curve and AUC value for predicting Black Oaks in California based on
the BioClim model.
31
6.4 True positive rate/ prediction area plot [Model True positive rate/
prediction area plot]
For the real presence-only data (one-class species point layer), which are very common in
ecological observation data, the above accuracy measures are not applicable. One
solution is to generate pseudo absence data and assume they are real absence data,
consequently, conventional accuracy assessment methods such as the Kappa values and
cross validation could be applied. Alternatively, Engler et al. (2004) proposed that a
good prediction model with presence-only data should predict a potential area as small as
possible while still covering the maximum number that the species occurs. Guo et al.
(2005) demonstrated the concept for selecting parameters of one-class SVM. ModEco
allows users to plot the true positive rate vs. the prediction area in aiding users to select
appropriate parameters for the model.
32
This feature is only available for one-class species point layers. Its
implementation is similar to that of the ROC curve. The difference between them is that
the prediction area is obtained instead of the true negative rate. (For one-class data, it
cannot be computed) (Fig. 35).
Fig. 35 Dialog box for showing the true positive rate/ prediction area plot
33
maximum Kappa value may be different from the value of assessing the accuracy of
result map (Section 6.1), since cross-validation is used in this function.
7 Tutorial
In this section, we will present a demonstration on how to use ModEco. After installing
the software, you can use open a project called Species.sml under the directory where
ModEco is installed (e.g. c:\program files\ModEco\Data\Species.sml). The project file is
saved as xml file format, therefore, users can open the file in a text editor. For example,
Notepad ++ is able to read and edit the xml file (available at: http://notepadplus.sourceforge.net/uk/site.htm). The Species.sml file contains the following data set in
California.
Black oak
Coast oak
Calibay
Fremont CW
Interior oak
Canyon oak
Valley oak
Tan oak
Madrone
Oregon oak.
The Species data are originally presence-only, pseudo-absence data are generated using
the Function "Create Pseudo-absence points". For one class method (such as BioClime,
Domain, One-class SVM, or background vs. presence only data model such as MaxEnt,
GLM, SVM), only presence data will be used in the analysis, and pseudo-absence points
will be ignored in the analysis. Note that for background vs. presence-only data model,
ModEco will generate the background points for the model.
35
Fig. 37 Factor importance analysis result for Calibay based on BioCLIM model
According to the result, three factors: DEM, pt_pm7, and ta_pm1 are less
important. We thus can use the other eight factors to predict the species distribution. Note
the analysis result is limited to BioClim model with 95 percentiles and the species
Calibay. For the other models and species, the result may be different.
36
37
8 Acknowledgement
This research is partially supported by the National Science Foundation (BDI-0742986)
and BioGeomancer project from the Gordon and Betty Moore Foundation. The
development of ModEco also benefited from discussion and comments from John
Wieczorek and Craig Moritz at the Museum of Vertebrate Zoology at UC Berkeley. We
also thank Otto Alvarez, Hong Yu, Miguel Fernandez, Andew Zumkehr for help on
software debugging and website maintenance.
9 References
Anderson, R.P. and Martinez-Meyer, E., 2004, "Modeling species' geographic
distributions for preliminary conservation assessments: An implementation with
the spiny pocket mice (heteromys) of ecuador", Biological Conservation, 116:
167-179.
Bian, L. and West, E., 1997, "Gis modeling of elk calving habitat in a prairie
environment with statistics", Photogrammetric Engineering & Remote Sensing,
63: 161-167.
Broenniman, O., Thuiller, W., Hughes, G., Midgley, G.F., Alkemade, J.M.R. and Guisan,
A., 2006, "Do geographic distribution, niche property and life form explain
plants' vulnerability to global change?" Global Change Biology, 12: 1079-1093.
Busby, J.R., 1986, "A biogeoclimatic analysis of nothofagus cunninghamii (hook.) oerst.
In southeastern australia", Australian Journal of Ecology, 11: 1-7.
Carpenter, G., Gillison, A.N. and Winter, J., 1993, "Domain - a flexible modeling
procedure for mapping potential distributions of plants and animals", Biodiversity
and Conservation, 2: 667-680.
38
Chefaoui, R.M., Hortal, J. and Lobo, J.M., 2005, "Potential distribution modelling, niche
characterization and conservation status assessment using gis tools: A case study
of iberian copris species", Biological Conservation, 122: 327-338.
Cristianini, N. and Scholkopf, B., 2002, "Support vector machines and kernel methods the new generation of learning machines", Ai Magazine, 23: 31-41.
De'ath, G. and Fabricius, K., 2000, "Classification and regression trees: A powerful yet
simple technique for ecological data analysis", Ecology, 81: 3178-3192.
Duda, R.O., Hart, P.E. and Stork, D.G., 2001, Pattern classification, New York: John
Wiley & Sons.
Eberhart, R.C. and Kennedy, J., 1995, A new optimizer using particle swarm theory.
Proceedings of the Sixth International Symposium on Micromachine and Human
Science, Japan: Nagoya, 39-43.
Elith, J., Graham, C.H., Anderson, R.P., Dudik, M., Ferrier, S., Guisan, A., Hijmans, R.J.,
Huettmann, F., Leathwick, J.R., Lehmann, A., Li, J., Lohmann, L.G., Loiselle,
B.A., Manion, G., Moritz, C., Nakamura, M., Nakazawa, Y., Overton, J.M.,
Peterson, A.T., Phillips, S.J., Richardson, K., Scachetti-Pereira, R., Schapire,
R.E., Soberon, J., Williams, S., Wisz, M.S. and Zimmermann, N.E., 2006,
"Novel methods improve prediction of species' distributions from occurrence
data", Ecography, 29: 129-151.
Engler, R., Guisan, A. and Rechsteiner, L., 2004, "An improved approach for predicting
the distribution of rare and endangered species from occurrence and pseudoabsence data", Journal of Applied Ecology, 41: 263-274.
Fabricius, K. and De'ath, G., 2001, "Environmental factors associated with the spatial
distribution of crustose coralline algae on the great barrier reef", Coral Reefs, 19:
303-309.
Felicisimo, A.M., Frances, E., Fernandez, J.M., Gondalez-Diez, A. and Varas, J., 2002,
"Modeling the potential distribution of forests with a gis", Photogrammetric
Engineering & Remote Sensing, 68: 455-462.
Feria, T.P. and Peterson, A.T., 2002, "Prediction of bird community composition based
on point-occurrence data and inferential algorithms: A valuable tool in
biodiversity assessments", Diversity and Distributions, 8: 49-56.
Fielding, A.H. and Haworth, P.F., 1995, "Testing the generality of bird-habitat models",
Conservation Biology, 9: 1466-1481.
Fonseca, M.S., Whitfield, P.E., Kelly, N.M. and Bell, S.S., 2002, "Statistical modeling of
seagrass landscape pattern and associated ecological attributes in relation to
hydrodynamic gradients", Ecological Applications, 12: 218-237.
Frescino, T.S., Edwards, T.C. and Moisen, G.G., 2001, "Modeling spatially explicit
forest structural attributes using generalized additive models", Journal of
Vegetation Science, 12: 15-26.
Funk, V.A. and Richardson, K.S., 2002, "Systematic data in biodiversity studies: Use it
or lose it", Systematic Biology, 51: 303-316.
Graham, C.H., Ferrier, S., Huettman, F., Moritz, C. and Peterson, A.T., 2004, "New
developments in museum-based informatics and applications in biodiversity
analysis", Trends in Ecology & Evolution, 19: 497-503.
39
Graham, C.H., Moritz, C. and Williams, S.E., 2006, "Habitat history improves prediction
of biodiversity in rainforest fauna", Proceedings of the National Academy of
Sciences of the United States of America, 103: 632-636.
Guisan, A., Edwards, T.C. and Hastie, T., 2002, "Generalized linear and generalized
additive models in studies of species distributions: Setting the scene", Ecological
Modelling, 157: 89-100.
Guo, Q.H., Kelly, M. and Graham, C.H., 2005, "Support vector machines for predicting
distribution of sudden oak death in california", Ecological Modelling, 182: 75-90.
Hastie, T., Tibshirani, R. and Friedman, J., 2001, The elements of statistical learning:
Data mining, inference and prediction., New York: Springer.
Iguchi, K., Matsuura, K., McNyset, K.M., Peterson, A.T., Scachetti-Pereira, R., Powers,
K.A., Vieglais, D.A., Wiley, E.O. and Yodo, T., 2004, "Predicting invasions of
north american basses in japan using native range data and a genetic algorithm",
Transactions of the American Fisheries Society, 133: 845-854.
Jaynes, E.T., 1957, Information Theory and Statistical Mechanics. Physical Review, 106,
pp. 620630.
Kelly, N.M., 2002, Monitoring sudden oak death in california using high-resolution
imagery. In: pp. 799-810. USDA-Forest Service.
Kelly, N.M., Fonseca, M. and Whitfield, P., 2001, "Predictive mapping for management
and conservation of seagrass beds in north carolina", Aquatic Conservation:
Marine and Freshwater Ecosystems, 11: 437-451.
Kueppers, L.M., Snyder, M.A., Sloan, L.C., Zavaleta, E.S. and Fulfrost, B., 2005,
"Modeled regional climate change and california endemic oak ranges", PNAS %R
10.1073/pnas.0501427102, 102: 16281-16286.
Lai, C., Tax, D., Duin, R., Pekalska, E. and Paclik, P., 2002, On combining one-class
classifiers for image database retrieval. In: Roli, F. & Kittler, J. (eds.) Multiple
classifier systems, pp. 212-221. Springer-Verlag, Berlin.
Latimer, A.M., Wu, S.S., Gelfand, A.E. and Silander, J.A., 2006, "Building statistical
models to analyze species distributions", Ecological Applications, 16: 33-50.
Lek, S., Delacoste, M., Baran, P., Dimopoulos, I., Lauga, J. and Aulagnier, S., 1996,
"Application of neural networks to modelling nonlinear relationships in ecology",
Ecological Modelling, 1 -13.
Livingston, S.A., Todd, C.S., Krohn, W.B. and Owen, R.B., 1990, "Habitat models for
nesting bald eagles in maine", Journal of Widlife Management, 54: 644-665.
Loiselle, B.A., Howell, C.A., Graham, C.H., Goerck, J.M., Brooks, T., Smith, K.G. and
Williams, P.H., 2003, "Avoiding pitfalls of using species distribution models in
conservation planning", Conservation Biology, 17: 1591-1600.
Manel, S., Dias, J.M., Buckton, S.T. and Ormerod, S.J., 1999a, "Alternative methods for
predicting species distribution: An illustration with himalayan river birds",
Journal of Applied Ecology, 36: 734-747.
Manel, S., Dias, J.M. and Ormerod, S.J., 1999b, "Comparing discriminant analysis,
neural networks and logistic regression for predicting species distributions: A case
study with a himalayan river bird", Ecological Modelling, 120: 337-347.
Manevitz, L.M. and Yousef, M., 2002, "One-class svms for document classification",
Journal of Machine Learning Research, 2: 139-154.
40
41
Venables, W.N. and Ripley, B.D. 2002. Modern applied statistics with s. Springer, New
York.
Werbos, P.J., 1994, The Roots of Backpropagation, Wiley.
Wieczorek, J., Guo, Q.G. and Hijmans, R.J., 2004, "The point-radius method for
georeferencing locality descriptions and calculating associated uncertainty",
International Journal of Geographical Information Science, 18: 745-767.
Wielanda, R., Vossa, M., Holtmanna, X., Mirschela, W. and Ajibefunb, I., 2006, "Spatial
analysis and modeling tool (samt): 1. Structure and possibilities", Ecological
Informatics, 1: 67-76.
42