Neural Network

Neural Networks Página 1 de 220
Neural Networks
Neural Networks Overviews
STATISTICA Neural Networks Program Overview
Neural Networks Introductory Overview
Preface
The Biological Inspiration
The Basic Artificial Model
Using a Neural Network
Gathering Data for Neural Networks
Pre- and Post-processing
Multilayer Perceptrons
Multilayer Perceptrons - Part II
Radial Basis Function Networks
Probabilistic Neural Networks
Generalized Regression Neural Networks
Linear Networks
SOFM Networks
Classification Problems
Regression Problems
Time Series Prediction in STATISTICA Neural Networks
Variable Selection and Dimensionality Reduction
Ensembles and Resampling
Recommended Textbooks
Specifying the Neural Networks Analysis
Extended Dot Product Training
Extended Dot Product Training Dialog
Extended Dot Product Training - Quick Tab
Extended Dot Product Training - End Tab
Extended Dot Product Training - Decay Tab
Extended Dot Product Training - BP (1)/(2) Tab
Extended Dot Product Training - QP (1)/(2) Tab
Extended Dot Product Training - DBD (1)/(2) Tab
file:///C:/Users/CERIn%2047/AppData/Local/Temp/~hhDEAB.htm 09/10/2018
Extended Kohonen Training
Extended Kohonen Training Dialog
Neural Networks (Startup Panel)
STATISTICA Neural Networks (SNN) Startup Panel
Neural Networks Startup Panel - Quick Tab
Neural Networks Startup Panel - Advanced Tab
Neural Networks Startup Panel - Networks/Ensembles Tab
Intelligent Problem Solver
Intelligent Problem Solver Overview
How the Intelligent Problem Solver Works
Intelligent Problem Solver Dialog
Intelligent Problem Solver - Quick Tab
Intelligent Problem Solver - Retain Tab
Intelligent Problem Solver - Types Tab
Intelligent Problem Solver - Complexity Tab
Intelligent Problem Solver - Thresholds Tab
Intelligent Problem Solver - Time Series Tab
Intelligent Problem Solver - MLP Tab
Intelligent Problem Solver - Feedback Tab
Intelligent Problem Solver - Progress dialog
Custom Network Designer
Custom Network Designer Dialog
Custom Network Designer - Quick Tab
Custom Network Designer - Units Tabs
Custom Network Designer - PNN Tab
Custom Network Designer - Time Series Tab
Create/Edit Ensemble
Create/Edit Ensemble Dialog
Multiple Model Selection
Select Networks and/or Ensembles Dialog
Select Networks and/or Ensembles - Models Tab
Select Networks and/or Ensembles - Options Tab
Code Generator
Run Code Generator
Retraining Networks
Select Neural Network to Retrain
Network Set Editor
Neural Network File Editor Dialog
Neural Network File Editor - File Details Tab
Neural Network File Editor - Networks Tab
Neural Network File Editor - Ensembles Tab
Neural Network File Editor - Replacement Options Tab
Neural Network File Editor - Advanced Tab
Select Network File
Select Pre-processing Network
Neural Network Editor
Neural Network Editor Dialog
Neural Network Editor - Quick Tab
Neural Network Editor - Variables Tab
Neural Network Editor - Layers Tab
Neural Network Editor - Weights Tab
Neural Network Editor - Time Series Tab
Neural Network Editor - Advanced Tab
Neural Network Editor - Pruning Tab
Neural Network Editor - Thresholds Tab
Add Variables to Network
Add Units to Network
Delete Units from Network
Nominal Definition
Generalized Regression Neural Network
Generalized Regression Neural Network Training Overview
Train Generalized Regression Networks Dialog
Train Generalized Regression Networks - Quick Tab
Train Generalized Regression Networks - Pruning Tab
Train Generalized Regression Networks - Classification Tab
Clustering Network
Cluster Network Training Overview
Train Cluster Network Dialog
Train Cluster Network - Quick Tab
Train Cluster Network - LVQ Tab
Train Cluster Network - Pruning Tab
Train Cluster Network - Classification Tab
Train Cluster Network - Interactive Tab
Linear
Linear Network Training Overview
Train Linear Network Dialog
Train Linear Network - Quick Tab
Train Linear Network- Pruning Tab
Train Linear Network - Classification Tab
Multiple Network Selection
Select Neural Networks
Multilayer Perceptron
Multilayer Perceptron Training Overview
Train Multilayer Perceptron Dialog
Train Multilayer Perceptron - Quick Tab
Train Multilayer Perceptron - Start Tab
Train Multilayer Perceptron - End Tab
Train Multilayer Perceptron - Decay Tab
Train Multilayer Perceptron - Interactive Tab
Train Multilayer Perceptron - BP Tab
Train Multilayer Perceptron - QP Tab
Train Multilayer Perceptron - DBD Tab
Train Multilayer Perceptron - Classification Tab
Principal Components
Principal Component Analysis Dialog
Radial Basis Function
Radial Basis Function Training Overview
Train Radial Basis Function Dialog
Train Radial Basis Function - Quick Tab
Train Radial Basis Function - Pruning Tab
Train Radial Basis Function - Classification Tab
Probabilistic Neural Network
Probabilistic Neural Network Training Overview
Train Probabilistic Neural Networks Dialog
Train Probabilistic Neural Networks - Quick Tab
Train Probabilistic Neural Networks - Priors Tab
Train Probabilistic Neural Networks - Loss Matrix Tab
Train Probabilistic Neural Networks - Pruning Tab
Train Probabilistic Neural Networks - Thresholds Tab
Self Organizing Feature Map
Self Organizing Feature Map Training Overview
Train Self Organizing Feature Map Dialog
Train Self Organizing Feature Map - Quick Tab
Train Self Organizing Feature Map - Start Tab
Train Self Organizing Feature Map - Classification Tab
Train Self Organizing Feature Map - Interactive Tab
Train Radial Layer
Train Radial Layer Dialog
Train Radial Layer - Sample Tab
Train Radial Layer - Deviation Tab
Train Radial Layer - Labels Tab
Train Radial Layer - Kohonen Tab
Train Radial Layer - Start(K) Tab
Train Radial Layer - LVQ Tab
Train Radial Layer - Interactive Tab
Train Dot Product Layers
Train Dot Product Layers Dialog
Train Dot Product Layers - Train Tab
Train Dot Product Layers - Start Tab
Train Dot Product Layers - End Tab
Train Dot Product Layers - Decay Tab
Train Dot Product Layers - Interactive Tab
Train Dot Product Layers - BP (1)+(2) Tab
TrainDot Product Layers - QP (1)+(2) Tab
Train Dot Product Layers - DBD (1)+(2) Tab
Training in Progress
Training in Progress Dialog
Training in Progress - Graph Tab
Training in Progress - Dynamic Options Tab
Training in Progress - Static Options Tab
Sampling of Case Subsets for Training
Sampling of Case Subsets for Training Dialog
Sampling of Case Subsets for Training - Quick Tab
Sampling of Case Subsets for Training - Advanced Tab
Sampling of Case Subsets for Training - Cross Validation Tab
Sampling of Case Subsets for Training - Bootstrap Tab
Sampling of Case Subsets for Training - Random Tab
Select Cases
Select Cases
Sampling of Case Subsets for Intelligent Problem Solver
Sampling of Case Subsets for Intelligent Problem Solver Dialog
Sampling of Case Subsets for Intelligent Problem Solver - Quick Tab
Sampling of Case Subsets for Intelligent Problem Solver - Random Tab
Sampling of Case Subsets for Intelligent Problem Solver - Bootstrap Tab
Sampling of Case Subsets for Feature Selection
Feature (Independent Variable) Selection Dialog
Feature Selection - Quick Tab
Feature Selection - Advanced Tab
Feature Selection - Genetic Algorithm Tab
Feature Selection - Interactive Tab
Feature Selection - Finish Tab
Feature Selection in Progress Dialog
Sampling of Case Subsets for Feature Selection Dialog
Training String
Profile String
Reviewing Neural Networks Results
Results
Results Dialog
Discard Results Prompt
Results Dialog - Quick Tab
Results Dialog - Advanced Tab
Results Dialog - Predictions Tab
Results Dialog - Residuals Tab
Results Dialog - Sensitivity Tab
Results Dialog - Plot Tab
Results Dialog - Descriptive Statistics Tab
User Defined Case Prediction
User Defined Case Prediction Dialog
User Defined Case Prediction - Quick Tab
User Defined Case Prediction - Advanced Tab
Response Graph
Response Graph Dialog
Response Graph - Quick Tab
Response Graph - Advanced Tab
Response Graph - Fixed Independent Tab
Response Surface
Response Surface Dialog
Response Surface - Quick Tab
Response Surface - Advanced Tab
Response Surface - Fixed Independents Tab
Topological Map
Topological Map Dialog
Topological Map - Topological Map Tab
Topological Map - Custom Case Tab
Topological Map - Advanced Tab
Topological Map - Win Frequencies Tab
Network Illustration
Network Illustration Dialog
Network Illustration - Illustration Tab
Network Illustration - Custom Case Tab
Time Series Projection
Time Series Projection Dialog
Time Series Projection - Quick Tab
Time Series Projection - Advanced Tab
Neural Networks Technical Notes
Activation Functions
Back Propagation
Classification
Classification by Labeled Exemplars
Classification Statistics
Classification Thresholds
Class Labeling
Conjugate Gradient Descent
Delta-Bar-Delta
Deviation Assignment Algorithms
Ensembles
Error Function
Joining Networks
Intelligent Problem Solver - Replication of Results
K-Means Algorithm
Kohonen Algorithm
Learned Vector Quantization
Levenberg-Marquardt
Loss Matrix
Model Profiles
Model Summary Details
Network Sets
Perceptron
Pseudo-Inverse (Singular Value Decomposition)
Quasi-Newton
Quick Propagation
Radial Sampling
Receiver Operating Characteristic (ROC) Curve
Regression
Regression Statistics
Resampling
Sensitivity Analysis
Stopping Conditions
Synaptic Functions
Time Series
Topological Map
Unit Types
Unsupervised Learning
Weigend Weight Regularization
Neural Networks Examples
STATISTICA Neural Networks Step-by-Step Examples
SNN Example 1: Preliminaries (Data File and Analysis)
SNN Example 2: Creating a Neural Network
SNN Example 3: Testing a Neural Network
SNN Example 4: Using the Network Set Editor
SNN Example 5: The Iris Problem
SNN Example 6: Advanced Use of the IPS
SNN Example 7: Classification Confidence Thresholds
SNN Example 8: Regression Problems
SNN Example 9: Case Subsets
SNN Example 10: PNN and GRNN Networks
SNN Example 11: Creating a Custom Neural Network
SNN Example 12: Training the Network
SNN Example 13: Stopping Conditions
SNN Example 14: Radial Basis Function Networks
SNN Example 15: Linear Models
SNN Example 16: SOFM Networks
SNN Example 17: PNNs and GRNNs Revisited
SNN Example 18: Ensembles and Sampling
SNN Example 19: Input Variable Selection
SNN Example 20: Time Series
Neural Networks Overviews

Preface
Multilayer Perceptrons - Part II
Linear Networks
SOFM Networks
Regression Problems

STATISTICA Neural Networks (SNN) is a comprehensive, state-of-the-art, powerful, and extremely fast neural network data
analysis package, featuring:
l Integrated pre- and post-processing, including data selection, nominal-value encoding, scaling, normalization, and
missing value substitution, with interpretation for classification, regression, and time series problems.
l Exceptional ease of use coupled with unsurpassed analytic power; for example, a unique wizard-style Intelligent Problem
Solver can guide you step by step through the procedure of creating a variety of different networks and choosing the
network with the best performance (a task that would otherwise require a lengthy "trial and error" process and a solid
background in the underlying theory).
l Powerful exploratory and analytic techniques, including Input Feature Selection algorithms (choosing the right input
variables in exploratory data analysis, which is a typical application of neural networks, is often a time-consuming
process; STATISTICA Neural Networks can also do this for you).
l State-of-the-art, highly optimized training algorithms (including Conjugate Gradient Descent and Levenberg-Marquardt);
full control over all aspects that influence the network performance such as activation and error functions, or network
complexity.
l Support for combinations of networks and network architectures of practically unlimited sizes organized in Network Sets;
selective training of network segments; merging, saving of network sets in separate files.
l Comprehensive graphical and statistical feedback that facilitates interactive exploratory analyses.
l Integration with the STATISTICA system, including direct transfer of data and graphs to STATISTICA for further analysis
and customization of results (STATISTICA Neural Networks can also be used as a stand-alone package).
l API (Application Programming Interface) support for embedded solutions using Visual Basic, Delphi, C, C++ and other
languages.
Tackling the Real Issues in Neural Computing
Using neural networks involves more than simply feeding data to a neural network.
STATISTICA Neural Networks has the functionality to help you through the critical design stages, including not only state-
of-the-art Neural Network Architectures and Training Algorithms, but also innovative new approaches to Input Parameter
Selection and Network Design. Moreover, software developers and those users who experiment with customized
applications will appreciate the fact that once your prototyping experiments are completed using SNN's simple, intuitive
interface, the STATISTICA Neural Networks API allows you to embed neural technology into your own specialized software
simply and at low cost.
Input Data
STATISTICA Neural Networks stores its data in the STATISTICA data format; thus, your data can easily be transferred from
STATISTICA or converted to the STATISTICA data format using any of the wide variety of STATISTICA data import
facilities. Alternatively, you can create a data set in STATISTICA Neural Networks by entering the data into the editor,
pasting them from the Clipboard, or importing from ASCII (text) files (both TAB-delimited and comma-delimited formats
are supported). SNN automatically identifies nominal valued variables and missing values during Import. Once data are
inside STATISTICA Neural Networks, they can easily be edited in STATISTICA Neural Networks' data set Editor, using a
familiar spreadsheet-like interface; editing operations can include the labeling of cases and variables, adding and removing
cases and/or variables, identification of input and output variables, or division of cases into Training, Verification, and Test
Sets. You can also temporarily ignore selected cases and/or variables.
Input Selection and Dimensionality Reduction
Once you have a data set prepared, you will need to decide which variables to use in your neural network. Larger numbers of
input variables require larger neural networks, with consequent increases in storage and training time requirements and the
need for greater numbers of training cases. Lack of data and correlations between variables make the selection of important
input variables, and the compression of information into smaller numbers of variables, issues of critical importance in many
neural applications.
Input feature selection algorithms. STATISTICA Neural Networks includes backward and
forward stepwise selection algorithms. In addition, the Neuro-Genetic Input Selection algorithm
uniquely combines the technologies of Genetic Algorithms and PNN/GRNNs (PNN stands for
Probabilistic Neural Networks, and GRNN stands for Generalized Regression Neural Network)
to automatically search for optimal combinations of input variables, even where there are
correlations and nonlinear interdependencies. The near-instantaneous training time of
PNN/GRNN not only allows the Neuro-Genetic Input Selection algorithm to operate, it also
allows you, in conjunction with the SNN data set Editor's simple variable suppression facilities,
to conduct your own input sensitivity experiments on a realistic time scale. STATISTICA Neural
Networks also includes built-in Principal Components Analysis (PCA and Autoassociative networks for "nonlinear PCA") to
extract smaller numbers of dimensions from the raw inputs. Note that a wide variety of statistical tools for data reduction are
available in STATISTICA.
Data Scaling and Nominal Value Preparation
Data must be specially prepared for input into a network, and also it is important that the network output can be interpreted
correctly. STATISTICA Neural Networks includes automatic data scaling (including Minimax and Mean/SD scaling) for both
inputs and outputs; there is also automatic recoding of Nominal valued variables (e.g., Sex={Male,Female}), including one-
of-N encoding. STATISTICA Neural Networks also has facilities to handle missing data. Normalization functions such as
Unit Sum, Winner-takes-all, and Unit Vector are also supported. There are special data preparation and interpretation
facilities for use with Time Series. A large number of relevant tools are also included in STATISTICA.
For classification problems, you can set confidence limits, which SNN uses to assign cases to classes. In combination with
STATISTICA Neural Networks' specialized Softmax activation function and cross-entropy error functions, this supports a
principled, probabilistic approach to classification.
Selecting a Neural Network Model
The range of neural network models and the number of parameters that must be decided upon (including network size, and
training algorithm control parameters) can seem bewildering (the wizard-style Intelligent Problem Solver is available to
guide you through the selection process). STATISTICA Neural Networks supports the most important classes of neural
networks for real world problem solving, including:
Multilayer Perceptrons (Feedforward Networks)
Kohonen Self-Organizing Feature Maps
Probabilistic (Bayesian) Neural Networks
Linear Modeling.
STATISTICA Neural Networks has numerous facilities to aid in selecting an appropriate network architecture. STATISTICA
Neural Networks statistical and graphical feedback includes Bar Charts, Matrices and Graphs of individual and overall case
errors, summaries of classification/misclassification performance, and vital statistics such as Regression Error Ratios - all
automatically calculated.
For data visualization, SNN can also display Scatterplots and 3D Response Surfaces to help you
understand the network's "behavior."
Naturally, you can export information from any of these sources to a file or to the Clipboard, or
directly to STATISTICA, for inclusion in your reports, further analysis, or customization.
STATISTICA Neural Networks automatically retains a copy of the best network found as you
experiment on a problem, which can be retrieved at any time. The usefulness and predictive validity
of the network can automatically be assessed by including verification cases, and by evaluating the
size and efficiency of the network as well as the cost of misclassification. SNN's automatic Cross
Verification and Weigend Weight Regularization procedures also allow you to quickly assess
whether a network is overly or not sufficiently complex for the problem at hand.
For enhanced performance, STATISTICA Neural Networks supports a number of network customization options. You can
specify a Linear output layer for networks used in Regression problems, or Softmax Activation functions for probability-
estimation in Classification problems. If your data suffers badly from outliers, you can replace the standard Error function
used in training with the less sensitive City-Block error function. Cross-entropy error functions, based on information-theory
models, are also included, and there are a range of specialized activation functions, including Step, Ramp and Sine functions.
The Intelligent Problem Solver (an Easy-to-Use Wizard for Network Creation)
Included with STATISTICA Neural Networks is an Intelligent Problem Solver (accessible

via a toolbar button), which will guide you through the network creation process.
The Intelligent Problem Solver can create networks using data whose cases are
independent (standard networks) as well as networks that predict future observations
based on previous observations of the same variable (time series networks).
Each wizard dialog requests general information regarding the data set and/or network,
explains the available choices, and suggests default values. After the information has been
gathered, SNN automatically displays all specified results.
A significant amount of time during the design of a neural network is spent on the selection of appropriate variables, and
then optimizing the network architecture by heuristic search. STATISTICA Neural Networks takes the pain out of the process
by automatically conducting a heuristic search for you.
Specifically, the Intelligent Problem Solver is an extremely effective tool that uses sophisticated
nonlinear optimization techniques (including Simulated Annealing) to search automatically for
an optimal network architecture. Why labor over a terminal for hours, when you can let
STATISTICA Neural Networks do the work for you?
The Intelligent Problem Solver can also be used in a process of model building when
STATISTICA Neural Networks is used in conjunction with some modules of the main
STATISTICA system to identify the most relevant variables (e.g., the best predictors to be
included and then tested in some Nonlinear Estimation model).
Training a Neural Network
As you experiment with architectures and network types, you rely critically on the quality and speed of the network training
algorithms. STATISTICA Neural Networks supports the best known state-of-the-art training algorithms.
For Multilayer Perceptrons, SNN naturally includes Back Propagation with time-varying
learning rate and momentum, case-presentation order shuffling, and additive Noise for robust
generalization. However, STATISTICA Neural Networks also includes two fast, second-order
training algorithms: Conjugate Gradient Descent and Levenberg-Marquardt. Levenberg-

Marquardt is a very powerful, modern nonlinear optimization algorithm, and it is strongly recommended. However, as
Levenberg-Marquardt is limited in application to fairly small networks with a single output variable, STATISTICA Neural
Networks also includes Conjugate-Gradient Descent for more difficult problems. Both of these algorithms typically converge
far more quickly than Back Propagation, and frequently to a far better solution.
STATISTICA Neural Networks' iterative training procedures are complemented by automatic tracking of both the Training
error and an independent Verification error, including a Graph of the overall errors and a Bar Chart for individual case
errors. Training can be aborted at any point by the click of a button, and you can also specify Stopping Conditions when
training should be prematurely aborted, for example, when a target error level is reached or when the Verification error
deteriorates over a given number of epochs (indicating Over-learning). If over-learning occurs, you needn't worry; SNN
automatically retains a copy of the Best Network discovered, which can be retrieved at the click of a button. When training
is finished, you can check performance against the independent Test Set.
STATISTICA Neural Networks also includes a range of training algorithms for other
network architectures. Radial Basis Function and Generalized Regression networks can
have Radial exemplar units and smoothing factors assigned by a variety of algorithms,
including Kohonen training, Sub-Sampling, K-Means, Isotropic, and Nearest Neighbor
techniques. The Linear output layers of Radial Basis Function networks can be fully
optimized using Singular Value Decomposition, as can Linear networks.
Hybridization of Network Structures. STATISTICA Neural Networks also supports

hybridization of network structures; for example, a modified Radial Basis Function
network could have a first layer trained by Kohonen's algorithm, and a nonlinear second
layer trained by Levenberg-Marquardt.
Probing and Testing a Neural Network
Once you have trained a network, you'll want to test its performance and explore its characteristics. STATISTICA Neural
Networks uses a range of on-screen statistics and graphical facilities.
All statistics are generated independently for the Training, Verification, and Test Sets. You
can view the individual weights and activations in the network in convenient data sheet
format; one click of a button can also transfer them into STATISTICA as spreadsheets.
Copying these data to the Clipboard or saving them to a file can also be accomplished by
clicking a button. A range of output types can be viewed: post-processed outputs, output
neuron activations, and codebook vectors. Activations can also be viewed in Bar Chart form.
The results of running individual cases, or the entire set, can also be viewed in data sheet
form.
Overall statistics calculated include mean network error, the so-called Confusion matrix for Classification problems (which
summarizes correct and incorrect classification across all classes), and the Regression Error Ratio for Regression problems -
all automatically calculated. Kohonen networks include a Topological Map window, which allows you to visually inspect
unit activations, and to relabel cases and units during data analysis. There is also a Win Frequencies window to instantly
locate clusters in the Topological Map. Cluster analysis can also be conducted using conventional networks together with
STATISTICA Neural Networks' Cluster Diagram (shown below). For example, you can train a Principal Components
Analysis network, and plot data through the first two Principal Components.
Network Editing, Modification, and Pipelining
STATISTICA Neural Networks includes intelligent facilities to prune existing networks,and to join networks together. Entire
layers can be deleted, networks with compatible numbers of Inputs and Outputs can be pipelined together, and individual
neurons can be added or removed. These facilities allow SNN to support Dimensionality Reduction (for preprocessing) by
the use of Autoassociative networks, and Loss Matrices (for minimum-cost decision making). Loss matrices are
automatically included with Probabilistic Neural Networks.
Embedded Solutions (Custom Applications that Use the STATISTICA Neural Networks Engines)
STATISTICA Neural Networks' simple and efficient user interface allows you to rapidly prototype neural network solutions
to your problems.
In some applications, you may want to embed these solutions in your own systems and,
for example, build them into some larger computing environments (such as predesigned
procedures built into enterprise-wide computing systems). STATISTICA Neural Networks'
API (Application Programming Interface) allows you to do just that. Two versions of the
API are provided. The Limited API allows networks created and trained in STATISTICA
Neural Networks to be executed from other applications, including programs written in C,
C++, Delphi, and Visual Basic. The Full API gives access to the whole power of
STATISTICA Neural Networks' sophisticated Neural Kernel, including creation, editing,
loading and saving of data sets and Networks, and the training of Networks using the
algorithms described above. After purchasing a respective number of Run-time Licenses,
you can embed this powerful functionality in your own applications. The API is provided using DLLs for Windows
platforms.
See the Neural Networks Overviews Index for other SNN Overviews.

STATISTICA Neural Networks (SNN) implements neural networks, powerful non-linear modeling techniques. Neural
networks originally grew out of research aimed at mimicking the fault tolerance and distributed learning structure of the
human brain. However, the models implemented in SNN are not closely related to these, but are better thought of as
advanced pattern recognition algorithms.
Neural network analysis can be used as a stand-alone application or as a perfect complement to traditional statistical
analyses, and SNN supplements and completes the various techniques for classification and prediction available in
STATISTICA and STATISTICA Data Miner (see for example the General Linear Models (GLM), General Regression Models
(GRM), Generalized Linear/Nonlinear Models (GLZ), General Discriminant Function Analysis (GDA), Generalized
Additive Models (GAM), General Classification and Regression Trees Models (GC&RT), General CHAID Models
(GCHAID), Time Series, and Association Rules, to name only a few of the powerful methods for prediction and
classification available in STATISTICA). Most traditional statistical analyses focus on building useful models based upon a
number of assumptions and theoretical considerations (e.g., that the underlying relationship is linear or that a particular
variable is normally distributed in the population). A neural network approach is free of the traditional assumptions, is well-
suited for complex non-linear relationships, and is perfect for exploratory analyses where the goal is to establish if any
relationship exists among a set of variables. Moreover, SNN's unique and powerful Intelligent Problem Solver removes most
of the problems usually associated with neural network design, automatically selecting suitable network types, complexity
and input variables.
The major types of analyses available in SNN are:
Intelligent Problem Solver - A specialized tool to analyze the data and generate neural networks for you, requiring minimal
intervention on your behalf and conducting all necessary phases of the analysis.
Custom Network Designer - A lower level tool allowing you to choose individual network architectures and training
algorithms to exact specifications.
Ensemble Editor - Forms networks into cooperating ensembles that may have superior generalization performance
(prediction on new data).
Feature Selection - Runs the auxiliary feature selection algorithms. SNN has feature selection (i.e. choice of independent
(input) variables) built into the Intelligent Problem Solver and Custom Network Designer. This additional analysis includes
specialized, slower algorithms that sometimes yield better results.
Code Generator - Generates a code version of your neural networks and ensembles, either in the "C" computer language or
in STATISTICA Visual Basic, for integration into your own programs. (This is an optional, separately licensed, feature.)
STATISTICA Neural Networks supports the most practical types of neural networks known for real-world problem-solving
today, and includes the latest state-of-the-art techniques for fast training, automatic design and variable selection. The user
interface of SNN is optimized to avail these methods to experts and non-experts alike, and to quickly yield useful results,
ready for deployment, to predict or classify efficiently new observations. SNN is also fully integrated into the STATISTICA
environment, with all its analytical and graphical capabilities. Also, practically all functionality of SNN is fully automated,
and accessible via STATISTICA Visual Basic or other programs that support the standard COM object interface.
Applications for Neural Networks
Neural networks have seen an explosion of interest over the last few years, and are being successfully applied across an
extraordinary range of problem domains, in areas as diverse as business, medicine, engineering, geology and physics.
Indeed, anywhere that there are problems of prediction, classification, or control, neural networks are being introduced.
Neural networks are applicable in virtually every situation in which a relationship between the predictor variables
(independents, inputs) and predicted variables (dependents, outputs) exists, even when that relationship is very complex and
not easy to articulate in the usual terms of "correlations" or "differences between groups." A few representative examples of
problems to which neural network analysis has been applied successfully are:
Detection of medical phenomena. A variety of health-related indices (e.g., a combination of heart rate, levels of various
substances in the blood, respiration rate) can be monitored. The onset of a particular medical condition could be associated
with a very complex (e.g., nonlinear and interactive) combination of changes on a subset of the variables being monitored.
Neural networks have been used to recognize this predictive pattern so that the appropriate treatment can be prescribed.
Stock market prediction. Fluctuations of stock prices and stock indices are another example of a complex,
multidimensional, but in some circumstances at least partially-deterministic phenomenon. Neural networks are being used
by many technical analysts to make predictions about stock prices based upon a large number of factors such as past
performance of other stocks and various economic indicators.
Credit assignment. A variety of pieces of information are usually known about an applicant for a loan. For instance, the
applicant's age, education, occupation, and many other facts may be available. After training a neural network on historical
data, neural network analysis can identify the most relevant characteristics and use those to classify applicants as good or
bad credit risks.
Monitoring the condition of machinery. Neural networks can be instrumental in cutting costs by bringing additional
expertise to scheduling the preventive maintenance of machines. A neural network can be trained to distinguish between the
sounds a machine makes when it is running normally ("false alarms") versus when it is on the verge of a problem. After this
training period, the expertise of the network can be used to warn a technician of an upcoming breakdown, before it occurs
and causes costly unforeseen "downtime."
Engine management. Neural networks have been used to analyze the input of sensors from an engine. The neural network
controls the various parameters within which the engine functions, in order to achieve a particular goal, such as minimizing
fuel consumption.
The following links lead to a series of Neural Networks Overviews:
Preface
Linear Networks
SOFM Networks
Regression Problems
Recommended Text Book
Neural Networks Introductory Overview - Preface

Neural networks have seen an explosion of interest over the last few years, and are being successfully applied across an
extraordinary range of problem domains, in areas as diverse as finance, medicine, engineering, geology and physics. Indeed,
anywhere that there are problems of prediction, classification, or control, neural networks are being introduced. This
sweeping success can be attributed to a few key factors:
Power. Neural networks are very sophisticated modeling techniques capable of modeling extremely complex functions. In
particular, neural networks are nonlinear (a term discussed in more detail later in this section). For many years, linear
modeling has been the commonly used technique in most modeling domains since linear models have well-known
optimization strategies. Where the linear approximation was not valid (which was frequently the case) the models suffered
accordingly. Neural networks also keep in check the curse of dimensionality problem that bedevils attempts to model
nonlinear functions with large numbers of variables.
Ease of use. Neural networks learn by example. The neural network user gathers representative data, and then invokes
training algorithms to automatically learn the structure of the data. Although the user does need to have some heuristic
knowledge of how to select and prepare data, how to select an appropriate neural network, and how to interpret the results,
the level of user knowledge needed to successfully apply neural networks is much lower than would be the case using (for
example) some more traditional nonlinear statistical methods.
Neural networks are also intuitively appealing, based as they are on a crude low-level model of biological neural systems. In
the future, the development of this neurobiological modeling may lead to genuinely intelligent computers. Meanwhile, the
simple neural networks modeled by STATISTICA Neural Networks already add a significant weapon to the armory of the
applied statistician.
Neural Networks Introductory Overview - The

Biological Inspiration
Neural networks grew out of research in artificial intelligence; specifically, attempts to mimic the fault-tolerance and
capacity to learn of biological neural systems by modeling the low-level structure of the brain (see Patterson, 1996). The
main branch of artificial intelligence research in the 1960s -1980s produced expert systems. These are based upon a high-
level model of reasoning processes (specifically, the concept that our reasoning processes are built upon manipulation of
symbols). It became rapidly apparent that these systems, although very useful in some domains, failed to capture certain key
aspects of human intelligence. According to one line of speculation, this was due to their failure to mimic the underlying
structure of the brain. In order to reproduce intelligence, it would be necessary to build systems with a similar architecture.
The brain is principally composed of a very large number (circa 10,000,000,000) of neurons, massively interconnected (with
an average of several thousand interconnects per neuron, although this varies enormously). Each neuron is a specialized cell
that can propagate an electrochemical signal. The neuron has a branching input structure (the dendrites), a cell body, and a
branching output structure (the axon). The axons of one cell connect to the dendrites of another via a synapse. When a
neuron is activated, it fires an electrochemical signal along the axon. This signal crosses the synapses to other neurons,
which may in turn fire. A neuron fires only if the total signal received at the cell body from the dendrites exceeds a certain
level (the firing threshold).
The strength of the signal received by a neuron (and therefore its chances of firing) critically depends on the efficacy of the
synapses. Each synapse actually contains a gap, with neurotransmitter chemicals poised to transmit a signal across the gap.
One of the most influential researchers into neurological systems (Donald Hebb) postulated that learning consisted
principally in altering the strength of synaptic connections. For example, in the classic Pavlovian conditioning experiment,
where a bell is rung just before dinner is delivered to a dog, the dog rapidly learns to associate the ringing of a bell with the
eating of food. The synaptic connections between the appropriate part of the auditory cortex and the salivation glands are
strengthened, so that when the auditory cortex is stimulated by the sound of the bell the dog starts to salivate.
Thus, from a very large number of extremely simple processing units (each performing a weighted sum of its inputs, and
then firing a binary signal if the total input exceeds a certain level) the brain manages to perform extremely complex tasks.
Of course, there is a great deal of complexity in the brain that has not been discussed here, but it is interesting that artificial
neural networks can achieve some remarkable results using a model not much more complex than this.
Neural Networks Introductory Overview - The Basic

Artificial Model
To capture the essence of biological neural systems, an artificial neuron is defined as follows:
It receives a number of inputs (either from original data, or from the output of other neurons in the neural network). Each
input comes via a connection that has a strength (or weight); these weights correspond to synaptic efficacy in a biological
neuron. Each neuron also has a single threshold value. The weighted sum of the inputs is formed, and the threshold
subtracted, to compose the activation of the neuron (also known as the post-synaptic potential, or PSP, of the neuron).
The activation signal is passed through an activation function (also known as a transfer function) to produce the output of the
neuron.
If the step activation function is used (i.e., the neuron's output is 0 if the input is less than zero, and 1 if the input is greater
than or equal to 0) then the neuron acts just like the biological neuron described earlier (subtracting the threshold from the
weighted sum and comparing with zero is equivalent to comparing the weighted sum to the threshold). Actually, the step
function is rarely used in artificial neural networks, as will be discussed. Note also that weights can be negative, which
implies that the synapse has an inhibitory rather than excitatory effect on the neuron: inhibitory neurons are found in the
brain.
This describes an individual neuron. The next question is: how should neurons be connected together? If a network is to be
of any use, there must be inputs (which carry the values of variables of interest in the outside world) and outputs (which
form predictions, or control signals). Inputs and outputs correspond to sensory and motor nerves such as those coming from
the eyes and leading to the hands. However, there also can be hidden neurons that play an internal role in the network. The
input, hidden, and output neurons need to be connected together.
The key issue here is feedback (Haykin, 1994). A simple network has a feedforward structure: signals flow from inputs,
forwards through any hidden units, eventually reaching the output units. Such a structure has stable behavior. However, if
the network is recurrent (contains connections back from later to earlier neurons) it can be unstable, and has very complex
dynamics. Recurrent networks are very interesting to researchers in neural networks, but so far it is the feedforward
structures that have proved most useful in solving real problems, and it is these types of neural networks that STATISTICA
Neural Networks models.
A typical feedforward network has neurons arranged in a distinct layered topology. The input layer is not really neural at all:
these units simply serve to introduce the values of the input variables. The hidden and output layer neurons are each
connected to all of the units in the preceding layer. Again, it is possible to define networks that are partially-connected to
only some units in the preceding layer; however, for most applications fully-connected networks are better, and this is the
type of network supported by STATISTICA Neural Networks.
When the network is executed (used), the input variable values are placed in the input units, and then the hidden and output
layer units are progressively executed. Each of them calculates its activation value by taking the weighted sum of the outputs
of the units in the preceding layer, and subtracting the threshold. The activation value is passed through the activation
function to produce the output of the neuron. When the entire network has been executed, the outputs of the output layer act
as the output of the entire network.
Neural Networks Introductory Overview - Using a

Neural Network
The previous section describes in simplified terms how a neural network turns inputs into outputs. The next important
question is: how do you apply a neural network to solve a problem?
The type of problem amenable to solution by a neural network is defined by the way they work and the way they are trained.
Neural networks work by feeding in some input variables, and producing some output variables. They can therefore be used
where you have some known information, and would like to infer some unknown information (see Patterson, 1996; Fausett,
1994). Some examples are:
Stock market prediction. You know last week's stock prices and today's DOW, NASDAQ, or FTSE index; you want to
know tomorrow's stock prices.
Credit assignment. You want to know whether an applicant for a loan is a good or bad credit risk. You usually know
applicants' income, previous credit history, etc. (because you ask them these things).
Control. You want to know whether a robot should turn left, turn right, or move forward in order to reach a target; you
know the scene that the robot's camera is currently observing.
Needless to say, not every problem can be solved by a neural network. You may want to know next week's lottery result, and
know your shoe size, but there is no relationship between the two. Indeed, if the lottery is being run correctly, there is no fact
you could possibly know that would allow you to infer next week's result. Many financial institutions use, or have
experimented with, neural networks for stock market prediction, so it is likely that any trends predictable by neural
techniques are already discounted by the market, and (unfortunately), unless you have a sophisticated understanding of that
problem domain, you are unlikely to have any success there either!
Therefore, another important requirement for the use of a neural network is that you know (or at least strongly suspect) that
there is a relationship between the proposed known inputs and unknown outputs. This relationship may be noisy (you
certainly would not expect that the factors given in the stock market prediction example above could give an exact
prediction, as prices are clearly influenced by other factors not represented in the input set, and there may be an element of
pure randomness) but it must exist.
In general, if you use a neural network, you won't know the exact nature of the relationship between inputs and outputs - if
you knew the relationship, you would model it directly. The other key feature of neural networks is that they learn the
input/output relationship through training. There are two types of training used in neural networks, with different types of
networks using different types of training. These are supervised and unsupervised training, of which supervised is the most
common and will be discussed in this section (unsupervised learning is described in a later section).
In supervised learning, the network user assembles a set of training data. The training data contains examples of inputs
together with the corresponding outputs, and the network learns to infer the relationship between the two. Training data is
usually taken from historical records. In the above examples, this might include previous stock prices and DOW, NASDAQ,
or FTSE indices, records of previous successful loan applicants, including questionnaires and a record of whether they
defaulted or not, or sample robot positions and the correct reaction.
The neural network is then trained using one of the supervised learning algorithms (of which the best known example is
back propagation, devised by Rumelhart et al., 1986), which uses the data to adjust the network's weights and thresholds so
as to minimize the error in its predictions on the training set. If the network is properly trained, it has then learned to model
the (unknown) function that relates the input variables to the output variables, and can subsequently be used to make
predictions where the output is not known.
Neural Networks Introductory Overview - Gathering

Data for Neural Networks
Once you have decided on a problem to solve using neural networks, you will need to gather data for training purposes. The
training data set includes a number of cases, each containing values for a range of input and output variables. The first
decisions you will need to make are: which variables to use, and how many (and which) cases to gather.
The choice of variables (at least initially) is guided by intuition. Your own expertise in the problem domain will give you
some idea of which input variables are likely to be influential. Once in STATISTICA Neural Networks (SNN), you can select
and deselect variables, and STATISTICA Neural Networks can also experimentally determine useful variables. As a first
pass, you should include any variables that you think could have an influence - part of the design process will be to whittle
this set down.
Neural networks processes numeric data in a fairly limited range. This presents a problem if data is in an unusual range, if
there is missing data, or if data is non-numeric. Fortunately, there are methods built into STATISTICA Neural Networks to
deal with each of these problems. Numeric data is scaled into an appropriate range for the network, and missing values can
be substituted for using the mean value (or other statistic) of that variable across the other available training cases (see
Bishop, 1995).
Handling non-numeric data is more difficult. The most common form of non-numeric data consists of nominal-value
variables such as Gender = {Male, Female}. Nominal-valued variables can be represented numerically, and STATISTICA
Neural Networks has facilities to support them. However, neural networks do not tend to perform well with nominal
variables that have a large number of possible values.
For example, consider a neural network being trained to estimate the value of houses. The price of houses depends critically
on the area of a city in which they are located. A particular city might be subdivided into dozens of named locations, and so
it might seem natural to use a nominal-valued variable representing these locations. Unfortunately, it would be very difficult
to train a neural network under these circumstances, and a more credible approach would be to assign ratings (based on
expert knowledge) to each area; for example, you might assign ratings for the quality of local schools, convenient access to
leisure facilities, etc.
Other kinds of non-numeric data must either be converted to numeric form, or discarded. Dates and times, if important, can
be converted to an offset value from a starting date/time. Currency values can be converted easily. Unconstrained text fields
(such as names) cannot be handled and should be discarded.
The number of cases required for neural network training frequently presents difficulties. There are some heuristic
guidelines, which relate the number of cases needed to the size of the network (the simplest of these says that there should be
ten times as many cases as connections in the network). Actually, the number needed is also related to the (unknown)
complexity of the underlying function that the network is trying to model, and to the variance of the additive noise. As the
number of variables increases, the number of cases required increases nonlinearly, so that with even a fairly small number of
variables (perhaps fifty or less) a huge number of cases are required. This problem is known as the curse of dimensionality
(see Variable Selection and Dimensionality Reduction).
For most practical problem domains, the number of cases required will be hundreds or thousands. For very complex
problems more may be required, but it would be a rare (even trivial) problem that required less than a hundred cases. If your
data is sparser than this, you really don't have enough information to train a network, and the best you can do is probably to
fit a linear model (which STATISTICA Neural Networks can also do for you; see Linear Networks). If you have a larger, but
still restricted, data set, you can compensate to some extent by forming an ensemble of networks, each trained using a
different resampling of the available data, and then average across the predictions of the networks in the ensemble.
Many practical problems suffer from data that is unreliable: some variables may be corrupted by noise, or values may be
missing altogether. STATISTICA Neural Networks has special facilities to handle missing values (they can be patched using
the Mean variable value, or other statistics), so if you are short of data you can include cases with missing values (although
obviously this is not ideal if you can avoid it). Neural networks are also noise tolerant. However, there is a limit to this
tolerance; if there are occasional outliers far outside the range of normal values for a variable, they may bias the training.
The best approach to such outliers is to identify and remove them (either discarding the case, or converting the outlier into a
missing value). If outliers are difficult to detect, STATISTICA Neural Networks does have features to make training more
outlier-tolerant (use of city block error function; see Bishop, 1995), but this outlier-tolerant training is generally less
effective than the standard approach.
Summary
l Choose variables that you believe may be influential.
l Numeric and nominal variables can be handled directly by STATISTICA Neural Networks. Convert other variables to one
of these forms, or discard.
l Hundreds or thousands of cases are required; the more variables, the more cases. STATISTICA Neural Networks has
facilities to help identify useful variables, so initially include even those you're not sure about.
l Cases with missing values can be used, if necessary, but outliers may cause problems - check your data. Remove outliers
if possible. If you have sufficient data, discard cases with missing values.
l If the volume of data available is small, consider using ensembles and resampling.
Neural Networks Introductory Overview - Pre- and

Post-processing
All neural networks take numeric input and produce numeric output. The transfer function of a unit is typically chosen so
that it can accept input in any range, and produces output in a strictly limited range (it has a squashing effect). Although the
input can be in any range, there is a saturation effect so that the unit is only sensitive to inputs within a fairly limited range.
The illustration below shows one of the most common transfer functions, the logistic function (also sometimes referred to as
the sigmoid function, although strictly speaking it is only one example of a sigmoid - S-shaped - function). In this case, the
output is in the range (0,1), and the input is sensitive in a range not much larger than (-1,+1). The function is also smooth
and easily differentiable, facts that are critical in allowing the network training algorithms to operate (this is the reason why
the step function is not used in practice).
The limited numeric response range, together with the fact that information has to be in numeric form, implies that neural
solutions require preprocessing and post-processing stages to be used in real applications (see Bishop, 1995). These facilities
are built into STATISTICA Neural Networks. Two issues need to be addressed:
Scaling. Numeric values have to be scaled into a range that is appropriate for the network. Typically, raw variable values are
scaled linearly. STATISTICA Neural Networks includes minimax and mean/SD algorithms, which automatically calculate
scaling values to transfer numeric values into the desired range.
In some circumstances, nonlinear scaling may be appropriate (for example, if you know that a variable is exponentially
distributed, you might take the logarithm). Nonlinear scaling is not supported in STATISTICA Neural Networks. Instead, you
should apply scaling using data analysis packages in STATISTICA before analyzing the transformed data in STATISTICA
Neural Networks.
Nominal. Nominal variables can be two-state [e.g., Gender = (Male, Female)] or many-state (i.e., more than two states). A
two-state nominal variable is easily represented by transformation into a numeric value (e.g., Male = 0, Female = 1). Many-
state nominal variables are more difficult to handle. They can be represented using an ordinal encoding (e.g., Dog = 0,
Budgie = 1, Cat = 2) but this implies a (probably) false ordering on the nominal values - in this case, that Budgies are in
some sense midway between Dogs and Cats. A better approach, known as one-of-N encoding, is to use a number of numeric
variables to represent the single nominal variable. The number of numeric variables equals the number of possible values;
one of the N variables is set, and the others cleared [e.g., Dog = (1,0,0), Budgie = (0,1,0), Cat = (0,0,1)]. STATISTICA
Neural Networks has facilities to convert both two-state and many-state nominal variables for use in the neural network.
Unfortunately, a nominal variable with a large number of states would require a prohibitive number of numeric variables for
one-of-N encoding, driving up the network size and making training difficult. In such a case it is possible (although
unsatisfactory) to model the nominal variable using a single numeric ordinal; a better approach is to look for a different way
to represent the information.
Prediction problems may be divided into two main categories:
Classification. In classification, the objective is to determine to which of a number of discrete classes a given input case
belongs. Examples include credit assignment (is this person a good or bad credit risk), cancer detection (tumor, clear),
signature recognition (forgery, true). In all these cases, the output required is clearly a single nominal variable. The most
common classification tasks are (as above) two-state, although many-state tasks are also not unknown.
Regression. In regression, the objective is to predict the value of a (usually) continuous variable: tomorrow's stock price, the
fuel consumption of a car, next year's profits. In this case, the output required is a single numeric variable.
Neural networks can actually perform a number of regression and/or classification tasks at once, although commonly each
network performs only one. In the vast majority of cases, therefore, the network will have a single output variable, although
in the case of many-state classification problems, this may correspond to a number of output units (the post-processing stage
takes care of the mapping from output units to output variables). If you do define a single network with multiple output
variables, it may suffer from cross-talk (the hidden neurons experience difficulty learning, as they are attempting to model at
least two functions at once). The best solution is usually to train separate networks for each output, then to combine them
into an ensemble so that they can be run as a unit.
STATISTICA Neural Networks deals with all these issues by including special pre- and post-processing facilities that
transform the raw data into a numeric form suitable for the neural network, and transform the outputs of the neural network
back to a form compatible with the raw data. The network is sandwiched between the pre/post-processing layers, and results
are presented in the desired form (for example, output classes are reported directly by name in classification problems).
However, STATISTICA Neural Networks does also allow you to access the internal activations of the network if you want.
Neural Networks Introductory Overview - Multilayer

Perceptrons
Multilayer Perceptrons is perhaps the most popular network architecture in use today, due originally to Rumelhart and
McClelland (1986) and discussed at length in most neural network textbooks (e.g., Bishop, 1995). This is the type of
network discussed briefly in previous sections: the units each perform a biased weighted sum of their inputs and pass this
activation level through a transfer function to produce their output, and the units are arranged in a layered feedforward
topology. The network thus has a simple interpretation as a form of input-output model, with the weights and thresholds
(biases) the free parameters of the model. Such networks can model functions of almost arbitrary complexity, with the
number of layers, and the number of units in each layer, determining the function complexity. Important issues in Multilayer
Perceptrons (MLP) design include specification of the number of hidden layers and the number of units in these layers (see
Haykin, 1994; Bishop, 1995).
The number of input and output units is defined by the problem (there may be some uncertainty about precisely which inputs
to use, a point to which we will return later. However, for the moment we will assume that the input variables are intuitively
selected and are all meaningful). The number of hidden units to use is far from clear. As good a starting point as any is to use
one hidden layer, with the number of units equal to half the sum of the number of input and output units. Again, we will
discuss how to choose a sensible number later.
Training Multilayer Perceptrons
Once the number of layers, and number of units in each layer, has been selected, the network's weights and thresholds must
be set so as to minimize the prediction error made by the network. This is the role of the training algorithms. The historical
cases that you have gathered are used to automatically adjust the weights and thresholds in order to minimize this error. This
process is equivalent to fitting the model represented by the network to the training data available. The error of a particular
configuration of the network can be determined by running all the training cases through the network, comparing the actual
output generated with the desired or target outputs. The differences are combined together by an error function to give the
network error. The most common error functions are the sum-squared error (used for regression problems), where the
individual errors of output units on each case are squared and summed together, and the cross entropy functions (used for
maximum likelihood classification). STATISTICA Neural Networks reports the RMS (the above normalized for the number
of cases and variables, then square-rooted); this neatly summarizes the error over the entire training set and set of output
units.
In traditional modeling approaches (e.g., linear modeling) it is possible to algorithmically determine the model configuration
that absolutely minimizes this error. The price paid for the greater (non-linear) modeling power of neural networks is that
although we can adjust a network to lower its error, we can never be sure that the error could not be lower still.
A helpful concept here is the error surface. Each of the N weights and thresholds of the network (i.e., the free parameters of
the model) is taken to be a dimension in space. The N+1th dimension is the network error. For any possible configuration of
weights the error can be plotted in the N+1th dimension, forming an error surface. The objective of network training is to
find the lowest point in this many-dimensional surface.
In a linear model with sum-squared error function, this error surface is a parabola (a quadratic), which means that it is a
smooth bowl-shape with a single minimum. It is therefore "easy" to locate the minimum.
Neural network error surfaces are much more complex, and are characterized by a number of unhelpful features, such as
local minima (which are lower than the surrounding terrain, but above the global minimum), flat-spots and plateaus, saddle-
points, and long narrow ravines.
It is not possible to analytically determine where the global minimum of the error surface is, and so neural network training
is essentially an exploration of the error surface. From an initially random configuration of weights and thresholds (i.e., a
random point on the error surface), the training algorithms incrementally seek for the global minimum. Typically, the
gradient (slope) of the error surface is calculated at the current point, and used to make a downhill move. Eventually, the
algorithm stops in a low point, which may be a local minimum (but hopefully is the global minimum).
The Back Propagation Algorithm
The best-known example of a neural network training algorithm is back propagation (see Patterson, 1996; Haykin, 1994;
Fausett, 1994). Modern second-order algorithms such as conjugate gradient descent and Levenberg-Marquardt (see Bishop,
1995; Shepherd, 1997) (both included in STATISTICA Neural Networks) are substantially faster (e.g., an order of magnitude
faster) for many problems, but back propagation still has advantages in some circumstances, and is the easiest algorithm to
understand. We will introduce this now, and discuss the more advanced algorithms later. There are also heuristic
modifications of back propagation that work well for some problem domains, such as quick propagation (Fahlman, 1988)
and Delta-bar-Delta (Jacobs, 1988) and are also included in STATISTICA Neural Networks.
In back propagation, the gradient vector of the error surface is calculated. This vector points in the direction of steepest
descent from the current point, so we know that if we move along it a "short" distance, we will decrease the error. A
sequence of such moves (slowing as we near the bottom) will eventually find a minimum of some sort. The difficult part is
to decide how large the steps should be.
Large steps can converge more quickly, but can also overstep the solution or (if the error surface is very eccentric) go off in
the wrong direction. A classic example of this in neural network training is where the algorithm progresses very slowly
along a steep, narrow, valley, bouncing from one side across to the other. In contrast, very small steps may go in the correct
direction, but they also require a large number of iterations. In practice, the step size is proportional to the slope (so that the
algorithms settles down in a minimum) and to a special constant: the learning rate. The correct setting for the learning rate is
application-dependent, and is typically chosen by experiment; it may also be time-varying, getting smaller as the algorithm
progresses.
The algorithm is also usually modified by inclusion of a momentum term: this encourages movement in a fixed direction, so
that if several steps are taken in the same direction, the algorithm picks up speed, which gives it the ability to (sometimes)
escape local minima, and also to move rapidly over flat spots and plateaus.
The algorithm therefore progresses iteratively, through a number of epochs. On each epoch, the training cases are each
submitted in turn to the network, and target and actual outputs compared and the error calculated. This error, together with
the error surface gradient, is used to adjust the weights, and then the process repeats. The initial network configuration is
random, and training stops when a given number of epochs elapses, or when the error reaches an acceptable level, or when
the error stops improving (you can select which of these stopping conditions to use).
Over-learning and Generalization
One major problem with the approach outlined above is that it doesn't actually minimize the error that we are really
interested in - which is the expected error the network will make when new cases are submitted to it. In other words, the
most desirable property of a network is its ability to generalize to new cases. In reality, the network is trained to minimize
the error on the training set, and short of having a perfect and infinitely large training set, this is not the same thing as
minimizing the error on the real error surface - the error surface of the underlying and unknown model (see Bishop, 1995).
The most important manifestation of this distinction is the problem of over-learning, or over-fitting. It is easiest to
demonstrate this concept using polynomial curve fitting rather than neural networks, but the concept is precisely the same.
A polynomial is an equation with terms containing only constants and powers of the variables. For example:
y=2x+3
y=3x2+4x+1
Different polynomials have different shapes, with larger powers (and therefore larger numbers of terms) having steadily
more eccentric shapes. Given a set of data, we may want to fit a polynomial curve (i.e., a model) to explain the data. The
data is probably noisy, so we don't necessarily expect the best model to pass exactly through all the points. A low-order
polynomial may not be sufficiently flexible to fit close to the points, whereas a high-order polynomial is actually too
flexible, fitting the data exactly by adopting a highly eccentric shape that is actually unrelated to the underlying function. See
the illustration below.
Neural networks have precisely the same problem. A network with more weights models a more complex function, and is
therefore prone to over-fitting. A network with less weights may not be sufficiently powerful to model the underlying
function. For example, a network with no hidden layers actually models a simple linear function.
How then can we select the right complexity of network? A larger network will almost invariably achieve a lower error
eventually, but this may indicate over-fitting rather than good modeling.
The answer is to check progress against an independent data set, the selection set. Some of the cases are reserved, and not
actually used for training in the back propagation algorithm. Instead, they are used to keep an independent check on the
progress of the algorithm. It is invariably the case that the initial performance of the network on training and selection sets is
the same (if it is not at least approximately the same, the division of cases between the two sets is probably biased). As
training progresses, the training error naturally drops, and providing training is minimizing the true error function, the
selection error drops too. However, if the selection error stops dropping, or indeed starts to rise, this indicates that the
network is starting to overfit the data, and training should cease (STATISTICA Neural Networks is, by default, configured to
stop automatically once over-learning starts to occur). When over-fitting occurs during the training process like this, it is
called over-learning. In this case, it is usually advisable to decrease the number of hidden units and/or hidden layers, as the
network is over-powerful for the problem at hand. In contrast, if the network is not sufficiently powerful to model the
underlying function, over-learning is not likely to occur, and neither training nor selection errors will drop to a satisfactory
level.
The problems associated with local minima, and decisions over the size of network to use, imply that using a neural network
typically involves experimenting with a large number of different networks, probably training each one a number of times
(to avoid being fooled by local minima), and observing individual performances. The key guide to performance here is the
selection error. However, following the standard scientific precept that, all else being equal, a simple model is always
preferable to a complex model, you can also select a smaller network in preference to a larger one with a negligible
improvement in selection error.
A problem with this approach of repeated experimentation is that the selection set plays a key role in selecting the model,
which means that it is actually part of the training process. Its reliability as an independent guide to performance of the
model is therefore compromised - with sufficient experiments, you may just hit upon a lucky network that happens to
perform well on the selection set. To add confidence in the performance of the final model, it is therefore normal practice (at
least where the volume of training data allows it) to reserve a third set of cases - the test set. The final model is tested with
the test set data, to ensure that the results on the selection and training set are real, and not artifacts of the training process.
Of course, to fulfill this role properly the test set should be used only once - if it is in turn used to adjust and reiterate the
training process, it effectively becomes selection data!
This division into multiple subsets is very unfortunate, given that we usually have less data than we would ideally desire
even for a single subset. We can get around this problem by resampling. Experiments can be conducted using different
divisions of the available data into training, selection, and test sets. There are a number of approaches to this subset
resampling in STATISTICA Neural Networks, including random (monte-carlo) resampling, cross-validation, and bootstrap. If
we make design decisions, such as the best configuration of neural network to use, based upon a number of experiments with
different subset samples, the results will be much more reliable. We can then either use those experiments solely to guide the
decision as to which network types to use, and train such networks from scratch with new samples (this removes any
sampling bias); or, we can retain the best networks found during the sampling process, but average their results in an
ensemble, which at least mitigates the sampling bias.
To summarize, network design (once the input variables have been selected) follows a number of stages:
l Select an initial configuration (typically, one hidden layer with the number of hidden units set to half the sum of the
number of input and output units; STATISTICA Neural Networks will default to this configuration).
l Iteratively conduct a number of experiments with each configuration, retaining the best network (in terms of selection
error) found. A number of experiments are required with each configuration to avoid being fooled if training locates a
local minimum, and it is also best to resample.
l On each experiment, if under-learning occurs (the network doesn't achieve an acceptable performance level) try adding
more neurons to the hidden layer(s). If this doesn't help, try adding an extra hidden layer.
l If over-learning occurs (selection error starts to rise) try removing hidden units (and possibly layers).
l Once you have experimentally determined an effective configuration for your networks, resample and generate new
networks with that configuration.
Since repeated heuristic experimentation is at best tedious, STATISTICA Neural Networks includes automatic search
algorithms to perform this process for you. The Intelligent Problem Solver experiments with different numbers of hidden
units (and indeed combinations of input variables), performs a number of training runs with each network architecture tested,
selecting a sample of the best network on the basis of selection error. The Intelligent Problem Solver's sophisticated search
algorithms, using advanced concepts such as regularization and sensitivity analysis, can test hundreds of combinations of
networks, concentrating on particularly promising network architectures, and can also find a rough-and-ready solution quite
quickly. Thus, STATISTICA Neural Networks removes much of the pain of continuous experimentation from you.
See also, Multilayer Perceptrons - Part II. See the Neural Networks Overviews Index for other SNN Overviews.
Neural Networks Introductory Overview - Multilayer

Perceptrons - Part II
Data Selection
All the stages rely on a key assumption. Specifically, the training, selection, and test data must be representative of the
underlying model (and, further, the three sets must be independently representative). The old computer science adage
garbage in, garbage out could not apply more strongly than in neural modeling. If training data is not representative, than the
model is at best compromised. At worst, it may be useless. It is worth spelling out the kind of problems that can corrupt a
training set:
The future is not the past. Training data is typically historical. If circumstances have changed, relationships that held in the
past may no longer hold.
All eventualities must be covered. A neural network can only learn from cases that are present. If people with incomes
over $100,000 per year are a bad credit risk, and your training data includes nobody over $40,000, you cannot expect it to
make a correct decision when it encounters one of the previously-unseen cases. Extrapolation is dangerous with any model,
but some types of neural network may make particularly poor predictions in such circumstances.
A network learns the easiest features it can. A classic (possibly apocryphal) illustration of this is a vision project designed
to automatically recognize tanks. A network is trained on a hundred pictures including tanks, and a hundred not. It achieves
a perfect 100% score. When tested on new data, it proves hopeless. The reason? The pictures of tanks are taken on dark,
rainy days; the pictures without on sunny days. The network learns to distinguish the (trivial matter of) differences in overall
light intensity. To work, the network would need training cases including all weather and lighting conditions under which it
is expected to operate - not to mention all types of terrain, angles of shot, distances...
Unbalanced data sets. Since a network minimizes an overall error, the proportion of types of data in the set is critical. A
network trained on a data set with 900 good cases and 100 bad will bias its decision towards good cases, as this allows the
algorithm to lower the overall error (which is much more heavily influenced by the good cases). If the representation of good
and bad cases is different in the real population, the network's decisions may be wrong. A good example would be disease
diagnosis. Perhaps 90% of patients routinely tested are clear of a disease. A network is trained on an available data set with a
90/10 split. It is then used in diagnosis on patients complaining of specific problems, where the likelihood of disease is
50/50. The network will react over-cautiously and fail to recognize disease in some unhealthy patients. In contrast, if trained
on the complainants, and then tested on routine data, the network may raise a high number of false positives. In such
circumstances, the data set may need to be crafted to take account of the distribution of data (e.g., you could replicate the
less numerous cases, or remove some of the numerous cases), or the network's decisions modified by the inclusion of a loss
matrix (Bishop, 1995). Often, the best approach is to ensure even representation of different cases, then to interpret the
network's decisions accordingly.
Insights into MLP Training
More key insights into MLP behavior and training can be gained by considering the type of functions they model. Recall that
the activation level of a unit is the weighted sum of the inputs, plus a threshold value. This implies that the activation level is
actually a simple linear function of the inputs. The activation is then passed through a sigmoid (S-shaped) curve. The
combination of the multi-dimensional linear function and the one-dimensional sigmoid function gives the characteristic
sigmoid cliff response of a first hidden layer MLP unit (the figure below illustrates the shape plotted across two inputs. An
MLP unit with more inputs has a higher-dimensional version of this functional shape). Altering the weights and thresholds
alters this response surface. In particular, both the orientation of the surface, and the steepness of the sloped section, can be
altered. A steep slope corresponds to large weight values: doubling all weight values gives the same orientation but a
different slope.
A multi-layered network combines a number of these response surfaces together, through repeated linear combination and
non-linear activation functions. The next figure illustrates a typical response surface for a network with only one hidden
layer, of two units, and a single output unit, on the classic XOR problem. Two separate sigmoid surfaces have been
combined into a single U-shaped surface.
During network training, the weights and thresholds are first initialized to small, random values. This implies that the units'
response surfaces are each aligned randomly with low slope: they are effectively uncommitted. As training progresses, the
units' response surfaces are rotated and shifted into appropriate positions, and the magnitudes of the weights grow as they
commit to modeling particular parts of the target response surface.
In a classification problem, an output unit's task is to output a strong signal if a case belongs to its class, and a weak signal if
it doesn't. In other words, it is attempting to model a function that has magnitude one for parts of the pattern-space that
contain its cases, and magnitude zero for other parts.
This is known as a discriminant function in pattern recognition problems. An ideal discriminant function could be said to
have a plateau structure, where all points on the function are either at height zero or height one.
If there are no hidden units, then the output can only model a single sigmoid-cliff with areas to one side at low height and
areas to the other high. There will always be a region in the middle (on the cliff) where the height is in-between, but as
weight magnitudes are increased, this area shrinks.
A sigmoid-cliff like this is effectively a linear discriminant. Points to one side of the cliff are classified as belonging to the
class, points to the other as not belonging to it. This implies that a network with no hidden layers can only classify linearly-
separable problems (those where a line - or, more generally in higher dimensions, a hyperplane - can be drawn which
separates the points in pattern space).
A network with a single hidden layer has a number of sigmoid-cliffs (one per hidden unit) represented in that hidden layer,
and these are in turn combined into a plateau in the output layer. The plateau has a convex hull (i.e., there are no dents in it,
and no holes inside it). Although the plateau is convex, it may extend to infinity in some directions (like an extended
peninsular). Such a network is in practice capable of modeling adequately most real-world classification problems. The
figure above shows the plateau response surface developed by an MLP to solve the XOR problem: as can be seen, this neatly
sections the space along a diagonal.
A network with two hidden layers has a number of plateaus combined together - the number of plateaus corresponds to the
number of units in the second layer, and the number of sides on each plateau corresponds to the number of units in the first
hidden layer. A little thought shows that you can represent any shape (including concavities and holes) using a sufficiently
large number of such plateaus.
A consequence of these observations is that an MLP with two hidden layers is theoretically sufficient to model any problem
(there is a more formal proof, the Kolmogorov Theorem). This does not necessarily imply that a network with more layers
might not more conveniently or easily model a particular problem. In practice, however, most problems seem to yield to a
single hidden layer, with two an occasional resort and three practically unknown.
A key question in classification is how to interpret points on or near the cliff. The standard practice is to adopt some
confidence levels (the accept and reject thresholds) that must be exceeded before the unit is deemed to have made a decision.
For example, if accept/reject thresholds of 0.95/0.05 are used, an output unit with an output level in excess of 0.95 is deemed
to be on, below 0.05 it is deemed to be off, and in between it is deemed to be undecided.
A more subtle (and perhaps more useful) interpretation is to treat the network outputs as probabilities. In this case, the
network gives more information than simply a decision: it tells us how sure (in a formal sense) it is of that decision. There
are modifications to MLPs (supported by STATISTICA Neural Networks) that allow the neural network outputs to be
interpreted as probabilities, which means that the network effectively learns to model the probability density function of the
class. However, the probabilistic interpretation is only valid under certain assumptions about the distribution of the data
(specifically, that it is drawn from the family of exponential distributions; see Bishop). Ultimately, a classification decision
must still be made, but a probabilistic interpretation allows a more formal concept of minimum cost decision making to be
evolved.
Other MLP Training Algorithms
Earlier in this section, we discussed how the back propagation algorithm performs gradient descent on the error surface.
Speaking loosely, it calculates the direction of steepest descent on the surface, and jumps down the surface a distance
proportional to the learning rate and the slope, picking up momentum as it maintains a consistent direction. As an analogy, it
behaves like a blindfold kangaroo hopping in the most obvious direction. Actually, the descent is calculated independently
on the error surface for each training case, and in random order, but this is actually a good approximation to descent on the
composite error surface. Other MLP training algorithms work differently, but all use a strategy designed to travel towards a
minimum as quickly as possible.
More sophisticated techniques for non-linear function optimization have been in use for some time. STATISTICA Neural
Networks includes three of these: conjugate gradient descent, quasi-Newton, and Levenberg-Marquardt (see Bishop, 1995;
Shepherd, 1997), which are very successful forms of two types of algorithm: line search and model-trust region approaches.
They are collectively known as second order training algorithms.
A line search algorithm works as follows: pick a sensible direction to move in the multi-dimensional landscape. Then project
a line in that direction, locate the minimum along that line (it is relatively trivial to locate a minimum along a line, by using
some form of bisection algorithm), and repeat. What is a sensible direction in this context? An obvious choice is the
direction of steepest descent (the same direction that would be chosen by back propagation). Actually, this intuitively
obvious choice proves to be rather poor. Having minimized along one direction, the next line of steepest descent may spoil
the minimization along the initial direction (even on a simple surface like a parabola a large number of line searches may be
necessary). A better approach is to select conjugate or non-interfering directions - hence conjugate gradient descent (Bishop,
1995).
The idea here is that, once the algorithm has minimized along a particular direction, the second derivative along that
direction should be kept at zero. Conjugate directions are selected to maintain this zero second derivative on the assumption
that the surface is parabolic (speaking roughly, a nice smooth surface). If this condition holds, N epochs are sufficient to
reach a minimum. In reality, on a complex error surface the conjugacy deteriorates, but the algorithm still typically requires
far less epochs than back propagation, and also converges to a better minimum (to settle down thoroughly, back propagation
must be run with an extremely low learning rate).
Quasi-Newton training is based on the observation that the direction pointing directly towards the minimum on a quadratic
surface is the so-called Newton direction. This is very expensive to calculate analytically, but quasi-Newton iteratively
builds up a good approximation to it. Quasi-Newton is usually a little faster than conjugate gradient descent, but has
substantially larger memory requirements and is occasionally numerically unstable.
A model-trust region approach works as follows: instead of following a search direction, assume that the surface is a simple
shape such that the minimum can be located (and jumped to) directly - if the assumption is true. Try the model out and see
how good the suggested point is. The model typically assumes that the surface is a nice well-behaved shape (e.g., a
parabola), which will be true if sufficiently close to a minima. Elsewhere, the assumption may be grossly violated, and the
model could choose wildly inappropriate points to move to. The model can only be trusted within a region of the current
point, and the size of this region isn't known. Therefore, choose new points to test as a compromise between that suggested
by the model and that suggested by a standard gradient-descent jump. If the new point is good, move to it, and strengthen the
role of the model in selecting a new point; if it is bad, don't move, and strengthen the role of the gradient descent step in
selecting a new point (and make the step smaller). Levenberg-Marquardt uses a model that assumes that the underlying
function is locally linear (and therefore has a parabolic error surface).
Levenberg-Marquardt (Levenberg, 1944; Marquardt, 1963; Bishop, 1995) is typically the fastest of the training algorithms
supported in STATISTICA Neural Networks, although unfortunately it has some important limitations, specifically: it can
only be used on single output networks, can only be used with the sum-squared error function, and has memory requirements
proportional to W 2 (where W is the number of weights in the network; this makes it impractical for reasonably big
networks). Conjugate gradient descent is nearly as good, and doesn't suffer from these restrictions.
Back propagation can still be useful, not least in providing a quick (if not overwhelmingly accurate) solution. It is also a
good choice if the data set is very large, and contains a great deal of redundant data. Back propagation's case-by-case error
adjustment means that data redundancy does it no harm (for example, if you double the data set size by replicating every
case, each epoch will take twice as long, but have the same effect as two of the old epochs, so there is no loss). In contrast,
Levenberg-Marquardt, quasi-Newton, and conjugate gradient descent all perform calculations using the entire data set, so
increasing the number of cases can significantly slow each epoch, but does not necessarily improve performance on that
epoch (not if data is redundant; if data is sparse, then adding data will make each epoch better). Back propagation can also
be equally good if the data set is very small, for there is then insufficient information to make a highly fine-tuned solution
appropriate (a more advanced algorithm may achieve a lower training error, but the selection error is unlikely to improve in
the same way). Finally, the second order training algorithms seem to be very prone to stick in local minima in the early
phases - for this reason, we recommend the practice of starting with a short burst of back propagation, before switching to a
second order algorithm.
STATISTICA Neural Networks also includes two variations on back propagation (quick propagation, Fahlman, 1988, and
Delta-bar-Delta, Jacobs, 1988) that are designed to deal with some of the limitations of this technique. In most cases, they
are not significantly better than back propagation, and sometimes they are worse (relative performance is application-
dependent). They also require more control parameters than any of the other algorithms, which makes them more difficult to
use, so they are not described in further detail in this section.
Neural Networks Introductory Overview - Radial Basis

Function Networks
We have seen in the last section how an MLP models the response function using the composition of sigmoid-cliff functions
- for a classification problem, this corresponds to dividing the pattern space up using hyperplanes. The use of hyperplanes to
divide up space is a natural approach - intuitively appealing, and based on the fundamental simplicity of lines.
An equally appealing and intuitive approach is to divide up space using circles or (more generally) hyperspheres. A
hypersphere is characterized by its center and radius. More generally, just as an MLP unit responds (non-linearly) to the
distance of points from the line of the sigmoid-cliff, in a radial basis function network (Broomhead and Lowe, 1988; Moody
and Darkin, 1989; Haykin, 1994) units respond (non-linearly) to the distance of points from the center represented by the
radial unit. The response surface of a single radial unit is therefore a Gaussian (bell-shaped) function, peaked at the center,
and descending outwards. Just as the steepness of the MLPs sigmoid curves can be altered, so can the slope of the radial
unit's Gaussian. See the next illustration below.
MLP units are defined by their weights and threshold, which together give the equation of the defining line, and the rate of
fall-off of the function from that line. Before application of the sigmoid activation function, the activation level of the unit is
determined using a weighted sum, which mathematically is the dot product of the input vector and the weight vector of the
unit; in STATISTICA Neural Networks, these units are therefore referred to as dot product units. In contrast, a radial unit is
defined by its center point and radius. A point in N dimensional space is defined using N numbers, which exactly
corresponds to the number of weights in a dot product unit, so the center of a radial unit is stored in STATISTICA Neural
Networks as weights. The radius (or deviation) value is stored as the threshold. It is worth emphasizing that the weights and
thresholds in a radial unit are actually entirely different to those in a dot product unit, and the terminology is dangerous if
you don't remember this: Radial weights really form a point, and a radial threshold is really a deviation.
A radial basis function network (RBF), therefore, has a hidden layer of radial units, each actually modeling a Gaussian
response surface. Since these functions are nonlinear, it is not actually necessary to have more than one hidden layer to
model any shape of function: sufficient radial units will always be enough to model any function. The remaining question is
how to combine the hidden radial unit outputs into the network outputs? It turns out to be quite sufficient to use a linear
combination of these outputs (i.e., a weighted sum of the Gaussians) to model any non-linear function. The standard RBF
therefore has an output layer containing dot product units with identity activation function (see Haykin, 1994; Bishop, 1995).
RBF networks have a number of advantages over MLPs. First, as previously stated, they can model any nonlinear function
using a single hidden layer, which removes some design-decisions about numbers of layers. Second, the simple linear
transformation in the output layer can be optimized fully using traditional linear modeling techniques, which are fast and do
not suffer from problems such as local minima which plague MLP training techniques. RBF networks can therefore be
trained extremely quickly (i.e., orders of magnitude faster than MLPs).
On the other hand, before linear optimization can be applied to the output layer of an RBF network, the number of radial
units must be decided, and then their centers and deviations must be set. Although faster than MLP training, the algorithms
to do this are equally prone to discover sub-optimal combinations. In compensation, the STATISTICA Neural Networks
Intelligent Problem Solver can perform the inevitable experimental stage for you.
Other features that distinguish RBF performance from MLPs are due to the differing approaches to modeling space, with
RBFs "clumpy" and MLPs "planey."
Experience indicates that the RBF's more eccentric response surface requires a lot more units to adequately model most
functions. Of course, it is always possible to draw shapes that are most easily represented one way or the other, but the
balance does not favor RBFs. Consequently, an RBF solution will tend to be slower to execute and more space consuming
than the corresponding MLP (but it was much faster to train, which is sometimes more of a constraint).
The clumpy approach also implies that RBFs are not inclined to extrapolate beyond known data: the response drops off
rapidly towards zero if data points far from the training data are used. Often the RBF output layer optimization will have set
a bias level, hopefully more or less equal to the mean output level, so in fact the extrapolated output is the observed mean - a
reasonable working assumption. In contrast, an MLP becomes more certain in its response when far-flung data is used.
Whether this is an advantage or disadvantage depends largely on the application, but on the whole the MLP's uncritical
extrapolation is regarded as a bad point: extrapolation far from training data is usually dangerous and unjustified.
RBFs are also more sensitive to the curse of dimensionality, and have greater difficulties if the number of input units is
large: this problem is discussed further in a later section.
As mentioned earlier, training of RBFs takes place in distinct stages. First, the centers and deviations of the radial units must
be set; then the linear output layer is optimized.
Centers should be assigned to reflect the natural clustering of the data. The two most common methods are:
Sub-sampling. Randomly chosen training points are copied to the radial units. Since they are randomly selected, they will
represent the distribution of the training data in a statistical sense. However, if the number of radial units is not large, the
radial units may actually be a poor representation (Haykin, 1994).
K-Means algorithm. This algorithm (Bishop, 1995) tries to select an optimal set of points that are placed at the centroids of
clusters of training data. Given K radial units, it adjusts the positions of the centers so that:
l Each training point belongs to a cluster center, and is nearer to this center than to any other center;
l Each cluster center is the centroid of the training points that belong to it.
Once centers are assigned, deviations are set. The size of the deviation (also known as a smoothing factor) determines how
spiky the Gaussian functions are. If the Gaussians are too spiky, the network will not interpolate between known points, and
the network loses the ability to generalize. If the Gaussians are very broad, the network loses fine detail. This is actually
another manifestation of the over/under-fitting dilemma. Deviations should typically be chosen so that Gaussians overlap
with a few nearby centers. Methods available are:
Explicit. Choose the deviation yourself.
Isotropic. The deviation (same for all units) is selected heuristically to reflect the number of centers and the volume of space
they occupy (Haykin, 1994).
K-Nearest Neighbor. Each unit's deviation is individually set to the mean distance to its K nearest neighbors (Bishop,
1995). Hence, deviations are smaller in tightly packed areas of space, preserving detail, and higher in sparse areas of space
(interpolating where necessary).
Once centers and deviations have been set, the output layer can be optimized using the standard linear optimization
technique: the pseudo-inverse (singular value decomposition) algorithm (Haykin, 1994; Golub and Kahan, 1965).
However, RBFs as described above suffer similar problems to Multilayer Perceptrons if they are used for classification - the
output of the network is a measure of distance from a decision hyperplane, rather than a probabilistic confidence level. We
may therefore choose to modify the RBF by including an output layer with logistic or softmax (normalized exponential)
outputs, which is capable of probability estimation. We lose the advantage of fast linear optimization of the output layer;
however, the non-linear output layer still has a relatively well-behaved error surface, and can be optimized quite quickly
using a fast iterative algorithm such as conjugate gradient descent.
Radial basis functions can also be hybridized in a number of ways. The radial layer (the hidden layer) can be trained using
the Kohonen and Learned Vector Quantization training algorithms, which are alternative methods of assigning centers to
reflect the spread of data, and the output layer (whether linear or otherwise) can be trained using any of the iterative dot
product algorithms in STATISTICA Neural Networks.
Neural Networks Introductory Overview - Probabilistic

Neural Networks
Elsewhere, we briefly mentioned that, in the context of classification problems, a useful interpretation of network outputs
was as estimates of probability of class membership, in which case the network was actually learning to estimate a
probability density function (p.d.f.). A similar useful interpretation can be made in regression problems if the output of the
network is regarded as the expected value of the model at a given point in input-space. This expected value is related to the
joint probability density function of the output and inputs.
Estimating probability density functions from data has a long statistical history (Parzen, 1962), and in this context fits into
the area of Bayesian statistics. Conventional statistics can, given a known model, inform us what the chances of certain
outcomes are (e.g., we know that an unbiased die has a 1/6th chance of coming up with a six). Bayesian statistics turns this
situation on its head, by estimating the validity of a model given certain data. More generally, Bayesian statistics can
estimate the probability density of model parameters given the available data. To minimize error, the model is then selected
whose parameters maximize this p.d.f.
In the context of a classification problem, if we can construct estimates of the p.d.f.s of the possible classes, we can compare
the probabilities of the various classes, and select the most-probable. This is effectively what we ask a neural network to do
when it learns a classification problem - the network attempts to learn (an approximation to) the p.d.f.
A more traditional approach is to construct an estimate of the p.d.f. from the data. The most traditional technique is to
assume a certain form for the p.d.f. (typically, that it is a normal distribution), and then to estimate the model parameters.
The normal distribution is commonly used as the model parameters (mean and standard deviation) can be estimated using
analytical techniques. The problem is that the assumption of normality is often not justified.
An alternative approach to p.d.f. estimation is kernel-based approximation (see Parzen, 1962; Speckt, 1990; Speckt, 1991;
Bishop, 1995, Patterson, 1996). We can reason loosely that the presence of particular case indicates some probability density
at that point: a cluster of cases close together indicates an area of high probability density. Close to a case, we can have high
confidence in some probability density, with a lesser and diminishing level as we move away. In kernel-based estimation,
simple functions are located at each available case, and added together to estimate the overall p.d.f. Typically, the kernel
functions are each Gaussians (bell-shapes). If sufficient training points are available, this will indeed yield an arbitrarily
good approximation to the true p.d.f.
This kernel-based approach to p.d.f. approximation is very similar to radial basis function networks, and motivates the
probabilistic neural network (PNN) and generalized regression neural network (GRNN), both devised by Speckt (1990 and
1991). PNNs are designed for classification tasks, and GRNNs for regression. These two types of network are really kernel-
based approximation methods cast in the form of neural networks.
In the PNN, there are at least three layers: input, radial, and output layers. The radial units are copied directly from the
training data, one per case. Each models a Gaussian function centered at the training case. There is one output unit per class.
Each is connected to all the radial units belonging to its class, with zero connections from all other radial units. Hence, the
output units simply add up the responses of the units belonging to their own class. The outputs are each proportional to the
kernel-based estimates of the p.d.f.s of the various classes, and by normalizing these to sum to 1.0 estimates of class
probability are produced.
The basic PNN can be modified in two ways:
First, the basic approach assumes that the proportional representation of classes in the training data matches the actual
representation in the population being modeled (the so-called prior probabilities). For example, in a disease-diagnosis
network, if 2% of the population has the disease, then 2% of the training cases should be positives. If the prior probability is
different from the level of representation in the training cases, then the network's estimate will be invalid. To compensate for
this, prior probabilities can be given (if known), and the class weightings are adjusted to compensate.
Second, any network making estimates based on a noisy function will inevitably produce some misclassifications (there may
be disease victims whose tests come out normal, for example). However, some forms of misclassification may be regarded
as more expensive mistakes than others (for example, diagnosing somebody healthy as having a disease, which simply leads
to exploratory surgery may be inconvenient but not life-threatening; whereas failing to spot somebody who is suffering from
disease may lead to premature death). In this case, the raw probabilities generated by the network can be weighted by loss
factors, which reflect the costs of misclassification. In STATISTICA Neural Networks, a fourth layer can be specified in
PNNs, which includes a loss matrix. This is multiplied by the probability estimates in the third layer, and the class with
lowest estimated cost is selected. (Loss matrices can also be attached to other types of classification network.)
The only control factor that needs to be selected for probabilistic neural network training is the smoothing factor (i.e., the
radial deviation of the Gaussian functions). As with RBF networks, this factor needs to be selected to cause a reasonable
amount of overlap - too small deviations cause a very spiky approximation that cannot generalize, too large deviations
smooth out detail. An appropriate figure is easily chosen by experiment, by selecting a number that produces a low selection
error, and fortunately PNNs are not too sensitive to the precise choice of smoothing factor.
The greatest advantages of PNNs are the fact that the output is probabilistic (which makes interpretation of output easy), and
the training speed. Training a PNN actually consists mostly of copying training cases into the network, and so is as close to
instantaneous as can be expected.
The greatest disadvantage is network size: a PNN network actually contains the entire set of training cases, and is therefore
space-consuming and slow to execute.
PNNs are particularly useful for prototyping experiments (for example, when deciding which input parameters to use), as the
short training time allows a great number of tests to be conducted in a short period of time. STATISTICA Neural Networks
itself uses PNNs in its Feature Selection algorithms, which automatically search for useful inputs.
Neural Networks Introductory Overview - Generalized

Regression Neural Networks
Generalized Regression Neural Networks (GRNNs) work in a similar fashion to PNNs, but perform regression rather than
classification tasks (see Speckt, 1991; Patterson, 1996; Bishop, 1995). As with the PNN, Gaussian Kernel functions are
located at each training case. Each case can be regarded, in this case, as evidence that the response surface is a given height
at that point in input space, with progressively decaying evidence in the immediate vicinity. The GRNN copies the training
cases into the network to be used to estimate the response on new points. The output is estimated using a weighted average
of the outputs of the training cases, where the weighting is related to the distance of the point from the point being estimated
(so that points nearby contribute most heavily to the estimate).
The first hidden layer in the GRNN contains the radial units. A second hidden layer contains units that help to estimate the
weighted average. This is a specialized procedure. Each output has a special unit assigned in this layer that forms the
weighted sum for the corresponding output. To get the weighted average from the weighted sum, the weighted sum must be
divided through by the sum of the weighting factors. A single special unit in the second layer calculates the latter value. The
output layer then performs the actual divisions (using special division units). Hence, the second hidden layer always has
exactly one more unit than the output layer. In regression problems, typically only a single output is estimated, and so the
second hidden layer usually has two units.
The GRNN can be modified by assigning radial units that represent clusters rather than each individual training case: this
reduces the size of the network and increases execution speed. Centers can be assigned using any appropriate algorithm (i.e.,
sub-sampling, K-means or Kohonen), and STATISTICA Neural Networks adjusts the internal weightings to take account.
GRNNs have advantages and disadvantages broadly similar to PNNs - the difference being that GRNNs can only be used for
regression problems, whereas PNNs are used for classification problems. A GRNN trains almost instantly, but tends to be
large and slow (although, unlike PNNs, it is not necessary to have one radial unit for each training case, the number still
needs to be large). Like an RBF network, a GRNN does not extrapolate.
Neural Networks Introductory Overview - Linear

Networks
A general scientific principal is that a simple model should always be chosen in preference to a complex model if the latter
does not fit the data better. In terms of function approximation, the simplest model is the linear model, where the fitted
function is a hyperplane. In classification, the hyperplane is positioned to divide the two classes (a linear discriminant
function); in regression, it is positioned to pass through the data. A linear model is typically represented using an NxN
matrix and an Nx1 bias vector.
A neural network with no hidden layers, and an output with dot product synaptic function and identity activation function,
actually implements a linear model. The weights correspond to the matrix, and the thresholds to the bias vector. When the
network is executed, it effectively multiplies the input by the weights matrix then adds the bias vector.
In STATISTICA Neural Networks, you can define a linear network, and train it using the standard pseudo-inverse (SVD)
linear optimization algorithm (Golub and Kahan, 1965). Of course, linear optimization is widely available; however, the
STATISTICA Neural Networks linear network has the advantage of allowing you to compare performance with real neural
networks within a single environment.
The linear network provides a good benchmark against which to compare the performance of your neural networks. It is
quite possible that a problem that is thought to be highly complex can actually be solved just as well by linear techniques as
by neural networks. If you have only a small number of training cases, you are probably anyway not justified in using a
more complex model.
Neural Networks Introductory Overview - SOFM

Networks
Self Organizing Feature Map (SOFM, or Kohonen) networks are used quite differently from the other networks in
STATISTICA Neural Networks. Whereas all the other networks are designed for supervised learning tasks, SOFM networks
are designed primarily for unsupervised learning (see Kohonen, 1982; Haykin, 1994; Patterson, 1996; Fausett, 1994).
Whereas in supervised learning the training data set contains cases featuring input variables together with the associated
outputs (and the network must infer a mapping from the inputs to the outputs), in unsupervised learning the training data set
contains only input variables.
At first glance this may seem strange. Without outputs, what can the network learn? The answer is that the SOFM network
attempts to learn the structure of the data.
One possible use is therefore in exploratory data analysis. The SOFM network can learn to recognize clusters of data, and
can also relate similar classes to each other. The user can build up an understanding of the data, which is used to refine the
network. As classes of data are recognized they are labeled, so that the network becomes capable of classification tasks.
SOFM networks can also be used for classification when output classes are immediately available - the advantage in this
case is their ability to highlight similarities between classes.
A second possible use is in novelty detection. SOFM networks can learn to recognize clusters in the training data, and
respond to it. If new data, unlike previous cases, is encountered, the network fails to recognize it and this indicates novelty.
A SOFM network has only two layers: the input layer, and an output layer of radial units (also known as the topological map
layer). The units in the topological map layer are laid out in space - typically in two dimensions (although STATISTICA
Neural Networks also supports one-dimensional SOFM networks).
SOFM networks are trained using an iterative algorithm. Starting with an initially random set of radial centers, the algorithm
gradually adjusts them to reflect the clustering of the training data. At one level, this compares with the sub-sampling and K-
means algorithms used to assign centers in RBF and GRNN networks, and indeed the SOFM algorithm can be used to assign
centers for these types of networks. However, the algorithm also acts on a different level.
The iterative training procedure also arranges the network so that units representing centers close together in the input space
are also situated close together on the topological map. You can think of the network's topological layer as a crude two-
dimensional grid, which must be folded and distorted into the N-dimensional input space, so as to preserve as far as possible
the original structure. Clearly any attempt to represent an N-dimensional space in two dimensions will result in loss of detail;
however, the technique can be worthwhile in allowing you to visualize data that might otherwise be impossible to
understand.
The basic iterative Kohonen algorithm simply runs through a number of epochs, on each epoch executing each training case
and applying the following algorithm:
Select the winning neuron (the one who's center is nearest to the input case);
Adjust the winning neuron to be more like the input case (a weighted sum of the old neuron center and the training case).
The algorithm uses a time-decaying learning rate, which is used to perform the weighted sum and ensures that the alterations
become subtler as the epochs pass. This ensures that the centers settle down to a compromise representation of the cases that
cause that neuron to win.
The topological ordering property is achieved by adding the concept of a neighborhood to the algorithm. The neighborhood
is a set of neurons surrounding the winning neuron. The neighborhood, like the learning rate, decays over time, so that
initially quite a large number of neurons belong to the neighborhood (perhaps almost the entire topological map); in the
latter stages the neighborhood will be zero (i.e., consists solely of the winning neuron itself). In the Kohonen algorithm, the
adjustment of neurons is actually applied to all the members of the current neighborhood, not just to the winning neuron.
The effect of this neighborhood update is that initially quite large areas of the network are dragged towards training cases -
and dragged quite substantially. The network develops a crude topological ordering, with similar cases activating clumps of
neurons in the topological map. As epochs pass the learning rate and neighborhood both decrease, so that finer distinctions
within areas of the map can be drawn, ultimately resulting in fine-tuning of individual neurons. Typically, training is
deliberately conducted in two distinct phases: a relatively short phase with high learning rates and neighborhood, and a long
phase with low learning rate and zero or near-zero neighborhood.
Once the network has been trained to recognize structure in the data, it can be used as a visualization tool to examine the
data. Win Frequencies (counts of the number of times each neuron wins when training cases are executed) can be examined
to see if distinct clusters have formed on the map. Individual cases are executed and the topological map observed, to see if
some meaning can be assigned to the clusters (this usually involves referring back to the original application area, so that the
relationship between clustered cases can be established). Once clusters are identified, neurons in the topological map are
labeled to indicate their meaning (sometimes individual cases may be labeled, too). Once the topological map has been built
up in this way, new cases can be submitted to the network. If the winning neuron has been labeled with a class name, the
network can perform classification. If not, the network is regarded as undecided.
SOFM networks also make use of an accept threshold, when performing classification. Since the activation level of a neuron
in a SOFM network is the distance of the neuron from the input case, the accept threshold acts as a maximum recognized
distance. If the activation of the winning neuron is greater than this distance, the SOFM network is regarded as undecided.
Thus, by labeling all neurons and setting the accept threshold appropriately, a SOFM network can act as a novelty detector
(it reports undecided only if the input case is sufficiently dissimilar to all radial units).
SOFM networks are inspired by some known properties of the brain. The cerebral cortex is actually a large flat sheet (about
0.5m squared; it is folded up into the familiar convoluted shape only for convenience in fitting into the skull) with known
topological properties (for example, the area corresponding to the hand is next to the arm, and a distorted human frame can
be topologically mapped out in two dimensions on its surface).
Neural Networks Introductory Overview - Classification

Problems
In classification problems, the purpose of the network is to assign each case to one of a number of classes (or, more
generally, to estimate the probability of membership of the case in each class). In STATISTICA Neural Networks (SNN),
nominal output variables are used to indicate a classification problem. The nominal values correspond to the various classes.
STATISTICA Neural Networks can perform classification using the following network types: MLP, RBF, SOFM, PNN,
Cluster, and Linear.
Nominal variables are normally represented in networks in STATISTICA Neural Networks using one of two techniques, the
first of which is only available for two-state variables; these techniques are: two-state, one-of-N. In two-state representation,
a single node corresponds to the variable, and a value of 0.0 is interpreted as one state, and a value of 1.0 as the other. In
one-of-N encoding, one unit is allocated for each state, with a particular state represented by 1.0 on that particular unit, and
0.0 on the others.
Input nominal variables are easily converted by STATISTICA Neural Networks using the above methods, both during
training and during execution. Target outputs for units corresponding to nominal variables are also easily determined during
training. However, more effort is required to determine the output class assigned by a network during execution.
The output units each have continuous activation values between 0.0 and 1.0. In order to definitely assign a class from the
outputs, the network must decide if the outputs are reasonably close to 0.0 and 1.0. If they are not, the class is regarded as
undecided.
STATISTICA Neural Networks optionally uses confidence levels (the accept and reject thresholds) to decide how to interpret
the network outputs. These thresholds can be adjusted to make the network more or less fussy about when to assign a
classification. The interpretation differs slightly for two-state and one-of-N representation:
Two-state. If the unit output is above the accept threshold, the 1.0 class is deemed to be chosen. If the output is below the
reject threshold, the 0.0 class is chosen. If the output is between the two thresholds, the class is undecided.
One-of-N. A class is selected if the corresponding output unit is above the accept threshold and all the other output units are
below the reject threshold. If this condition is not met, the class is undecided.
For one-of-N encoding, the use of thresholds is optional. If not used, the "winner-takes-all" algorithm is used (the highest
activation unit gives the class, and the network is never undecided).
There is one peculiarity when dealing with one-of-N encoding. On first reading, you might expect a network with accept and
reject thresholds set to 0.5 is equivalent to a "winner takes all" network. Actually, this is not the case for one-of-N encoded
networks (it is the case for two-state). You can actually set the accept threshold lower than the reject threshold, and only a
network with accept 0.0 and reject 1.0 is equivalent to a winner-takes-all network. This is true since STATISTICA Neural
Networks' algorithm for assigning a class is actually:
Select the unit with the highest output. If this unit has output greater than or equal to the accept threshold, and all other
units have output less than the reject threshold, assign the class represented by that unit.
With an accept threshold of 0.0, the winning unit is bound to be accepted, and with a reject threshold of 1.0, none of the
other units can possibly be rejected, so the algorithm reduces to a simple selection of the winning unit. In contrast, if both
accept and reject are set to 0.5, the network may return undecided (if the winner is below 0.5, or any of the losers are above
0.5).
Although this concept takes some getting used to, it does allow you to set some subtle conditions. For example, accept/reject
0.3/0.7 can be read as: select the class using the winning unit, provided it has an output level at least 0.3, and none of the
other units have activation above 0.7 - in other words, the winner must show some significant level of activation, and the
losers mustn't, for a decision to be reached.
If the network's output unit activations are probabilities, the range of possible output patterns is of course restricted, as they
must sum to 1.0. In that case, winner-takes-all is equivalent to setting accept and reject both to 1/N, where N is the number
of classes.
However, not all classification neural networks actually output probabilities.
The above discussion covers the assignment of classifications in most types of network: MLPs, RBFs, linear, and Cluster.
However, SOFM networks work quite differently.
In a SOFM network, the winning node in the topological map (output) layer is the one with the lowest activation level
(which measures the distance of the input case from the point stored by the unit). Some or all of the units in the topological
map may be labeled, indicating an output class. If the distance is small enough, then the case is assigned to the class (if one
is given). In STATISTICA Neural Networks, the accept threshold indicates the largest distance which will result in a positive
classification. If an input case is further than this distance away from the winning unit, or if the winning unit is unlabelled
(or its label doesn't match one of the output variable's nominal values) then the case is unclassified. The reject threshold is
not used in SOFM networks.
The discussion on non-SOFM networks above has assumed that a positive classification is indicated by a figure close to 1.0,
and a negative classification by a figure close to 0.0. This is true if the logistic output activation function is used, and is
convenient as probabilities range from 0.0 to 1.0. However, in some circumstances a different range may be used. Also,
sometimes ordering is reversed, with smaller outputs indicating higher confidence. STATISTICA Neural Networks can deal
with both of these situations.
First, the range values used are actually the min/mean and max/SD values stored for each variable. With a logistic output
activation function, the default values 0.0 and 1.0 are fine. Some authors actually recommend using the hyperbolic tangent
activation function, which has the range (-1.0,+1.0). Training performance may be enhanced because this function (unlike
the logistic function) is symmetrical. In such a case, alter the min/mean and max/SD values, and STATISTICA Neural
Networks will automatically interpret classes correctly. Alternatively (and we recommend this practice) use hyperbolic
tangent activation function in hidden layers, but not in the output layer.
Ordering is typically reversed in two situations. We have just discussed one of these: SOFM networks, where the output is a
distance measure, with a small value indicating greater confidence. The same is true in the closely-related Cluster networks.
The second circumstance is the use of a loss matrix (which can be added at creation time to PNNs, and also manually joined
to other types of network). When a loss matrix is used, the network outputs indicate the expected cost if each class is
selected, and the objective is to select the class with the lowest cost. In this case, we would normally expect the accept
threshold to be smaller than the reject threshold.
When selecting accept/reject thresholds, and assessing the classification ability of the network, the most important indicator
is the classification summary spreadsheet. This shows how many cases were correctly classified, incorrectly classified, or
unclassified. You can also use the confusion matrix spreadsheet to break down how many cases belonging to each class were
assigned to another class. All these figures can be independently reported for the training, selection and test sets.
Neural Networks Introductory Overview - Regression

Problems
In regression problems, the objective is to estimate the value of a continuous output variable, given the known input
variables. Regression problems can be solved using the following network types in STATISTICA Neural Networks: MLP,
RBF, GRNN, and Linear. Regression problems are represented in STATISTICA Neural Networks by data sets with non-
nominal (standard numeric) output(s).
A particularly important issue in regression is output scaling, and extrapolation effects.
The most common neural network architectures have outputs in a limited range (e.g., (0,1) for the logistic activation
function). This presents no difficulty for classification problems, where the desired output is in such a range. However, for
regression problems there clearly is an issue to be resolved, and some of the consequences are quite subtle.
This subject is discussed below.
As a first pass, we can apply a scaling algorithm to ensure that the network's output will be in a sensible range. The simplest
scaling function in STATISTICA Neural Networks is minimax: this finds the minimum and maximum values of a variable in
the training data, and performs a linear transformation (using a shift and a scale factor) to convert the values into the target
range (typically [0.0,1.0]). If this is used on a continuous output variable, then we can guarantee that all training values will
be converted into the range of possible outputs of the network, and so the network can be trained. We also know that the
network's output will be constrained to lie within this range. This may or may not be regarded as a good thing, which brings
us to the subject of extrapolation.
Consider the figure above. Here, we are trying to estimate the value of y from the value of x. A curve has to be fitted that
passes through the available data points. We can probably easily agree on the illustrated curve, which is approximately the
right shape, and this will allow us to estimate y given inputs in the range represented by the solid line where we can
interpolate.
However, what about a point well to the right of the data points? There are two possible approaches to estimating y for this
point. First, we might decide to extrapolate: projecting the trend of the fitted curve onwards. Second, we might decide that
we don't really have sufficient evidence to assign any value, and therefore assign the mean output value (which is probably
the best estimate we have lacking any other evidence).
Let us assume that we are using an MLP. Using minimax as suggested above is highly restrictive. First, the curve is not
extrapolated, however close to the training data we may be (if we are only a little bit outside the training data, extrapolation
may well be justified). Second, it does not estimate the mean either - it actually saturates at either the minimum or
maximum, depending on whether the estimated curve was rising or falling as it approached this region.
There are a number of approaches to correct this deficiency in an MLP:
First, we can replace the logistic output activation function with a linear activation function, which simply passes on the
activation level unchanged (N.B. only the activation functions in the output layer are changed; the hidden layers still use
logistic or hyperbolic activation functions). The linear activation function does not saturate, and so can extrapolate further
(the network will still saturate eventually as the hidden units saturate). A linear activation function in an MLP can cause
some numerical difficulties for the back propagation algorithm, however, and if this is used a low learning rate (below 0.1)
must be used. This approach is appropriate if you want to extrapolate, and is the default in STATISTICA Neural Networks.
Second, you can alter the target range for the minimax scaling function (for example, to [0.1,0.9]). The training cases are
then all mapped to levels that correspond to only the middle part of the output units' output range. Interestingly, if this range
is small, with both figures close to 0.5, it corresponds to the middle part of the sigmoid curve that is nearly linear, and the
approach is then quite similar to using a linear output layer. Such a network can perform limited extrapolation, but
eventually saturates. This has quite a nice intuitive interpretation: extrapolation is justified for a certain distance, and then
should be curtailed. This form of encoding is also available in STATISTICA Neural Networks.
It may have occurred to you that if the first approach is used, and linear units are placed in the output layer, there is no need
to use a scaling algorithm at all, since the units can achieve any output level without scaling. It is indeed possible to turn off
scaling entirely in STATISTICA Neural Networks, for efficiency reasons. However, in reality the entire removal of scaling
presents difficulties to the training algorithms. It implies that different weights in the network operate on very different
scales, which makes both initialization of weights and (some) training more complex. It is therefore not recommended that
you turn off scaling unless the output range is actually very small and close to zero. The same argument actually justifies the
use of scaling during preprocessing for MLPs (where, in principal, the first hidden layer weights could simply be adjusted to
perform any scaling required).
The above discussion focused on the performance of MLPs in regression, and particularly their behavior with respect to
extrapolation. Networks using radial units (RBFs and GRNNs) perform quite differently, and need different treatment.
Radial networks are inherently incapable of extrapolation. As the input case gets further from the points stored in the radial
units, so the activation of the radial units decays and (ultimately) the output of the network decays. An input case located far
from the radial centers will generate a zero output from all hidden units. The tendency not to extrapolate can be regarded as
good (depending on your problem-domain and viewpoint), but the tendency to decay to a zero output (at first sight) is not. If
we decide to eschew extrapolation, then what we would like to see reported at highly novel input points is the mean. In fact,
the RBF has a bias value on the output layer, and sets this to a convenient value, which hopefully approximates the sample
mean. Then, the RBF will always output the mean if asked to extrapolate.
In fact, STATISTICA Neural Networks uses the mean/SD scaling function with radial networks in regression problems. The
training data is scaled so that its output mean corresponds to 0.0, with other values scaled according to the output standard
deviation, and the bias is expected to be approximately zero. As input points are executed outside the range represented in
the radial units, the output of the network tends back towards the mean.
The performance of a regression network can be examined in a number of ways:
1. The output of the network for each case (or any new case you choose to test) can be submitted to the network. If part of
the data set, the residual errors can also be generated.
2. Summary statistics can be generated. These include the mean and standard deviation of both the training data values and
the prediction error. One would generally expect to see a prediction error mean extremely close to zero (it is, after all,
possible to get a zero prediction error mean simply by estimating the mean training data value, without any recourse to the
input variables or a neural network at all). The most significant value is the prediction error standard deviation. If this is
no better than the training data standard deviation, then the network has performed no better than a simple mean estimator.
STATISTICA Neural Networks also reports the ratio of the prediction error SD to the training data SD. A ratio
significantly below 1.0 indicates good regression performance, with a level below 0.1 often said (heuristically) to indicate
good regression. This regression ratio (or, more accurately, one minus this ratio) is sometimes referred to as the explained
variance of the model.
The regression statistics also include the Pearson-R correlation coefficient between the network's prediction and
the observed values. In linear modeling, the Pearson-R correlation between the predictor variable and the predicted is
often used to express correlation - if a linear model is fitted, this is identical to the correlation between the model's
prediction and the observed values (or, to the negative of it). Thus, this gives you a convenient way to compare the neural
network's accuracy with that of your linear models.
3. A view of the response surface can be generated. The network's actual response surface is, of course, constructed in N+1
dimensions, where N is the number of input units, and the last dimension plots the height. It is clearly impossible to
directly visualize this surface where N is anything greater than two (which it invariably is). However, STATISTICA Neural
Networks can display the response surface plotted across any two of the input units. In order to do this, all other inputs are
held at a fixed value, while the two inputs to be examined are varied. The other inputs can be held at any value desired (by
default, STATISTICA Neural Networks holds them at their mean values), and the two examined inputs can be varied in
any range (by default, across the range represented in the training data). You can also view a response graph, where all the
inputs bar one are held fixed.
Neural Networks Introductory Overview - Time Series

Prediction in STATISTICA Neural Networks
In time series problems, the objective is to predict ahead the value of a variable that varies in time, using previous values of
that and/or other variables (see Bishop, 1995).
Typically the predicted variable is continuous, so that time series prediction is usually a specialized form of regression.
However, this is not built into STATISTICA Neural Networks as a restriction, which can also do time series prediction of
nominal variables (i.e., classification).
It is also usual to predict the next value in a series from a fixed number of previous values (looking ahead a single time step).
STATISTICA Neural Networks can actually be used to lookahead any number of steps. When the next value in a series is
generated, further values can be estimated by feeding the newly-estimated value back into the network together with other
previous values: time series projection. If single-step lookahead is used, STATISTICA Neural Networks can also do this
projection. Obviously, the reliability of projection drops the more steps ahead one tries to predict, and if a particular distance
ahead is required, it is probably better to train a network specifically for that degree of lookahead.
In STATISTICA Neural Networks, any type of network can be used for time series prediction (the network type must,
however, be appropriate for regression or classification, depending on the problem type). A network is configured for time
series prediction by setting its Steps and Lookahead parameter. The Steps parameter indicates how many cases should be fed
in as inputs, and the Lookahead parameter how far ahead the prediction should be made. The network can also have any
number of input and output variables. However, most commonly there is a single variable that is both the input and (with the
lookahead taken into account) the output. Configuring a network for time series usage alters the way that data is pre-
processed (i.e., it is drawn from a number of sequential cases, rather than a single case), but the network is executed and
trained just as for any other problem.
The time series training data set therefore typically has a single variable, and this has type input/output (i.e., it is used both
for network input and network output).
The most difficult concept in STATISTICA Neural Networks' time series handling is the interpretation of training, selection,
test and ignored cases. For standard data sets, each case is independent, and these meanings are clear. However, with a time
series network each pattern of inputs and outputs is actually drawn from a number of cases, determined by the network's
Steps and Lookahead parameters. There are two consequences of this:
The input pattern's type is taken from the type of the output case. For example, in a data set containing some cases, the first
two ignored and the third test, with Steps=2 and Lookahead=1, the first usable pattern has type Test, and draws its inputs
from the first two cases, and its output from the third. Thus, the first two cases are used in the test set even though they are
marked Ignore. Further, any given case may be used in three patterns, and these may be any of training, selection and test
patterns. In some sense, data actually leaks between training, selection and test sets. To isolate the three sets entirely,
contiguous blocks of training, selection and test cases would need to be constructed, separated by the appropriate number of
ignore cases.
The first few cases can only be used as inputs for patterns. When selecting patterns for time series use, the case number
selected is always the output case. The first few clearly cannot be selected (as this would require further cases before the
beginning of the data set, and are not available.
Neural Networks Introductory Overview - Variable

Selection and Dimensionality Reduction
The most common approach to dimensionality reduction is principal components analysis (see Bishop, 1995; Bouland and
Kamp, 1988). This is a linear transformation that locates directions of maximum variance in the original input data, and
rotates the data along these axes. Typically, the first principal components contain most information. Principal component
analysis can be represented in a linear network, and STATISTICA Neural Networks includes a simple network type to do
PCA. PCA can often extract a very small number of components from quite high-dimensional original data and still retain
the important structure.
The preceding sections on network design and training have all assumed that the input and output layers are fixed; that is,
that we know what variables will be input to the network, and what output is expected. The latter is always (at least, for
supervised learning problems) known. However, the selection of inputs is far more difficult (see Bishop, 1995). Often, we do
not know which of a set of candidate input variables are actually useful, and the selection of a good set of inputs is
complicated by a number of important considerations:
Curse of dimensionality. Each additional input unit in a network adds another dimension to the space in which the data
cases reside. We are attempting to fit a response surface to this data. Thought of in this way, there must be sufficient data
points to populate an N dimensional space sufficiently densely to be able to see the structure. The number of points needed
to do this properly grows very rapidly with the dimensionality (roughly, in proportion to 2N for most modeling techniques).
Most forms of neural network (in particular, MLPs) actually suffer less from the curse of dimensionality than some other
methods, as they can concentrate on a lower-dimensional section of the high-dimensional space (for example, by setting the
outgoing weights from a particular input to zero, an MLP can entirely ignore that input). Nevertheless, the curse of
dimensionality is still a problem, and the performance of a network can certainly be improved by eliminating unnecessary
input variables. Indeed, even input variables that carry a small amount of information may sometimes be better eliminated if
this reduces the curse of dimensionality.
Interdependency of variables. It would be extremely useful if each candidate input variable could be independently
assessed for usefulness, so that the most useful ones could be extracted. Unfortunately, it is seldom possible to do this, and
two or more interdependent variables may together carry significant information that a subset would not. A classic example
is the two-spirals problem, where two classes of data are laid out in an interlocking spiral pattern in two dimensions. Either
variable alone carries no useful information (the two classes appear wholly intermixed), but with the two variables together
the two classes can be perfectly distinguished. Thus, variables cannot, in general, be independently selected.
Redundancy of variables. Often, a number of variables can carry to some extent or other the same information. For
example, the height and weight of people might in many circumstances carry similar information, as these two variables are
correlated. It may be sufficient to use as inputs some subset of the correlated variables, and the choice of subset may be
arbitrary. The superiority of a subset of correlated variables over the full set is a consequence of the curse of dimensionality.
Selection of input variables is therefore a critical part of neural network design. You can use a combination of your own
expert knowledge of the problem domain, and standard statistical tests to make some selection of variables before starting to
use STATISTICA Neural Networks. Once in STATISTICA Neural Networks, various combinations of inputs can be tried.
STATISTICA Neural Networks includes facilities to ignore some variables, building networks that do not use those variables
as inputs. You can experimentally add and remove various combinations, building new networks for each. You can also
conduct Sensitivity Analysis, which rates the importance of variables with respect to a particular model, and the Intelligent
Problem Solver will automatically select variables for you using a variety of regularization and search techniques.
When experimenting in this fashion, the probabilistic and generalized regression networks are extremely useful. Although
slow to execute, compared with the more compact MLPs and RBFs, they train almost instantaneously - and when iterating
through a large number of input variable combinations, you will need to repeatedly build networks. Moreover, PNNs and
GRNNs are both (like RBFs) examples of radially-based networks (i.e., they have radial units in the first layer, and build
functions from a combination of Gaussians). This is an advantage when selecting input variables because radially-based
networks actually suffer more from the curse of dimensionality than linearly-based networks.
To explain this statement, consider the effect of adding an extra, perfectly spurious input variable to a network. A linearly-
based network such as an MLP can learn to set the outgoing weights of the spurious input unit to 0, thus ignoring the
spurious input (in practice, the initially-small weights will just stay small, while weights from relevant inputs diverge). A
radially-based network such as a PNN or GRNN has no such luxury: clusters in the relevant lower-dimensional space get
smeared out through the irrelevant dimension, requiring larger numbers of units to encompass the irrelevant variability. A
network that suffers from poor inputs actually has an advantage when trying to eliminate such inputs.
Since this form of experimentation is time-consuming, STATISTICA Neural Networks also contains facilities to do this for
you. Several feature selection algorithms are available, including the genetic algorithm (Goldberg, 1989). Genetic
Algorithms are very good at this kind of problem, having a capability to search through large numbers of combinations
where there may be interdependencies between variables.
Another approach to dealing with dimensionality problems, which may be an alternative or a complement to variable
selection, is dimensionality reduction. In dimensionality reduction, the original set of variables is processed to produce a
new and smaller set of variables that contains (one hopes) as much information as possible from the original set. As an
example, consider a data set where all the points lie on a plane in a three dimensional space. The intrinsic dimensionality of
the data is said to be two (as all the information actually resides in a two-dimensional sub-space). If this plane can be
discovered, the neural network can be presented with a lower dimensionality input, and stands a better chance of working
correctly.
Neural Networks Introductory Overview - Ensembles

and Resampling
We have already discussed the problem of over-learning, which can compromise the ability of neural networks to generalize
successfully to new data. An important approach to improve performance is to form ensembles of neural networks. The
member networks' predictions are averaged (or combined by voting) to form the ensemble's prediction. Frequently,
ensemble formation is combined with resampling of the data set. This approach can significantly improve generalization
performance. Resampling can also be useful for improved estimation of network generalization performance.
To explain why resampling and ensembles are so useful, it is helpful to formulate the neural network training process in
statistical terms (Bishop, 1995). We regard the problem as that of estimating an unknown nonlinear function, which has
additive noise, on the basis of a limited data set of examples, D. There are several sources of error in our neural network's
predictions. First, and unavoidably, even a "perfect" network that exactly modeled the underlying function would make
errors due to the noise. However, there is also error due to the fact that we need to fit the neural network model using the
finite sample data set, D.
This remaining error can be split into two components, the model bias and variance. The bias is the average error that a
particular model training procedure will make across different particular data sets (drawn from the unknown function's
distribution). The variance reflects the sensitivity of the modeling procedure to a particular choice of data set.
We can trade off bias versus variance. At one extreme, we can arbitrarily select a function that entirely ignores the data. This
has zero variance, but presumably high bias, since we have not actually taken into account the known aspects of the problem
at all. At the opposite extreme, we can choose a highly complex function that can fit every point in a particular data set, and
thus has zero bias, but high variance as this complex function changes shape radically to reflect the exact points in a given
data set. The high bias, low variance solutions can have low complexity (e.g. linear models), whereas the low bias, high
variance solutions have high complexity. In neural networks, the low complexity models have smaller numbers of units.
How does this relate to ensembles and resampling? We necessarily divide the data set into subsets for training, selection, and
test. Intuitively, this is a shame, as not all the data gets used for training. If we resample, using a different split of data each
time, we can build multiple neural networks, and all the data gets used for training at least some of them. If we then form the
networks into an ensemble, and average the predictions, an extremely useful result occurs. Averaging across the models
reduces the variance, without increasing the bias. Arguably, we can afford to build higher bias models than we would
otherwise tolerate (i.e. higher complexity models), on the basis that ensemble averaging can then mitigate the resulting
variance.
The generalization performance of an ensemble can be better than that of the best member network, although this does
depend on how good the other networks in the ensemble are. Unfortunately, it is not possible to show whether this is actually
the case for a given ensemble. However, there are some reassuring pieces of theory to back up the use of ensembles.
First, it can be shown (Bishop, 1995) that, on the assumption that the ensemble members' errors have zero mean and are
uncorrelated, the ensemble reduces the error by a factor of N, where N is the number of members. In practice, of course,
these errors are not uncorrelated. An important corollary is that an ensemble is more effective when the members are less
correlated, and we might intuitively expect that to be the case if diverse network types and structures are used. STATISTICA
Neural Networks (SNN) does in fact has facilities to select just such diverse networks, and to include them in an ensemble.
Second, and perhaps more significantly, it can be shown that the expected error of the ensemble is at least as good as the
average expected error of the members, and usually better. Typically, some useful reduction in error does occur. There is of
course a cost in processing speed, but for many applications this is not particularly problematic.
There are a number of approaches to resampling available in STATISTICA Neural Networks.
The simplest approach is random (monte carlo) resampling, where the training, selection and test sets are simply drawn at
random from the data set, keeping the sizes of the subsets constant. Alternatively, you CAN sometimes resample the training
and selection set, but keep the test set the same, to support a simple direct comparison of results.
The second approach supported in STATISTICA Neural Networks is the popular cross-validation algorithm. Here, the data
set is divided into a number of equal sized divisions. A number of neural networks are created. For each of these, one
division is used for the test data, and the others are used for training and selection. In the most extreme version of this
algorithm, leave-one-out cross validation, N divisions are made, where N is the number of cases in the data set, and on each
division the network is trained on all bar one of the cases, and tested on the single case that is omitted. This allows the
training algorithm to use virtually the entire data set for training, but is obviously very intensive.
The third approach is bootstrap sampling. In the bootstrap, a new training set is formed by sampling with replacement from
the available data set. In sampling with replacement, cases are drawn at random from the data set, with equal probability,
and any one case may be selected any number of times. Typically the bootstrap set has the same number of cases as the data
set, although this is not a necessity. Due to the sampling process, it is likely that some of the original cases will not be
selected, and these can be used to form a test set, whereas other cases will have been duplicated.
The bootstrap procedure replicates, insofar as is possible with limited data, the idea of drawing multiple data sets from the
original distribution. Once again, the effect can be to generate a number of models with low bias, and to average out the
variance.
Ensembles can also be beneficial at averaging out bias. If we include different network types and configurations in an
ensemble, it may be that different networks make systematic errors in different parts of the input space. Averaging these
differently configured networks may iron out some of this bias.
Neural Networks Introductory Overview -

Further details are available from References, and references to specific topics are included throughout the Electronic
Manual. There also are a number of textbooks that give a good, comprehensive summary of the subject. Among these, we
would particularly recommend:
Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford: University Press. Extremely well-written, up-to-date.
Requires a good mathematical background, but rewards careful reading, putting neural networks firmly into a statistical
context.
Carling, A. (1992). Introducing Neural Networks. Wilmslow, UK: Sigma Press. A relatively gentle introduction. Starting to
show its age a little, but still a good starting point.
Fausett, L. (1994). Fundamentals of Neural Networks. New York: Prentice Hall. A well-written book, with very detailed
worked examples to explain how the algorithms function.
Haykin, S. (1994). Neural Networks: A Comprehensive Foundation. New York: Macmillan Publishing. A comprehensive
book, with an engineering perspective. Requires a good mathematical background, and contains a great deal of background
theory.
Patterson, D. (1996). Artificial Neural Networks. Singapore: Prentice Hall. Good wide-ranging coverage of topics, although
less detailed than some other books.
Ripley, B.D. (1996). Pattern Recognition and Neural Networks. Cambridge University Press. A very good advanced
discussion of neural networks, firmly putting them in the wider context of statistical modeling.
Specifying the Neural Networks Analysis

Code Generator
Run Code Generator
Retraining Networks
Network Set Editor
Select Network File
Nominal Definition
Clustering Network
Linear
Train Radial Layer
Select Cases
Select Cases
Training String
Profile String


Click the Extend button on the Training in Progress dialog to display the Extended Dot Product Training dialog. Tabs on
this dialog can include: Quick, End, Decay, BP, QP, and DBD. The options on these tabs support extended iterative dot
product training. You can extend training using the same or a different iterative algorithm, and can alter the algorithm's
control parameters.
OK. Click the OK button to extend iterative dot product training with the options specified on the tabs. Control is returned
immediately to the Training in Progress dialog.
Cancel. Click this button to exit this dialog without making any changes.

Select the Quick tab of the Extended Dot Product Training dialog to access the options described here.
Phase one/Phase two. Select a two-phase iterative algorithm by selecting both of these check boxes. Clear one or the other
to specify a single-phase algorithm. The algorithm, and all the parameters of the algorithm, can be chosen independently for
each phase.
Algorithm. Select the iterative training algorithm from these drop-down lists for each phase. The available algorithms are:
Back propagation. Back propagation is a simple algorithm with a large number of tuning parameters; often slow terminal
convergence, but good initial convergence. STATISTICA Neural Networks implements the on-line version of the algorithm;
see Technical Details.
Conjugate gradient descent. Conjugate gradient descent is a good generic algorithm with generally fast convergence.
Quasi-Newton (BFGS). Quasi-Newton is a powerful second order training algorithm with very fast convergence but high
memory requirements.
Levenberg-Marquardt. Levenberg-Marquardt is an extremely fast algorithm in the right circumstances - low-noise

regression problems with the standard sum-squared error function.
Quick propagation. Quick Propagation is an older algorithm with comparable performance to back propagation in most
circumstances, although it seems to perform noticeably better on some problems. Not usually recommended.
Delta-bar-delta. Delta-bar-Delta is another variation on back propagation that occasionally seems to have better
performance. Not usually recommended.
Epochs. Specify the number of epochs of iterative training of the network in a given phase.
Learning rate. Specify the main learning rate for back propagation, quick propagation, or Delta-bar-Delta training.

Select the End tab of the Extended Dot Product Training dialog to access options that control the end of training.
Specifically, there are stopping conditions to determine if training should be terminated before the full number of epochs has
expired.
Stopping conditions. The following options are in the Stopping conditions group box.
Target error: Training/Selection. You can specify target error values here. If the error on the training or selection test
drops below the given target values, the network is considered to have trained sufficiently well, and training is terminated.
The error never drops to zero or below, so the default value of zero is equivalent to not having a target error.
Minimum improvement in error. The following options are in this group box.
Training/Selection. Specify a minimum improvement (drop) in error that must be made; if the rate of improvement drops
below this level, training is terminated. The default value of zero implies that training will be terminated if the error
deteriorates. You can also specify a negative improvement rate, which is equivalent to giving a maximum rate of
deterioration that will be tolerated. The improvement is measured across a number of epochs, called the "window" (see
below).
Window. Specifies the number of epochs across which improvement is measured. Some algorithms, including back
propagation, show noise on the training and selection errors, and all the algorithms may show noise in the selection error. It
is therefore not usually a good idea to halt training on the basis of a failure to achieve the desired improvement in error rate
over a single epoch. The window specifies a number of epochs over which the error rates are monitored for improvement.
Training is only halted if the error fails to improve for that many epochs.

Select the Decay tab of the Extended Dot Product Training dialog to access options to specify the use of Weigend weight
decay regularization. This option encourages the development of smaller weights, which tends to reduce the problem of
over-fitting, thereby potentially improving generalization performance of the network, and also allowing you to prune the
network (see also, the description of the End tab). Weight decay works by modifying the network's error function to penalize
large weights - the result is an error function that compromises between performance and weight size. Consequently, too
large a weight decay term may damage network performance unacceptably, and experimentation is generally needed to
determine an appropriate weight decay factor for a particular problem domain.
Phase one decay factors/Phase two decay factors. Weight decay can be applied separately to the two phases of a two-
phase algorithm.
Apply weight decay regularization. Select this check box to enable weight decay regularization on a particular phase.
Decay factor. Specify the decay factor; see Weigend Weight Regularization for a mathematical definition of the decay
factor. A study of decay factors and weight pruning shows that the number of inputs and units pruned is approximately
proportional to the logarithm of the decay factor, and this should be bourne in mind when altering the decay factor. For
example, if you use a decay factor of 0.001 and this has insufficient effect on the weights, you might try 0.01 next, rather
than 0.002; conversely, if the weights were over-adjusted resulting in too much pruning, try 0.0001.
Scale factor. A secondary factor in weight decay, which is usually left at the default value of 1.0; see Weigend Weight
Regularization for more details.

Select the BP tab of the Extended Dot Product Training dialog to access additional options for Back Propagation training.
There are separate tabs for the first and second phase algorithms (either or both of which might be back propagation); the
appropriate tab is available only if the algorithm is in use for that phase.
Adjust learning rate and momentum each epoch. By default, STATISTICA Neural Networks uses a fixed learning rate and
momentum throughout training. Some authors recommend altering these rates on each epoch (specifically, by reducing the
learning rate - this is often counter-balanced by increasing momentum). See Back Propagation for more details.
Learning rate. The learning rate used to adjust the weights. A higher learning rate may converge more quickly, but may
also exhibit greater instability. Values of 0.1 or lower are reasonably conservative - higher rates are tolerable on some
problems, but not on all (especially on regression problems, where a higher rate may actually cause catastrophic divergence
of the weights).
If the learning rate is being adjusted on each epoch, both fields are enabled, and specify the starting and finishing learning
rates.
If a fixed learning rate is being used, the first field specifies this, and the second is disabled.
Momentum. Momentum is used to compensate for slow convergence if weight adjustments are consistently in one direction
- the adjustment "picks up speed." Momentum usually increases the speed of convergence of Back Propagation considerably,
and a higher rate may allow you to decrease the learning rate to increase stability without sacrificing much in the way of
convergence speed.
Shuffle presentation order of cases each epoch. STATISTICA Neural Networks uses the on-line version of back
propagation, which adjusts the weights of the network as each training case is presented (rather than the batch approach,
which calculates an average adjustment across all training cases and applies a single adjustment at the end of the epoch). If
the Shuffle check box is selected, the order of presentation is adjusted each epoch. This makes the algorithm somewhat less
prone to stick in local minima, and partially accounts for back propagation's greater robustness than the more advanced
second-order training algorithms in this respect.
Add Gaussian noise. If this check box is selected, Gaussian noise of the given deviation is added to the target output value
on each case. This is another regularization technique, which can reduce the tendency of the network to overfit. The best
level of noise is problem dependent, and must be determined by experimentation.
Deviation. Specify here the standard deviation of the Gaussian noise added to the target output during training.

Select the QP tab of the Extended Dot Product Training dialog to access options for Quick Propagation training. There are
separate tabs for the first and second phase algorithms (either or both of which might be quick propagation); the appropriate
tab is available only if the algorithm is in use for that phase.
Learning rate. The initial learning rate, applied in the first epoch; subsequently, the quick propagation algorithm determines
weight changes independently for each weight.
Acceleration. This gives the maximum rate of geometric increase in the weight change that is permitted. For example, an
acceleration of two will permit the weight change to no more than double on each epoch. This prevents numerical
difficulties otherwise caused by non-concave error surfaces.

Select the DBD tab of the Extended Dot Product Training dialog to access additional options for Delta-bar-Delta training.
There are separate tabs for the first and second phase; the appropriate tab is available only if the algorithm is in use for that
phase.
Learning rate. The following options are available under Learning rate:
Initial. The initial learning rate used for all weights on the first epoch. Subsequently, each weight develops its own learning
rate.
Increment. The linear increment added to a weight's learning rate if the slope remains in a consistent direction.
Decay. The geometric decay factor used to reduce a weight's learning rate if the slope changes direction.
Smoothing. The smoothing coefficient is used to update the bar-Delta smoothed gradient. It must lie in the range [0,1).
If the smoothing factor is high, the bar-Delta value is updated only slowly to take into account changes in gradient. On a
noisy error surface, this allows the algorithm to maintain a high learning rate consistent with the underlying gradient;
however, it may also lead to overshoot of minima, especially on an already smooth error surface.


Click the Extend button on the Training in Progress dialog while the Kohonen algorithm is running to display the Extended
Kohonen Training dialog. This dialog contains one tab: Quick. The options on this dialog are used to perform additional
epochs of the Kohonen algorithm training.
Phase one/Phase two. Clear either one of these check boxes to specify a single-phase algorithm.
Epochs. These boxes contain the number of epochs used in each phase.
Learning rate. The Kohonen learning rate is altered linearly from the first to last epochs. You can specify a Start and End
value. Normal practice is to use different rates in the two phases:
In the first phase use an initially high learning rate (e.g., 0.9 to 0.1), combined with a large neighborhood (e.g., 2 to 1 ) and
small number of epochs (e.g., 100).
In the second phase, use a low learning rate throughout (e.g., 0.01) combined with a small neighborhood (e.g., 0) and large
number of epochs (e.g., 10,000).
Neighborhood. This is the "radius" of a square neighborhood centered on the winning unit. For example, a neighborhood
size of 2 specifies a 5x5 square.
If the winning node is placed near or on the edge of the topological map, the neighborhood is clipped to the edge.
The neighborhood is scaled linearly from the Start value to the End value given.
The neighborhood size is stored and scaled as a real number. However, for the sake of determining neighbors, the nearest
integral value is taken. Thus, the actual neighborhood used decreases in a number of discrete steps. It is not uncommon to
observe a sudden change in the performance of the algorithm as the neighborhood changes size. The neighborhood is
specified as a real number since this gives you greater flexibility in determining when exactly the changes should occur.
OK. Click the OK button to run the Kohonen algorithm, with the control parameters entered. Control is returned to the
Training in Progress dialog.


Select Neural Networks from the Statistics - Data Mining menu to display the STATISTICA Neural Networks (SNN) Startup
Panel. This dialog contains three tabs: Quick, Advanced, and Networks/Ensembles.
The Quick tab provides convenient access to the most common types of data analysis problems: Regression (prediction of
continuous variables), Classification (prediction of categorical or grouping variables), Time Series (predictions in the time
domain), and Cluster Analysis (detection and evaluation of clusters in the data).
The options on the Advanced tab can be used to specify variables more flexibly (as categorical and continuous variables, and
as input and output variables); the Advanced tab also contains options for retraining, applying, or further editing of existing
networks.
The Networks/Ensembles tab contains options for saving and retrieving network architectures in files, and for combining
multiple network architectures into ensembles of networks for prediction.
Managing Networks in SNN: The General Structure of the Program
The general management of multiple networks in SNN follows a "workbench" model: You can load existing networks or
network ensembles from files, or create one or more new networks with the options available on the Quick and Advanced
tabs of the STATISTICA Neural Networks Startup Panel). After starting up an instance of SNN, when you open a file with an
existing network or create a new network by applying, for example, the Intelligent Problem Solver, STATISTICA will create
internally a "workbench" (temporary working file) of current networks. These networks can be modified, further edited,
retrained, etc. In various places throughout the program you can select the networks in the current workspace that are of
interest. You can review the currently available networks and network ensembles on the Networks/Ensembles tab of the
STATISTICA Neural Networks (SNN) Startup Panel; you can also use the options on that tab to save the networks in a
permanent file, for later use or deployment (application to new data).
The Startup Panel as the main "analysis hub." You can think of the Startup Panel as the main "hub" for your analyses
with SNN: Here you can select to retrain existing networks, apply them to new data, edit them, review them, or create C or
STATISTICA Visual Basic code containing the trained networks (i.e., all parameter values that are necessary, and the
computer code applying the respective architectures), ready for inclusion in your own (C or STATISTICA Visual Basic)
programs. For example, when you want to add network architectures to the set of network architectures on the current
"workbench," you would select, for example, the Custom Network Designer option, train the desired type of network, and
exit the Results dialog by clicking the OK button to add the newly trained network(s) to the set of currently available
networks.
Selected Variables. This group box contains descriptions of the variables in the two variables lists: Dependent or output
variables, and Independent or input variables. These variables will be used to train new or existing networks and to compute
predicted and residual values. In most general terms, dependent or output variables are predicted from the independent or
predictor variables.
Selecting variables to apply existing networks. The Advanced tab of the Startup Panel contains options to run existing
models, i.e., to apply existing trained networks to new variables or data. In STATISTICA Neural Networks, variables in the
list of variables selected from the data file, and variables in the respective networks to identify the inputs and outputs, are
matched by name. In other words, if a network expects variables (variable names) "MyInputA" and "MyInputB" as two
input variables, and you select two variables from a new input data file named "NewInputA" and "NewInputB," you will
have a mismatch. The consequence of the mismatch will be that no data will be read, and when you get to the Results dialog,
only those options are accessible that allow you to inspect the network and network architecture, or to run user-defined
values through the network. However, you cannot compute predicted values for the values in the new input data file, based
on the values found in variables "NewInputA" and "NewInputB." On the other hand, when the variable names of selected
variables do match the labels for the inputs and outputs for a given network, then the order in which these variables are
selected into a list will not matter. Hence, by enforcing this "matching-of-variables-by-names" convention, the program
protects the user from selecting the wrong variables for a network, or the right variables in the wrong order, and thus from
computing erroneous results. Of course, the names of the input and output variables expected by networks in the current
workspace can be changed via the Neural Network Editor dialog.
Selected variable. This group box displays the variables currently selected for the analyses.
Dependent. The variables listed here will be treated as the output variables for the neural networks. These types of variables
are also referred to as dependent variables, because they are to be predicted from the input or independent variables, and
hence are presumed to depend on those input or independent variables to some extend.
Independent. The variables listed here will be treated as input variables for the neural networks. These types of variables
are also referred to as the independent variables, because within the neural network architectures these variables are not
predicted by or dependent on any other variables.
Variable types. The variable list in this group box identifies the variables that are categorical or continuous in nature; if a
subset variable was selected, it will also be shown here. In general, STATISTICA Neural Networks can handle continuous
and categorical inputs and outputs in most network architectures. Hence, variables included in analyses with SNN can either
be continuous or categorical in nature, and can either appear as input variables or output variables for the network. An
optional categorical subset variable can be specified if you want to divide the sample into one of the four groups: Train,
Selection, Test, and Ignore (see below).
Continuous. These variables typically describe continuous measurements such as weight or height.
Categorical. Categorical variables are also referred to as grouping variables or class variables. A typical example of a
categorical variable would be Gender with the two categories, groups, or classes Male and Female.
Selector. You can specify a Selector variable to determine which cases in the sample belong to the Training set, Selection
set, Test set, or are Ignored during the training and final evaluation of the network. When you specify a Selector variable,
you also need to select the specific codes (text values) used in that variable to identify to which group or set each
observation belongs. In short, observations in the Training set will be used to train the network (i.e., to estimate the network
weights and other parameters); cases in the Selection set will be used to perform an "independent check" of the network
performance during training, to avoid over-fitting the data (i.e., to determine when to terminate training the network); cases
in the Test set will not be used during training of the network (estimation procedure) at all, but the fully trained network will
be applied to those cases as a final independent check of the final network performance; cases in the Ignore set will be
ignored during training or in the final evaluation of the network, but can be used later to execute the trained network for
producing results.
OK. Click OK to begin the chosen analysis, as specified on the Startup Panel.
Cancel. Click the Cancel button to close the Neural Networks Startup Panel without performing an analysis.
Options. Click the Options button to display the following menu commands:
Output. Select Output to display the Analysis/Graph Output Manager dialog, which is used to customize the current
analysis output management of STATISTICA.
Display. Select Display to display the Analysis/Graph Display Options dialog, which is used to customize the current
analysis display of STATISTICA.
Create Macro. Select Create Macro to display the New Macro dialog. When running analyses in STATISTICA, all options
and output choices are automatically recorded; when you click Create Macro, the complete recording of all your actions will
be translated into a STATISTICA Visual Basic program that can be run to recreate the analysis. See Macro (STATISTICA
Visual Basic) Overview for further details.
Close Analysis. Select Close Analysis to close all dialogs associated with the analysis. Note that results spreadsheets/graphs
will not be closed, only analysis dialogs will close.
Open Data. Click the Open Data button to display the Select Spreadsheet dialog, which is used to choose the spreadsheet on
which to perform the analysis. The Select Spreadsheet dialog contains a list of the spreadsheets that are currently active.
Select Cases. Click the Select Cases button to display the Analysis/Graph Case Selection Conditions dialog, which is used
to create conditions for which cases will be included (or excluded) in the current analysis. More information is available in
the case selection conditions overview, syntax summary, and dialog description.
W. Click the W (Weight) button to display the Analysis/Graph Case Weights dialog, which is used to adjust the contribution
of individual cases to the outcome of the current analysis by "weighting" those cases in proportion to the values of a selected
variable.

Select the Quick tab of the STATISTICA Neural Networks (SNN) Startup Panel to access options to specify the most common
types of analyses where neural network methods are commonly applied. Refer also to the description of the STATISTICA
Neural Networks (SNN) Startup Panel for details regarding the general structure of the SNN module, how sets of networks
are managed in an active workspace, and how continuous and categorical variables can be selected as the input or output
(independent or dependent) variables for the analyses. If the type of analysis that you want to perform does not easily fit
inside the Problem types available on this tab, use the options on the Advanced tab to specify the input and output variables.
Problem type. Use the options in this group box to select the general type of analysis you want to perform. These selections
will constrain the types of variables you can select as input and output (independent and dependent) variables, to conform to
the respective common type of analysis. Note that you can also specify variables, and hence the general type of analysis
problem, with the options available on the Advanced tab; those options will not constrain the variable selections to conform
to one of these (common) analysis types, so for example, you could specify simultaneously continuous and categorical
output variables (a mixture of a regression and classification problem).
Regression. Select this option button when your output (dependent) variables of interest are continuous in nature (e.g.,
weight, temperature, height, length, etc.), and when no lagged predictors are involved in the model (i.e., when the problem
type is not a time-series of analysis).
Classification. Select this option button when your output (dependent) variables are categorical in nature (e.g., Gender).
Time Series. Select this option when your output (dependent) variables are continuous in nature, and may involve lagged
(over time) predictions. When this problem type is selected, the variable selection dialog will permit selection of identical
continuous variables as both input and output (independent and dependent variable) for the network.
Cluster Analysis. Select this option to perform unsupervised learning, to detect clusters in the data. When this option is
chosen, one or more output variables can be selected that will be used to label the clusters determined in the analyses.
Variables. Click the Variables button to display the variable selection dialog. The types and numbers of variables that you
can select here depend on the Problem type selected at the point when you chose the Variables option. See the description of
the Problem type options (above) for additional details.
Select analysis. The Quick tab offers two types of analyses that can be performed. Review the options on the Advanced tab
to see all available operations that can be requested to create new networks, apply existing networks, manage networks,
generate code, etc.
Intelligent Problem Solver. The Intelligent Problem Solver is a specialized tool to analyze the data and generate neural
networks for you, requiring minimal intervention on your behalf, and conducting all necessary phases of the analysis.
Custom Network Designer. The Custom Network Designer is a general tool allowing you to choose individual network
architectures and training algorithms to exact specifications.
Specify Selector variable codes. These options are only available (not dimmed) if a Selector variable was chosen for the
analysis. In that case, you can specify codes that identify the respective samples (as described below) for the analyses. As
with all single-code selection fields in STATISTICA modules, you can double-click on the respective edit field to review all
codes (text values) available in the Selector variable, and choose the desired code from the Variable Code Window. If no
Selector variable was chosen, then these options will not be available, and STATISTICA will typically apply random case
selection procedures during the training of the networks, to determine the samples necessary for the computations.
Train. Observations in the Training set will be used to train the network (i.e., to estimate the network weights and other
parameters).
Selection. Observations in the Selection set will be used to perform an "independent check" of the network performance
during training, to avoid over-fitting the data (i.e., to determine when to terminate training the network).
Test. Observations in the Test set will not be used during training of the network (estimation procedure) at all, but the fully
trained network will be applied to those cases as a final independent check of the final network performance;
Ignore. Observations in the Ignore set will be ignored during training or in the final evaluation of the network, but can be
used later to execute the trained network for producing results

Select the Advanced tab of the STATISTICA Neural Networks (Startup Panel) to access options to specify various types of
analyses. This tab also provides access to many advanced features of SNN. When specifying variables via the options on this
tab, as opposed to the Variables option on the Quick tab, no constraints are imposed on the nature of variables that can serve
as input or output for the networks. Refer also to the description of the SNN Startup Panel for details regarding the general
structure of the SNN module, how sets of networks are managed in an active workspace, and how continuous and categorical
variables can be selected as the input or output (independent or dependent) variables for the analyses.
Note: Selecting variables to apply existing networks. The Select analysis list contains the option to run existing models,
i.e., to apply existing trained networks to new variables or data. In STATISTICA Neural Networks, variables in the list of
variables selected from the data file, and variables in the respective networks to identify the inputs and outputs, are matched
by name. In other words, if a network expects variables (variable names) "MyInputA" and "MyInputB" as two input
variables, and you select two variables from a new input data file named "NewInputA" and "NewInputB," you will have a
mismatch. The consequence of the mismatch will be that no data will be read, and when you get to the Results dialog, only
those options are accessible that allow you to inspect the network and network architecture, or to run user-defined values
through the network. However, you cannot compute predicted values for the values in the new input data file based on the
values found in variables "NewInputA" and "NewInputB." On the other hand, when the variable names of selected variables
do match the labels for the inputs and outputs for a given network, then the order in which these variables are selected into a
list will not matter. Hence, by enforcing this "matching-of-variables-by-names" convention, the program protects the user
from selecting the wrong variables for a network, or the right variables in the wrong order, and thus from computing
erroneous results. Of course, the names of the input and output variables expected by networks in the current workspace can
be changed via the Neural Network Editor dialog, accessed by selecting Model Editor.
Variable Types. Click this button to display a standard three-variable selection dialog. On this dialog you can specify the
variables that are to be used in the analyses as continuous variables (e.g., Weight, Length, etc.) and as categorical or class
variables (e.g., Gender), and a subset variable. Refer also to the SNN Startup Panel for a description of the different types of
variables that can be specified. Note that you must first specify the variable types for the variables in the analyses before
assigning them to either the input or output for neural networks.
Input-Output Variables. If no Variable Types have been specified when you choose this option, the Variable Types
variable selection dialog will first be displayed. After identifying the continuous and categorical variables to be used in the
analysis, you can then specify which ones are to be used for the output of neural networks (i.e., which ones are to be
predicted), and which ones are to be used as input for the neural networks.
Specify the subset variable codes. These options are only available (not dimmed) if a subset variable is chosen for the
analysis. In this case, you can specify codes that identify the respective samples (as described below) for the analyses. As
with all single-code selection fields in STATISTICA modules, you can double-click on the respective edit field to review all
codes (text values) available in the subset variable, and choose the desired code from the Variable Code Window. If no
subset variable was chosen, then these options will be dimmed, and the program will typically apply random case selection
procedures during the training of the networks, to determine the samples necessary for the computations.
Train. Observations in the Training set will be used to train the network (i.e., to estimate the network weights and other
parameters).
Selection. Observations in the Selection set will be used to perform an "independent check" of the network performance
during training, to avoid over-fitting the data (i.e., to determine when to terminate training the network).
Test. Observations in the Test set will not be used during training of the network (estimation procedure) at all, but the fully
trained network will be applied to those cases as a final independent check of the final network performance;
Ignore. Observations in the Ignore set will be ignored during training or in the final evaluation of the network, but can be
used later to execute the trained network for producing results.
Issue message if missing cases are found. If this check box is selected, you will be alerted if missing values are
encountered in the variables selected from the data set.
Maximum data size, in MB. Select this check box to limit the maximum data size that can be processed; note that very
large data problems may require significant memory and processing resources; modify the default only as needed.
Select analysis. Select the analysis you want to perform from the Select analysis list:
Intelligent Problem Solver. The Intelligent Problem Solver is a specialized tool to analyze the data and generate neural
networks for you, requiring minimal intervention on your behalf, and conducting all necessary phases of the analysis.
Custom Network Designer. The Custom Network Designer is a general tool allowing you to choose individual network
architectures and training algorithms to exact specifications.
Create New Ensemble. Choose this option to form networks into cooperating ensembles, which may have superior
generalization performance (prediction on new data).
Run Existing Model. Choose this option to apply existing networks or ensembles to new cases or user-defined values (see
also the note above on Selecting variables to apply existing networks).
Feature Selection. Choose this option to run the auxiliary feature selection algorithms. SNN has feature selection (i.e. choice
of independent (input) variables) built into the Intelligent Problem Solver and Custom Network Designer. This additional
analysis includes specialized, slower algorithms that sometimes yield better results.
Code Generator. Choose this option to generate a C-language or STATISTICA Visual Basic code version of your neural
networks and ensembles for integration into your own programs. This is an optional, separately licensed, feature.
Retrain Network. Choose this option to retrain an existing network.
Network Set Editor. Choose this option to display the Neural Network File Editor dialog, which is used to view and edit
the current network set. Summary details on the networks and ensembles are available on the Networks/Ensembles tab.
Model Editor. Choose this option to edit a model; either the Neural Network Editor or the Edit Ensemble dialog is
displayed. The Network Editor includes a Train button. Clicking this button is equivalent to using the Custom Network
Designer, but with the first stage (to create the network) omitted.
Neural Networks Startup Panel - Networks/Ensembles

Tab
Select the Networks/Ensembles tab of the STATISTICA Neural Networks (SNN) Startup Panel to access options to review the
networks and network ensembles currently available in the workspace (see the Startup Panel for details regarding the general
structure of the SNN module) and to save them to a network (.snn) file. Use the Open Network File option to open an
existing network file into the current workspace.
Save Network File As... Click this button to save the networks or ensembles of networks in the current workspace to a new
network file (file name extension .snn).
Save Network File. Click this button to save the networks or ensembles of networks in the current workspace to the network
file that was previously opened, or to which you previously saved the current networks (via the Save Network File As...
option).
Open Network File. Click this button to open an existing network file. The networks and ensembles of networks in this file
will replace the networks and ensembles of networks currently in the workspace. Therefore, be careful to save any current
work that you want to retain for later work.
Networks/Summary of Networks. The currently available networks are displayed in the list under the title Networks. To
view the summary statistics for the networks in a standard results spreadsheet, click the Summary of Networks button. The
summary statistics for networks consist of:
Index. This is a unique life-long number assigned to each neural network when it is created. The indices are assigned in
chronological order.
Lock. Indicates whether the network is locked to prevent accidental deletion.
S/A. Indicates whether the network is standalone. Standalone networks are models in their own right, which can optionally
also be part of an ensemble. Non-standalone networks are members of ensembles that are not considered to be models in
their own right.
Refs. The number of references to a network; that is, the number of ensembles that contain the network. A single network
may be a member of several ensembles. Not applicable to ensembles.
Profile. This is the most useful summary statistic, packing a great deal of information into a short piece of text. It tells you
the network type, the number of input and output variables, the number of layers, and the number of neurons in each layer.
The general format is:
<type> <inputs>:<layer1>-<layer2>-<layer3>:<outputs>
where the number of layers may vary. For example, the profile MLP 2:2-3-1:1 signifies a Multilayer Perceptron with two
input variables and one output variable, and three layers of 2, 3, and 1 units respectively. For simple networks, the number of
input variables and output variables may match the number of neurons in the input and output layers, but this is not always
so, as we shall see in due course. See the description of the Profile string for additional details.
Train Perf/Select Perf/Test Perf. These columns give the performance of the networks on the training, selection, and test
subsets respectively. You should not give too much credence to the performance rate reported on the training set, which is
often deceptively good (indicating over-learning). Also, you should avoid using the test set performance to select models, as
that defeats the object of having it (which is to maintain some data not used for training or model selection, so that a
dispassionate final assessment of performance can be made). The meaning of the performance measure depends on the
network type. For a classification network, it is the proportion of cases in the subset correctly classified. For a regression
network, is the ratio of the prediction to observation standard deviations.
Train Error/Select Error/Test Error. Neural Network training algorithms optimize an error function (e.g. the RMS of the
cross entropy between observed and predicted outputs). These columns report the error rates on the subsets. The error rate is
less directly interpretable than the performance measure, but is of more significance to the training algorithms themselves.
Training. This is a brief description of the training algorithm used to train the network. A typical code might read:
BP200CG102b, which signifies "two hundred epochs of back propagation, followed by one hundred and two epochs of
conjugate gradient descent, at which point training was terminated due to over-learning and the best network in the training
run retrieved." Refer to Training String for additional details.
Note. A short text description that you can enter using the Network Editor dialog.
Inputs. The number of input variables to the model (also displayed as part of the profile).
Hidden (1)/Hidden (2). The number of hidden units in the network.
Ensembles/Summary of Ensembles. The currently available network ensembles are listed under the header Ensembles; the
summary details reported there can also be displayed in a standard results spreadsheet by clicking the Summary of
Ensembles button. These summary details are for the most part identical to those reported for individual networks. However,
in addition the program will report:
Members. This is a list of the networks in an ensemble.


The Intelligent Problem Solver (IPS) is a sophisticated tool to help you create and test neural networks for your data analysis
and prediction problems. It designs a number of networks to solve the problem, copies these into the current Network Set,
and then selects those networks into the Results dialog, allowing you to test their performance in a variety of ways.
To launch the Intelligent Problem Solver, select Intelligent Problem Solver from the Neural Networks Startup Panel Quick
tab or Advanced tab, select variables from the variable selection dialog, and click the OK button. The Quick tab of the IPS
contains the minimum number of options you need to start using the tool, and are designed to be used even if you have no
previous experience with neural networks.
Neural networks are much more complex models than the linear techniques used in traditional statistical modeling. They are
more difficult to optimize, and there are difficult design decisions to make, such as the right type and complexity of network
for the problem and the right input variables to use. The Intelligent Problem Solver uses some highly sophisticated
algorithms to solve these problems automatically. You simply specify the type of problem, the variables to use, the amount
of time to be spent designing the network, and the number of networks to be retained and added to the network set. The IPS
does the rest for you.
The other tabs contain options to control the design process more closely, including the assignment of classification
confidence levels and the selection of the type and complexity of networks created. These advanced options are suitable for
users already familiar with neural networks who will value the ability of the IPS to automatically conduct a large number of
experiments and select the best networks. The IPS is also more effective if the complexity of the search is constrained, and
advanced users may get better results by specifying some aspects of the design rather than allowing the basic version to
examine all aspects.
The Intelligent Problem Solver includes facilities to perform feature selection (that is, to determine which of the available
input variables should be used). This is an extremely difficult task in itself, and STATISTICA Neural Networks contains other
facilities for advanced users that can also be used to support feature selection; see Feature Selection, Sensitivity Analysis,
and Weigend Regularization.
See Replication of Results if you want to use networks created by the Intelligent Problem Solver as a basis for further
experiments that you conduct yourself or if you want to replicate the results obtained by the IPS.
See also, How the Intelligent Problem Solver Works.

In contrast to traditional linear techniques in statistics, there is no method currently known that will automatically locate the
optimal neural network to fit a particular data set. Neural network designers therefore traditionally run training algorithms a
number of times with a given "neural network design," selecting the best network (or perhaps a few of the best). Further, a
"design" must also be selected (i.e., the type of neural network, the number of input variables and hidden units, and the
settings of various control parameters in the training algorithms that can affect the final performance of the network).
Therefore, a number of experiments with different designs are conducted, and the best networks selected. During this
experimental process, the designer must guard against overlearning, using techniques such as "early stopping." Special
techniques such as regularization and sensitivity analysis can be deployed to help in the design process.
The Intelligent Problem Solver (IPS) follows a similar process, although in this case the "heuristic expertise" of a neural
network designer is replaced by search algorithms that use state-of-the-art techniques to determine the selection of inputs,
the number of hidden units, and other key factors in the network design. These search algorithms are interleaved so that the
IPS searches for optimal networks of different types (for example, Multilayer Perceptrons and Raadial Basis Functions)
simultaneously.
The IPS can search indefinitely, although after some a priori unknown period of time, it is unlikely to make further progress.
(Exception: in certain simple cases, such as linear networks, the search may terminate itself.)
The IPS requires (and makes good use of) much more time when doing certain tasks, particularly feature selection
(automatic determination of inputs) and, to a lesser extent, complexity determination (automatic determination of number of
hidden units). If you have a large problem, with tens of potential input variables and thousands or tens of thousands of cases,
you may even find it beneficial to set the IPS to run overnight using the timed facility.
Click the OK button to run the Intelligent Problem Solver. If real-time feedback is enabled (by default, it is enabled) then the
Intelligent Problem Solver Progress dialog is displayed, which updates a spreadsheet showing you summary details of new
or improved models as they are created. You can terminate the IPS at any time by clicking the Finish button on that dialog.
When the IPS finishes, the Results dialog is displayed.

Select Intelligent Problem Solver from the Neural Networks Startup Panel Quick tab or Advanced tab, select variables from
the variable selection dialog, and click the OK button to display the Intelligent Problem Solver dialog. This dialog contains
eight tabs: Quick, Retain, Type, Complexity, Thresholds, Time Series, MLP, and Feedback. The options described here are
available regardless of which tab is selected. See also, Intelligent Problem Solver - Overview and How the Intelligent
Problem Solver Works.
OK. Click the OK button to run the IPS. If real-time feedback is enabled (by default, it is enabled) then the Intelligent
Problem Solver Progress dialog is displayed, which updates a spreadsheet showing you summary details of new or
improved models as they are created. When the IPS finishes, the Results (Run Models) dialog is displayed.
Cancel. Click the Cancel button to exit the Intelligent Problem Solver. Any selections made will be ignored.
Options. Click the Options button to display the Options menu.
Sampling. Click the Sampling button to display the Sampling of Case Subsets for Intelligent Problem Solver dialog, which
is used to specify how the cases in the data set should be distributed among the training, selection, and test subsets.

Select the Quick tab of the Intelligent Problem Solver dialog to access the options described here.
Optimization time. Specify how long the Intelligent Problem Solver (IPS) should run. You can specify the duration by two
methods:
Networks tested. Select this option button and specify, in the adjacent box, how many network tests (iterations) the IPS
should perform. Type the amount into the box or use the microscrolls. A larger number of iterations allows the IPS to
conduct a more thorough search.
Irrespective of the number of iterations you specify, the IPS always uses the same search strategy. It interleaves independent
searches for each network type. Each search is conducted in a number of phases (specific to the network type, and also
dependent on some of the parameters specified, e.g., whether feature selection is occurring). The searches are structured so
that the early phases perform "rough and ready" selection that can find a reasonable solution quickly.
If you are unsure how long to run the IPS, you can always specify a large number of tests, then monitor the progress report
and terminate it when a satisfactory solution is reached or no progress is made for a significant period of time.
Hours/minutes. Select this option button to search for a specified duration. This feature is particularly useful in allowing
you to exploit time when you would otherwise not use your computer - for example, you can run the Intelligent Problem
Solver overnight, during lunch or meetings, or in other periods of absence. This is particularly recommended for large scale
problems. For smaller problems, especially those with few cases, you should be wary of running the Intelligent Problem
Solver for extremely long periods in order to select networks with marginal improvements in selection performance. In this
case, a random sampling effect kicks in - if sufficient models are tested, some with unusually high selection performance
may be discovered, which does not indicate any true improvement in performance. You can verify whether this effect has
occurred by checking that the test set performance has not deteriorated in comparison with networks trained using a quicker
search.
Networks retained. In this box, specify how many of the networks tested by the IPS should be retained (for testing, and
then insertion into the current network set).
By default, the IPS tries to maintain a diverse set of networks (i.e., representatives of different types and different
complexities, where complexity refers to the number of input variables and hidden units). You can customize the retention
algorithm on the Retain tab.
If the number of networks you've specified would result in the network set becoming full, STATISTICA Neural Networks
will query whether you want to continue. If you do, STATISTICA Neural Networks may delete some existing models or
discard some of the new ones, depending on the options selected on the Neural Network File Editor - Replacement Options
tab.
Form an ensemble from retained networks. Ensembles are collections of neural networks that cooperate (by averaging or
voting) to form predictions. They often have better generalization performance than individual networks. If this check box is
selected, the output of the IPS is an ensemble containing the networks it generated, rather than the individual networks.
Select a subset of independent variables. Select this check box to specify that the IPS should determine which variables to
use as inputs. The variables are chosen from the independent variables selected using the Variables button. Each network
tested may use a different combination of inputs. This option specifies that the IPS should conduct feature selection for you.

Select the Retain tab of the Intelligent Problem Solver dialog to access the options described here.
Networks retained. In this box, specify how many of the networks tested by the Intelligent Problem Solver (IPS) should be
retained (for testing, and then insertion into the current network set). Type the amount into the box or use the microscrolls.
Criteria to select retained networks. Specify how to select which networks to retain.
Lowest error (on selection subset). Select this option button to keep the networks with the lowest error (i.e. the error
figure, as reported in the Network Set Datasheet, measured on the selection subset). However, this is frequently not the best
idea.
First, some network types may perform consistently better than others, and that may mean that you get no networks of
certain types, even though you would like to know how they perform.
Second, it is often worthwhile to trade a little marginal performance for a reduction in network complexity - the number of
input variables and hidden units, especially the number of input variables. Smaller networks are more efficient, generalize
more reliably, and, if they have less input variables, may also be cheaper to deploy.
Balance error against diversity. It is also instructive to compare the performance of networks with different input
variables. You can, therefore, specify that the Intelligent Problem Solver balance error against type and diversity by
selecting this option button, in which case it will preserve networks with a range of types and performance/complexity trade-
offs. See the Neural Network File Editor - Replacement Options tab for a description of how diversity is maintained.
If the network file is full. (Action if the new model is inferior to the candidate for replacement) Specify what to do if the
network set is full.
Increase the network file size. Select this option button to specify that the network set be increased in maximum size to
accommodate the new networks.
Replace selected models. Select this option button to specify that networks be conditionally replaced, using the
Replacement Options specified on the Network File Editor. Some or all of the networks discovered by the Intelligent
Problem Solver might be discarded entirely in this case, if they do not meet the standards of the networks currently in the
network set.
Save copy of all the networks generated in log file (*IPSLog.snn). If this check box is selected, every network trained by
the IPS is saved to a special network file, which has the same name as the data file, except with the postfix "_IPSLog.snn" in
place of the data file extension.
The log file may be useful if you conduct extremely long runs of the IPS, where there is a danger of the machine crashing as
a result of other activities happening at the same time. You will always be able to retrieve the work created by the IPS up to
the last network successfully tested.

Select the Types tab of the Intelligent Problem Solver dialog to access the options described here.
Network types to test. Specify which types of network the Intelligent Problem Solver (IPS) should create and test by
selecting the respective check boxes. The IPS conducts separate interleaved searches for each type of network, using
optimization strategies appropriate for each type.
Linear. Linear networks implement a general linear model. Training is quite fast, and the networks are compact, but
performance will suffer if the problem domain exhibits non-linearities. They are mainly provided to give a point of
comparison for the "real" neural networks. This is a very important (and often neglected) step when building neural
networks. If your data is linearly related or very close to linearly related, then a Linear model will be superior to any neural
network, and smaller and more reliable as well. Quite a number of problem domains are approximately linear (otherwise the
technique would not be so widely used), and you are advised to stick with linear models if they prove superior to neural
networks.
PNN or GRNN. PNNs and GRNNs (Probabilistic Neural Networks and Generalized Regression Neural Networks) have
extremely fast training times (at least if the data file does not contain too many cases - less than a thousand, say), and
reasonable performance, but are large and slow in execution. They share and magnify the advantages and disadvantages of
Radial Basis Function Networks, in comparison to Multilayer Perceptrons. STATISTICA Neural Networks automatically
decides between the two types depending on whether this is a classification problem (in which case it uses a Probabilistic
Neural Network) or a regression problem (in which case it uses a Generalized Regression Neural Network).
Radial basis function. Radial Basis Function networks tend to be slower and larger than Multilayer Perceptron (see below),
and often have worse performance, but they train extremely quickly. They are also usually less effective than multilayer
perceptrons if you have a large number of input variables (they are more sensitive to the inclusion of unnecessary inputs).
Three-layer perceptron/Four-layer perceptron. The multilayer perceptron is the most common form of network. This
requires iterative training, which may be quite slow, but the networks are quite compact, execute quickly once trained, and in
most problems yield better results than the other types of networks.

Select the Complexity tab of the Intelligent Problem Solver dialog to access the options described here.
Number of hidden units. These boxes are available only if Radial Basis Function or Multiple Layer Perceptron is selected
on the Intelligent Problem Solver - Type tab. For each network type, specify the complexity of networks to be tested in terms
of a range of figures for the number of hidden units.
Specifying the number of hidden units exactly (i.e. by setting the minimum equal to the maximum) may be beneficial if you
know, or have good cause to suspect, the optimal number. In this case, it allows the Intelligent Problem Solver to
concentrate its search algorithms, and is particularly beneficial if you also turn off automatic determination of input
variables.
What effect does the number of hidden units have? In general, increasing the number of hidden units increases the
modeling power of the neural network (it can model a more convoluted, complex underlying function), but also makes it
larger, more difficult to train, slower to operate, and more prone to over-fitting (modeling noise instead of the underlying
function). Decreasing the number of hidden units has the opposite effect.
If your data is from a fairly simple function, or is very noisy, or you have too few cases, a network with relatively few
hidden units is preferable. If, in experimenting with different numbers of hidden units you find that larger networks have
better training performance, but worse selection performance, then you are probably over-fitting and should revert to smaller
networks.

Select the Thresholds tab of the Intelligent Problem Solver dialog to access the options described here. Classification neural
networks must translate the numeric level on the output neuron(s) to a nominal output variable. The options on this tab
govern the techniques used to make this translation. The technique is described in more detail in classification thresholds.
Classification thresholds.
Assign to highest confidence. (No thresholds.)This option is available only if the dependent variable is nominal with three
or more values. It indicates a "winner takes all" network.
Use the thresholds specified below. Specify the accept and reject thresholds explicitly here; see classification thresholds for
an explanation of these thresholds.
Accept/Reject. The accept and reject thresholds if the option to explicitly specify thresholds is used are specified here.
Calculate minimum loss threshold. This is available only if the dependent variable is nominal with two values. A single
threshold (accept=reject) is determined to minimize expected loss (see classification thresholds). If PNNs are created, this
option is interpreted as Assign to highest confidence (PNNs always have multiple output neurons).
Loss. This is the loss coefficient used if the threshold is being automatically determined to minimize loss.

Select the Time Series tab of the Intelligent Problem Solver dialog to access options to specify whether the networks are
designed for Time Series analysis (i.e. predict a variable from time-lagged (earlier) copies of the same and/or different
variables).
Treat problem as time series. Select this check box to specify that the networks be used for time series analysis. If this
check box is not selected, the Steps fields (see below) are disabled and ignored, and the value 1 is used for the steps.
Range for steps (number of time steps used as inputs).
Minimum/Maximum. Specify the range of values to be used for the steps parameter, which determines how many time-
lagged copies of the independent variables are fed as input to the neural networks.
Note. The IPS sets the lookahead parameter (the number of time steps after the input values that the output is predicted) to 1
for time series problems, thus performing single step ahead prediction.

Select the MLP tab of the Intelligent Problem Solver dialog to access options to specify the form of encoding to be used for
the output of Multilayer Perceptrons. There are separate approaches available for classification and regression networks.
Classification output encoding. Specify the error function to be used in training a classification network. The available
options are:
Entropy based. Generates networks using cross entropy error functions. Such networks perform maximum likelihood
optimization, assuming that the data is drawn from the exponential family of distributions. This supports a probabilistic
interpretation of the output confidence levels generated by the network.
Sum-squared, logistic. This option has less statistical justification than cross entropy. The network learns a discriminant
function, and although the outputs can be treated as confidence measures, they are not probability estimates (and indeed may
not even sum to 1.0). On the other hand, such networks sometimes train more quickly than the entropic equivalents, the
training process is more stable, and the network may achieve a higher classification rate.
Regression output encoding. Two approaches can be used to map the output variable in a regression problem. In both
cases, the sum-squared error function is used.
Linear. An identity activation function is used. This supports a substantial amount of extrapolation, although not an
unlimited amount (the hidden units will saturate eventually).
Logistic. A logistic activation function is used, with scaling factors determined so that the output range encountered in the
training set is mapped to 90% of the logistic function's output range. Consequently, only a small amount of extrapolation can
occur (of course, significant extrapolation from data is usually unjustified anyway).

Select the Feedback tab of the Intelligent Problem Solver dialog to access the options described here.
Summary details of networks. Specify how frequently the Intelligent Problem Solver (IPS) should output summary details
of the networks tested. The options are as follows:
None. There is no summary details feedback.
Final networks only. When the IPS terminates, a results spreadsheet is generated containing summary details on the
networks retained, provided that the Generate spreadsheet of summary details when finished option (see below) is also
selected.
Improved networks (real time). The IPS Progress dialog is displayed when the IPS is run. Whenever a network is
discovered that is an improvement (in terms of selection error) over any previous network discovered of the same type, a
row is added to the summary spreadsheet on the IPS Progress dialog.
All networks tested (real time). The IPS Progress dialog is displayed when the IPS is run. A row is added to the summary
spreadsheet for every network tested.
Generate spreadsheet of summary details when finished. When the IPS finishes, the summary spreadsheet is copied to a
results spreadsheet. This option has no effect if the Summary details of networks option (see above) is set to None.
Display summary message when finished. If this check box is selected, a message summarizing progress is generated and
displayed in a message dialog when the Intelligent Problem Solver finishes. The final message includes a summary of the
performance of the best network discovered by the Intelligent Problem Solver. This may, depending on the type of network,
be: the proportion of the selection cases correctly classified (e.g. 0.971 indicates that 97.1% of the selection cases not used in
training were correctly classified by the network), the S.D. ratio, and standard correlation coefficient between the predicted
and actual output values (for regression problems). In some classification problems, the area under the ROC curve is also
reported.
Copy summary message to Clipboard when finished. Copies the message described above to the system Clipboard for
pasting into another application (e.g. a word processing package).
Intelligent Problem Solver Progress Dialog

Click OK on the Intelligent Problem Solver dialog to display the Intelligent Problem Solver Progress dialog. This dialog is
displayed while the Intelligent Problem Solver runs, provided that you have selected the Improved networks (real time) or
All networks tested (real time) from Summary details of networks on the Feedback tab of the Intelligent Problem Solver
dialog.
The major feature of the dialog is a spreadsheet that reports summary details of networks as they are tested. There are also
buttons to prematurely terminate training, either canceling the run or stopping further testing and committing the results so
far.
While training is in progress, the Intelligent Problem Solver Progress dialog also displays, in addition to the summary of
networks, the percentage of training accomplished and the time taken to achieve it.
Progress spreadsheet. The spreadsheet contains summary details of the networks so far tested by the Intelligent Problem
Solver. Only a subset of the full summary details are included - those that are relevant to network training. Depending on the
setting in the Summary details of networks on the Feedback tab of the Intelligent Problem Solver dialog, a line may be added
to this spreadsheet either for each network tested, or only for networks with lower selection error than previously-generated
networks of the same type.
Finish. Click this button at any point during training to prematurely finish training. The last test will be completed, networks
selected from those so-far generated, and the Results dialog displayed.
Cancel. Click this button at any point to abort training. The Intelligent Problem Solver stops as soon as possible, any
networks already generated are discarded, and the Intelligent Problem Solver dialog is displayed again.


Select Custom Network Designer from the Neural Networks Startup Panel Quick tab or Advanced tab, select Output and
Input Variables from the variable selection dialog, and click the OK button to display the Custom Network Designer dialog.
This dialog contains three tabs: Quick, Units or PNN (depending on the network type specified on the Quick tab), and Time
Series. If a Linear network is specified, only the Quick tab and Time Series tab are displayed. The Custom Network Designer
is used to specify the type and create individual neural networks. The options described here are available regardless of
which tab is selected.
OK. Click the OK button to create the network and display a training dialog that corresponds to the network type.
Cancel. Click the Cancel button to exit the Custom Network Designer dialog without creating a network.
Edit. Further customization is available by clicking the Edit button to display the Neural Network Editor dialog. Use the
options on this dialog to fine-tune details such as variable conversion functions and network synaptic functions.

Select the Quick tab of the Custom Network Designer dialog to access options to specify the network type.
Network type. Select the Network type from the options in this group box.
Multilayer Perceptron. Multilayer Perceptrons are one of the most popular network types, and in many problem domains
seem to offer the best possible performance. They are trained using iterative algorithms, of which the best known is back
propagation. See the Multilayer Perceptrons Overview for more details.
Radial Basis Function. Radial Basis Function networks combine a single radial hidden layer with a dot product output
layer. The hidden layer neurons act as cluster centers, grouping similar training cases, and the output layer forms a
discriminant function or regression. Since the clustering transformation is non-linear, a linear output layer is sufficient to
perform an overall non-linear function. See the Radial Basis Function Overview for more details.
Probabilistic Neural Network. Probabilistic Neural Networks form a kernel-based estimation of the class in classification
problems, using the training cases as exemplars. Every training case is copied to the hidden layer of the network, which
applies a Gaussian kernel. The output layer is then reduced to a simple addition of the kernel-based estimates from each
hidden unit. Optionally, a fourth layer can be added that contains the coefficients of a loss matrix. See the Probabilistic
Neural Networks Overview for more details.
Generalized Regression Neural Network. Generalized Regression Neural Networks (GRNN) form a kernel-based
estimation of the regression surface. The output variable is usually numeric (Probabilistic Neural Networks (PNN) is used
for classification problems). Typically, like PNNs, GRNNs have one hidden unit per training case. However, unlike PNNs, it
is possible to train a GRNN with a smaller number of hidden units, which represent the centroids of clusters of known data.
These centers are typically assigned using K-Means. It is also of course permissible to sub-sample the available training
data. See the Generalized Regression Neural Networks Overview for more details.
Self Organizing Feature Map. Self Organizing Feature Maps (SOFMs) are rather different from the other networks
available in STATISTICA Neural Networks, in that they are designed primarily for unsupervised learning. They are usually
trained using the Kohonen algorithm, which does not require any examples of the output variable in the data set. However,
other center assignment algorithms (e.g. K-Means) can also be used, and if a labeled output variable is available in the data
set, this can be used to apply center labels. See the Self Organizing Feature Maps Overview for more details.
Linear. Linear neural networks implement a basic linear model, used principally for regression (although they can be used
for classification, forming a simple linear discriminant). Linear models are equivalent to simple forms of neural network,
with no hidden layer. They are included in STATISTICA Neural Networks principally to allow you to compare linear and
non-linear modeling within the same framework. See the Linear Neural Networks Overview for more details.
Principal Components. Principal Components networks are simply linear networks used to perform Principal Components
Analysis. PCA is useful to reduce the dimensionality of data sets with many input variables. The output of the PCA network
can be copied to the data set and used to train a simpler network, which can subsequently be joined to the PCA network.
Clustering Network. Cluster Networks are actually a non-neural model presented in a neural form for convenience. A
cluster network consists of a number of class-labeled exemplar vectors (each represented by a radial neuron). The vectors are
assigned centers by clustering algorithms such as K-Means, and then labeled using nearby cases. After labeling, the centers
positions can be fine-tuned using Learned Vector Quantization.
Note: Self Organizing Feature Map, Clustering, and Probabilistic Neural Network networks can only have a single output,
which must be nominal.
Custom Network Designer - Units Tab

Select the Units tab of the Custom Network Designer dialog to access the options described here. The options available
depend on the network type specified on the Quick tab.
Options available for Multilayer Perceptron networks:
Number of hidden layers. In this box, specify the number of hidden layers (zero, one, two, or three) in each network.
STATISTICA Neural Networks can create multilayer perceptron networks with more than three hidden layers by joining
networks in the Network Set Editor dialog.
Number of units per layer. In this box, specify the number of units in each layer.
Classification error function. Specify the error function to be used in training the network (classification problems only).
The output neuron activation levels provide confidence estimates for the output classes. It is desirable to be able to interpret
these confidence levels as probabilities.
Entropy. If a cross entropy error function is used (select the Entropy option button), the network performs maximum
likelihood optimization, assuming that the data is drawn from the exponential family of distributions.
Sum-squared. The alternative approach is to use a sum-squared error function (combined with the logistic output activation
function). This has less statistical justification - the network learns a discriminant function, and although the outputs can be
treated as confidence measures, they are not probability estimates (and indeed may not even sum to 1.0). On the other hand,
such networks sometimes train more quickly, the training process is more stable, and the network can achieve a higher
classification rate.
Regression output function. Two approaches can be used to map the output variable in a regression problem. In both cases,
the sum-squared error function is used
Linear. The first approach uses an Identity activation function. This supports a substantial amount of extrapolation, although
not unlimited (the hidden units will saturate eventually).
Logistic-range. In the second approach a logistic activation function is used, with scaling factors determined so that the
range encountered in the training set is mapped to a specified proportion of the logistic function's (0,1) range (e.g. proportion
0.9 corresponds to [0.05, 0.95]). This allows a small amount of extrapolation (significant extrapolation from data is usually
unjustified anyway). Using the logistic function makes training stable.
Options available for Radial Basis Function networks:
Normalize numeric input variables and codebook vectors. The radial units in the hidden layer of the network act as
exemplar, or codebook, vectors representing cluster centers in the training data. By selecting this check box, you can specify
whether to normalize numeric input variables to a consistent scale (zero to one, using minimax mapping) before presentation
to the network (both during training and execution). Normalized input variables tend to produce better predictions, as scaling
differences between the variables are discounted. However, normalized variables are also less interpretable.
Hidden units. In this box, enter the number of hidden (radial) units.
Classification error function. See the description above, under Options available for Multilayer Perceptron network.
Options available for Generalized Regression Neural Networks:
Normalize numeric input variables and codebook vectors. See description above, under Options available for Radial
Basis Function network.
Hidden units. The number of hidden units in a GRNN is usually overridden to equal the number of training cases, using the
option on the GRNN Training dialog. Enter a number in the Hidden units box if you want to explicitly select the number of
hidden units.
Options available for Self Organizing Feature Map networks:
Topological map dimensions - Width/Height. Enter the numbers in these boxes (or use the microscrolls) to specify the
dimensions of the topological map (output layer), which is laid out as a rectangular lattice.
Options available for Principal Components networks:
Number of principal components. In this box, specify the number of principal components to be extracted (identical to the
number of output units in the PCA network).
Options available for Clustering networks:
Number of cluster units. In this box, specify the number of cluster units in the output layer of the cluster network.

Select the PNN tab of the Custom Network Designer dialog to access the options described here. This tab is available only if
a Probabilistic Neural Network type is specified.
Normalize numeric input variables and codebook vectors. The radial units in the hidden layer of the network act as
exemplar, or codebook, vectors representing cluster centers in the training data. By selecting this check box, you can specify
whether to normalize numeric input variables to a consistent scale (zero to one, using minimax mapping) before presentation
to the network (both during training and execution). Normalized input variables tend to produce better predictions, as scaling
differences between the variables are discounted. However, normalized variables are also less interpretable.
Include loss matrix in PNN. Select this check box to specify that the PNN should include a loss matrix. The loss matrix is
stored as the fourth layer of the network, and can be edited on the PNN Training dialog.

Select the Time Series tab of the Custom Network Designer dialog to access options to specify whether a network is
designed for time series analysis (i.e. predicts a variable from time-lagged (earlier) copies of the same and/or different
variables).
Create network for time series prediction. Select this check box to specify that the network be used for time series
analysis. If this box is not checked, the other fields on this tab are disabled, and the values 1 and 0 respectively are used
(which corresponds to a non-time series network).
Predict X steps ahead. Specify the number of lagged time series values to provide as input to the network.
Steps used to predict. Specify the number of time steps ahead of the lagged input values that the predicted output lies.
Example. To predict a variable at time t, from lagged values at t-3, t-2 and t-1, specify Predict X steps ahead 1, Steps used
to predict 3.
Note. Most applications of time series analysis use Predict X Steps Ahead 1, and the input variables are identical to the
output variables (usually, there is only one variable). In this case, the output(s) of the network can be combined with
previous input values, shifted one time step, and repeated predictions made; see Time Series Projection.
You can use the Create/Edit Ensemble dialog for both creating and editing ensembles. To create a new ensemble, select
Create New Ensemble on the Advanced tab of the Startup Panel, and click OK. To edit an existing ensemble, select Model
Editor on the Advanced tab of the Startup Panel, and click OK. In the latter case, if no ensembles are selected from the
Networks/Ensembles tab of the Startup Panel, the Select Network or Ensemble dialog will be displayed first, from which you
can choose the ensemble you want to edit.
This dialog contains one tab: Create/Edit Ensemble. Use the options on this dialog to specify the member networks and the
ensemble type.
Type. In this drop-down list, select the ensemble type: Output or Confidence.
Output. Output ensembles are the most common and flexible type. They perform averaging for regression problems and
voting for classification problems, and can also combine networks with different output variables to produce multiple
variable predictions.
Confidence. Confidence ensembles perform averaging across output neuron activation levels and, consequently, can only be
used if all their member networks use the same output variable encoding. They are designed for classification problems, and
have the advantage that a composite confidence estimation can be formed (output ensembles can only provide a prediction,
as they use a voting procedure).
There is no reason to create a confidence ensemble for a regression problem, as the decision process is equivalent to an
output ensemble (the ensemble averaging is simply performed at a different stage).
Note. In this box, specify a short textual note used to identify the ensemble in summary details lists.
Lock. Select this check box to prevent the ensemble being automatically deleted to make way for higher performance or
newer models when the network file is full. See Neural Network File Editor - Replacement Options tab for more details on
automatic deletion.
Weights. This box specifies the weight of the member network currently selected in the Members list (see below). By
default, all members have equal weight (1.0). However, you can adjust the weight. Network weights are used to bias the
voting or averaging procedure used to determine the ensemble's predictions from those of the individual members.
Members list. The list box at the bottom of the dialog displays summary details of the member networks of the ensemble.
The summary includes the general summary statistics of the network, as well as the weight. Select a network from the list
box to edit the weight.
Networks. Click this button to display the Select Neural Networks dialog, which is used to select the networks to be
included as members of the ensemble.
OK. Click the OK button to accept the option settings on this dialog and display the Results dialog.
Cancel. Click the Cancel button to return to the STATISTICA Neural Networks (SNN) Startup Panel. Any options selected
on the Edit Ensemble dialog will be ignored.


Click the Models button on the Results dialog or the Run Code Generator dialog to display the Select Networks and/or
Ensembles dialog. This dialog contains two tabs: Models and Options. Use the options on these tabs to select one or more
models. You can select the models on the Quick tab either by clicking on them in the list box or typing in the range of
indices in the text box. On the Options tab, you can specify which networks are shown on the list and whether selecting an
ensemble is equivalent to selecting the networks it contains as members.
OK. Click the OK button to confirm the current selection.
Cancel. Click the Cancel button to close the dialog and return to the previous dialog.
Select all. Click this button to select all the models displayed in the list box.

Select the Models tab of the Select Networks and/or Ensembles dialog to access the options described here.
Models list. This list displays summary details of all models (networks and ensembles) in the current network file, as
qualified by the options specified on the Options tab.
To select a single model, click on it. To select a range of models, click on the first one in the range, then hold down the SHIFT
key and click on the last one in the range. To extend an existing selection by adding a new model, or to remove a model
from the current selection, hold down the CTRL key and click on the model.
Range. This text box lists the currently selected models by index, separated by spaces. If there are several models with
contiguous indices, you can specify them as a range; e.g. 2-4 means models with indices 2, 3, and 4.

Select the Options tab of the Select Networks and/or Ensembles dialog to access the options described here.
Networks shown. Specify which networks should be listed for selection By default, only Standalone networks (those that
are intended for use as models in their own right) are listed. Networks that are part of an ensemble, and not intended for use
in their own right, are not listed. However, you can alter this default so that All networks are listed, irrespective of standalone
status.
Selecting an ensemble. STATISTICA Neural Networks ensembles allow multiple networks to be grouped together and to
form a group prediction, with usually lower generalization error than the individual members. Such an ensemble is usually
treated as a model in its own right. However, you can also use ensembles as a convenient way to group related networks, and
you may sometimes want to compare the performance of the member networks in an ensemble. Using this option, you can
specify that selecting an ensemble, rather than just Selecting the ensemble itself for execution, selects either the member
networks for separate execution (Select the network in the ensemble), or the member networks and the ensemble (Select the
ensemble and its network).
Code Generator
Run Code Generator
Run Code Generator

The STATISTICA Neural Network Code Generator is an optional extra feature. It generates a source code version of a neural
network, which can then be compiled and integrated into your own programs for royalty-free distribution. However, you
must notify StatSoft before distributing programs that use generated code to your customers to ensure compliance with
licensing restrictions. Contact StatSoft for further details.
Select Code Generator from the Neural Networks Startup Panel - Advanced tab to display the Run Code Generator dialog.
This dialog contains one tab: Quick. Note that you need to have an existing neural network in your current analysis in order
to start the code generator. If you do not have a network in your current analysis, you can create one using either the
Intelligent Problem Solver or the Custom Network Designer. Alternatively you can load a network(s) from a previously
saved network file using the options on the Networks/Ensembles tab of the Startup Panel.
Language. Specify the programming language to which the network should be converted.
C/C++. Select this option button to generate network code in C language.
Generate test harness main procedure. If this check box is selected, in addition to the functions that execute the models,
the code generator will write a simple interactive program (main procedure) that allows you to test the models.
If you choose to generate a C language test harness, compile the output file as a Windows Console program.
STATISTICA Visual Basic (SVB). Select the STATISTICA Visual Basic (SVB) option button to generate network code in
Visual Basic language. You can place the code directly into a Visual Basic editor by clicking the Copy to editor/report
button (see below).
Predictive Model Markup Language (PMML). Select this option button to generate network code in Predictive Model
Markup Language (PMML) which is an XML-based language for storing information about fully trained (parameterized)
models, and for sharing those models with other applications. STATISTICA and WebSTATISTICA contain facilities to use
this information to compute predicted values or classifications, i.e., to quickly and efficiently deploy models (typically in
the context of data mining projects).
Generate code. Select an option in the Generate code group box to specify the output destination.
Copy to clipboard. Click this button to send the code generator output to the Clipboard (so that you can paste it into the
application of your choice).
Copy to editor/report. Click this button to send the code generator output to a report if you use the "C" language (see
Language options above), or to a Visual Basic editor should you chose to generate the code in Visual Basic.
Save to single file. Click this button to display a standard Save As dialog to send the output of the code generator to a single
file with a name that you specify.
Save to multiple files. Click this button to generate code for multiple models in multiple files. If you choose to generate
multiple files, when the Save As dialog is displayed, the default file name has a dollar '$' character in it. If you specify a
name with a dollar character, the multiple files generated will replace the dollar character with the index number of the
respective model. If you do not specify a dollar sign, the file names will each be given the model index number as a postfix.
Cancel. Click the Cancel button to close the Run Code Generator dialog and return to the STATISTICA Neural Networks
(SNN) Startup Panel.
Select models. Click the Select models button to display the Select Networks and/or Ensembles dialog, where you select the
models for which code is to be generated.
Note: Accuracy. In the vast majority cases, the code-generated network will perform, to all intents and purposes, just like
the original network. However, there may be some numerical inaccuracy in the results, and very occasionally this may be
significant.
The reason for the inaccuracy is that the floating point numbers stored in the binary network file cannot be written in text
form in the generated code file with absolute precision (in fact, they are precise to about 20 significant digits). During
execution of the network, inaccuracies may accumulate as the series of calculations are performed. In the majority of cases,
any accumulated errors remain in the last few significant digits and can be safely ignored.
An exception may occur if the weights in the network are very large, with detailed balance (that is, some very large positive
weights finely balance out very large negative ones). In particular, this situation can sometimes occur if you use radial basis
function networks for classification problems, with the output layer trained using the pseudo-inverse technique.
The solution in this case is to retrain the output layer of the network using an iterative algorithm such as conjugate gradient
descent. Although somewhat slower than pseudo-inverse, the results are likely to be equally good, and the problem of
detailed balance should not occur.
Retraining Networks

Select Retrain Network from the Neural Networks Startup Panel - Advanced tab to display the Select neural network to
retrain dialog. This dialog contains one tab: Select Neural Network.
Summary details of the available models are displayed in the spreadsheet at the top of the dialog.
OK. Click the OK button to display a specification dialog in which you select options to retrain the network.
Cancel. Click the Cancel button to close the dialog and return to the Startup Panel.
Network Set Editor

Select Network File

Select Network Set Editor from the Neural Networks Startup Panel - Advanced tab to display the Neural Network File Editor
dialog, which is used to view and edit the current network set. This dialog contains five tabs: File Details, Networks,
Ensembles, Replacement Options, and Advanced.
Summary details on the networks and ensembles are available on the Networks and Ensembles tabs. Details of networks
include: unique index, profile (type, number of input and output variables, neurons in each layer, and time series parameters
if a time series network), error and performance measures, and training algorithms used on the network. Summary details for
ensembles are the same as for networks, except that the indices of the member networks are listed in place of the training
algorithm.
The options on the Replacement Options tab are used to prevent network sets from growing uncontrollably in size; it is easy
to create additional models in STATISTICA Neural Networks, and each can be stored in the network set. You can specify a
maximum number of models to be stored in the set, and specify what action should be taken if you try to generate more
models when the set is full (which can include warning you and/or selecting an old, poor performance or commonplace
model for deletion).
OK. Click the OK button to accept the options selected on this dialog.
Cancel. Click the Cancel button to exit this dialog, ignoring any changes made.

Select the File Details tab of the Neural Network File Editor dialog to access the options described here.
Limit number of models (standalone networks and ensembles) allowed in the file. Select this check box to impose a
maximum number of models on the set. If the set already contains more models than the maximum, no models are deleted,
but no more can be added. The maximum refers to the number of standalone models (see below for an explanation of the
standalone concept).
Maximum. The entry in this box indicates the maximum number of models allowed in the file, if the check box described
above is selected. The Maximum refers to the number of standalone models. A network that is not in any ensemble is a
standalone model, as is an ensemble. A network that is in at least one ensemble can optionally be a standalone model, in
which case it is considered a model in its own right in addition to being a constituent part of an ensemble. See Ensembles.
Note. Enter in this box a short line of descriptive text to include in the network file.

Select the Networks tab of the Neural Network File Editor dialog to access the options described here.
Specify in the drop-down list at the top of the dialog which networks should be shown in the network list box. You can
choose to display only Standalone networks (those treated as models in their own right), All networks, or only networks that
are In ensemble(s).
Network list. This list box shows summary details of the networks in the network file (i.e. those networks selected for
display in the drop-down option described above).
Click on a network in the network list to select it. When a model is selected, its note, training note, lock and standalone
status are transferred to the fields below the list (described below), and can be edited.
The details listed include:
Index. A unique identifier assigned when the network is created and preserved throughout its lifetime.
Lock. Indicates whether the network is locked to prevent accidental deletion.
S/A. Indicates whether the network is standalone. Standalone networks are models in their own right, which can optionally
their own right.
Refs. The number of references to the network; that is, the number of ensembles that contain the network. A single network
can be a member of several ensembles.
Profile. A summary of the network structure. See Model Profiles.
Train Perf/Select Perf/Test Perf. The performance of the network on the subsets used during training. The performance
measure depends on the type of network output variable. For continuous variables (regression networks), the performance
measure is the Standard Deviation Ratio (see Regression Statistics). For nominal variables (classification outputs), the
performance measure is the proportion of cases correctly classified (see Classification Statistics). This takes no account of
doubt options, and so a classification network with conservative Accept and Reject thresholds (confidence limits) may have
a low apparent performance, as many cases are not correctly classified.
The vast majority of neural networks have only a single output variable, and the performance measures are reported on this
assumption If you have a network with multiple output variables, the performance measure is with respect to the first output
variable only.
Train Error/Select Error/Test Error. The error of the network on the subsets used during training. This is less
interpretable than the performance measure, but is the figure actually optimized by the training algorithm (at least, for the
training subset). This is the RMS of the network errors on the individual cases, where the individual errors are generated by
the network error function, which is either a function of the observed and expected output neuron activation levels (usually
sum-squared or a cross-entropy measure); see Error Function for more details.
Training. Gives a brief summary of the training algorithm(s) used to optimize the neural network; described in more detail
below.
Note. A brief line of text that you can attach to a network for informational purposes.
Inputs. The number of input variables in the network. This is also reported in the profile.
Hidden 1/Hidden 2. The number of hidden units in the first and second layer. This is also reported in the profile. The
number of hidden units, together with the number of input variables, defines the complexity of the network. As a general
rule, it is desirable to use networks with as low a complexity as possible, consistent with good performance. Some network
types have no hidden layers (Linear, SOFM), or only one hidden layer (three layer Multilayer Perceptron, Radial Basis
Function) in which case a dash is displayed in one or both of these fields.
Note. In this box, you can edit the note (short line of descriptive text) associated with the current network.
Training. This box contains a concise description of the training algorithms used to optimize the network. It contains a
number of codes, which are followed by the number of epochs for which the algorithm ran (if an iterative algorithm), and an
optional terminal code indicating how the final network was selected. For example, the code CG213b indicates that the
Conjugate Gradient Descent algorithm was used, that the best network discovered during that run was selected (for "best"
read "lowest selection error") and that this network was found on the 213th epoch.
The codes are:
BP Back Propagation
CG Conjugate Gradient Descent
QN Quasi-Newton
LM Levenberg-Marquardt
QP Quick Propagation
DD Delta-Bar-Delta
SS (sub)Sample
KM K-Means (Center Assignment)
EX Explicit (Deviation Assignment)
IS Isotropic (Deviation Assignment)
KN K-Nearest Neighbor (Deviation Assignment)
PI Pseudo-Invert (Linear Least Squares Optimization)
KO Kohonen (Center Assignment)
PN Probabilistic Neural Network Training
GR Generalized Regression Neural Network Training
PC Principal Components Analysis
The terminal codes are:
"b" Best Network (the network with lowest selection error in the run was restored)
"s" Stopping Condition (the training run was stopped before the total number of epochs elapsed as a stopping condition
was fulfilled)
"c" Converged (the algorithm stopped early because it had converged; that is, reached and detected a local or global
minimum. Note that only some algorithms can detect stoppage in a local minimum, and that this is an advantage
not a disadvantage!)
The field is editable, so you can add other information if you want.
Lock (prevents deletion/replacement until unlocked). Select this check box to prevent accidental deletion of the network.
If you have specified a maximum set size, and the network set becomes full, STATISTICA Neural Networks may look for
networks to replace when new models are generated. A locked network is never replaced in this fashion. You can still
explicitly delete locked networks.
Available as a stand alone model. Select this check box to specify whether a network that belongs to at least one ensemble
is also available as a standalone model in its own right. Networks that do not belong to any ensembles are always standalone,
and the check box is checked and disabled in this case.
A standalone network is listed by default in the models available to generate results, whereas a non-standalone network is
not. If you delete an ensemble, a non-standalone member network that is not a member of any other ensembles is
automatically deleted when the ensemble is deleted. A standalone network is not deleted when an ensemble containing it is
deleted.
Summary. Click this button to generate a spreadsheet containing the summary details listed in the network list box.
Clone. Click this button to create a copy of the current network. The copied network has its own index, and is unlocked and
standalone, but in all other respects is identical to the copied network.

Select the Ensembles tab of the Neural Network File Editor dialog to access details of the ensembles in the network file.
Ensemble list. The ensembles are listed in the ensemble list box at the top of the tab. Select an ensemble from this list box
either by clicking on it. The selected ensemble's note and lock status are displayed below the list, as is a list of the members
of the ensemble.
The ensemble list contains the following summary details of the ensembles:
Index. A unique identifier assigned when the ensemble is created and preserved throughout its lifetime.
Lock. Indicates whether the ensemble is locked to prevent accidental deletion.
Profile. A summary of the ensemble structure. See Model Profiles.
Train Perf/Select Perf/Test Perf. The weighted average performance of the networks in the ensemble (on the subsets used
in training). Note that each network can actually use a different division into subsets; see Ensembles for further details.
Train Error/Select Error/Test Error. The weighted average error of the networks in the ensemble.
Training. Lists the indices of the networks included in the ensemble.
Note. A brief line of text that you can attach to an ensemble for informational purposes.
No. Members. The number of networks in the ensemble. This is also part of the Profile.
Note. This is a brief line of descriptive text.
Lock. Select this check box to prevent accidental deletion of the ensemble when the file is full and additional models are
inserted, which may cause STATISTICA Neural Networks to look for models that can be removed to make space for the new
ones.
Network list. The network list displays details of the networks included in the currently selected ensemble. The majority of
these summary details are the same as in the Network list on the Networks tab. In addition, the weight of the network is
listed, which specifies the contribution made to the ensemble's prediction by each network; see ensemble for more details.
Ensembles Summary. Click this button to transfer the details in the ensembles list to a results spreadsheet.
Member details. Click this button to transfer the details of the networks in the currently selected ensemble to a results
spreadsheet.
Clone. Click this button to make a copy of the current ensemble.

Select the Replacement Options tab of the Neural Network File Editor dialog to specify options to help control the growth of
the network file, in conjunction with the Limit number… check box and Maximum box on the File Details tab. These options
are only applicable if the Limit number… check box is selected.
If the network file is full (maximum is reached), analyses that generate additional models may need to either replace existing
models or be discarded. Using these options, you can specify that a warning message be displayed before any such special
action is taken; specify the criterion used to select a candidate existing model for replacement, and specify what to do if the
new model is inferior to the existing one.
When adding a model (network or ensemble) to a file that is full (maximum reached)
Inform user in advance of creating the model. If this check box is selected, a warning message is displayed before the
analysis is executed in such circumstances.
Criterion to select candidate model for possible replacement. Specify is this group box the criterion used to select a
candidate existing model for replacement. The available options are:
Try to maintain diversity. STATISTICA Neural Networks will attempt to preserve a balanced mix of network and ensemble
types and architectures, including diverse numbers of input variables and network sizes. Secondary to maintaining diversity,
poor performance models will be selected for replacement. The algorithm is explained in more detail in Technical Details.
Replace the oldest model. Models are selected according to their index, which indicates the creation order, with oldest
models replaced first.
Replace the highest error model. Models are selected for replacement according to the selection subset error (or, if there is
no selection subset, the training subset error); the highest error models are replaced first.
Action if the new model is inferior to the candidate for replacement. The options in this group box govern the action
taken if the model selected for replacement is actually superior (in terms of performance) to the new model potentially
replacing it. In this circumstance, you can choose either to have the existing model Replaced anyway, or to Discard the new
model.

Select the Advanced tab of the Neural Network File Editor dialog to access the options described here.
Delete models. Click this button to launch the model selection dialog: Select Networks and/or Ensembles. Select the
standalone models (networks and ensembles) that you want to delete, and click OK. The selected standalone models are
deleted.
l If you delete standalone networks that are also members of ensembles, these networks are not actually deleted, but their
standalone status is removed.
l If you delete all the ensembles containing a non-standalone network, then the network is also deleted.
l If you want to delete non-standalone networks, you should instead use the Edit Ensemble dialog to remove them from the
ensembles; once removed from all ensembles, a non-standalone network is automatically deleted.
Join networks. Click this button to join two networks together in a pipelined fashion. See Joining Networks for more
details..
Merge. Click this button to display the Select network file dialog, in which you can open a second network file, extract all
the models from that file, and append them to the current network set. The merged models are given new indices, but are
otherwise identical to the original models.
Note. If you have old version 3 ".NET" files, you can use the Merge option to import them into the latest version of
STATISTICA Neural Networks.
Technical Details. The diversity-maintaining algorithm works as follows:
The algorithm first "adds" the new network to the network set. It then looks for a network to delete, which may in fact be the
"added" network. As a consequence of this approach, the new network is treated on an equal footing with all other networks.
A list of candidates for removal is maintained. This is progressively narrowed down as described below.
Only unlocked networks are candidates for removal. Locked networks are not placed in the candidate list.
Underrepresented network types are removed from the candidate list. Only the most numerous network types (that is, the
joint maximums) are considered. As a consequence, if the new network is of the most numerous type, only that type will be
considered as candidates for removal.
If any networks have not been trained with selection data, they are removed in preference to networks that have been trained
with selection data, as the results of training are far less reliable in this case. If there are any networks that have not been
trained with selection data, then those that have been are removed from the candidate list.
The last stage attempts to maintain an "interesting" performance/complexity trade-off. If one network has both better
performance and lower complexity than another, then it is said to dominate that other network. If there exist any networks
that are dominated by others, then the candidate list is reduced to the least domineering (that is, networks which are
dominated by others, and do not themselves dominate any).
The algorithm finally considers a list of non-dominating networks (that is, networks which, if listed in order of increasing
complexity, also have improving performance). The network with the best performance is never removed. All the remaining
networks might be considered valuable, given that they present a genuine trade-off between reduced complexity and reduced
performance. The algorithm attempts to maintain diversity, by calculating the performance versus complexity trade-off of
each such network with respect to the next most complex. The network with the worst trade-off is removed. The effect is to
prevent networks "bunching" with similar complexity and marginally different performance. In particular, if there are two
candidates with the same complexity, the inferior of the two will always be replaced rather than a less complex network with
lower performance.
In judging "complexity", the primary determinant is the number of input variables, as a reduction in the number of input
variables makes a network more practical (less need to gather data) and more informative (a better idea of which variables
are important) in addition to being more efficient and less prone to over-learning. The secondary determinant of complexity
is the number of hidden units, and this is used if networks have the same number of hidden units.
Select Network File

Click the Merge button on the Neural Network File Editor - Advanced tab to display the Select network file dialog, a
standard Open file dialog.
From this dialog, you can open any document of a compatible type. Using the Look in list box, select the drive and directory
location of the desired file. Select the file and click Open, or double-click the file name. You can also enter the complete
path of the document in the File name box.

Click the Join networks button on the Neural Network File Editor dialog Advanced tab to display the Select pre-processing
network dialog. For details about joining networks and pre-processing, see Joining Networks and Neural Networks
Introductory Overview - Pre- and Post-processing.
Select neural network. Select the pre-processing network (which provides the input and succeeding layers of the joined
network) from this list.
OK. Click the OK button on the Select pre-processing network dialog to display the Select post-processing network dialog
(see below for option descriptions for this dialog).
Cancel. Click the Cancel button to close this dialog without selecting a pre-processing network and return to the Neural
Network File Editor dialog.
Select Post-processing Network
Click the OK button on the Select pre-processing network dialog to display the Select post-processing network dialog.
Select neural network. Select the post-processing network (which provides the output and preceding layers) from this
list.
OK. Click the OK button to join the networks.
Cancel. Click the Cancel button to close this dialog without selecting networks to join and return to the Neural Network File
Editor dialog.

Nominal Definition


Select Model Editor from the Neural Networks Startup Panel - Advanced tab, or click the Edit button on the Custom Network
Designer dialog to display the Neural Network Editor dialog. In the latter case, you can detail customization of the network
before proceeding to training. The Network Editor contains eight tabs: Quick, Variables, Layers, Weights, Time Series,
Advanced, Pruning, and Thresholds. Use the options on these tabs to directly edit a neural network, right down to the
detailed level of individual weights. You can also generate a variety of useful summary statistics about the network.
The summary details of the selected network are displayed in the list box at the top of the dialog. Although there always is
only a single network in this dialog, the list box is used to display the details for consistency with the similar list boxes used
to present summary information on networks in other dialogs.
OK. Confirm the edits and end the analysis by clicking the OK button.
Cancel. Click the Cancel button to close this dialog and return to the Custom Network Designer dialog.
Train. Click the Train button to confirm the edits and proceed to a training dialog appropriate to the network type.

Select the Quick tab of the Neural Network Editor to access the options described here.
Note. This box displays the short textual note that is presented in the summary details for each network.
Training. This box displays the training note that is presented in the summary details of each network. Although the training
note is automatically generated, you can alter it (however, if you retrain the network, any changes to the text that you make
will be overwritten).
Lock (prevents deletion/replacement until unlocked). Select this check box to lock the network against accidental
deletion; see Neural Network File Editor - Replacement Options tab for details of STATISTICA Neural Networks
replacement strategy when a network file is full.
Available as a stand alone model. Select this check box to specify that a network that belongs to at least one ensemble is
also available as a stand alone model in its own right. Networks that do not belong to any ensembles are always stand alone,
and the check box is disabled in this case.
A stand alone network is listed by default in the models available to generate results, whereas a non-stand alone network is
not. If you delete an ensemble, a non-stand alone member network that is not a member of any other ensembles is
automatically deleted when the ensemble is deleted. A stand alone network is not deleted when an ensemble containing it is
deleted.

Select the Variables tab of the Neural Network Editor to view and edit details of the input and output variables of the
network. The input and output variables are listed in separate list boxes, which display summary details on each variable.
You can select a variable from the list boxes and then edit the details using the options described below. There are also
buttons to add or remove variables.
The most common customization is to alter the Conversion function of some variables.
On execution, STATISTICA Neural Networks first converts input variables to numeric values for insertion into the network,
which involves conversion of nominal values, scaling of numeric values, substitution of missing values, and replication of
values if predicting time series values from a data set. This produces a numeric vector with one entry for each network input.
If an input neuron normalization technique has been defined (on the Neural Networks Startup Panel - Advanced tab - this is
extremely rare) it is then applied to transform the input vector. The input values are then fed into the neural network. On
completion, the output values of the network are de-scaled to produce numeric output variables, or interpreted to produce
nominal output variables. On the extremely rare occasions that one has been defined, the neuron normalization function is
applied to the network outputs before de-scaling and nominal variable interpretation.
During training of networks, input preprocessing is performed as above. Outputs are preprocessed before being used in
training. Output neuron normalization (from the Advanced tab - extremely rarely used) is ignored during training (neuron
normalization functions are typically not invertible, and so cannot be "reversed" for data preparation during training).
Scaling of inputs and training outputs involves multiplication by a scale factor, followed by the addition of a shift factor.
After network execution, de-scaling of outputs follows the reverse procedure: subtraction of the shift factor, followed by
division by the scale factor.
You can provide explicit shift and scale factors if you want; however, usually conversion algorithms are defined that
calculate them automatically. Conversion algorithms are automatically executed by STATISTICA Neural Networks at the
beginning of training runs.
Nominal values require special handling on output. STATISTICA Neural Networks must determine whether the pattern of
corresponding network outputs is sufficiently like the pattern expected for a given nominal class. This decision is guided by
accept and reject thresholds.
Input/Output. Select a single variable from either the input variable or output variable list for editing by using the options
on the lower section of the dialog.
Name. This box displays the name of the variable. STATISTICA Neural Networks uses the name to match network inputs
and outputs to data set variables, and the variable names are automatically assigned when networks are created. Therefore, it
is unusual to edit the variable name. However, you may occasionally need to do so, for example, if you have renamed a data
set variable and want to update a network to use the new name.
Conversion function. STATISTICA Neural Networks converts each variable before use in neural network execution or
training. Different conversion methods are available for nominal and numeric variables. Nominal variables must be
converted to a suitable numeric form, and numeric variables must be scaled into a suitable range.
Most of the techniques use control factors to determine precisely how the conversion should take place: these control factors
are the Min/Mean and Max/SD factors (the names reflecting the fact that they have different meanings depending on the
conversion technique used).
The techniques available are:
One-of-N. Used with nominal variables. A variable with N possible values is converted into a set of N numeric values, with
one value set to indicate the type. For example, if the variable is Color ={Red, Green, Blue}; Red can be represented as
{1,0,0}, Green as {0,1,0} and Blue as {0,0,1}. The set and clear values used (1 and 0 in the above example) can be
customized using the Max/SD and Min/Mean fields respectively.
Two-state. Used with nominal variables that have only two states (e.g., Gender = {Male, Female}. The variable is converted
to a single numeric value, either set or clear (by default, 1 and 0 respectively; these values may be customized using the
Max/SD and Min/Mean fields).
Minimax. Used primarily with numeric variables. The raw values will be scaled linearly, so that the smallest value in the
training set is scaled to the Min value, and the largest value in the training set to the Max value. his is usually necessary as,
although most neural networks can accept input values in any range, they are only sensitive to inputs in a far smaller range.
Minimax can also be used with nominal variables, which are then represented by an ordinal value (1,2, 3 etc.) to which linear
scaling is applied. This is occasionally appropriate if the number of nominal values for a variable is large, since the
alternative (one-of-N encoding) would create an extremely large network.
Mean/SD. For numeric variables only. The raw values will be scaled linearly, so that the mean value in the training set is
scaled to the mean value specified, and the standard deviation of the training set is scaled to the specified standard deviation.
This method can be used in preference to Minimax encoding, to which it is closely related (both methods ultimately perform
simple linear scaling; only the method of determining the scale factors differs).
Explicit. For numeric variables. The raw values will be scaled linearly, by addition of a shift factor followed by
multiplication by a scale factor. These linear factors are not automatically determined (by default, they are 0 and 1
respectively).
Explicit conversion can also be used on nominal variables, which can be thought of as ranging from 1 to N before
conversion takes place.
None. For numeric variables. The raw values are used directly. Can also be used on nominal variables, which are set to
values ranging from 1 to N.
Missing value fn. STATISTICA Neural Networks has the capability to deal with missing data values, both during network
training and execution. Missing values are handled by substituting a special value (or, in the case of one-of-N encoded
nominal variables, a set of values).
The special values used to patch missing values are determined from the training set when the network is trained.
The available methods are:
Mean. For numeric variables, the mean of the variable's values in the training set is used. For nominal variables, the
proportion of training cases in each class is used. This is usually the most appropriate method.
Median. For numeric variables, the median of the values in the training set is used. For nominal variables, this is treated
identically to mean substitution. May be superior to mean substitution for a numeric variable that contains strong outliers.
Minimum. For numeric variables, the minimum of the values in the training set. For nominal variables, the class assigned is
that with the lowest frequency in the training set.
Maximum. For numeric variables, the maximum of the values in the training set. For nominal variables, the class assigned
is that with the highest frequency in the training set.
Zero. A zero value is substituted.
Missing code. In the data set, a missing value is typically indicated by a special numeric code. The missing value code for a
specific network input is recorded in this field.
Min/Mean, Max/SD. These values are used by the conversion methods to determine the target range of data. The default
values of zero and one are usually adequate.
Shift, Scale. The linear scaling factors used in transforming input and output variables to/from network input/output
neurons. With most conversion functions these are automatically generated by STATISTICA Neural Networks, based on the
range of the variable in the data set and the neuron type. If you select the Explicit conversion function for a variable, then
you should set the Shift and Scale factors manually.
Add. Click the Add button to display the Add Variables to Network dialog, which is used to add new variables to the
network. The new variables are added before or after the current variable. Therefore, select an input variable from the list
before clicking Add if you want to add input variables, or select an output variable first to add output variables. Adding
variables invalidates existing weights values in the network, which will consequently need retraining.
Delete. Click the Delete button to display the Delete variables dialog, which is used to delete variables from the network.
Select an input variable before clicking the button if you want to delete inputs; select an output variable first to delete
outputs. Select the variables you want to delete from the list on this dialog, and click OK to delete them.
Edit Nominals. Click this button to display the Nominal Editor dialog, to allow editing of the nominal definition of a
network variable. Select the variable from the list box before clicking this button.
Summary. Click this button to generate a summary spreadsheet that contains a copy of the information displayed in the
variable list boxes.

Select the Layers tab of the Neural Network Editor to edit the details of the layers in the network. You can alter the
activation function and add or delete neurons.
The Layer list displays details of all the layers in the network, including the number of units in the layer, and the synaptic
and activation functions. Select a layer from the list to edit the details.
Synaptic function. This drop-down list displays the synaptic function of the currently selected layer. See Synaptic
Functions.
Note. To permit maximum flexibility, it is possible to change the synaptic function. However, the consequences of this may
be unexpected, and it is not recommended that you do so.
Activation function. This drop-down list displays the activation function of the currently selected layer. You can select a
different synaptic function. Sensible choices of activation function are discussed in Activation Functions.
Delete units. Click this button to display the Delete Units from Network dialog.
Add units. Click this button to display the Add Units to Network dialog.
Delete layer. Click this button to delete a hidden layer.

The Weights tab of the Neural Network Editor displays the weights of the network. It is seldom possible to interpret
individual weights in a meaningful fashion; however, on occasions inspecting the weights may help you in diagnosing
problems with a particular network or training algorithm.
Weights spreadsheet. Each weight or threshold occupies a cell in the spreadsheet. The neuron feeding its output through the
weighted connection is shown in the column, and the neuron receiving input from the weighted connection is shown in the
row. Thus, a single row shows the fan-out from a particular neuron, and a single column the fan-in to a particular neuron.
Neurons are identified in the style L.N, where L is the layer number and N the neuron number within the layer.
If there is no connection between two neurons, the corresponding cell is empty. You can edit cells that are not empty,
directly altering a network weight or threshold.
Weights shown. Specify whether only the weights fanning in to a given layer should be shown (select the Layer option
button and specify in the accompanying field the particular layer you want displayed in the spreadsheet), or all the weights in
the network (select the All layers option button). In the latter case, significant parts of the matrix are empty (as there are
weights only between neurons in adjacent layers). If you select to display weights in a layer, and then select layer 1, no
weights are shown (as layer 1, the input layer, does not have any fan-in weights).
Edit weights. Click this button to display a General User Entry Spreadsheet where you can edit the network weights
individually. You can only edit the non-empty cells. An empty cell indicates that there is no weight connection between two
neurons. Therefore, STATISTICA Neural Networks will ignore any editing made to the empty cells.
Weights spreadsheet. Click this button to transfer the contents of the weights spreadsheet to a STATISTICA results
spreadsheet.
Weights histogram. Click this button to generate a histogram of the weights displayed in the weights spreadsheet in a. This
can give some insight into the training of the network.

Select the Time Series tab of the Neural Network Editor to access options to alter the time series factors of the network. For
a non-time series network, the Steps factor is 1, and the Lookahead is 0.
Steps. The steps factor specifies how many time-lagged copies of the input variable(s) are used as inputs to the network.
Lookahead. The lookahead factor specifies how far ahead of the last input factor the output is predicted.
Example: a time series network has Steps 3, Lookahead 2. If the input to the network is taken from time steps 9,10 and 11,
the output predicts the value at time step 13.

Select the Advanced tab of the Neural Network Editor to access specialized network factors that can occasionally be altered.
Note: The effects of altering these parameters can be unexpected - you should alter them only if you are entirely familiar
with these concepts.
Error function. This box specifies the network's error function.
Output interpretation. This box specifies whether the activation level of an output neuron is treated as a confidence or
error term. This is used to qualify the meaning of the accept and reject terms in classification, and should rarely if ever be
altered manually.
Pre-processing function/Post-processing function. (neuron normalization functions) Neuron normalization functions

provide some further capabilities, which are not usually needed. They allow STATISTICA Neural Networks to model some
rare types of neural network that may be of interest to some neural network theoreticians. However, typically they are not
needed, and the default of both input and output neuron normalization functions are set to None.
These neuron normalization functions act upon the entire vector of inputs or outputs, rather than upon individual input or
output variables. They are strictly numeric procedures.
Input neuron normalization is performed after input conversion/missing value substitution and before the network is
executed: it is responsible for further transformation of the numeric input values before submission to the network.
Output neuron normalization is performed before output scaling. Output neuron normalization is only performed during
network execution. Output neuron normalization is not used during network training since the functions used cannot (in
general) be inverted.
Although STATISTICA Neural Networks supports the combination of input and output neuron normalization and conversion,
in the specialized circumstances where input neuron normalization is used it is unlikely that conversion will also be
employed, and the conversion function None is typically selected for all variables.
Different forms of neuron normalization are typically used for inputs and outputs.
The techniques supported are:
None. The scaled data is passed directly to/from the network.
Unit length. The data is treated as a vector, and the components scaled so that the overall vector length is 1.0. This
technique is sometimes recommended for input cases to networks with radial units in the first layer; for example, SOFM
networks.
Unit sum. The data is scaled so that the sum of the values is 1.0. This may occasionally be appropriate with inputs if the
significant information is the relative size of the input values. It is useful with outputs if you want to interpret the output
values as probabilities of membership of a number of classes. In this case, of course, you must also ensure that the activation
function in the output layer guarantees that all outputs will be positive. However, STATISTICA Neural Networks supports
this type of interpretation more efficiently and easily using nominal output variables with One-of-N, Two-state, or Minimax
conversion function.
Maximum. Output neuron normalization only. The highest component is set to 1.0, and all others to 0.0. Sometimes known
as a "winner-takes-all" network.
Minimum. Output neuron normalization only. The lowest-output value is set to 1.0, and all others to 0.0 Like Maximum,
but useful where outputs are interpreted as errors rather than confidence measures; for example, in a SOFM network.
Step. Output neuron normalization only. Positive outputs are set to 1.0, and negative ones to 0.0. A linear network with Step
neuron normalization acts like an ADALINE neural network.

Select the Pruning tab of the Neural Network Editor to access pruning algorithms, which can be used to remove some input
variables and/or hidden units from a network.
Pruning threshold. Specify the threshold below which the fan-out weights of a neuron must lie in order for it to be pruned.
Prune input variables. Click this button to prune any input variables, where the fan-out weights of all the variable's
associated input layer neurons are below the threshold (a specific input variable may have a number of input neurons, either
because it is a nominal variable with one-of-N conversion function, or because it is part of a time series network. It is not
meaningful to prune individual input neurons in such a case).
Prune hidden units. Click this button to prune any hidden units where the fan-out weights all have magnitude below the
pruning threshold.
Summary of fan-out statistics. Click this button to generate a spreadsheet given summary statistics on the fan-out weights
of each neuron.
Each row of the spreadsheet corresponds to a particular neuron. Details include the minimum, maximum, mean, and
standard deviation of the fan-out weights, and the minimum threshold that would lead to that neuron being pruned. The first
column of the spreadsheet contains an asterisk (*) against neurons that would not be pruned using the current threshold.

Select the Thresholds tab of the Neural Network Editor to access the thresholds used to determine the classification of
nominal output variables. The precise meaning of the thresholds depends on the network type.
Use classification threshold. Select this check box to indicate that the thresholds should be used. If not selected, the
network has no "doubt option," - i.e. it will always assign a class in a "winner takes all" fashion.
Accept. The threshold used to indicate that the activation of an output neuron is sufficiently good to assign a class.
Reject. The threshold used to indicate that the activation of an output neuron is sufficiently low to indicate that the class is
definitely not as represented by that neuron.

Click the Add button on the Neural Network Editor - Variables tab to display the Add Variables to Network dialog. Adding
variables invalidates any previous training of the network, and you should retrain it afterwards.
Add variable. The following two options are in this group box.
Current variable. This field indicates the name of the current variable in the Network Editor, and whether it is an input or
output variable.
Number to add. Specify the number of new variables to add. Initially, the new variables are all numeric (you can convert
them to nominals on the Network Editor).
Insert position. Specify whether to add the variable Before current variable or After current variable.

Click the Add units button on the Neural Network Editor - Layers tab to display the Add Units to Network dialog. Use the
options on this dialog to add new units to a network. Adding units invalidates any previous training of the network, and you
should retrain it afterwards.
Layer. This field displays the layer to which the units will be added.
Units. This field displays how many units there are currently in the selected layer.
Number to add. Specify the number of new units to add.
Current unit. Select a "current unit" - the new units are inserted either before or after this unit.
Insert position. Specify whether the new units should be inserted Before or After the current unit.

Click the Delete units button on the Neural Network Editor - Layers tab to display the Delete Units from Network dialog.
Use the options on this dialog to delete units from a network. Deleting units invalidates any previous training of the network,
and you should retrain it afterwards.
Unit Editor. The following options are in the Unit editor group box.
Layer. This field displays the layer from which the units will be deleted.
Units. This field displays how many units there are currently in the selected layer.
From. In this box, specify the start of the range of units to delete.
To. In this box, specify the end of the range of units to delete.
Nominal Definition
Click the Edit nominals button on the Variables tab of the Neural Network Editor, or click the Edit class list button on the
Topological Map dialog to display the Nominal definition dialog. Use this dialog to change the definition of nominal
variables (the number and names of the nominal values). In a network output variable, a nominal (categorical) variable
corresponds to a classification problem, and the nominal values to the classes. You can rename existing nominals, delete
existing ones, and add new ones. Nominal values must be distinct, and the dialog will check that you have given distinct
names.
It is quite unusual to directly edit the nominal definition of a network, which is typically automatically derived from the
corresponding variable in the data set, and which must retain the same names as the data set variable in order to function.
However, it is occasionally useful for unsupervised networks, or if you alter the data file variable definition and want to
change an existing network to match. Note, however, that the existing weight structure of the network will be lost, and the
network will require retraining.
Nominals List. Lists the nominal values (text). Click on a nominal to edit it (the name is displayed in the field below for
editing), or prior to clicking Add in order to insert newly added nominals before or after an existing value. You can move up
and down the list by pressing the UP ARROW and DOWN ARROW keys.
Edit field. Situated below the nominals list. Use this field to alter the nominal name.
For convenience, you can move up and down the nominal list without leaving this field, by pressing ALT + UP ARROW or
ALT + DOWN ARROW. This is useful if you want to edit several nominal values.
Position field. Situated to the right of the Edit field. This field gives the position of the current nominal value within the list.
Alter this value to reorder the list. The most convenient method is to click the microscroll up arrow or down arrow until the
nominal value is in the desired place in the list.
OK. Click this button to accept any changes you have made and exit this dialog.
Add. Displays the Add New Items dialog. On this dialog, specify the number of new nominal values to be added, and
whether to insert them before or after the currently selected nominal value in the list.
Delete. Displays the Delete Items dialog. On this dialog, select the nominal values to be removed, and click OK. Removing
all nominal values converts a nominal variable into a numeric variable.

Generalized Regression Neural Network Training

Overview
Generalized Regression Neural Networks (GRNNs) form a kernel-based estimation of the regression surface. The output
variable is usually numeric [a Probabilistic Neural Network (PNN) is used for classification problems]. Typically, like
PNNs, GRNNs have one hidden unit per training case. However, unlike PNNs, it is possible to train a GRNN with a smaller
number of hidden units, which represent the centroids of clusters of known data. These centers are typically assigned using
K-Means. It is also, of course, permissible to sub-sample the available training data.
GRNN centers can be assigned using any of the algorithms (e.g., Kohonen training) on the generic Train Radial Layer
dialog, available by clicking the Custom button on the Train Generalized Regression Networks dialog. However, you should
proceed with caution, as it is assumed that the centers are the centroids of clusters, and if you use an algorithm that violates
this assumption (e.g. Learned Vector Quantization), the resulting network is likely to be invalid.
Once centers have been assigned, the balance of GRNN training is extremely simple and takes a minimal period of time.

Select the Generalized Regression Neural Network option button on the Custom Network Designer - Quick tab and click the
OK button to display the Train Generalized Regression Networks dialog. This dialog contains three tabs: Quick, Pruning,
and Classification. The options described here are available regardless of which tab is selected. See also, Generalized
Regression Neural Network (GRNN) Training Overview.
OK. Click the OK button to run the center assignment algorithm and to set the regression factors in the third and fourth layer
of the network.
Custom. Click the Custom button to display the Train Radial Layer dialog, which is used to assign GRNN centers. After
assigning centers using the algorithm of your choice, click OK on the radial training dialog. Layers three and four of the
GRNN will then be optimized, and the pruning algorithms run
Sampling. Click the Sampling button to display the Sampling of Case Subsets for Training dialog, which is used to specify
how the cases in the data set should be distributed among the training, selection, and test subsets.

Select the Quick tab of the Train Generalized Regression Networks dialog to access the options described here.
Assignment of radial centers. Specify the algorithm used to locate the centers. The available options are:
Random sampling. Use random sampling for radial center assignment. If the number of hidden units is equal to the number
of training cases (which is the default situation), then Random Sampling is equivalent to simply using all the training cases.
K-Means clustering. Select this option button to use K-Means algorithm for radial center assignment.
Smoothing. A Generalized Regression Neural Network (GRNN) estimates the regression surface by adding together a
number of Gaussian (bell-shaped) curves located at each training case. The smoothing factor determines the width of the
gaussians, and the training case's target output its height.
As with Probabilistic Neural Networks (PNNs), too small a factor leads to very sharply peaked gaussians - in the limit 0.0,
the GRNN has a spike at each training case, and so (barring identical points in separate classes) achieves an exactly correct
prediction on the training set, but predicts 0.0 for all new cases. On the other hand, a large factor leads to an over-smoothed
surface with little detail. Typically a figure between 0.1 and 3 yields reasonable results, although this is problem-dependent.
Note. Although all forms of neural network exhibit the problem of over-learning to some degree, manifested by a training
error much lower than the selection error, GRNNs and PNNs demonstrate the most extreme example of over-learning: the
training error can be arbitrarily lowered simply by reducing the smoothing factor. It is therefore particularly important to use
a selection set and not to be guided by the training error, when using a GRNN or PNN.
Set number of hidden units to equal number of training cases. If this check box is selected, the number of hidden units in
the GRNN is reset to equal the number of training cases (which is standard practice for GRNNs).
Clear this check box if you want to explicitly specify a number of hidden units less than the number of training cases (if you
have more hidden units than training cases, STATISTICA Neural Networks will remove the excess units).

Select the Pruning tab of the Train Generalized Regression Networks dialog to access the options described here.
Prune inputs with low sensitivity after training. Select this check box to apply sensitivity-analysis based pruning after
training. A sensitivity analysis is run after the network is trained, and input variables with training and selection sensitivity
ratios below the threshold are pruned.
Ratio. (The sensitivity pruning ratio threshold.) An input variable with a sensitivity of 1.0 actual makes no contribution to
the network's decision, and can be pruned without any detriment. An input variable with sensitivity below 1.0 actually
damages network performance, and should definitely be pruned (perhaps surprisingly, inputs with sensitivity below 1.0 on
the selection data are not an uncommon occurrence, a by-product of over-learning). If you specify a threshold above 1.0,
then there is some deterioration in performance, but this may be acceptable in order to reduce the network size.
Train Generalized Regression Networks - Classification

Tab
It is not recommended practice to employ a GRNN for a classification problem; the closely-related PNN is better suited.
Therefore, this tab can usually be ignored. However, as it is feasible to use a GRNN for classification, the tab is available.
Classification neural networks must translate the numeric level on the output neuron(s) to a nominal output variable. This tab
governs the techniques used to make this translation. The technique is described in more detail in classification thresholds.
Assign to highest confidence (no thresholds). This option button is available only for multiple output neuron networks. It
indicates a "winner takes all" network.
Use the thresholds specified below. Specify explicit accept and reject thresholds; see classification thresholds for more
details.
Accept/Reject. Specify the accept and reject thresholds if the option to explicitly specify thresholds is selected.
Calculate optimum thresholds. Available only for single output neuron, two-class problem networks. The thresholds are
determined to minimize expected loss (see classification thresholds).
Loss. The loss coefficient used if the threshold is being automatically determined to minimize loss.
Clustering Network

A cluster network is actually a non-neural model presented in a neural form for convenience. A cluster network consists of a
number of class-labeled exemplar vectors (each represented by a radial neuron). The vectors are assigned centers by
clustering algorithms such as k-Means, and then labeled using nearby cases. After labeling, the centers positions can be fine-
tuned using Learned Vector Quantization.
Cluster networks are closely related to Kohonen networks, with a few differences. Cluster networks are intended for
supervised learning situations where class labels are available in the training data, and cluster networks do not have a
topologically organized output layer.
Training consists of center assignment, followed by labeling. Optionally, this can be followed by LVQ training to improve
center location. After training, the network can be pruned and classification factors set, including an acceptance threshold
and K,L factors for KL nearest neighbor classification.

Select the Clustering Network option button on the Custom Network Designer - Quick tab and click the OK button to display
the Train Cluster Network dialog. This dialog can contain up to five tabs: Quick, LVQ, Pruning, Classification, and
Interactive (if LVQ training is enabled). See also, Cluster Network Training - Overview.
OK. Click this button to train the cluster network. Training consists of center assignment followed by labeling.
Cancel. Click this button to exit this dialog and return to the Custom Network Designer.
Custom. Click this button to display the Train Radial Layer dialog. Any of the radial training algorithms in STATISTICA
Neural Networks can be applied to cluster networks. When you click OK on the Train Radial Layer dialog, control is
returned to the Train Cluster Network dialog, which will apply pruning algorithms and set classification factors before
displaying the Results dialog.
Sampling. Click this button to display the Sampling of Case Subsets for Training dialog.

Select the Quick tab of the Train Cluster Network dialog to access the options described here.
Assign Centers. Specify the algorithm to be used in assigning centers. The options are: random sampling, K-Means, and
random values.
Label centers (class). Select an algorithm to assign class labels to the cluster network. The class label of a neuron is based
on the class of nearby training cases. The class can be based either on the K nearest training cases to the neuron, or on those
training cases that are nearer to the neuron than to any other neuron (the Voronoi neighbors).
KL Nearest Cases.
K/L. Specify the K and L factors used to recover the neuron class. The K nearest training cases to the neuron are located.
The most common class label among K is assigned to the neuron, provided that at least L are in common. If there are less
than L in common or there is a tie, the neuron is unlabeled (indicating "Unknown" class).
Voronoi Neighbors. This algorithm counts the cases that are assigned to each unit (that is, are nearer to that unit than any
other unit), and labels the unit using the most common class of these assigned units, providing that these exceed a stated
proportion of the assigned units. If less than the given proportion is in the most common class, a blank label is applied,
indicating "Unknown." See Voronoi Neighbors for more details.
Learned vector quantization. Select this check box to enable LVQ training; detailed control parameters are available on
the LVQ tab.

Select the LVQ tab of the Train Cluster Network dialog to access the options described here.
Learned vector quantization. Select this check box to enable LVQ training.
Assign centers. Use the options in this group box to specify the algorithm variant to be used. The variants are briefly
described below; see Learned Vector Quantization for more details.
LVQ 1. Select this option button to adjust the nearest exemplar vector to the training case, moving it either toward the
training case if they have the same class label or away from it if they do not.
LVQ 3. Select this option button to also compare the two nearest exemplars. Similar to LVQ 2.1, it moves one toward and
one away from the training case only if one is of the correct class and the distances are approximately equal. However, it
also moves both exemplars toward the training case a smaller distance if they are both correctly labeled. This small
movement of both exemplars is controlled by the Beta parameter, described below.
LVQ 2.1. Select this option button to compare the two nearest exemplars. It adjusts them only if they are approximately the
same distance from the training case and only one of them is of the correct class. In this case, the correct one is moved
toward the training case and the incorrect one away. The definition of "approximately the same distance" involves the
Epsilon parameter, described below.
Epochs. Specify the number of epochs over which the algorithm will run. On each epoch, the entire training set is fed
through the network and used to adjust the network weights and thresholds.
Learning rate. The learning rate is altered linearly from the first to last epochs. You can specify a start and end value. The
usual practice is to decay the learning rate over time, so that the algorithm "settles down" to a solution.
Epsilon. Specify the Epsilon value here. In LVQ 2.1. and LVQ 3, the definition of "approximately the same distance" is
controlled by this parameter. Epsilon is typically between 0 and 1 and usually less than 0.5.
Beta. Specify the Beta value here. In LVQ 3, when both nearest exemplars are of the same class as the training case, they are
both moved toward the training case. This movement is more subtle than when mismatched exemplars are found - the usual
learning rate is multiplied by Beta, which is greater than 0 but significantly less than 1.

Select the Pruning tab of the Train Cluster Network dialog to access the options described here.
ratios below the threshold described below are pruned.
Ratio. Specify the sensitivity pruning ratio threshold. An input variable with a sensitivity of 1.0 makes no contribution to the
network's decision, and can be pruned without any detriment. An input variable with sensitivity below 1.0 actually damages
network performance, and should definitely be pruned (perhaps surprisingly, inputs with sensitivity below 1.0 on the
selection data are not an uncommon occurrence, a by-product of over-learning). If you specify a threshold above 1.0, there is
some deterioration in performance, but this may be acceptable in order to reduce the network size.

Select the Classification tab of the Train Cluster Network dialog to access the options described here. Cluster networks
perform classification by storing labeled exemplar vectors. See Classification by Labeled Exemplars for more details.
Classification threshold. Use the options in this group box to select whether to use a threshold.
No threshold. Select this option button if you do not want to specify a threshold. The Self Organizing Feature Map will
always use the closest neuron(s) irrespective of the distance.
Use the threshold specified below. Select this option button if you want to specify a threshold.
Accept. Specify the classification threshold to use.
K/L. Specify the K and L nearest neighbor control factors. By default K is 1 and L is 0, corresponding to the standard
Kohonen "winner takes all" algorithm.

Select the Interactive tab of the Train Cluster Network dialog to access options that relate to interactive training. In
interactive training, a dialog is displayed during training showing the progress of the error function (see Training in Progress
dialog), which includes options to stop training early and to extend a training run with additional epochs of the same or
different training algorithms. This tab is available only if Learned Vector Quantization training is enabled on the Train
Cluster Network - LVQ tab.
Interactive training (display graph during training). Select this check box to enable interactive training. When you click
the OK button to execute the training algorithm, the Training in Progress dialog is displayed.
Keep lines from previous training runs on graph (for comparison). If this check box is selected, the progress graph is not
cleared when training starts, allowing you to compare the results of your training run with previous runs.
Automatically close graph window when training finishes. By default, when interactive training finishes, the Training in
Progress dialog stays visible so that you can extend training or choose to transfer the graph to the results workbook. If this
check box is selected, the progress dialog is closed automatically when training finishes. Select this check box if you want to
see the graph for purposes of instantaneous feedback but do not want to extend training.
Error lines plotted. Specify which lines are plotted on the progress graph. By default the training and selection errors are
both plotted; however, you can choose to plot only one to reduce clutter.
Sampling interval (in epochs). Specify how often (in epochs) the error should be plotted on the graph. Specify a sampling
interval greater than one if you are running for a large number of epochs to reduce memory requirements and the drawing
time for the graph.
Linear

Linear Neural Networks implement a basic linear model, used principally for regression (although they can be used for
classification, forming a simple linear discriminant). Linear models are equivalent to simple forms of neural network, with
no hidden layer. They are included in STATISTICA Neural Networks principally to allow you to compare linear and non-
linear modeling within the same framework.
The Occam's Razor principle dictates that the simplest model fitting the data should be used. Linear models have been used
very successfully in science for many years not least because many problems are linear, or approximately linear. If the
problem is linear, then it is pointless to deploy a complex non-linear model such as a neural network, which cannot perform
any better and is prone to over-fitting problems. It is therefore recommended always to benchmark your neural networks
performance against a linear network.
Linear networks are optimized using the standard Singular Value Decomposition (Pseudo Inverse) procedure. However, as
they use a Dot Product synaptic function, they can also be optimized using any of the generic Dot Product training
algorithms.
Clicking the OK button on the Train Linear Network dialog optimizes the linear network using SVD.

Select the Linear option button on the Custom Network Designer - Quick tab and click the OK button to display the Train
Linear Network dialog. This dialog contains three tabs: Quick, Pruning, and Classification. The options described here are
available regardless of which tab is selected. See also, Linear Network Training Overview.
OK. Click the OK button to optimize the linear network using Singular Value Decomposition (SVD).
Custom. Click the Custom button to display the Train Dot Product Layers dialog. Any of the algorithms on this dialog can
be used to optimize a linear network. SVD is usually quicker, but occasionally another algorithm, such as Conjugate
Gradient Descent, is useful as SVD can be numerically unstable.

There are no options on the Train Linear Network - Quick tab. Simply click the OK button to run the standard algorithm (to
optimize by single value decomposition).
Train Linear Network - Pruning Tab

Select the Pruning tab of the Train Linear Network dialog to access the options described here.
Prune inputs/units with small fan-out weights. Select this check box to apply a pruning algorithm at the end of training. A
neuron with small magnitude fan-out weights (i.e. weights leading to the next level) makes little contribution to the
activations of the next layer, and can be pruned, leading to a compact, faster network with equivalent performance. Specify
whether input variables, hidden units, or both should be pruned, and the threshold at which pruning takes place.
Prune Input Variables. Select this check box to specify that input variables with small fan-out weights should be pruned.
Each input variable has one or more associated input layer neurons (more than one for some nominal variables and for time
series networks), and the fan out on all these input neurons must be less than the pruning threshold for the input to be
pruned.
Prune Hidden Units. Select this check box to specify that hidden units with small fan-out weights should be pruned.
Threshold. If all a unit's fan-out weights have smaller magnitude than the threshold entered in this box, it is a candidate for
pruning.
ratios below the threshold (see below) are pruned.
Ratio. (The sensitivity pruning ratio threshold.) An input variable with a sensitivity of 1.0 actual makes no contribution to
the network's decision, and can be pruned without any detriment. An input variable with sensitivity below 1.0 actually
damages network performance, and should definitely be pruned (perhaps surprisingly, inputs with sensitivity below 1.0 on
the selection data are not an uncommon occurrence, a by-product of over-learning). If you specify a threshold above 1.0,
then there is some deterioration in performance, but this may be acceptable in order to reduce the network size.

Select the Classification tab of the Train Linear Network dialog to access the options described here. Classification neural
networks must translate the numeric level on the output neuron(s) to a nominal output variable. The options described here
govern the techniques used to make this translation. The technique is described in more detail in classification thresholds.
Assign to highest confidence (no thresholds). This option is available only for multiple output neuron networks. Select this
option to indicate a "winner takes all" network.
Use the thresholds specified below. Select this option to use accept and reject thresholds; the precise interpretation of the
thresholds depends on whether this is a single output neuron or multiple output neuron network (see classification
thresholds).
Accept, Reject. Specify the accept and reject thresholds if the option to explicitly specify thresholds is used.
Calculate optimum thresholds. This option is available only for single output neuron, two-class problem, networks. The
thresholds are determined so as to minimize expected loss (see classification thresholds).
Loss. Specify the loss coefficient used if the threshold is being automatically determined to minimize loss.


Click the Networks button on the Edit Ensemble dialog to display the Select Neural Networks dialog. This dialog contains
one tab: Select Neural Networks. Use the options on this dialog to select one or more neural networks either by clicking on
the network displayed in the list box, or typing in the range of indices in the text box at the bottom of the dialog.
Networks list. The list displays summary details of all networks in the current network file. Networks are listed in ascending
order of index, which corresponds to the order of creation.
To select a single network, click on it. To select a range of networks, click on the first one in the range, then hold down the
SHIFTkey and click on the last one in the range. To extend an existing selection by adding a new network, or to remove a
network from the current selection, hold down the CTRL key and click on the network.
Range. This text box lists the currently selected networks by index, separated by spaces. If there are several networks with
contiguous indices, you can specify them as a range; e.g. 2-4 means networks with indices 2, 3, and 4.
Select all. Click this button to select all the networks shown in the list.
OK. Click the OK button to confirm the current selection and return to the previous dialog.
Cancel. Click the Cancel button to close this dialog, ignoring any selections, and return to the previous dialog.

Multilayer Perceptrons are one of the most popular network types, and in many problem domains seem to offer the best
possible performance. They are trained using iterative algorithms, of which the best known is back propagation.
A considerable amount of research has been conducted into improved algorithms for training of multilayer perceptrons. The
most influential of these are the second-order optimization algorithms (conjugate gradient descent, quasi-newton and
levenberg-marquardt). These algorithms are usually described as converging far more quickly than back propagation (one or
two orders of magnitude faster). However, experience indicates that their initial convergence is usually much slower than the
on-line version of back propagation, and that they seem to be more inclined to suffer an important training problem in neural
networks - convergence to a local minima.
If the training data is sparse for the complexity of the underlying function, then multilayer perceptrons are inclined to suffer
the second major problem of neural networks - overfitting (in effect, the neural network estimates an over-complex function,
modeling the noise in the data set rather than the underlying function). Overfitting can be addressed by applying weight
regularization (which penalizes the large weights that correspond to complex functions), and by early stopping (double
checking the network's performance against the selection data subset during training). Both of these techniques are available
in STATISTICA Neural Networks. Since second-order training algorithms are more powerful than Back Propagation, they are
also more inclined to overfit the data, which is a second reason why sometimes the simpler Back Propagation algorithm
actually proves superior.
The standard training procedure recommended for multilayer perceptrons in STATISTICA Neural Networks is designed to
address these problems. It consists of a two-phase process. The first phase is a quite short burst of back propagation, with a
moderate training rate. This performs the "gross convergence" stage, and for some simple problems may actually be
sufficient in itself. The second phase is a longer run of conjugate gradient descent, a much more powerful algorithm, which
is less likely to encounter convergence problems than otherwise due to the use of back propagation first.
The two other second order algorithms, Quasi-Newton and Levenberg-Marquardt, are both generally considered to be faster
than Conjugate Gradient Descent. However, both have other drawbacks. Quasi-Newton has significant memory
requirements (proportional to the square of the number of weights in the network), and so is not suitable for large networks.
It is also prone to occasional problems of numerical instability. Levenberg-Marquardt is suitable only for low-noise
regression problems using the sum-squared error function, but can be very fast in those circumstances. Conjugate Gradient
Descent is a highly effective generic algorithm with low memory requirements and good stability.
You can plot a graph of the development of the error function when training a multilayer perceptron. The graph shows the
error on the training and selection set on each epoch. This can be quite informative. In general, you would expect the
training error to drop on every epoch, as the algorithm is attempting to minimize this error score (in fact, the back
propagation algorithm is not guaranteed to always cause a decrease in the error, particularly if you use an over-high learning
rate). The selection subset is not used to train the network, but acts as an independent check on training. A common
occurrence is that at first the selection error drops at about the same rate as the training error. If the data set is quite small, it
may be significantly more noisy, however. Then, the selection error begins to fall behind the training error, perhaps even
beginning to rise quite sharply.
This demonstrates the phenomenon of over-learning, or over-training, very clearly. The initially random network has small
weights, which correspond to a low-complexity model. As training begins, the weights are increased, and the function
modeled by the neural network is fit to the data. The selection error decreases as the model at this stage fits the underlying
function that generated both training and selection data. However, as training continues, the weights diverge further, and the
very powerful and flexible neural model is fit increasingly closely to the data points, beginning to model the noise rather
than the underlying function. The training error naturally continues to fall, but the selection error rises as the match with the
increasingly eccentric model becomes poor.
You can at least partially mitigate this problem by taking a number of steps: reducing the number of input variables,
especially if some are not especially significant (see: Sensitivity Analysis, Feature Selection), reducing the number of
hidden units, introducing a Weigend weight decay term, and (if at all possible) increasing the number of data cases.
Another useful approach is to stop training once the selection error starts to rise - a technique known as early stopping.
STATISTICA Neural Networks can perform early stopping automatically for you. The algorithm is stopped when the error
fails to improve for a certain number of epochs. As at this point some deterioration will have set in, STATISTICA Neural
Networks also restores the best network (i.e. the one with the lowest selection error) discovered during the training process.

Select the Multilayer Perceptron option button on the Custom Network Designer - Quick tab and click the OK button to
display the Train Multilayer Perceptron dialog. This dialog contains six to eight tabs: Quick, Start, End, Decay, and
Interactive. Depending on the options selected on the Quick tab, there can be BP, QP, and DBD tabs. There is an additional
Classification tab for classification neural networks.
OK. Click this button to start the training algorithm(s) with the given parameters. A progress-monitoring dialog is displayed,
and the algorithm can be interrupted if desired.
Custom. Click this button to display the generic Train Dot Product Layers dialog. In addition to the iterative algorithms
described in this section, the Train Dot Product Layers dialog is used to apply principal component analysis training to the
first layer, and linear optimization (singular value decomposition) to the output layer. The former is appropriate if you have a
large number of input variables, and the latter if the network is for a regression problem and uses the Identity activation
function.
If you select custom training, the options on the Start tab of the Multilayer Perceptron dialog are applied before the Train
Dot Product Layers dialog is displayed, and the options on the End tab are applied after you click OK on the Train Dot
Product Layers dialog.
See also, Multilayer Perceptron Training - Overview.

Select the Quick tab of the Train Multilayer Perceptron dialog to access the options described here.
Phase one/Phase two. Select a two-phase algorithm by selecting both of these check boxes. Clear one or the other to specify
a single-phase algorithm. The algorithm, and all the parameters of the algorithm, can be chosen independently for each
phase.
Select the training algorithm from the drop-down list for each phase. The available algorithms are:
Back propagation. A simple algorithm with a large number of tuning parameters, often slow terminal convergence, but
good initial convergence. STATISTICA Neural Networks implements the on-line version of the algorithm.
Conjugate gradient descent. A good generic algorithm with generally fast convergence.
Quasi-Newton (BFGS). A powerful second order training algorithm with very fast convergence but high memory
requirements.
Levenberg-Marquardt. An extremely fast algorithm in the right circumstances - low-noise regression problems with the
standard sum-squared error function.
Quick propagation. An older algorithm with comparable performance to back propagation in most circumstances, although
it seems to perform noticeably better on some problems. Not usually recommended.
Delta-bar-delta. Another variation on Back propagation, which occasionally seems to have better performance. Not usually
recommended.
Epochs. Specify the number of epochs of training of the network in a given phase.
Learning rate. Specify the main learning rate for Back propagation, Quick propagation, or Delta-bar-delta training.

Select the Start tab of the Train Multilayer Perceptron dialog to access options that are applied at the start of training,
specifically, to randomly assign initial weights.
Reinitialize network before training. This check box is selected by default. Small random weights are usually assigned to
a neural network before training. You may occasionally want to clear this box if you are subjecting a preexisting network to
further training.
Initialization method. Use these options to specify exactly how the weights should be initialized at the beginning of
training.
Random uniform. The weights are initialized to a uniformly-distributed random value, within a range whose minimum and
maximum values are given by the Minimum/Mean and Maximum/S.D. fields respectively.
Random Gaussian. The weights are initialized to a normally-distributed random value, within a range whose mean and
standard-deviation are given by the Minimum/Mean and Maximum/S.D. fields respectively.
Control parameters for random initialization.
Minimum/Mean. Range control for Random uniform and Random Gaussian Initialization methods. If the method is
Uniform, this field gives the range minimum. If the method is Gaussian, this field gives the mean.
Maximum/S.D. Range control for Random uniform and Random Gaussian Initialization methods. If the method is Random
uniform, this field gives the range maximum. If the method is Random Gaussian, this field gives the standard deviation.

Select the End tab of the Train Multilayer Perceptron dialog to access options that control the end of training and are applied
at the end of training. Specifically, there are stopping conditions to determine if training should be terminated before the full
number of epochs has expired, and pruning algorithms to remove superfluous network inputs and/or hidden units.
Track and restore best network. This check box is selected by default. When selected, STATISTICA Neural Networks
keeps a copy of the network with the lowest selection error during training, and this network is the final outcome of training,
not the network remaining at the end of the last epoch. You may occasionally want to clear this check box if you are
convinced that the training process for your problem domain is extremely stable so that the final network will be the best, or
if you want to observe training and are not concerned with getting the best possible network. Clearing this check box speeds
training (since STATISTICA Neural Networks does not need to make a copy of the network each time an improvement is
made).
Stopping conditions.
Target error (Training/Selection).
Specify target error values here. If the error on the training or selection test drops below the given target values, the network
is considered to have trained sufficiently well, and training is terminated. The error never drops to zero or below, so the
default value of zero is equivalent to not having a target error.
Minimum improvement in error.
Training/Selection.
Specify a minimum improvement (drop) in error that must be made; if the rate of improvement drops below this level,
training is terminated. The default value of zero implies that training will be terminated if the error deteriorates. You can
also specify a negative improvement rate, which is equivalent to giving a maximum rate of deterioration that will be
tolerated. The improvement is measured across a number of epochs, called the "window" (see below).
propagation, demonstrate noise on the training and selection errors, and all the algorithms may show noise in the selection
error. It is therefore not usually a good idea to halt training on the basis of a failure to achieve the desired improvement in
error rate over a single epoch. The window specifies a number of epochs over which the error rates are monitored for
improvement. Training is only halted if the error fails to improve for that many epochs. If the window is zero, the minimum
improvement threshold is not used at all.
Prune inputs/units with small fan-out weights. Select this check box to apply a pruning algorithm at the end of training. A
neuron with small magnitude fan-out weights (i.e. weights leading to the next level) makes little contribution to the
activations of the next layer and can be pruned, leading to a compact, faster network with equivalent performance. This
option is particularly useful in conjunction with Weigend weight decay (see Decay tab) that encourages the development of
small weights precisely so that they can be pruned.
Prune input variables. Specifies that input variables with small fan-out weights should be pruned. Each input variable has
one or more associated input layer neurons (more than one for some nominal variables, and for time series networks), and
the fan out on all these input neurons must be less than the threshold for the input to be pruned.
Prune hidden units. Specifies that hidden units with small fan-out weights should be pruned.
Pruning threshold. If all a unit's fan-out weights have smaller magnitude than this threshold, it is a candidate for pruning.
Ratio. The sensitivity pruning ratio threshold. An input variable with a sensitivity of 1.0 makes no contribution to the
selection data are not an uncommon occurrence, a by-product of over-learning). If you specify a threshold above 1.0, then
there is some deterioration in performance, but this may be acceptable in order to reduce the network size.

Select the Decay tab of the Train Multilayer Perceptron dialog to access the options described here. These options are used
to specify the use of Weigend weight decay regularization. This option encourages the development of smaller weights,
which tends to reduce the problem of over-fitting, thereby potentially improving generalization performance of the network,
and also allowing you to prune the network (see End tab). Weight decay works by modifying the network's error function to
penalize large weights - the result is an error function that compromises between performance and weight size.
Consequently, too large a weight decay term may damage network performance unacceptably, and experimentation is
generally needed to determine an appropriate weight decay factor for a particular problem domain.
Weight decay can be applied separately to the two phases of a two-phase algorithm.
Phase one decay factors/Phase two decay factors.
Decay Factor. Specify the decay factor; see Weigend Weight Regularization for a mathematical definition of the decay
Select the Interactive tab of the Train Multilayer Perceptron dialog to access options that relate to interactive training. In
interactive training, a special dialog is displayed during training showing the progress of the error function (see Training in
Progress dialog), which includes options to stop training early and to extend a training run with additional epochs of the
same or different training algorithms.
Interactive training (display graph during training). Select this check box to enable interactive training. When the OK
button is clicked to execute the training algorithm, the Training in Progress dialog will be displayed.
check box is selected, the Training in Progress dialog is closed automatically when training finishes. Select this check box if
you want to see the graph for purposes of instantaneous feedback but do not want to extend training.
Error lines plotted. Specify which lines are plotted on the progress graph. By default the Training and selection errors are
both plotted; however, you can choose to plot only one to reduce clutter.
time for the graph.

Select the BP tab of the Train Multilayer Perceptron dialog to access the options described here. This tab is available only if
Back propagation was selected on the Quick tab. There are separate tabs for the first and second phase algorithms; the
appropriate tab is only available if the algorithm is in use for that phase. This tab contains additional options for Back
Propagation training.
momentum throughout training. Some authors, however, recommend altering these rates on each epoch (specifically, by
reducing the learning rate - this is often counter-balanced by increasing momentum). Select this check box to be able to
adjust the learning rate and momentum. See Back Propagation for more details.
Learning rate. Specify the learning rate used to adjust the weights. A higher learning rate may converge more quickly, but
may also exhibit greater instability. Values of 0.1 or lower are reasonably conservative - higher rates are tolerable on some
problems, but not on all (especially on regression problems, where a higher rate may actually cause catastrophic divergence
of the weights).
Initial. Specify the initial learning rate here.
Final. Specify the final learning rate here.
If the learning rate is being adjusted on each epoch, both fields are enabled, and give the starting and finishing learning rates.
Momentum. Momentum is used to compensate for slow convergence if weight adjustments are consistently in one direction
- the adjustment "picks up speed." Momentum usually increases the speed of convergence of Back Propagation considerably,
and a higher rate can allow you to decrease the learning rate to increase stability without sacrificing much in the way of
convergence speed.
Initial. Specify the initial momentum here.
Final. Specify the final momentum here.
Shuffle presentation order of cases each epoch. STATISTICA Neural Networks uses the on-line version of Back
Propagation, which adjusts the weights of the network as each training case is presented (rather than the batch approach,
which calculates an average adjustment across all training cases, and applies a single adjustment at the end of the epoch). If
the shuffle option is checked, the order of presentation is adjusted each epoch. This makes the algorithm somewhat less
prone to stick in local minima, and partially accounts for Back Propagation's greater robustness than the more advanced
Deviation. Specify the standard deviation of the Gaussian noise added to the target output during training.

Select the QP tab of the Train Multilayer Perceptron dialog to access the options described here. This tab is available only if
Quick propagation is selected on the Quick tab. There are separate tabs for the first and second phase algorithms; the
appropriate tab is only available if the algorithm is in use for that phase. This tab contains additional options for Quick
Propagation training.
Learning Rate. Specify the initial learning rate, applied in the first epoch; subsequently, the quick propagation algorithm
determines weight changes independently for each weight.
Acceleration. Specify the maximum rate of geometric increase in the weight change which is permitted. For example, an
Multilayer Perceptron - DBD Tab

Select the DBD tab of the Train Multilayer Perceptron dialog to access the options described here. This tab is available only
if Delta-bar-delta is selected on the Quick tab. There are separate tabs for the first and second phase algorithms; the
appropriate tab is only available if the algorithm is in use for that phase. This tab contains additional options for Delta-bar-
Delta training.
Learning rate. The following options are available in the Learning rate group box:
Initial. Specify the initial learning rate used for all weights on the first epoch. Subsequently, each weight develops its own
learning rate.
Increment. Specify the linear increment added to a weight's learning rate if the slope remains in a consistent direction
Decay. Specify the geometric decay factor used to reduce a weight's learning rate if the slope changes direction
Smoothing. Specify the smoothing coefficient used to update the bar-Delta smoothed gradient. It must lie in the range (0,1).
noisy error surface this allows the algorithm to maintain a high learning rate consistent with the underlying gradient;
however, it may also lead to overshoot of minima, especially on an already-smooth error surface.

Select the Classification tab of the Train Multilayer Perceptron dialog to access the options described here. Classification
neural networks must translate the numeric level on the output neuron(s) to a nominal output variable. The options on this
tab govern the techniques used to make this translation. The technique is described in more detail in Classification
Thresholds.
Assign to highest confidence (no thresholds). This option is available only for multiple output neuron networks. It
Use the thresholds specified below. Uses explicitly defined accept and reject thresholds; the precise interpretation of the
thresholds depends whether this is a single output neuron or multiple output neuron network (see Classification Thresholds).
Accept. Specify the accept threshold.
Reject. Specify the reject threshold.
Calculate minimum loss threshold. Available only for single output neuron, two-class problem, networks. The thresholds
are determined so as to minimize expected loss (see classification thresholds).
Loss. The loss coefficient used if the threshold is being automatically determined to minimize loss.
Principal Component Analysis

Select the Principal Components option button on the Custom Network Designer - Quick tab and click the OK button to
display the Principal Component Analysis dialog. This dialog contains one tab: Quick.
Principal Components networks are simply linear networks used to perform Principal Component Analysis (PCA). PCA is
useful to reduce the dimensionality of data sets with many input variables. The output of the PCA network can be copied to
the data set, and used to train a simpler network, which can subsequently be joined to the PCA network.
Generate spreadsheet of eigenvalues. The eigenvalues generated during principal components analysis give an indication
of the amount of information carried by each component. If this check box is selected, a results spreadsheet is generated
when you click the OK button.
OK. Click the OK button to calculate the Principal Components, which are stored in the weights of the output neurons. Note
that these are the principal components of the normalized data, which are likely to differ from principal components
calculated using unnormalized data. The number of principal components recovered is equal to the number of output
neurons.
Sampling. Click the Sampling button to display the Sampling of Case Subsets for Training dialog.


Radial Basis Function (RBF) networks combine a single radial hidden layer with a dot product output layer. The hidden
layer neurons act as cluster centers, grouping similar training cases, and the output layer forms a discriminant function or
regression. Since the clustering transformation is non-linear, a linear output layer is sufficient to perform an overall non-
linear function.
RBF networks use a two stage training process - first, assignment of the radial centers and their deviations; second,
optimization of the output layer. A classic RBF uses the identity activation function in the output layer, in which case linear
optimization (pseudo-inverse, SVD) can be used, which is relatively quick compared with Multilayer Perceptron training.
However, for classification problems an entropy error function is often combined with a non-linear (logistic or softmax)
activation function, and the slower conjugate gradient descent algorithm is used.
The standard training procedure therefore involves selection of algorithms for center and deviation assignment, with the
output optimization stage assumed. In addition, some generic procedures for network pruning and classification threshold
assignment are available.
However, it is actually possible to use any of the STATISTICA Neural Networks extensive range of center assignment
algorithms (including for example the Kohonen algorithm, and learned vector quantization), and indeed to use iterative dot
product training algorithms such as Back Propagation to optimize the output layer. These additional procedures are available
via the Custom button on the Radial Basis Function dialog.

Select the Radial Basis Function option button on the Custom Network Designer - Quick tab and click the OK button to
display the Train Radial Basis Function dialog. This dialog contains two tabs: Quick and Pruning. There is an additional
Classification tab for classification neural networks. See also, Radial Basis Function Training Overview.
OK. Click the OK button to start the standard training algorithm for Radial Basis Function networks. This applies the
selected radial assignment and radial spread algorithms in turn, followed by optimization of the output layer (by singular
value decomposition or conjugate gradient descent, depending on the activation function).
Custom. Click this button to display the generic Train Radial Layer dialog for optimization of the radial layer.
When the OK button is clicked on the Train Radial Layer dialog, the generic Train Dot Product Layers dialog is displayed
to allow optimization of the output layer. When OK is clicked on that dialog, the options on the Pruning tab and thresholds
are applied.

Select the Quick tab of the Train Radial Basis Function dialog to access the options described here.
Radial assignment. Select the algorithm used to assign radial centers.
Sample training cases. Select this option button to specify randomly sampling the training cases.
K-Means. Select this option button to assign radial centers by the K-Means algorithm.
Radial spread. Select the algorithm to assign the radial deviation.
Set to this value. Enter the value to be used in the associated field.
Isotropic, scale by. This option uses the Isotropic Deviation Assignment algorithm; the resulting deviation is multiplied by
the scaling factor entered in the associated field.
K-Nearest neighbors. This option assigns the deviation entered in the associated field to the average distance of the K
nearest neighbors.

Select the Pruning tab of the Train Radial Basis Function dialog to access the options described here.
Prune hidden units with small fan-out weights. Select this check box to apply a pruning algorithm at the end of training.
A hidden neuron with small magnitude fan-out weights (i.e. weights leading to the output layer) makes little contribution to
the activations of the output layer, and can be pruned, leading to a compact, faster network with equivalent performance.
Threshold. This is the threshold for the weight-based pruning algorithm. If all a hidden unit's fan-out weights have smaller
magnitude than this threshold, it is pruned.
Ratio. (Sensitivity pruning ratio threshold) An input variable with a sensitivity of 1.0 makes no contribution to the network's
decision, and can be pruned without any detriment. An input variable with sensitivity below 1.0 actually damages network
performance, and should definitely be pruned (perhaps surprisingly, inputs with sensitivity below 1.0 on the selection data
are not an uncommon occurrence, a by-product of over-learning). If you specify a threshold above 1.0, then there is some
deterioration in performance, but this can be acceptable in order to reduce the network size.

Select the Classification tab of the Train Radial Basis Function dialog to access the options described here.
Classification thresholds. Classification neural networks must translate the numeric level on the output neuron(s) to a
nominal output variable. The options here govern the techniques used to make this translation. The technique is described in
more detail in Classification Thresholds.
Assign to highest confidence (no thresholds). This option is available only for multiple output neuron networks. It
Use the thresholds specified below. Uses accept and reject thresholds; the precise interpretation of the thresholds depends
whether this is a single output neuron or multiple output neuron network (see Classification Thresholds).
Accept. Specify the accept threshold here.
Reject. Specify the reject threshold here.
Calculate optimum thresholds. Available only for single output neuron, two-class problem, networks. The thresholds are
determined to minimize expected loss (see Classification Thresholds).
Loss. Specify the loss coefficient used if the threshold is being automatically determined to minimize loss.


Probabilistic Neural Networks (PNN) form a kernel-based estimation of the class in classification problems, using the
training cases as exemplars. Every training case is copied to the hidden layer of the network, which applies a Gaussian
kernel. The output layer is then reduced to a simple addition of the kernel-based estimates from each hidden unit. Optionally,
a fourth layer can be added, which contains the coefficients of a loss matrix.
PNN training is extremely fast, consisting mainly of copying the training cases (post-normalisation) to the network.
However, since they contain a neuron for every training case, PNNs can be extremely large. This makes them slow to
execute. After training, STATISTICA Neural Networks tests all cases on a network in order to calculate the error and
performance rates on the training, selection and test subsets. For a PNN, this requires a number of operations approximately
proportional to the square of the number of training cases, and if you have a large number of cases, this can actually equal
the time taken to train other network types that are usually described as being far slower to train (e.g. multilayer
perceptrons).
PNN training can incorporate prior probabilities (of class distribution) if these are known and different from the frequency
distribution of the training set. If the priors are unknown, the frequency distribution at least provides a reasonable estimate.

Select the Probabilistic Neural Network option button on the Custom Network Designer - Quick tab and click the OK button
to display the Train Probabilistic Neural Networks dialog. This dialog contains four tabs: Quick, Priors, Pruning, and
Thresholds. The options described here are available regardless of which tab is selected. See also, Probabilistic Neural
Network Training Overview.
OK. Click the OK button to start the training algorithm(s) with the given parameters. First, a progress-monitoring dialog is
displayed, and then, the Results (Run Models) dialog.

Select the Quick tab of the Train Probabilistic Neural Networks dialog to access the option described here.
Smoothing. A Probabilistic Neural Network (PNN) essentially constructs an estimate of the probability density function of
each class by adding together Gaussian (bell-shaped) curves located at each point in the training set. The smoothing factor
determines the width of these Gaussians.
Too small a factor leads to very sharply peaked Gaussians - in the limit of 0.0, the PNN has a spike at each training case, and
so (barring identical points in separate classes) achieves a 100% classification rate on the training set, and classifies all other
cases as "unknown." On the other hand, a large factor leads to blurred class boundaries and poor discrimination. Typically a
figure between 0.1 and 3 yields reasonable results, although this is problem-dependent.
Note. Although all forms of neural networks exhibit the problem of over-learning to some degree, manifested by a training
error much lower than the selection error, PNNs demonstrate the most extreme example, as the training error can be lowered
arbitrarily simply by reducing the smoothing factor. It is therefore particularly important to use a selection set, and not to be
guided by the training error, when using a PNN.

Select the Priors tab of the Train Probabilistic Neural Networks dialog to access options to specify, optionally, prior class
probabilities.
Use specified prior probabilities. Select this check box to enable the prior probabilities spreadsheet.
Prior probabilities spreadsheet. Displays the list of prior probabilities. To edit this list, see Edit Priors, below.
Edit priors. Click this button to display a General User Entry Spreadsheet, where you can edit the prior probabilities. The
priors should sum to 1.0; however, if they do not, STATISTICA Neural Networks will scale them accordingly. All priors
should be greater than 0.0.

The Loss matrix tab of the Train Probabilistic Neural Networks dialog is displayed only for four-layer PNNs, which include
a loss matrix in the last layer.
The list displays the relative cost of each possible misclassification (the leading diagonal cannot be edited, and the loss
coefficients are always zero, as there is no cost in correctly classifying a case). The cost coefficients represent the cost of
misclassifying a case that is actually of the class given by the corresponding row, by erroneously classifying it as belonging
to the corresponding column.
Edit loss matrix. Click this button to display a general user entry spreadsheet, where you can edit the prior probabilities.

Select the Pruning tab of the Train Probabilistic Neural Networks dialog to access the options described here.
ratios below the threshold are pruned.
Ratio. (Sensitivity pruning ratio threshold) An input variable with a sensitivity of 1.0 actual makes no contribution to the
selection data are not an uncommon occurrence, a by-product of over-learning). If you specify a threshold above 1.0, then
there is some deterioration in performance, but this may be acceptable in order to reduce the network size.

Select the Thresholds tab of the Train Probabilistic Neural Networks dialog to access the options described here.
Classification thresholds. Classification neural networks must translate the numeric level on the output neuron(s) to a
nominal output variable. The options on this tab govern the techniques used to make this translation. The techniques are
described in more detail in Classification Thresholds.
Assign to highest confidence (no thresholds). This option indicates a "winner takes all" network.
Use the thresholds specified below. Specify explicit accept and reject thresholds; see Classification Thresholds for more
details.
Accept/Reject. Enter the accept and reject thresholds if the option to explicitly specify thresholds is selected.


Self Organizing Feature Maps (SOFMs) are rather different to the other networks available in STATISTICA Neural
Networks, in that they are designed primarily for unsupervised learning. They are usually trained using the Kohonen
algorithm, which does not require any examples of the output variable in the data set. However, other center assignment
algorithms (e.g. K-Means) can also be used, and if a labeled output variable is available in the data set, this can be used to
apply center labels.
Another unique feature of the SOFM is the Topological Map. The output layer neurons are arranged on a two dimensional
lattice, and the Kohonen training algorithm is designed to encourage the formation of clusters of similar cases at nearby
positions in the lattice. This allows independent clusters to be identified and labeled even in the absence of labeled training
data. Class labels are assigned to neurons, and it is the class label of the "winning" neuron during execution that forms the
class estimation of the SOFM.
Train Self Organizing Feature Map

Select the Self Organizing Feature Map option button on the Custom Network Designer - Quick tab and click the OK button
to display the Train Self Organizing Feature Map dialog. This dialog contains four tabs: Quick, Start, Classification, and
Interactive. The options described here are available regardless of which tab is selected.
OK. Click the OK button to run the Kohonen training algorithm on the Self Organizing Feature Map (SOFM). After training
is complete, the Topological Map dialog is displayed, allowing you to test the network and add class labels to neurons.
Custom. Click the Custom button to display the generic Train Radial Layer dialog. You can use any of the algorithms for
center assignment and labeling. If you have a target output variable available in the data set, the labeling algorithms (KL
Nearest Neighbors and Voronoi Neighbors) are particularly useful, and can be run after applying the Kohonen algorithm.
When you click OK on the Radial Training dialog, training is completed and the Topological Map dialog is displayed.

The Quick tab of the Train Self Organizing Feature Map dialog contains the parameters of the Kohonen algorithm. The
algorithm is typically run in two stages, although you can disable one phase if you want. Training is iterative, taking a
number of epochs, using learning rate and neighborhood factors that are adjusted on each epoch.
Phase one/Phase two. Clear either one of these boxes to specify a single-phase algorithm.
Epochs. Number of epochs training should last.
Neighborhood. This is the "radius" of a square neighborhood centered on the winning unit. For example, a neighborhood
size of 2 specifies a 5x5 square.
The neighborhood is scaled linearly from the Start value to the End value given.

Select the Start tab of the Train Self Organizing Feature Map dialog to access the options described here. These options are
applied at the start of training, specifically, to randomly assign initial weights.
a neural network before training. You may occasionally want to clear this check box if you are subjecting a preexisting
network to further training.
Initialization method. Specify exactly how the weights should be initialized at the beginning of training. The possible
values are:
maximum values are given by the Minimum/Mean and Maximum/S.D. fields respectively.
standard-deviation are given by the Minimum/Mean and Maximum/S.D. fields respectively.
Control parameters for random initialization.
Random Uniform, this box displays the range minimum. If the method is Random Gaussian, this box displays the mean.
Maximum/S.D. Range control for Random uniform and Random Gaussian initialization methods. If the method is Uniform,
this box displays the range maximum. If the method is Gaussian, this box displays the standard deviation.

Select the Classification tab of the Train Self Organizing Feature Map dialog to access the options described here. Self
Organizing Feature Maps (SOFMs) perform classification by storing labeled exemplar vectors. See Classification by
Labeled Exemplars for more details.
Classification threshold. Specify whether or not to use a threshold (select No threshold or Use the threshold specified
below). If a threshold is not used, the SOFM will always use the closest neuron(s), irrespective of the distance.
Accept. Specify the classification threshold to use.
K/L. Specify the K/L nearest neighbor control factors. By default K is 1 and L 0, corresponding to the standard Kohonen
"winner takes all" algorithm.

Select the Interactive tab of the Train Self Organizing Feature Map dialog to access the options described here. These
options relate to interactive training. In interactive training, a special dialog is displayed during training showing the
progress of the error function (see Training in Progress), which includes options to stop training early and to extend a
training run with additional epochs of the same or different training algorithms.
check box is selected, the Training in Progress dialog is closed automatically when training finishes. Check this box if you
want to see the graph for purposes of instantaneous feedback but do not want to extend training.
Error lines plotted. Specify which lines are plotted on the progress graph. By default, the Training and selection error
option button is selected to plot both lines; however, you can choose to plot only one (select the Training error or Selection
error option button) to reduce clutter.
Sampling interval (in epochs). In this box, specify how often (in epochs) the error should be plotted on the graph. Specify a
sampling interval greater than one if you are running for a large number of epochs, to reduce memory requirements and the
drawing time for the graph.
Train Radial Layer


Click the Custom button on any of the training dialogs for network types with radial units to display the Train Radial Layer
dialog. This dialog supports custom training of any network with a radial layer. It gives individual access to all STATISTICA
Neural Networks training algorithms for radial layers, including center-assignment, deviation setting, and labeling
algorithms. This dialog contains seven tabs: Sample, Deviation, Labels, Kohonen, Start(K), LVQ, and Interactive.
To run a specific algorithm, click the corresponding button on the tabs of the dialog. To complete the custom training
process, click the OK button.
OK. The OK button finishes custom training of the radial layer, and returns control to the type-specific training dialog that
launched it.
If that was the Radial Basis Function training dialog, the effect is to display the Train Dot Product Layers dialog, which
conducts the second stage of custom training.
For all other network types, the effect is to complete custom training, and the training dialog performs any end of training
actions (e.g. pruning) before passing control to the Results dialog.
Cancel. Click the Cancel button to close this dialog, ignoring any changes, and return to the previous dialog.

Select the Sample tab of the Train Radial Layer dialog to access the options described here.
Sample. Click this button to copy randomly selected cases to exemplar vectors and label the exemplars using the class labels
in the data set (i.e. the output variable values). If a cluster network is created with a number of output units equal to the
number of training cases, they are all copied to the cluster network, and it implements a standard KL-Nearest Neighbor
classifier. The disadvantage of this approach is that the resulting classifier is large and slow to execute, as there are many
exemplar vectors. Alternative approaches look for exemplar vectors that group clusters of training data. However, as these
exemplars are not expected to coincide exactly with training cases, additional algorithms are needed to apply class labels to
the exemplars. See Radial Sampling for more details.
K Means. Click this button to assign exemplar vectors using the K-Means algorithm, which assigns a limited number of
cluster centers. Vectors are assigned so that each one has a group of training cases "assigned" to it (i.e. closer to it than any
other exemplar), and so that each exemplar is the centroid of its assigned training cases. The vectors are not class labeled,
and any class labels on the training cases are ignored. This may lead to sub-optimal placement of cluster centers lying on the
boundary between two classes. See K-Means for more details.
Random. When you click this button, centers are assigned to random positions in the unit hypercube (as positions are
usually normalized to the unit hypercube by the input pre-processing step).

Select the Deviation tab of the Train Radial Layer dialog to access algorithms to set the deviations of the radial units.
Explicit. Click this button to set the deviation of the radial units to the figure specified in the adjacent box.
Isotropic. Click this button to use the Isotropic Deviation Assignment algorithm; the resulting deviation is multiplied by the
scaling factor specified in the adjacent box.
K-Nearest. Click this button to assign the deviation to the average distance of the K nearest neighbors, specified in the
adjacent box.

Select the Labels tab of the Train Radial Layer dialog to access algorithms to be used in assigning the radial centers. The
options are:
KL Nearest Neighbor. This algorithm labels exemplar vectors by considering the labels of the nearest training cases to the
exemplar. A class label is assigned if at least L of the K nearest neighbors have the same class label. If there are not
sufficient of the K nearest neighbors with the same label, a blank unit label is applied, indicating "Unknown." See Class
Labeling for more details.
K/L. These boxes contain the control factors for the KL Nearest Neighbor algorithm.
Voronoi Neighbors. This algorithm counts the cases that are assigned to each unit (that is, are nearer to that unit than any
other unit), and labels the unit using the most common class of these assigned units, providing that these exceed a stated
proportion. If less than the given proportion are in the most common class, a blank label is applied, indicating "Unknown."
See Class Labeling for more details.
Minimum proportions. This box contains the control factor for the Voronoi Neighbors algorithm.

Select the Kohonen tab of the Train Radial Layer dialog to access the parameters of the Kohonen algorithm. The algorithm
is typically run in two stages, although you can disable one phase if you want. Training is iterative, taking a number of
epochs, using learning rate and neighborhood factors that are adjusted on each epoch.
Phase one/Phase two. Clear either one of these check boxes to specify a single-phase algorithm.
Epochs. These boxes contain the number of epochs used in each phase.
Neighborhood. These boxes contain the "radius" of a square neighborhood centered on the winning unit. For example, a
neighborhood size of 2 specifies a 5x5 square.
The neighborhood is scaled linearly from the Start value to the End value specified.

Select the Start(K) tab of the Train Radial Layer dialog to access options that are applied at the start of Kohonen training -
Reinitialize dot product layers before training. This check box is selected by default. Small random weights are usually
assigned to a neural network before training. You may occasionally want to clear this check box if you are subjecting a
preexisting network to further training.
Initialization method. In this group box, specify exactly how the weights should be initialized at the beginning of training.
The possible values are:
Random uniform. When this option button is selected, the weights are initialized to a uniformly-distributed random value,
within a range whose minimum and maximum values are given by the Minimum/Mean and Maximum/S.D. fields,
respectively (see below).
Random Gaussian. When this options button is selected, the weights are initialized to a normally distributed random value,
within a range whose mean and standard-deviation are given by the Minimum/Mean and Maximum/S.D. fields, respectively
(see below).
Control parameters for random initialization. This group box contains the options described here.
Minimum/Mean. This box is used to specify the range control for Random uniform and Random Gaussian Initialization
methods. If the method is Random uniform, this box specifies the range minimum. If the method is Gaussian, this box
specifies the mean.
Maximum/S.D. This box is used to specify the range control for Random uniform and Random Gaussian Initialization
methods. If the method is Random uniform, this box specifies the range maximum. If the method is Random Gaussian, this
box specifies the standard deviation.

Select the LVQ tab of the Train Radial Layer dialog to run the Learned Vector Quantization (LVQ). This algorithm is unique
among the clustering algorithms in STATISTICA Neural Networks, in that it takes account of class labels in adjusting
exemplar vector positions. It can be used to improve radial unit positioning after clustering and labeling have been
performed. Exemplar vectors that misclassify cases, in particular, are moved to try to correct this behavior
Run. Click this button to run the LVQ algorithm, using the settings specified by the options described below.
Algorithm variant. In this group box, specify the algorithm variant to be used. The variants are briefly described below; see
Learned Vector Quantization for more details.
LVQ 1. LVQ 1 adjusts the nearest exemplar vector to the training case, either moving it toward the training case if they
have the same class label, or away from it if they do not.
LVQ 3. LVQ 3 compares the two nearest exemplars. Similar to LVQ 2.1, it moves one toward and one away from the
training case only if one is of the correct class and the distances are approximately equal. However, it also moves both
exemplars toward the training case a smaller distance if they are both correctly labeled. This small movement of both
exemplars is controlled by the Beta parameter, described below.
LVQ 2.1. LVQ 2.1 also compares the two nearest exemplars. It only adjusts them if they are approximately the same
distance from the training case and only one of them is of the correct class. In this case, the correct one is moved toward the
training case and the incorrect one away. The definition of "approximately the same distance" involves the Epsilon
parameter, described below.
Epochs. The figure in this box specifies the number of epochs over which the algorithm will run. On each epoch, the entire
training set is fed through the network and used to adjust the network weights and thresholds.
Learning rate. The Learning rate is altered linearly from the first to last epochs. You can specify a Start and End value. The
usual practice is to decay the learning rate over time so that the algorithm "settles down" to a solution.
Epsilon. In LVQ 2.1 and LVQ 3, the definition of "approximately the same distance" is controlled by this parameter. Epsilon
is typically between 0 and 1, usually less than 0.5.
Beta. In LVQ 3, when both nearest exemplars are of the same class as the training case, they are both moved toward the
training case. This movement is smaller than when mismatched exemplars are found - the usual learning rate is multiplied by
beta, which is greater than 0 but significantly less than one.

Select the Interactive tab of the Train Radial Layer dialog to access options related to interactive training. In interactive
training, a special dialog is displayed during training showing the progress of the error function (see Training in Progress
dialog), which includes options to stop training early and to extend a training run with additional epochs of the same or
different training algorithms.
Progress dialog stays visible so that you can extend training or choose to transfer the graph to the results workbook If this
check box is selected, the Progress dialog is closed automatically when training finishes. Select this check box if you want
to see the graph for purposes of instantaneous feedback but do not want to extend training.
Error lines plotted. In this group box, specify which lines are plotted on the progress graph. By default the Training and
selection errors are both plotted; however, you can choose to plot only one - Training error or Selection error - to reduce
clutter.
time for the graph.


Click the OK button on the Train Radial Layer dialog. The OK button finishes custom training of the radial layer, and
returns control to the type-specific training dialog that launched it. If that was the Radial Basis Function training dialog, the
effect is to display the Train Dot Product Layers dialog. This dialog is designed to support custom (hybrid) training of any
neural network with dot product layers. All the algorithms in STATISTICA Neural Networks applicable to dot product layers
are available from this dialog. The algorithms accessed from this dialog ignore any radial layers. Tabs on this dialog can
include: Train, Start, End, Decay, Interactive, BP, QP, and DBD.
The algorithms available include:
Principal components (trains the second layer only)
Linear optimization (trains the last layer only)
Iterative algorithms (train all dot product layers).
These families of algorithms are described in more detail in the following topics:
Principal components analysis
Linear network training
Multilayer Perceptron training
The algorithms are all run from buttons on the Train tab. Other tabs give access to detailed control parameters for the
algorithms.
OK. Click this button to complete custom training. This returns control to the type-specific training dialog from which
custom training was started - that dialog then finishes training, applying any post-training steps such as pruning and
confidence limit determination.
Note. Clicking the OK button does not apply any of the training algorithms available on this dialog. They are executed using
the buttons on the Train tab.

Select the Train tab of the Train Dot Product Layers dialog to access the options described here.
Principal components. Click this button to run the Principal Components Analysis algorithm. The principal components are
written into the second layer of the network, which must contain dot product units.
Generate spreadsheet of eigenvalues. If this check box is selected, when the Principal Components Analysis algorithm is
run, the eigenvalues discovered as part of the algorithm are written to a results spreadsheet.
Linear optimization. Click this button to run the linear optimization algorithm (Singular Value Decomposition or Pseudo-
Inverse). The final layer of the network is trained. This must contain dot product units with the identity activation function.
Iterative. Click this button to run an iterative dot product layer optimization algorithm. The precise algorithm(s) run is
controlled by the options described in the rest of this article.
Phase one/Phase two. Select a two-phase iterative algorithm by selecting both of these check boxes. Clear one or the other
to specify a single-phase algorithm. The algorithm, and all the parameters of the algorithm, can be chosen independently for
each phase.
Algorithm. Select the iterative training algorithm from the drop-down list for each phase. The available algorithms are:
Back propagation. A simple algorithm with a large number of tuning parameters, often slow terminal convergence, but
good initial convergence. STATISTICA Neural Networks implements the on-line version of the algorithm; see Technical
Details.
Conjugate gradient descent. A good generic algorithm with generally fast convergence.
Quasi-Newton (BFGS). A powerful second order training algorithm with very fast convergence but high memory
requirements.
Levenberg-Marquardt. An extremely fast algorithm in the right circumstances - low-noise regression problems with the
standard sum-squared error function.
Quick propagation. An older algorithm with comparable performance to back propagation in most circumstances, although
it seems to perform noticeably better on some problems. Not usually recommended.
Delta bar delta. Another variation on Back propagation, which occasionally seems to have better performance. Not usually
recommended.
Epochs. Specify the number of epochs of iterative training of the network in a given phase.
Learning rate. Specify the main learning rate for Back propagation, Quick propagation, or Delta-bar-delta training.
a neural network before training. You may occasionally want to clear this box if you are subjecting a pre-existing network to
further training.
Track and restore best network. This check box is selected by default. When selected, STATISTICA keeps a copy of the
network with the lowest selection error during training, and this network is the final outcome of training, not the network
remaining at the end of the last epoch. You may occasionally want to clear this check box if you are convinced that the
training process for your problem domain is extremely stable so that the final network will be the best, or if you want to
observe training and are not concerned with getting the best possible network. Clearing this box speeds training (since
STATISTICA does not need to make a copy of the network each time an improvement is made).

Select the Start tab of the Train Dot Product Layers dialog to access options that are applied at the start of iterative training -
Reinitialize dot product layers before training. This check box is selected by default. Small random weights are usually
assigned to a neural network before training. You may occasionally want to clear this check box if you are subjecting a
preexisting network to further training.
Initialization method. Specify exactly how the weights should be initialized at the beginning of training. The possible
values are:
maximum values are specified by the Minimum/Mean and Maximum/S.D. fields respectively.
standard-deviation are specified by the Minimum/Mean and Maximum/S.D. fields respectively.
Control parameters for random initialization. There are two options in this group box:
Uniform, this field gives the range minimum. If the method is Gaussian, this field gives the mean.
Maximum/S.D. Range control for Random uniform and Random Gaussian Initialization methods. If the method is Random
uniform, this field gives the range maximum. If the method is Random Gaussian, this field gives the standard deviation.

Select the End tab of the Train Dot Product Layers dialog to access options that control the end of training, and that are
applied at the end of training. Specifically, there are stopping conditions to determine if training should be terminated before
the full number of epochs has expired.
Track and restore best network. This check box is selected by default. When selected, STATISTICA Neural Networks
keeps a copy of the network with the lowest selection error during training, and this network is the final outcome of training,
not the network remaining at the end of the last epoch. You may occasionally want to clear this check box if you are
convinced that the training process for your problem domain is extremely stable so that the final network will be the best, or
if you want to observe training and are not concerned with getting the best possible network. Clearing this check box speeds
training (since STATISTICA Neural Networks does not need to make a copy of the network each time an improvement is
made).
Stopping conditions. Specify target error values here.
Target error (Training/Test). If the error on the training or selection test drops below the given target values, the network
is considered to have trained sufficiently well, and training is terminated. The error never drops to zero or below, so the
default value of zero is equivalent to not having a target error.
Minimum improvement in error. Specify a minimum improvement (drop) in error that must be made.
Training/Test. If the rate of improvement drops below this level, training is terminated. The default value of zero implies
that training will be terminated if the error deteriorates. You can also specify a negative improvement rate, which is
equivalent to giving a maximum rate of deterioration that will be tolerated. The improvement is measured across a number
of epochs, called the "window" (see below).
propagation, show noise on the training and selection errors, and all the algorithms may show noise in the selection error. It
is therefore not usually a good idea to halt training on the basis of a failure to achieve the desired improvement in error rate
over a single epoch. The window specifies a number of epochs over which the error rates are monitored for improvement.
Training is only halted if the error fails to improve for that many epochs.

Select the Decay tab of the Train Dot Product Layers dialog to access options to specify the use of Weigend weight decay
regularization. This option encourages the development of smaller weights, which tends to reduce the problem of over-
fitting, thereby potentially improving generalization performance of the network, and also allowing you to prune the network
(see End tab). Weight decay works by modifying the network's error function to penalize large weights - the result is an error
function that compromises between performance and weight size. Consequently, too large a weight decay term may damage
network performance unacceptably, and experimentation is generally needed to determine an appropriate weight decay
factor for a particular problem domain.
Phase one decay factors/Phase two decay factors. Weight decay can be applied separately to the two phases of a two-
phase algorithm.
Decay factor. Specify the decay factor; see Weigend Weight Regularization for a mathematical definition of the decay

Select the Interactive tab of the Train Dot Product Layers dialog to access options relating to interactive training. In
interactive training, a special dialog is displayed during training showing the progress of the error function (see: Training in
Progress), which includes options to stop training early, and to extend a training run with additional epochs of the same or
different training algorithms.
Interactive training (display graph during training). Select this check box to enable interactive training. When you click
the OK button to execute the training algorithm, the Training in Progress dialog will be displayed.
Progress dialog stays visible, so that you can extend training, or choose to transfer the graph to the results workbook. If this
check box is selected, the progress dialog is closed automatically when training finishes. Select this check box if you want to
see the graph for purposes of instantaneous feedback but do not want to extend training.
Error lines plotted. Specify which lines are plotted on the progress graph. By default the Training and selection errors are
both plotted; however, you can choose to plot only one, to reduce clutter.
interval greater than one if you are running for a large number of epochs, to reduce memory requirements and the drawing
time for the graph.
Train Dot Product Layers - BP(1)/(2) Tab

Select the BP tab of the Train Dot Product Layers dialog to access additional options for Back Propagation training There
are separate tabs for the first and second phase algorithms (either or both of which might be Back Propagation); the
appropriate tab is available only if the algorithm is in use for that phase.
momentum throughout training. Some authors recommend altering these rates on each epoch (specifically, by reducing the
learning rate; this is often counter-balanced by increasing momentum). See Back Propagation for more details.
Learning rate. The learning rate used to adjust the weights. A higher learning rate can converge more quickly, but can also
exhibit greater instability. Values of 0.1 or lower are reasonably conservative; higher rates are tolerable on some problems,
but not on all (especially on regression problems, where a higher rate can cause catastrophic divergence of the weights).
If the learning rate is being adjusted on each epoch, both fields are enabled, and give the Initial and Final learning rates.
Momentum. Momentum is used to compensate for slow convergence if weight adjustments are consistently in one
direction; the adjustment "picks up speed." Momentum usually increases the speed of convergence of Back Propagation
considerably, and a higher rate may allow you to decrease the learning rate to increase stability without sacrificing much in
the way of convergence speed.
Shuffle presentation order of cases each epoch. STATISTICA Neural Networks uses the on-line version of Back
Propagation, which adjusts the weights of the network as each training case is presented (rather than the batch approach,
which calculates an average adjustment across all training cases, and applies a single adjustment at the end of the epoch). If
the Shuffle check box is selected, the order of presentation is adjusted each epoch. This makes the algorithm somewhat less
prone to stick in local minima, and partially accounts for Back Propagation's greater robustness than the more advanced
Deviation. The standard deviation of the Gaussian noise added to the target output during training.
Dot Product Training - QP (1)+(2) Tab

Select the QP tab of the Dot Product Training dialog to access additional options for Quick Propagation training. There are
separate tabs for the first and second phase algorithms (either or both of which might be Quick Propagation); the appropriate
tab is only available if the algorithm is in use for that phase.
Learning Rate. The initial learning rate, applied in the first epoch; subsequently, the quick propagation algorithm
determines weight changes independently for each weight.
Acceleration. This gives the maximum rate of geometric increase in the weight change which is permitted. For example, an
Train Dot Product Layers - DBD (1)/(2) Tab

Select the DBD tab of the Train Dot Product Layers dialog to access additional options for Delta-bar-Delta training. There
are separate tabs for the first and second phase; the appropriate tab is available only if the algorithm is in use for that phase.
Initial. The initial learning rate used for all weights on the first epoch. Subsequently, each weight develops its own learning
rate.
Increment. The linear increment added to a weight's learning rate if the slope remains in a consistent direction.
Decay. The geometric decay factor used to reduce a weight's learning rate if the slope changes direction
Smoothing. The smoothing coefficient is used to update the bar-Delta smoothed gradient. It must lie in the range [0,1).
noisy error surface this allows the algorithm to maintain a high learning rate consistent with the underlying gradient;
however, it may also lead to overshoot of minima, especially on an already-smooth error surface.

Select the Interactive training (display graph during training) check box on the Interactive tab of the training dialog to
display the Training in Progress dialog. This dialog contains three tabs: Graph, Dynamic Options, and Static Options.
This dialog displays a graph of the training and/or selection subset error rate, and updates this graph in real-time (see the
Graph tab). You can interrupt training at any point (either finishing training and accepting the network, or canceling the
training run), and can extend the training run with further epochs of training. You can also specify the graph title, legend,
and line labels (see the Dynamic Options tab), and output it as a results graph.
A progress bar and the time elapsed in training are displayed at the lower-right of the dialog. The time field displays the
amount of CPU time taken by the algorithm.
Finish. While the algorithm is running, the Finish button is displayed. Click this button at any point to stop training,
indicating that training has completed successfully (e.g. if you can see from the graph that the algorithm has converged
adequately). There will be a short delay while the training algorithm completes the epoch it is currently running. When the
algorithm finishes, the Finish button is replaced by the OK button (no graph symbol).
Click this OK button to indicate that the algorithm has finished successfully. The Results dialog is then displayed.
OK. The OK button (graph symbol) is enabled when the algorithm finishes. Click this button to indicate that the algorithm
has finished successfully, and to transfer a copy of the graph to a STATISTICA graph.
Extend. Click this button at any point (while the algorithm is running, or after it completes) to specify further epochs, and/or
to change the training algorithm parameters. A special extended training dialog appropriate to the network type is displayed.
When OK is clicked on that dialog control is transferred back to the Training in Progress dialog.
Cancel. Click the Cancel button to terminate training (if still in progress) and return to the Training dialog that launched the
training algorithm. When the Cancel button is clicked, a Cancel Training dialog is displayed asking for confirmation that
you want to cancel training. Click OK on the Cancel Training dialog to confirm the action, or click Cancel to return to the
Training in Progress dialog.

Select the Graph tab of the Training in Progress dialog to display a graph of the error function per epoch on the training
and/or selection subsets.
The graph can be displayed in one of two modes - a scrolling graph, which shows a limited number of epochs at any one
time, or a fixed width graph (typically just wide enough to reach the last epoch specified by your training algorithm).
The graph can be configured either to show only details of the current training run (clearing the results of previous runs) or
to retain previous runs so that you can compare results.
While training in progress, you can scroll back and forth through the real time graph(s) by using the double arrows shown
under the graph area.

Select the Dynamic Options tab of the Training in Progress dialog to access options that have an immediate effect on the
graph.
X Dimension plotting (Epochs). Specify the style of plotting used. The available styles are:
Fixed width graph. The graph is set to the Minimum and Maximum extents given, and remains at that width until told
otherwise. By default the X dimension extends from 0 to the last epoch in the training run. You can explicitly change the
dimension; for example, if you want to view a section of the graph in greater detail.
Scrolling graph. The graph scrolls automatically in the X dimension (this type of graph is sometimes called a "strip graph").
You can specify how many samples are visible on the graph at any one time. Once the current line reaches the right side of
the graph, the entire graph is scrolled to the left by a fixed amount, which you can also specify. A scrolling graph allows you
to view the development of the error rate in resolvable detail even if conducting an extremely long training run.
Y Dimension plotting (Epochs). Specify the range of the graph in the Y dimension. By default, this is [0,1] (the range of
STATISTICA Neural Networks' usual error functions). You may want to change this in order to display more detail when a
training run is nearly converged (so that the error is fluctuating by only a small amount) or, on rare occasions, to view the
error when it exceeds the usual [0,1] range (e.g. if using SOFM networks with unnormalized codebook vectors).
Graph labels.
Title. Specify a title for the training graph. This is not actually displayed on the interactive graph, but if you transfer the
graph to a results graph, the title is used in that graph.
Legend. Specify a title for the graph legend.
Training label, Selection label. Specify labels for the current training and/or selection error lines. By default these are
assigned the prefix "T." or "S.", followed by the number of the line.

Select the Static Options tab of the Training in Progress dialog to access options that affect the plotting of future graphs, but
not the current one. They are all replicated on the Interactive tab of the training dialog that launched the interactive training
run.
Keep lines from previous training runs. If this check box is selected, the progress graph is not cleared when training starts,
allowing you to compare the results of your training run with previous runs.
Close window when training finishes. By default, when interactive training finishes, the Training in Progress dialog stays
visible so that you can extend training, or choose to transfer the graph to the results workbook. If this check box is selected,
the progress dialog is closed automatically when training finishes. Select this check box if you want to see the graph for
purposes of instantaneous feedback but do not want to extend training.
Error lines plotted. Specify which lines are to be plotted on the progress graph. By default, the Training and selection error
are both plotted; however, you can choose to plot only one to reduce clutter.
interval greater than one if you are running for a large number of epochs, to reduce memory requirements and the drawing
time for the graph.


Click the Sampling button on various Neural Networks training dialogs to display the Sampling of Case Subsets for Training
dialog. This dialog can contain up to three tabs at a time: Quick, Advanced, Cross Validation, Bootstrap, or Random. Use the
options on these tabs to control the sampling of cases for the training, selection, and test subsets. See Resampling for a
detailed overview of sampling issues. In addition, this dialog allows you to specify repeated sampling experiments, training
multiple networks using different samples and optionally combining them into an ensemble.
Subset assignment (to train, select, and test subsets) can be random, or based upon a data set variable or preexisting neural
network. In addition, you can exclude cases containing missing values, and/or explicitly select cases to be considered.
If you decide to perform multiple training runs with resampling, you can specify that the resampling is random (and you can
choose to fix some of the subsets), or you can use the specialized resampling algorithms, cross validation, and bootstrap.
OK. Click the OK button to confirm that the current sampling options should be used in training.
Cancel. Click the Cancel button to close this dialog and return to the previous dialog.
Select Cases. Click this button to display the Select Cases dialog, which is used to specify a range of cases to be distributed
among the training, selection, and test subsets. Any cases not included in the Select Cases dialog are excluded from
consideration, even if you specify an explicit option such as From subset variable or As existing network.
MD Selection. These options specify how cases with missing values should be handled during training. If Casewise is
selected, any cases with missing values are excluded from consideration during training. This is preferable, unless you are
severely restricted in the number of cases available for training. In that case, you can select Mean substitution, which implies
that cases with missing values are used, and the network's missing value substitution procedure will be employed to "patch"
those cases during training. The default missing value procedure substitutes the training set sample mean for a numeric
variable (the sample frequency for a nominal variable); you can select an alternative substitution procedure on the Network
Editor dialog.

Select the Quick tab of the Sampling of Case Subsets for Training dialog to access options for a single training run (i.e. no
resampling).
Assignment of subsets. You can specify that the subsets be determined by one of three methods: by using a subset variable
(defined below); using the same distribution as a selected preexisting network; or randomly. The first two options are only
available if there is a suitable subset variable in the data set, or a suitable network.
From subset variable. A subset variable is a variable in the data set that can be used to indicate the case subsets. It must be
a nominal variable with values taken from the set {"Train","Select","Test","Ignore"}. The variable can have any name, but
the STATISTICA Neural Networks convention is to give such variables the prefix NNSET. The easiest way to generate such
a variable is to export the subset assignment of a randomly assigned network to a results spreadsheet (by including Subset in
the Also in spreadsheet options on the Predictions tab of the Results dialog), and then to copy and paste this to the data set.
If there are no subset variables available, the drop-down list is disabled; otherwise, you can select from the list of available
variables.
As existing network. Each neural network in STATISTICA Neural Networks records the case division used to train it. You
can select from the existing networks, indicating that the new network should use the same case division - this is useful if
you want to compare network performance on a "level playing field." If there are no existing networks, or if the networks
were imported from an older version of STATISTICA Neural Networks that does not record case subsets, the As existing
network drop-down list is disabled.
Shuffle train and select subsets. Select this check box if you want to maintain the same test set selection as the given subset
variable or existing network, but want to resample the train and select subsets. This allows you to conduct resampling
experiments while maintaining a single test set to allow comparison of results.
Subset sizes.
Training/Selection/Test
If you choose to derive the case subsets from a subset variable or an existing network, the size of the subsets (or, at least, of
the test subset) is also derived, and the subset size options are disabled. However, if you are randomly determining subsets,
or resampling training and selection subsets, you can select the subset sizes.
You can specify the subset sizes in two ways: by giving the exact numbers of cases, or by giving relative proportions.
If you want to set the numbers exactly, simply enter the numbers you want. To help you in this process, the remainder
available is displayed (in blue, if a positive remainder; in red, if you have specified more cases than are available). You can
assign the remainder to a given subset by double-clicking on the corresponding field.
If you want to set the numbers using proportions, simply enter the proportions into the fields. For example, to split the data
3:1:1 between training, selection, and test, enter 3,1, and 1 into the three fields respectively. When you click OK,
STATISTICA Neural Networks will assign the cases in the given proportions.

Select the Advanced tab of the Sampling of Case Subsets for Training dialog to access the options described here.
Sampling method. Specify how sampling should be applied. The options are:
Create simple network (parameters on "Quick" tab). Only a single network is created, and the subset sampling is
controlled by the options on the Quick tab. If this option is not selected, the Quick tab is overridden and all options on that
tab disabled.
Cross validated resampling. The cross validation technique is used for resampling. Further parameters specific to cross
validation are available on the Cross Validation tab.
Bootstrap resampling. Select this option button to use the bootstrap technique for resampling. Further parameters, specific
to bootstrap resampling, are available on the Bootstrap tab.
Random resampling. When this option button in selected, the subsets are randomly resampled for each network created.
The control parameters are available on the Random tab.
Resampling.
Number of samples. For all methods other than Create simple network, this field specifies the number of networks created.
Form an ensemble. If you are creating multiple networks by any of the resampling techniques, this option forms them into
an ensemble, which generates a single prediction by averaging or voting among the predictions of the members. Ensembles
can have substantially improved generalization performance compared with their members; see Ensembles for further
details.
Sampling of Case Subsets for Training - Cross

Validation Tab
The Cross Validation tab of the Sampling of Case Subsets for Training dialog is displayed only when the Cross validated
resampling option button is selected as the Sampling method on the Advanced tab. Select this tab to access options specific
to Cross Validated resampling. The cross validation is N-fold, where N is the number of samples taken. The available data is
divided into N parts, and one part is assigned to the test set on each sample. The remainder is divided among the Training
and Selection subsets, and you can specify how many are put in each. Many authors, when using cross validation, do not use
a selection set at all, on the basis that any bias contributed by a particular network can be compensated for by averaging
predictions across the networks in an ensemble. However, it is probably still advisable to take some steps to alleviate over-
learning, which might include use of a selection set, weight decay, or stopping conditions determined experimentally to be
reasonable for the problem domain.
Number of training and selection cases. Specify the number of the available cases assigned to the Training and Selection
sets respectively. The total is constrained to equal the number of cases available.

The Bootstrap tab of the Sampling of Case Subsets for Training dialog is displayed only when the Bootstrap resampling
option button is selected as the Sampling method on the Advanced tab. Select this tab to access options specific to Bootstrap
resampling. In bootstrap resampling, the training subset is constructed by sampling with replacement (i.e. a given case can
be sampled multiple times) from the available data. As with cross validation, most authors do not consider the issue of
constructing a selection subset when using the bootstrap. STATISTICA Neural Networks samples the selection subset first
(without using bootstrap; i.e. the selection subset is sampled without replacement); then, the training subset is bootstrapped
from the remaining data. By default the training subset size is set to the number of available data points minus the selection
subset size, but since sampling with replacement is used the training subset can actually be any size you choose.
The test subset is formed from any cases that are left over by the bootstrap.
Size of bootstrapped training set.
Training. Specify the number of cases to be bootstrapped into the training subset. The number of cases that are available
(i.e. not part of the selection subset) is displayed to the right of this field.
Selection. Specify the number of cases in the selection subset.

The Random tab of the Sampling of Case Subsets for Training dialog is displayed only when the Random resampling option
button is selected as the Sampling method on the Advanced tab. Select this tab to access options supporting random (Monte
Carlo) resampling for multiple networks, and options allowing multiple networks to be created using the same subset split.
You can specify that resampling does not happen at all, or that the test set is held fixed while the training and selection
subsets are resampled, or that all three subsets are resampled. You can also specify how the nonresampled subsets are
assigned (randomly, or the same as a subset variable or preexisting network), and how large the resampled subsets should
be.
Subset selection methods. Specify whether resampling should be performed at all (select Fix all subsets), and if it is,
whether the test subset should be held fixed while the training and selection subsets are resampled, or whether all three
subsets should be resampled. Keeping a consistent test subset makes it easier to compare the performance of the networks.
Assignment of fixed subsets. If you have decided to fix (i.e. not resample) either all three subsets or the test subset alone,
you can specify how the fixed subsets should be assigned. The options are:
a nominal variable with values taken from the set {"Train","Select","Test","Ignore"}. If there are no subset variables
available, the drop-down list is disabled; otherwise, you can select from the list of available variables.
can select from the existing networks, indicating that the new network should use the same case division. If there are no
existing networks with case divisions, the As existing network drop-down list is disabled.
Random (once, before generating any samples). The subsets are assigned randomly from the available cases, in the
numbers displayed in the fields below. If you've selected to fix the test subset but to resample training and selection, then the
test subset is assigned once at the beginning of training, but fixed thereafter, whereas the training and selection subsets are
resampled from among the balance of cases each time a network is created.
Subset sizes. Specify the size of any subsets that are being randomly assigned. Depending on the resampling options
selected above, the Test field, or all three fields, may be disabled, indicating that the number of cases in the appropriate
subsets are derived and cannot be altered.
You can specify the subset sizes in two ways: by specifying the exact numbers of cases, or by specifying relative
proportions.
If you want to set the numbers exactly, enter the numbers you want. To help you in this process, the remainder available is
displayed (in blue, if a positive remainder; in red, if you have specified more cases than are available). You can assign the
remainder to a given subset by double-clicking on the corresponding field.
If you want to set the numbers using proportions, enter the proportions into the fields. For example, to split the data 3:1:1
between training, selection, and test, enter 3, 1, and 1 into the three fields respectively. When you click OK, STATISTICA
Neural Networks will assign the cases in the given proportions.
Select Cases
Select Cases
Select Cases
Click the Select cases button on the Results dialog to display the Select Cases dialog. This dialog contains one tab: Select
Cases. Use the options on this dialog to specify which cases should be used, either for training or execution. By default, all
the cases in the data set are used. However, you can specify a range that excludes some cases, or specify a nominal variable
in the data set that indicates which cases to use.
Case selection. This group box contains three options:
Specified cases. Select this option button to explicitly select a range of cases. You can enter individual numbers separated
by spaces (e.g. 1 2 5) and/or inclusive ranges of cases using a hyphen (e.g. 1-30), also separated by spaces.
From data set variable/when value is. If this option button is selected, you can select a nominal variable (from the drop-
down list to the right), and one of the nominals of the variable (from the when value is drop-down list). The case selection is
then set to all cases with that value for that nominal variable. This option allows you to construct arbitrary splits of the data
set and store them permanently with the data set.
OK. Click the OK button to accept the specifications on this dialog and return to the previous dialog.
Cancel. Click the Cancel button to return to the previous dialog. Any changes made on the Select Cases dialog will be
ignored.
Select all. Click this button to select all cases in the data set.


Click the Sampling button on the Intelligent Problem Solver dialog to display the Sampling of Case Subsets for Intelligent
Problem Solver dialog. This dialog can contain up to two tabs: Quick, Random, or Bootstrap. Use the options on these tabs
to control the sampling of cases for the training, selection, and test subsets when using the Intelligent Problem Solver. See
Resampling for a detailed overview of sampling issues.
Each network created and tested by the Intelligent Problem Solver has associated with it subsets for training, selection, and
test; these may be the same for all networks tested, or may differ.
The assignment of the cases to the subsets can be random, or based upon a special subset variable in the data set, or the same
as that used by a preexisting neural network. In addition, you can exclude cases containing missing values, or explicitly
select cases to be considered. You can also use bootstrap sampled training sets.
OK. Click the OK button to confirm that the current sampling options should be used by the Intelligent Problem Solver.
Cancel. Click the Cancel button to close this dialog and return to the Intelligent Problem Solver dialog.
Select cases. Click the Select cases button to display the Select Cases dialog, which is used to specify a range of cases to be
distributed among the training, selection, and test subsets. Any cases not included in the Select Cases dialog are excluded
from consideration, even if you specify an explicit option such as From Subset variable or As existing network on the
Random tab.
MD Deletion. The options in this group box specify how cases with missing values should be handled during training. If
Casewise is selected, any cases with missing values are excluded from consideration during training. This is a preferable
approach, unless you are severely restricted in the number of cases available for training. In that case, you may select Mean
substitution. Mean substitution implies that cases with missing values are used, and the network's missing value substitution
procedure will be employed to "patch" those cases during training. The default missing value procedure substitutes the
training set sample mean for a numeric variable (the sample frequency for a nominal variable).

- Quick Tab
Select the Quick tab of the Sampling of Case Subsets for Intelligent Problem Solver dialog to access options for basic
random assignment of the subsets; the subsets can be resampled for each network created, or fixed for the entire Intelligent
Problem Solver (IPS) run. In addition, there are options to enable the more advanced resampling techniques available on the
Random and Bootstrap tab (accessible only if the corresponding option is selected on the Quick tab).
Selection of subsets for each network trained. Select from the following options:
Fix, in numbers given below (randomly assigned at beginning). If you select this option button, the train, select, and test
subsets are randomly assigned at the beginning of the IPS run in the numbers specified in the Training, Selection, and Test
fields at the bottom of the tab. The same case division is then used for every network tested.
Resample, in numbers given below (randomly assign each network). If you select this option button, the subsets are
randomly assigned in the numbers specified in the Training, Selection, and Test fields each time a network is created and
tested.
Advanced random resampling (see Random tab). If you select this option button, the subset division is performed as
specified on the Random tab.
Bootstrap resampling (see Bootstrap tab). If you select this option button, Bootstrap resampling is performed; options are
on the Bootstrap tab.
Subset sizes. Training/Selection/Test.
These three fields display the sizes of the three subsets.
between Training, Selection, and Test, enter 3, 1, and 1 into the three fields respectively. When you click OK, STATISTICA

- Random Tab
Select the Random tab of the Sampling of Case Subsets for Intelligent Problem Solver to access options supporting random
resampling for multiple networks, and also to options allowing multiple networks to be created using the same subset split.
You can specify that resampling does not happen at all, or that the test set is held fixed while the training and selection
subsets are resampled, or that all three subsets are resampled. You can also specify how the nonresampled subsets are
assigned (randomly, or the same as a subset variable or preexisting network), and how large the resampled subsets should
be. The options on the Quick tab are a subset of the options available on this tab.
Subset selection methods. Specify that resampling should not be performed (select the Fix all subsets option button), or if it
is, whether the test subset should be held fixed while the training and selection subsets are resampled, or whether all three
subsets should be resampled. Keeping a consistent test subset makes it easier to compare the performance of the networks.
Assignment of fixed subsets. If you have decided to fix (i.e. not resample) either all three subsets, or the test subset alone,
you can specify how the fixed subsets should be assigned. The options are:
a nominal variable with values taken from the set {"Train","Select","Test","Ignore"}. If there are no subset variables
available, the drop-down list is disabled; otherwise, you can select from the list of available variables.
can select from the existing networks, indicating that the new network should use the same case division. If there are no
existing networks with case divisions, the existing network drop-down list is disabled.
Random (once, before generating any samples). The subsets are assigned randomly from the available cases, in the
numbers specified in the fields described below. If you've selected to fix the test subset, but to resample training and
selection, then the test subset is assigned once at the beginning of training, but fixed thereafter, whereas the training and
selection subsets are resampled from among the balance of cases each time a network is created.
Subset sizes. Specify the size of any subsets that are being randomly assigned. Depending on the resampling options
selected, the Test field or all three fields may be disabled, indicating that the number of cases in the appropriate subsets are
derived and cannot be altered.
You can specify the subset sizes in two ways: by entering the exact numbers of cases or relative proportions.
If you want to set the numbers exactly, enter the numbers you want. To help you in this process, the remainder available is
displayed (in blue, if a positive remainder; in red, if you have specified more cases than are available). You can assign the
remainder to a given subset by double-clicking on the corresponding field.
between Training, Selection, and Test, enter 3, 1, and 1 into the three fields respectively. When you click OK, STATISTICA

- Bootstrap Tab
The Bootstrap tab of the Sampling of Case Subsets for Intelligent Problem Solver dialog contains options specific to
bootstrap resampling. In bootstrap resampling, the training subset is constructed by sampling with replacement (i.e. a given
case can be sampled multiple times) from the available data. Most authors who recommend bootstrap sampling do not
discuss how to combine the technique with the use of a selection subset. STATISTICA Neural Networks samples the
selection subset first (without using bootstrap; i.e. the selection subset is sampled without replacement); then, the training
subset is bootstrapped from the remaining data. By default the training subset size is set to the number of available cases
minus the selection subset size, but since sampling with replacement is used, the training subset can actually be any size you
choose.
The test subset is formed from any cases left over after the bootstrap selection of the training subset.
Size of bootstrapped training set.
Training. Specify the number of cases to be bootstrapped into the training subset. The number of cases that are available
(i.e. not part of the selection subset) is displayed to the right of this field.
Selection. Specify the number of cases in the selection subset.

Feature (Independent Variable) Selection

Select Feature Selection on the Neural Networks Startup Panel - Advanced tab to display the Feature (Independent Variable)
Selection dialog. This dialog contains five tabs: Quick, Advanced, Genetic algorithm, Interactive, and Finish.
In many problem domains, a range of input variables are available that can be used to train a neural network, but it is not
clear which of them are most useful, or indeed are needed at all. The problem is further complicated when there are
interdependencies or correlations between some of the input variables, which means that any of a number of subsets might
be adequate.
To an extent, some neural network architectures (e.g., Multilayer Perceptrons) can actually learn to ignore useless variables.
However, other architectures (e.g., Radial Basis Functions) are adversely affected, and in all cases a larger number of inputs
implies that a larger number of training cases are required (as a rule of thumb, the number of training cases should be a good
few times bigger than the number of weights in the network) to prevent over-learning. As a consequence, the performance of
a network can be improved by reducing the number of inputs, even sometimes at the cost of losing some input information.
There are two possible approaches to this problem. One approach (feature extraction) is to retain all the original variables,
but to process them into a smaller number so as to retain as much information as possible. The technique commonly used for
this is Principal Components Analysis.
The second approach (feature selection) is to explicitly identify input variables that do not contribute significantly to the
performance of networks, and to remove them. The Intelligent Problem Solver does this automatically for you using a
variety of search, regularization, and sensitivity based techniques. If you use the Custom Network Designer, most of the
network-type specific training algorithms in STATISTICA Neural Networks allow you to specify the use of pruning
algorithms to remove unnecessary inputs - sensitivity-based pruning for all network types, and weight-based pruning for
some. Those procedures are very efficient, and are usually sufficient.
This dialog supports alternative feature selection algorithms that may sometimes prove a useful alternative. These algorithms
use a combination of Probabilistic or Generalized Regression neural networks and feature selection algorithms, either
stepwise algorithms that progressively add or remove variables, or genetic algorithms (Goldberg, 1989). These algorithms
can be relatively slow if you have a large number of cases; however, they can sometimes identify subsets of inputs that are
not discovered by other techniques.
Forward and backward stepwise feature selection algorithms work by adding or removing variables one at a time. Forward
selection begins by locating the single input variable that, on its own, best predicts the output variable. It then checks for a
second variable that, added to the first, most improves the model, repeating this process until either all variables have been
selected, or no further improvement is made. Backward stepwise feature selection is the reverse process - it starts with a
model including all variables, and then removes them one at a time, at each stage finding the variable that, when it is
removed, least degrades the model.
Forward and backward selection each have their advantages and disadvantages. Forward selection is generally faster.
However, it may miss key variables if they are interdependent (that is, where two or more variables must be added at the
same time in order to improve the model). Backward selection does not suffer from this problem, but as it starts with the
whole set of variables, the initial evaluations are most time consuming. Furthermore, the model can actually suffer purely
from the number of variables, making it difficult for the algorithm to behave sensibly if there are a large number of
variables, especially if there are only a few weakly predictive ones in the set. In contrast, because it selects only a few
variables initially, forward selection can succeed in this situation. Forward selection is also much faster if there are few
relevant variables, as it will locate them at the beginning of its search, whereas backwards selection will not whittle away the
irrelevant ones until the very end of its search.
On the whole, backward selection is to be preferred if there are a small number of variables (say, twenty or less), and
forward selection may be better for larger numbers.
Another alternative available in STATISTICA Neural Networks is the genetic algorithm. This is an optimization algorithm
that can search efficiently for binary strings. In this case, the binary strings represent masks: the masks determine which
input variables should be used in constructing the neural networks, with a 0 indicating that a variable should not be used, and
a 1 indicating that it should. For example, if there are six original input variables, the masking string 001101 indicates that
the first, second, and fifth variables should be discarded, and the third, fourth, and sixth ones kept.
The genetic algorithm randomly creates a population of such strings, and then uses a process analogous to natural selection
to select superior strings, which are "bred" together to form a new population. Over a period of generations, successively
better strings are produced. Eventually, the best member of the final generation is selected and subjected to bit-by-bit
improvement (in case the genetic algorithm has failed to choose the best settings for some individual bits). Genetic
algorithms are well suited for feature selection as they are very good at recognizing subsets of interrelated bits (in this case,
correlated or mutually required inputs). The time requirements are high, but more-or-less unaffected by the number of
variables, whereas forward and backwards selection have time requirements proportional to the square of the number of
variables. The genetic algorithm is therefore a good alternative where there are large numbers of variables (more than fifty,
say), and also provides a valuable "second or third opinion" for smaller numbers of variables.
Note: Genetic algorithms are particularly good at spotting interdependencies (called epistasis) between variables located
close together on the masking strings. Consequently, if you have reason to suspect that variables are closely interrelated, you
should place them in adjacent columns in the data set spreadsheet before running the feature selection algorithm.
All the feature selection algorithms evaluate feature selection masks. These are used to select the input variables for a new
training set, and a Probabilistic or Generalized Regression Neural Network (PNN or GRNN) is tested on this training set.
These forms of network are used for several reasons:
l They usually train extremely quickly, making the large number of evaluations required by the feature selection algorithm
feasible. However, this fast training time is dependent on the number of cases not being too excessive (the training time is
related to the square of the number of cases). You can, therefore, set a sampling rate - if this is less than one, the
algorithm uses a sub-sample of the available cases for each evaluation. A sensible approach is to experiment by running
the algorithm for a single generation with a low sampling rate, and assess how long it takes before fixing on a sampling
figure for the final run.
l They are capable of modeling nonlinear functions quite accurately;
l Like radial basis function networks, they are relatively sensitive to the inclusion of irrelevant input variables. This is
actually an advantage when trying to decide whether input variables are required.
STATISTICA Neural Networks automatically chooses whether to use a PNN or GRNN, based on the type of output variables
in the current data set.
Although irrelevant inputs will typically increase the network error, they may do so only by a small amount, making it
difficult for the algorithm to locate all of them. Moreover, as stated earlier, it can be advantageous to use a smaller number
of inputs even if this causes a slight deterioration in the error, as the generalization performance may then be improved. In
STATISTICA Neural Networks, you can specify a Unit Penalty factor. This is multiplied by the number of selected inputs in
a mask and added to the error, thus favoring smaller networks. Note, however, that it is not necessary to specify a unit
penalty - with a penalty of zero, only clearly irrelevant inputs will be ignored.
You can specify that ongoing results from the algorithm are displayed in a special progress window as the algorithm
progresses. This window displays a spreadsheet showing the selection of variables, and the corresponding error value. You
can specify whether to display all combinations tested, just the best of each epoch, or just improvements over previous
epochs.
You can also specify whether the case subsets (training, selection, and test) should be resampled for each model, and if so,
how. If resampling is used, there is inevitably some noise in the error rating used for selection, and it is sensible to specify a
small unit penalty.
Dependent/Independent. The input and output variables displayed are those that are selected from the Startup Panel (from
either the Quick or Advanced tab).
OK. Click the OK button to run the feature selection algorithm and display the Feature Selection Progress dialog.
Cancel. Click the Cancel button to close this dialog and return to the Startup Panel.
Sampling. Click this button to display the Sampling of Case Subsets for Feature Selection dialog, which is used to specify
how the case subsets should be sampled.
Results. Click this button to generate a spreadsheet of feature selection training results.

Select the Quick tab of the Feature (Independent Variable) Selection dialog to access the options described here.
Method. In this group box, select the algorithm to be used in feature selection.
Forward selection. Select the Forward selection option button for the fastest performance
Backward selection. Select the Backward selection option button (or Genetic algorithm option button) for more accurate
evaluation.
Genetic algorithm. If you select the Genetic algorithm option button, and the product of the number of generations and the
population is greater than two raised to the power of the number of input variables, then STATISTICA Neural Networks will
exhaustively evaluate all possible combinations (as this is actually quicker than running the genetic algorithm in that case).
See Feature (Independent Variable) Selection for more details on Forward selection, Backward selection, and Genetic
algorithm.

Select the Advanced tab of the Feature (Independent Variable) Selection dialog to access the options described here.
Sampling. In the Sampling box, specify the proportion of the available training and selection cases to be used in feature
selection. STATISTICA Neural Networks randomly selects the given proportion of these cases to use in building and
evaluating the PNN or GRNN. The time taken is proportional to the square root of the sampling rate; thus, with a sampling
rate of 0.1 the algorithm takes approximately one hundredth of the time taken with a sampling rate of 1.0. It is worth
considering a low sampling rate if you have a large number of cases (e.g., over a thousand). Otherwise, the algorithm may be
extremely slow.
Smoothing. In this box is the Smoothing factor used in training the PNN or GRNN. In general such networks are not too
sensitive to the precise value of this parameter; however, you may find it worthwhile to experiment a little to find a
reasonable value before starting the Feature Selection algorithms.
Unit penalty. The factor in this box is multiplied by the number of selected input variables, and added to the selection error
of the network when it is trained and tested. A non-zero Unit penalty favors smaller networks, and often improves
performance. However, if the factor is too large, the number of variables becomes more important than the quality of the
network; this may even cause the algorithm to mask out all inputs. A small value is recommended to compensate for noise
in the case sampling process, which may lead to arbitrary selection of some irrelevant variables. Typical values are in the
range 0.001-0.005.

Select the Genetic algorithm tab of the Feature (Independent Variable) Selection dialog to access the options described here.
Population. In this box is the number of masking strings maintained by the genetic algorithm at any one time. A larger
Population makes the algorithm more likely to locate a good masking string, but also increases the time taken by the
algorithm.
Generations. This box displays the number of times that the population is replaced by a newly generated population. As
with Population size, a larger number of Generations increases the time taken by the algorithm but increases the chance of
locating a good masking string.
Mutation rate. On each generation of the genetic algorithm, new masking strings are generated, by a combination of
crossover and mutation. During mutation, some of the bits are randomly flipped. The Mutation rate controls the number of
bits that are flipped. It gives the number of bits, on average, that are flipped per generated string (although some strings may
have more than one bit flipped, and some none). The default rate of 1 is usually recommended. A larger number gives the
algorithm somewhat greater exploratory capability, but may lead to unhelpful disruption of good strings.
Crossover rate. When a new population is generated, some of the strings are copied (with mutation) directly from a single
member of the old population, while others are created by crossing over two members of the old population (and then
applying mutation). During crossover, a random point on the strings is selected, and the two strings "swap ends" to create
two children. Crossover allows two strings with different good features to combine together into a single more powerful
string. It is the primary feature that makes the genetic algorithm powerful.
A Crossover rate as high as 1.0 is quite acceptable; however, with such a high rate, the population may converge to a very
similar set of children very quickly. A lower Crossover rate is recommended if a larger number of generations can be
tolerated, as this tends to increase the range of the search.

Select the Interactive tab of the Feature (Independent Variable) Selection dialog to access the options described here.
Results reported. In this group box, specify what results should be generated and reported. In all cases, the results are
copied to a results spreadsheet. If any of the last three options are selected, a progress window is displayed containing the
spreadsheet, and it is updated as new solutions are tested. The options are:
Final (best) feature set only. The spreadsheet contains a single line indicating the final selection of features if this option
button is selected.
Best of each stage (if improvement). If this option button is selected, a single line is added to the spreadsheet on each stage
(iteration of forward or backward selection, generation of the genetic algorithm). The new line indicates the best solution
found at that stage, and is only added if it is better than the best solution previously found.
Best of each stage. As above, except that a row is added at each stage even if this is not an improvement over the results
from previous stages.
All combinations tested. If this option button is selected, a row is generated for every single feature set tested. The
spreadsheet contains the following information:
There is one row for each test reported, and one row for the final selection. There is a column for the network error, and one
column for each candidate input feature. Each row contains: in the row label, an identifier for the stage of the algorithm (e.g.
2.4 means "the fourth test in stage two"), the selection error of the network, an indicator of the variables selected ("Y"
implies the variable was selected, "-" that it was deselected).

Select the Finish tab of the Feature (Independent Variable) Selection dialog to access the options described here.
On completion. Select an option in this group box to specify what STATISTICA Neural Networks should do when the
algorithm finishes.
Finish analysis. By default this option button is selected, and when the analysis finishes, the results spreadsheet generated is
the outcome.
Start Intelligent Problem Solver using selected features/Start Custom Network Designer using selected features. You
can alternatively specify that the Intelligent Problem Solver or Custom Network Designer be displayed, with the variables
selected to match the best combination found by the feature selection algorithm.
Keep feature selection dialog. Select this option button to display the Feature Selection dialog when the analysis finishes.
Technical Details. The genetic algorithm uses the following techniques:
The standard Holland genetic algorithm, with elitism (retention of best string from each generation unaltered) and roulette
selection. Fitness (selection error plus unit penalty) is normalized linearly before selection so that the ratio of best:worst
fitness is 2:1: this ensures that a constant selective pressure is maintained throughout the duration of the algorithm. All
strings except the elite one are replaced at every generation.
The PNNs and GRNNs are not actually explicitly constructed; they are simulated in situ using the training set to ensure
maximum performance speed.
The genetic algorithm is followed by a single pass of bitwise gradient descent (that is, each bit in turn is flipped, and the flip
accepted if the error decreases). This tends to "tidy up" any simple errors made by the genetic algorithm due to its stochastic
nature.

This dialog is displayed while the Feature Selection algorithms run, provided that one of the last three options in the Results
reported group box are selected on the Feature Selection - Interactive tab.
The major feature of the dialog is a spreadsheet. Rows are added to the dialog during training, indicating the progress of the
algorithm. Depending on the Results reported option selected, there may be rows for every combination of inputs tested, for
the best combination on each epoch/generation, or for the best combination when it constituted an improvement over
previous epochs/generations. A final row is added when the algorithm terminates, repeating the details of the best
combination discovered overall.
While the algorithm is in progress, the Feature Selection in Progress dialog also displays the percentage of training
accomplished and the time taken to achieve it.
Progress spreadsheet. The spreadsheet contains one column for the network error, and one column for each candidate input
feature. Each row contains: in the row label, an identifier for the stage of the algorithm (e.g. 2.4 means "the fourth test in
stage two"), the selection error of the network, and an indication of the variables selected ("Y" implies the variable was
selected, "-" that it was deselected).
Finish. Click this button at any time to prematurely finish the algorithm. The current test is stopped, and the best selection so
far is taken as the output of the procedure.
Cancel. Click this button at any point to abort feature selection. The algorithm stops as soon as possible, and you will return
to the Feature Selection dialog.

Click the Sampling button on the Feature (Independent Variable) Selection dialog to display the Sampling of Case Subsets
for Feature Selection dialog. This dialog contains one tab: Quick. Use the options on this dialog to control the sampling of
cases for the training, selection, and test subsets when running the feature selection algorithms. See Resampling for a
detailed overview of sampling issues.
Subset assignment (to train, select, and test subsets) can be random or based upon a data set variable or preexisting neural
network. In addition, you can exclude cases containing missing values or explicitly select those cases to be considered.
Quick Tab
Assignment of fixed subsets. You can specify that the subsets be determined by one of the three methods contained in this
group box: by using a subset variable (defined below); using the same distribution as a selected preexisting network, or
randomly. The first two options are available only if there is a suitable subset variable in the data set, or a suitable network.
a nominal variable with values taken from the set ("Train," "Select," "Test," "Ignore"). The variable can have any name, but
the STATISTICA Neural Networks (SNN) convention is to give such variables the prefix NNSET. The easiest way to generate
such a variable is to export the subset assignment of a randomly assigned network to a results spreadsheet (by including
Subset in the Also in Spreadsheet options on the Predictions tab of the Results dialog), and then to copy and paste this to the
data set.
If there are no subset variables available, the drop-down list is disabled; otherwise, you can select from the list of available
variables.
As existing network. Each neural network in SNN records the case division used to train it. You can select from the existing
networks, indicating that feature selection should use the same case division; this is useful if you want to compare network
performance on a "level playing field." If there are no existing networks, or if the networks were imported from an older
version of SNN that does not record case subsets, the As existing network drop-down list is disabled.
Shuffle train and select subsets. Select this check box if you want to maintain the same test set selection as the given subset
variable or existing network, but want to resample the train and select subsets. This can be useful if you want to employ
sampling in feature selection, but retain a single test set to allow comparison of results.
Random (once before generating any samples). Select this option button to specify that the subsets be assigned randomly
from the available cases.
Subset sizes (Training/Selection/Test). If you choose to derive the case subsets from a subset variable or an existing
network, the size of the subsets (or, at least, of the test subset) is also derived, and the subset size fields are disabled.
However, if you are randomly determining subsets, or resampling training and selection subsets, you can select the subset
sizes.
If you want to set the numbers using proportions, simply enter the proportions into the fields. For example, to split the data
3:1:1 between training, selection, and test, enter 3, 1, and 1 into the three fields respectively. When you click OK, SNN will
assign the cases in the given proportions.
OK. Click the OK button to confirm that the current sampling options should be used in feature selection.
Select cases. Click this button to display the Select Cases dialog, which is used to specify a range of cases to be distributed
among the training, selection, and test subsets. Any cases not included in the case selection are excluded from consideration,
even if you specify an explicit option such as From subset variable or As existing network.
MD Deletion. Specify how cases with missing values should be handled during training. If Casewise is selected, any cases
with missing values are excluded from consideration during feature selection. This is preferable, unless you are severely
restricted in the number of cases available. In that case, you can select Mean substitution, which implies that cases with
missing values are used, and the network's missing value substitution procedure will be employed to "patch" those cases
during feature selection. This substitutes the training set sample mean for a numeric variable or the sample frequency for a
nominal variable.
Training String
The training string is used throughout SNN as a shortcut method to identify the methods used to train the respective
networks. The training string contains a number of codes, which are followed by the number of epochs for which the
algorithm ran (if an iterative algorithm), and an optional terminal code indicating how the final network was selected. For
example, the code CG213b indicates that the Conjugate Gradient Descent algorithm was used, that the best network
discovered during that run was selected (for "best" read "lowest selection error") and that this network was found on the
213th epoch.
The codes are:
BP Back Propagation
QN Quasi-Newton
DD Delta-Bar-Delta
SS (sub)Sample
KN K-Nearest Neighbour (Deviation Assignment)
PN Probabilistic Neural Network training
GR Generalised Regression Neural Network training
b Best Network (the network with lowest selection error in the run was restored)
s Stopping Condition (the training run was stopped before the total number of epochs elapsed as a stopping condition
was fulfilled)
c Converged (the algorithm stopped early because it had converged; that is, reached and detected a local or global
minimum. Note that only some algorithms can detect stoppage in a local minimum, and that this is an advantage not
a disadvantage!)
Profile String
The profile string is used throughout SNN as a shortcut method to identify the architecture of the network or ensemble. The
profile consists of a type code, followed by a code giving the number of input and output variables, and number of layers
and units (networks) or members (ensembles). For time series networks, the number of steps and the lookahead factor are
also given. The individual parts of the profile are the type and architecture.
The type is indicated by one of the following codes:
MLP Multilayer Perceptron Network

RBF Radial Basis Function Network
SOFM Kohonen Self-Organizing Feature Map
Linear Linear Network
PNN Probabilistic Neural Network
GRNN Generalized Regression Neural Network
PCA Principal Components Network
Cluster Cluster Network
Output Output Ensemble
Conf Confidence Ensemble
A neural network's architecture is of the form I:N-N-N:O, where I is the number of input variable, O the number of output
variables, and N the number of units in each layer.
Example. 2:4-6-3:1 indicates a network with 2 input variables, 1 output variable, 4 input neurons, 6 hidden neurons, and 3
output neurons.
For a time series network, the steps factor is prepended to the profile, and signified by an "s".
Example. s10 1:10-2-1:1 indicates a time series network with steps factor (lagged input) 10.
An ensemble's architecture is of the form I:[N]:O, where I is the number of input variable, O the number of output variables,
and N the number of members of the ensemble.
Reviewing Neural Networks Results

Results
Results Dialog
Response Graph
Response Surface
Topological Map
Results
Results Dialog
Results (Run Models) Dialog

Select Run Existing Model from the Startup Panel to display the Results (Run Models) dialog. This dialog is also displayed
as the final stage of many other analyses. The Results dialog is used to generate outputs, including descriptive statistics,
predictions, residuals, response graphs and surfaces, and other forms of output, by executing models.
You can select multiple models (networks and ensembles), in which case, wherever possible, STATISTICA Neural Networks
will display any results generated in a comparative fashion (e.g. by plotting the response curves for several models on a
single graph, or presenting the predictions of several models in a single spreadsheet).
This dialog contains seven tabs: Quick, Advanced, Predictions, Residuals, Sensitivity, Plot, and Descriptive Statistics. The
options described here are available regardless of which tab is selected.
OK. Click the OK button to finish the current analysis. If the Results dialog has been displayed as the last stage in an
Intelligent Problem Solver or Custom Network Designer analysis, clicking OK confirms the addition of the new networks to
the network set.
Cancel. Click the Cancel button to discard the results provided that the Results dialog has been displayed as the last stage in
an Intelligent Problem Solver or Custom Network Designer analysis. In this case, STATISTICA Neural Networks will display
a prompt asking whether you want to proceed with discarding the results.
Models. Click the Models button to display the Select Networks and/or Ensembles dialog. Use this dialog to select the
models for which results are to be generated. You can select multiple models, in which case comparative results are
generated. Summary details of the selected models are displayed in the list box at the top of the dialog.
Select cases. Click the Select cases button to display the Select Cases dialog, which is used to select cases for which results
are to be generated. By default all the cases in the data set are selected.
MD Deletion. Specify how to treat cases with missing values (in the input and output variables of the selected models).
There are two options:
Casewise. Any cases with missing values are omitted when generating results.
Mean substitution. The networks' missing value substitution procedures are used to "patch" missing values before
executing the network. The standard missing value procedure is to substitute the training subset sample mean for numeric
outputs, or the training subset sample class frequencies for nominal outputs. See Network Editor for further options.

Click the Cancel button on the Results (Run Models) dialog to discard the results provided that the Results dialog has been
displayed as the last stage in an Intelligent Problem Solver or Custom Network Designer analysis. In this case, STATISTICA
Neural Networks will display a prompt asking whether you want to proceed with discarding the results.
OK. Click the OK button to discard the current results.
Cancel. Click the Cancel button to return to the Results dialog. The results will not be discarded.
Do not ask again. If you select this check box, this prompt will not be displayed when you click the Cancel button on the
Results dialog.

Select the Quick tab of the Results dialog to access the options described here.
Models summary. Click this button to display a spreadsheet containing summary details of models shown in the list box at
the top of the dialog.
Predictions. Click this button to generate a model prediction spreadsheet. See the Predictions tab for more details.
Residuals. Click this button to generate a model residual spreadsheet. See the Residuals tab for more details. Only available
for regression networks.
Sensitivity Analysis. Click this button to perform a sensitivity analysis on model input variables. See the Sensitivity tab for
more details.
Descriptive Statistics. Click this button to generate a descriptive statistics spreadsheet. See the Descriptive Statistics tab for
more details.
Subsets used to generate results. You can generate results for all selected cases, or for a given subset, or separately for all
subsets. The Overall option button generates results for any case not excluded by MD Deletion or Case Selection. The All
(separately) option button generates results for the training, selection, test and ignored cases separately, and is equivalent to
requesting each of the Training, Selection, Test, and Ignored options separately in turn.

Select the Advanced tab of the Results dialog to access a number of dialogs that perform specialized results generation, and
to generate Receiver Operating Characteristic (ROC) curves. These dialogs are described in the topics listed below:
User Defined Cases
Response Graphs
Response Surfaces
Topological Map
ROC Curve

Select the Predictions tab of the Results dialog to access the options described here.
Predictions. Click this button to generate a spreadsheet of model predictions. The precise details shown are controlled by
the options discussed below.
Prediction type(s) shown. Select the form of prediction required. Some of these options pertain only to certain model types,
and are disabled if not relevant to the current selection of models.
Prediction. Select this check box to generate the standard network prediction; that is, a predicted value for the output
variables of the model.
Confidence levels. Applicable only to classification networks (i.e. those with nominal output variables). Displays the
confidence levels of the network in the various possible classes, which are represented by the activation levels of the output
neurons corresponding to the output variable.
Codebook vector. Applicable only to networks with a Radial hidden layer, and usually meaningful only for SOFM and
Cluster networks. Each hidden neuron in such networks represents an exemplar or "codebook" vector (i.e. a prototypical
case). If this check box is selected, the codebook vector in the "winning" neuron is output.
Winning neuron. Applicable only to SOFMs and Cluster networks. Select this check box to list the index of the winning
radial neuron.
Also in spreadsheet. You can optionally specify additional information to be included in the predictions spreadsheet:
Observed. Selects the observed values of the output variables, drawn from the data set.
Independents. All the independent variables used as inputs to the selected models.
Subset. Generates a subset variable (with name NNSET.nn, where nn is the index number of model), which indicates to
which subsets (Training, Selection, Test, and Ignored) the cases were assigned during training.
Variables. This option allows you to choose an arbitrary selection of variables from the data set for inclusion in the results
spreadsheet.
Click the variable selection button to display a variable selection dialog.

This Residuals tab of the Results dialog is only available for regression networks.
Residuals. Click this button to generate a spreadsheet of residuals (errors made by regression networks).
Residual type(s) shown. Select one or more types of residual to be included in the spreadsheet. The types available are:
Raw. (Observed - Predicted).
Squared. (Raw^2)
Absolute. (|Observed-Predicted|).
Standard. See Standard Residual.
Also in spreadsheet. You can optionally specify additional information to be included in the residuals spreadsheet:
Prediction. Includes model predictions in the residual spreadsheet.
Observed. Selects the observed values of the output variables, drawn from the data set.
Independents. All the independent variables used as inputs to the selected models.
Subset. Generates a subset variable (with name NNSET.nn, where nn is the index number of model), which indicates to
which subsets (Training, Selection, Test, and Ignored) the cases were assigned during training.
Variables. Select this check box in order to choose an arbitrary selection of variables from the data set for inclusion in the
results spreadsheet.
Click the variable selection button to display a variable selection dialog.

Select the Sensitivity tab of the Results dialog to access the options described here.
Sensitivity Analysis. Click this button to conduct a sensitivity analysis on each model and display the results in a
spreadsheet. See Sensitivity Analysis for a description of the technique, which rates the importance of the models' input
variables.
Sensitivity metrics shown. Select the measures of sensitivity to be displayed in the spreadsheet. The ratio is the basic
measure of sensitivity (ratios of 1.0 or lower indicate an irrelevant or even damaging input variables, progressively higher
values indicate more important variables). The ranking simply indicates the ordering of the ratios.
Note. You can perform sensitivity-based pruning of networks during custom training (using options on the appropriate
training dialog).

The Plot tab of the Results dialog is only available for regression networks. Use the options on this tab to produce a variety
of line plots and histograms.
X-axis. From this list box, select the quantity to be plotted on the X axis of the line graph. You can choose from:
Predicted. The Predicted value of the output variable (the model's prediction);
Observed. The Observed value of the output variable (in the data set);
Numeric Variable. Any specified numeric variable from the data set.
Y-axis. Select the quantity to be plotted on the Y axis of the line graph, or the quantity from which the histogram is to be
constructed. The options include the various forms of the residual and the Predicted and the Observed values.
Variable. Identifies the specific numeric variable to be displayed if the corresponding X option is selected.
Graph X versus Y. Click this button to generate a line graph relating the quantities selected for X and Y.
Histogram of Y. Click this button to generate a histogram of the quantity Y selected.

Select the Descriptive Statistics tab of the Results dialog to access the options described here.
Descriptive Statistics. Click this button to generate the summary statistics. There are two types of summary output
available, controlled by the check boxes described below.
Summaries Generated.
Summary statistics. Select this check box to generate standard overall statistics. The precise statistics used depend on
whether the output variable is numeric (a regression problem) or nominal (a classification problem). If there are both
regression and classification output variables, both types of summary spreadsheets are generated.
The summary statistics are described in classification statistics and regression statistics respectively.
Confusion matrix. Select this check box to generate a confusion matrix for each nominal output variable. A confusion
matrix gives a detailed breakdown of misclassifications. The observed class is displayed at the top of the matrix, and the
predicted class down the side; each cell contains a number showing how many cases that were actually of the given observed
class were assigned by the model to the given predicted class. In a perfectly performing model, all the cases are counted in
the leading diagonal.


Click the User defined case button on the Results dialog - Advanced tab to display the User Defined Case Prediction dialog.
Use the options on this dialog to define new cases (that are not drawn from the data set) and execute models using them.
You can also select individual cases from the data set, and either execute them or modify and then execute them, allowing
you to perform some ad. hoc. "What if?" analyses. The predictions of the models are accumulated in an output spreadsheet
that you can transfer to a separate STATISTICA spreadsheet. This dialog contains two tabs: Quick and Advanced.
Cancel. Click the Cancel button to close the current analysis and display the Results dialog.
Models. Click this button to display the Select Networks and/or Ensembles dialog, which is used to select the model or
models to be executed. Summary details of the selected models are shown in the list box at the top of the User Defined Case
Prediction dialog.

Select the Quick tab of the User Defined Case Prediction dialog to access the options described here.
Input spreadsheet. The Input spreadsheet displays the values of the current input variable. For most networks there is a
single column, with each row corresponding to a model input variable. For time series networks, there is a column for each
required lagged value. You can enter an input drawn from the data using the Input case. Alternatively, you may want to run
your own defined inputs using the User defined input button.
Predictions. Click this button to generate a spreadsheet of predicted outputs for the input values you run by clicking the Run
current input button (which can be either inputs drawn from the data or user defined, or both). Clicking this button will also
clear the currently stored predictions once the spreadsheet is generated.
No. predicted cases. Displays the number of times predictions have been made (i.e. number of times you clicked the Run
current input button) before the Predictions button is clicked.
Input case. Use this box to specify a case from the data set. Alter the case number, and the corresponding values are copied
to the Input spreadsheet. You can subsequently customize the values by clicking the User defined input button before
running the case.
Run current input. Click this button to run the current input case and add a row to the output spreadsheet containing the
prediction(s) of the currently selected model(s).
User defined input. Click this button to display a general user entry spreadsheet that you can use to customize the input. By
clicking the OK button on this spreadsheet, you can confirm the changes you have made and the new values will be
displayed in the Input spreadsheet. If you leave a cell blank inside the spreadsheet, it will be interpreted as a missing value.
Clear runs. Click this button to clear all the cells in the Input spreadsheet corresponding to missing values for all inputs.

Select the Advanced tab of the User Defined Case Prediction dialog to access the options described here.
Prediction type(s). In this group box, select the form(s) of prediction required. Some of these options pertain only to certain
model types, and are disabled if not relevant to the current selection of models.
Prediction. Select this check box to generate the standard network prediction; that is, a predicted value for the output
variables of the model.
Confidence levels. This option is applicable only to classification networks (i.e. those with nominal output variables). Select
this check box to display the confidence levels of the network in the various possible classes, which are represented by the
activation levels of the output neurons corresponding to the output variable.
Codebook vector. This option is applicable only to networks with a Radial hidden layer, and usually meaningful only for
SOFM and Cluster networks. Each hidden neuron in such networks represents an exemplar or "codebook" vector (i.e. a
prototypical case) If this check box is selected, the codebook vector of the "winning" neuron is output.
Winning neuron. This option is applicable only to SOFMs and Cluster networks. Select this check box to list the index of
the winning radial neuron.
Also in spreadsheet. Specify additional information to be included in the output spreadsheet.
Inputs. Select the Inputs check box, to copy the user-defined input case into the output spreadsheet.
Retain multiple predictions in the output spreadsheet. This check box is selected by default, so that each time you click
the Run button an extra row is added to the output spreadsheet, allowing you to accumulate a set of predictions. Clear this
check box if you would like the output spreadsheet to be cleared each time a new prediction is made.
Response Graph

Click the Response graph button on the Advanced tab of the Results dialog to display the Response Graph dialog. This
dialog contains three tabs: Quick, Advanced, and Fixed Independent.
Use the options on this dialog to generate a Response Graph. A response graph shows the effect on the output variable
prediction of adjusting an input (independent) variable. The input variable must be numeric. The output variable may be
numeric or may be nominal - in the latter case, the default is to plot the response of the confidence level (output neuron
activation), which is continuous and can therefore be conveniently graphed. Optionally, you can instead plot the ordinal
values (1,2,3, etc) corresponding to the classes, in which case the response graph consists a number of plateaus.
While one specially chosen numeric input variable is altered (and plotted across the X axis of the graph), values must also be
provided for the other input variables of the network. These are given fixed values, and are referred to as "fixed
independents." Thus, the response graph actually represents a one-dimensional slice through an N dimensional response
surface, where N is the number of input variables.
Cancel. Click this button to exit the Response Surface dialog and return to the Results dialog.
Select Models. Click this button to display the Select Networks and/or Ensembles dialog where you can select the models to
be used in generating response curves. If you select multiple models with the same output variable, the responses are plotted
on the same graph. Summary details of the selected models are shown in the list box at the top of the window.

Select the Quick tab of the Response Graph dialog to access the options described here.
Independent. Select the numeric input variable that is to be varied across the X axis. A response graph will be generated for
each output variable of the selected models.
Resp. graph. Click this button to generate the response graph.
Resp. Spreadsheet. Click this button to generate a spreadsheet containing the values corresponding to the response graph(s).

Select the Advanced tab of the Response Graph dialog to access the options described here.
Independent. Select the numeric input variable that is to be varied across the X axis.
Minimum. Specify the Minimum value of the independent variable on the X axis. By default this is the smallest value of the
variable within the training subset.
Maximum. Specify the Maximum value of the independent variable on the X axis. By default this is the largest value of the
variable within the training subset.
Number of samples. Specify the number of points evaluated to generate the graph. The samples are evenly spaced between
the minimum and maximum. Note that, if you want to use a simple sampling interval, you need to include an extra point for
the end; e.g. to plot points at intervals of 2 units from 0 to 10, you need 6 points, not 5 (at positions 0,2,4,6,8,10
respectively).
Response plotted for classification outputs. Use these options to determine what is plotted if you have a classification
output (i.e. a nominal output variable).
Confidence. This option is selected by default. Confidence levels are plotted for each class (actually the activation levels of
the associated output neurons).
Prediction (plateaus). Select this option to plot the ordinal version of the prediction (1,2,3,… etc. for each class). In this
case the response graph is a series of plateaus (a level zero plateau indicates a missing output, equivalent to an "Unknown"
prediction).
Plot multiple confidence levels on separate graphs. If this check box is selected, and there are lines to be plotted for
multiple confidence levels (i.e. confidence in different classes), then separate graphs are plotted for each class. This may
make it easier to compare the response curves of multiple models, as a single graph may then become crowded with too
many lines.

Select the Fixed Independent tab of the Response Graph dialog to access the options described here. To generate a response
graph, all the input variables except for the variable which is being altered across the X axis must be fixed to some value.
You may specify the values for these fixed independents on this tab.
Input spreadsheet. Specify the values for the fixed independents in this spreadsheet. The X variable cell is disabled, and
shows the symbol "X", to indicate that it is not fixed. The other cells can be left blank, in which case the networks' missing
value substitution procedures are used. By default, all of them are blank, meaning that in some sense "average" values
should be used for variables other than the one of interest (the X variable).
Response graph. Click this button to generate the response graph.
Response Spreadsheet. Click this button to generate a spreadsheet containing the values corresponding to the response
graph(s).
Input case. Enter a case number (or use the microscrolls). The input values are taken from the case and copied into the input
spreadsheet. This allows you to perform a kind of "what if?" analysis (if a given variable from a known case is altered, what
effect does it have on the models' prediction?).
User defined input. Click this button to display a general user entry spreadsheet that you can use to customize the inputs.
Click the OK button on the spreadsheet to accept the changes you have made. If you leave a cell blank inside the
spreadsheet it will be interpreted as a missing value.
Clear runs. Click this button to clear all the cells of the input spreadsheet to blanks (missing values).
Response Surface

Click the Response surface button on the Advanced tab of the Results dialog to display the Response Surface dialog. This
dialog contains three tabs: Quick, Options, and Fixed independents.
Use the options on this dialog to generate Response Surface graphs. A response surface shows the effect on the output
variable prediction of adjusting two input (independent) variables, and is a generalization of a Response Graph. The input
variables must be numeric. The output variable can be numeric or nominal - in the latter case, the default is to plot the
response of the confidence level (output neuron activation), which is continuous and can therefore be conveniently graphed.
Optionally, you can instead plot the ordinal values (1,2,3, etc) corresponding to the classes, in which case the response graph
consists a number of plateaus corresponding to different classes.
While the two specially chosen numeric input variables are altered (and plotted across the X and Y axes of the graph), values
must also be provided for the other input variables of the network. These are given fixed values, and are referred to as "fixed
independents." Thus, the response surface actually represents a two-dimensional slice through an N dimensional response
surface, where N is the number of input variables.
Cancel. Click this button to exit the Response Surface dialog and return to the Results dialog.
Models. Click this button to display the Select Networks and/or Ensembles dialog where you can select the models to be
used in generating response surfaces.

Select the Quick tab of the Response Surface dialog to access the options described here.
Independent (X Axis, Y Axis). Select the numeric input variables that are to be varied across the X and Y axes. A response
surface will be generated for each output variable of the selected models.
Resp. surface. Click this button to generate response surface graphs.
Resp. Spreadsheet. Click this button to generate a spreadsheet containing the values corresponding to the response surface.

Select the Options tab of the Response Surface dialog to access the options described here.
Independent (X Axis, Y Axis). Select the numeric input variables that are to be varied across the X and Y axes.
Minimum (X Axis, Y Axis). Specify the minimum value of the independent variable on the X or Y axis. By default this is
the smallest value of the variable within the training subset.
Maximum (X Axis, Y Axis). Specify the maximum value of the independent variable on the X or Y axis. By default this is
the largest value of the variable within the training subset.
Number of samples (X Axis, Y Axis). Specify the number of points evaluated along the appropriate axis to generate the
response surface. The samples are evenly spaced between the minimum and maximum. Note that, if you want to use a
simple sampling interval, you need to include an extra point for the end; e.g. to plot points at intervals of 2 units from 0 to
10, you need 6 points, not 5 (at positions 0,2,4,6,8,10 respectively).
Response plotted for classification outputs. Use these options to determine what is plotted if you have a classification
output (i.e. a nominal output variable).
Confidence. This option is selected by default. Confidence levels are plotted for each class (actually the activation levels of
the associated output neurons).
Prediction (plateaus). Select this option to plot the ordinal version of the prediction (1,2,3,… etc. for each class). In this
case the response surface is a series of plateaus (a level zero plateau indicates a missing output, equivalent to an "Unknown"
prediction).

Select the Fixed Independents tab of the Response Surface dialog to access the options described here. To generate a
response surface, all the input variables except for the variables that are being altered across the X and Y axis must be fixed
to some value. You can specify the values for these fixed independents on this tab.
Input spreadsheet. Specify the values for the fixed independents in this spreadsheet. The X and Y variable cells are
disabled, and show the symbols "X" and "Y" to indicate that they are fixed. The other cells can be left blank, in which case
the networks' missing value substitution procedures are used. By default, all of them are blank, meaning that in some sense
"average" values should be used for variables other than the ones of interest (the X and Y variable). To alter a cell value,
click the User defined input button.
Response surface. Click this button to generate response surface graphs.
Response Spreadsheet. Click this button to generate a spreadsheet containing the values corresponding to the response
surface(s).
Input case. Enter a case number (or use the microscrolls). The input values are taken from the case and copied into the input
spreadsheet. This allows you to perform a kind of "what if?" analysis (if a given variable from a known case is altered, what
effect does it have on the models' prediction?).
Click the OK button on this spreadsheet to accept the changes you have made. If you leave a cell blank inside the
spreadsheet it will be interpreted as a missing value.
Clear runs. Click this button to clear all the cells of the input spreadsheet to blanks (missing values).
Topological Map

Click the Topological map button on the Results dialog - Advanced tab to display the Topological Map dialog. This dialog
displays an editable topological map. It is designed primarily for use with SOFMs. You can use the topological map to:
l Locate self-organized clusters in the topological layer by displaying win frequencies or observing activation levels;
l Label radial neurons with output classes and assign new output classes;
l Visualize the network's performance.
This dialog contains four tabs: Topological Map, Custom Case, Advanced, and Win Frequencies.
Select models. Click this button to display the Select Networks and/or Ensembles dialog, which is used to select the
networks to be displayed on the topological map. The map can display only a single network at one time; however, if you
select multiple networks, their summary details are shown in the list at the top of the Topological Map dialog. You can select
one of these and move through the models using the ARROW keys on your keyboard, and compare the topological maps of the
different networks. Only networks with a radial hidden layer can display a meaningful topological map.
Topological graph. Click this button to generate a STATISTICA graph containing the currently displayed topological map.
Case. Enter a number in this box to specify case from the data set. The case is fed as input to the network, and the
Topological Map is updated. The "winning" neuron is automatically selected when the case is run.
Subset readout. This readout field, located just below the Case number field, displays the subset to which the current case
was assigned when the current model was trained.
For networks trained in earlier versions of STATISTICA Neural Networks, this field may display "<unknown>" as the
subsets used in training were not recorded with the network in versions before 6.0.
Class drop-down list. Select a class label for the currently selected range of units in the Topological Map.
Edit class list. Click the Edit class list button to display the Nominal definition dialog. Use this dialog to change the
definition of nominal variables (the number and names of the nominal values). In a network output variable, a nominal
(categorical) variable corresponds to a classification problem, and the nominal values to the classes. You can rename
existing nominals, delete existing ones, and add new ones. Nominal values must be distinct, and the dialog will check that
you have given distinct names.
Readout fields (unit, class, activation, win frequency). These fields (at the lower-right of the dialog) display information
about the unit from the topological map underneath the mouse pointer as you move it. The fields show, in turn: the unit's
ordinal number and position in the topological map; the class label of the unit, the activation level of the unit when last
executed, and the win frequency of the unit (the number of cases from the currently-selected subset for which that unit has
the smallest response).

Select the Topological Map tab from the Topological Map dialog to display the topological map. Each unit is represented by
a small outline square containing a smaller solid square. The solid square gives a visual representation of the activation level
of the unit - a zero activation (exact match, minimum distance) gives a maximally sized solid square that fills the outline
square. Smaller solid squares indicate a larger (more distant) activation level.
Each unit can be labeled with various pieces of information; see the Advanced tab.
You can select a range of units on the Topological Map. by holding down the left mouse button and dragging the mouse
pointer to select a range of units - they are highlighted in color. Hold the CTRL key down and drag the mouse pointer to
extend the range and to deselect previously-selected units. Right-click anywhere on the topological map to deselect all units.
You can export the topological map to the Clipboard by clicking the Copy button in the upper-right corner of the topological
window.

Select the Custom Case tab of the Topological Map dialog to enter a custom (user-defined) input case, then click the Update
button to execute the case and update the topological map.
Input spreadsheet. Specify the values for the case to be executed. You can edit the values displayed in the cell values by
clicking the User defined input button (see below).
Update. Click this button to update the display of the topological map to reflect the input variable values in the Input
spreadsheet. The winning neuron is automatically selected on the Topological Map.
Input case. Enter a number in this box to specify a case from the data set, run the network using it, and update the
Topological Map accordingly. The values from the case are also copied into the Input spreadsheet so that you can modify
them and then run the network, if desired.
Click the OK button on the spreadsheet to confirm the changes you have made and the new values will be displayed in the
Input spreadsheet. If you leave a cell blank inside the spreadsheet it will be interpreted as a missing value.
Clear runs. Click this button to clear all the cells of the Input spreadsheet to blanks (missing values).

Select the Advanced tab of the Topological Map dialog to access the options described here.
Text used to label units on Topological Map. The units on the topological map can optionally be labeled using one or
more pieces of information. Select the check boxes in this group box to display the following pieces of information:
Unit number. Select this check box to display an ordinal identifier for the neuron, as used in the Network Editor.
Position. Select this check box to display the (x,y) position of the unit in the topological map.
Class name. Select this check box to display the class label currently applied to the unit, and used to form the prediction of
the network should the unit become the winner.
Activation level. Select this check box to display the activation level of the unit when last executed.
Win frequency. Select this check box to display the number of cases for which the given unit was the winner (lowest
activation unit in the topological layer).
Saturation coefficient. The solid square indicator of activation level in the topological map is designed so that a zero-
distance input fills the entire unit square, and a distance 2.0 input (the maximum possible in a unit-normalized space, the
default STATISTICA Neural Networks configuration) produces a one pixel solid square. However, if you choose to use un-
normalized data, you may want to adjust this coefficient so that the significant range of distances is properly covered.
Activations. Click this button to generate a results spreadsheet of the current activations of the topological map units.

Select the Win Frequencies tab of the Topological Map dialog to access the options described here.
Win frequencies. Click this button to generate a results spreadsheet of the Win Frequencies (the number of cases for which
the given neuron is the winner).
Show win frequencies on Topological Map. Select this check box to display the number of cases for which the given unit
was the winner (lowest activation unit in the topological layer).
Subsets used to generate win frequencies. The win frequencies are generated by executing the network using a number of
cases. In this group box, specify that one or more of the case subsets be used. You can choose among: Overall, All
separately, Training, Selection, Test, Ignored.
Select cases. Click this button to display the Select Cases dialog, where you can explicitly select the cases to be used in
constructing the win frequencies.
MD Deletion. In this group box, specify how cases with missing values should be treated. The Casewise option implies that
such cases should not be used when calculating the Win Frequencies; if Mean substitution is selected, these cases are used to
calculate win frequencies, and the network's missing value substitution procedure is used to compensate for any missing
values.
Note. The Subset, Case Selection, and MD Deletion options intersect. A case must pass all three of these tests to be accepted
and used to generate win frequencies.

Click the Network illustration button on the Advanced tab of the Results dialog to display the Network Illustration dialog.
This dialog contains two tabs: Illustration and Custom Case. Use the options on this dialog to generate an illustration of a
neural network. The illustration can be exported for inclusion in documents describing your application. In addition, the
neurons in the illustration can be colored to correspond to their activation levels, giving an informative visual indication of
the network's activity.
Network list. Lists the currently selected networks. Select on a network to display it in the illustration on the Illustration tab.
Cancel. Click this button to exit the Network Illustration dialog and return to the Results dialog.
Options. Click this button to display the Options menu.
Networks. Click this button to display the Select Networks and/or Ensembles dialog, which is used to select one or more
networks to be displayed in the Network list.
Network graph. Click this button to generate a STATISTICA graph containing the currently displayed network illustration.
Display unit activation using color. This check box is selected by default. Clear it if you want to remove the color
indication of unit activation level. This option may be useful if you are generating a network illustration for inclusion in
documentation.
Input case. Enter a case number (or use the microscrolls). The illustration is updated to reflect the effect if the network is
executed with that case as input. If you want to modify the case and observe the effect upon the network, see Custom Case
tab.

Select the Illustration tab of the Network Illustration dialog, where the illustration is displayed. Unit activation levels are (by
default) displayed in color - red for positive activation levels, green for negative. If there is insufficient space to display all
the neurons in a layer, the middle ones are omitted.
Copy. Click this button (at the upper-right of the illustration window) to export the illustration to the Clipboard.
Neurons are represented using one of several shapes:
Triangles. Triangles pointing to the right indicate input neurons. These neurons perform no processing, and simply
introduce the input values to the network.
Squares. Squares indicate Dot Product synaptic function units (e.g. as found in Multilayer Perceptrons).
Circles. Circles indicate Radial synaptic function units.
Small open circles. Input and output variables are illustrated using a small open circle joined to the corresponding input or
output neuron. In some circumstances (nominal variables and time series inputs) a number of neurons are joined to a single
input or output variable.

Select the Custom Case tab of the Network Illustration dialog to access the options described here.
Update. Click this button to execute the custom case and redisplay the network illustration.
Input case. Enter a case number (or use the microscrolls). This can be modified by clicking the User defined input button.
Click the OK button on the spreadsheet to accept the changes you have made. If you leave a cell blank inside the spreadsheet
it will be interpreted as a missing value.
Clear user defined runs. Click this button to clear the user-defined case, setting all values to "missing."


Click the Time Series Projection button on the Results dialog - Advanced tab to display the Time Series Projection dialog.
This dialog contains two tabs: Quick and Advanced.
In time series problems, the objective is to predict (later) values of a variable or variables, from a number of (earlier) values
of the same or different variable or variables. In the most common case, a single variable is involved, and a number of
sequential values are used to predict the next value in the same sequence (Bishop, 1995).
STATISTICA Neural Networks supports a more general model: the input and output variable(s) do not have to be the same,
and the prediction can be more than one step ahead.
If the simple case (one step lookahead) is used, time series projection can be used. In Time Series Projection, the network is
executed on a starting case, and produces a prediction of the next value in the series. The starting case values are then shifted
back one time step, the prediction is added, and the next value is predicted. This process can be repeated an indefinite
number of times to predict further into the future, and a prediction graph plotted.
Time Series Projection can be started from either a case in the data set, or from a user defined case. If a case from the data
set is used, the target time series is plotted in addition to the predicted output.
Select models. Click this button to display the Select Networks and/or Ensembles dialog where you can select the models to
be used for time series projection. If you select multiple models, their predictions are plotted on a single graph for
comparison. Time series projection can only be successfully applied to models matching the criteria discussed in the
introduction to this article.

Select the Quick tab of the Time Series Projection dialog to access the options described here.
Length of projection. Specifies for how many time steps the projection should be performed.
Case (starts from). Specify a case from the data set to use to start the projection. For a time series network, the
STATISTICA Neural Networks convention is to specify the output case number. For example, if you want to start the
projection using cases 101 through 112 in a network that has steps 12, lookahead 1, then you would specify case 113.
Time series graph. Click this button to generate the time series projection graph.
Time series spreadsheet. Click this button to generate a spreadsheet containing the data from the projection.

Select the Advanced tab of the Time Series Projection dialog to access the options described here.
Include observed values in graph/spreadsheet (where available). If this check box is selected and you start the time series
projection from a case in the data set, the corresponding values from the data set are plotted on the graph, allowing you to
compare the network's prediction with the observed value. This option is ignored if you modify the custom case yourself.
Inputs spreadsheet. This spreadsheet displays the values of the current input variable. For time series networks, there is a
column for each required lagged value. You can enter an input drawn from the data using the Input case. Alternatively, you
may want to run your own defined inputs using the User defined input button (see description below).
Time series graph. Click this button to generate the time series projection graph.
Time series spreadsheet. Click this button to generate a spreadsheet containing the data from the projection.
Input case. You can input a custom case here. If you project from a custom case, corresponding observed values are
obviously not available, and the Include observed values… option is ignored. The starting case is identical to Case (starts
from) on the Quick tab). If you want to specify a custom case based on an existing case from the data set, select the case
using this field first, then modify it in the custom case spreadsheet.
User defined input. Click this button to display a spreadsheet that you can use to customize the inputs. By clicking the OK
button on this dialog you can confirm the changes you have made and the new values will be displayed in the Inputs
spreadsheet. If you leave a cell blank inside the spreadsheet, it will be interpreted as a missing value.
Clear case. Click this button to clear all the inputs of the custom case (i.e. treat them as missing values).
Neural Networks Technical Notes

Back Propagation
Classification
Class Labeling
Delta-Bar-Delta
Ensembles
Error Function
Joining Networks
K-Means Algorithm
Kohonen Algorithm
Levenberg-Marquardt
Loss Matrix
Model Profiles
Network Sets
Perceptron
Quasi-Newton
Quick Propagation
Radial Sampling
Regression
Regression Statistics
Resampling
Stopping Conditions
Synaptic Functions
Time Series
Topological Map
Unit Types
STATISTICA Neural Networks supports a wide range of activation functions. Only a few of these are used by default; the
others are available for customization.
Identity. The activation level is passed on directly as the output. Used in a variety of network types, including linear
networks, and the output layer of radial basis function networks.
Logistic. This is an S-shaped (sigmoid) curve, with output in the range (0,1).
Hyperbolic. The hyperbolic tangent function (tanh): a sigmoid curve, like the logistic function, except that output lies in the
range (-1,+1). Often performs better than the logistic function because of its symmetry. Optionally available via the
Intelligent Problem Solver. Ideal for customization of multilayer perceptrons, particularly the hidden layers.
-Exponential. The negative exponential function. Ideal for use with radial units. The combination of radial synaptic function
and negative exponential activation function produces units that model a Gaussian (bell-shaped) function centered at the
weight vector. The standard deviation of the Gaussian is given by the formula below, where d is the "deviation" of the unit
stored in the unit's threshold:
Softmax. Exponential function, with results normalized so that the sum of activations across the layer is 1.0. Can be used in
the output layer of multilayer perceptrons for classification problems, so that the outputs can be interpreted as probabilities
of class membership (Bishop, 1995; Bridle, 1990). Optionally used by the Intelligent Problem Solver.
Unit sum. Normalizes the outputs to sum to 1.0. Used in PNNs to allow the outputs to be interpreted as probabilities.
Square root. Used to transform the squared distance activation in an SOFM network or Cluster network to the actual
distance as an output.
Sine. Possibly useful if recognizing radially-distributed data; not used by default.
Ramp. A piece-wise linear version of the sigmoid function. Relatively poor training performance, but fast execution.
Step. Outputs either 1.0 or 0.0, depending on whether the Synaptic value is positive or negative. Can be used to model
simple networks such as perceptrons.
The mathematical definitions of the activation functions are given in the table below:
Function Definition Range

Identity x (-inf,+inf)
Logistic (0,+1)
Hyperbolic (-1,+1)
-Exponential (0, +inf)

Softmax (0,+1)
Unit sum (0,+1)
Square root (0, +inf)
Sine sin(x) [0,+1]

Ramp [-1,+1]
Step [0,+1]
Back Propagation
Back propagation is the best known training algorithm for neural networks and still one of the most useful. Devised
independently by Rumelhart et. al. (1986), Werbos (1974), and Parker (1985), it is thoroughly described in most neural
network textbooks (e.g., Patterson, 1996; Fausett, 1994; Haykin, 1994). It has lower memory requirements than most
algorithms, and usually reaches an acceptable error level quite quickly, although it can then be very slow to converge
properly on an error minimum. It can be used on most types of networks in STATISTICA Neural Networks, although it is
most appropriate for training multilayer perceptrons.
The version of back propagation supported in STATISTICA Neural Networks includes:
l Time-dependent learning rate
l Time-dependent momentum rate
l Random shuffling of order of presentation.
l Additive noise during training
l Independent testing on selection set
l A variety of stopping conditions
l RMS error plotting: graph
l Selectable error function
The last five bulleted items are equally available in other iterative algorithms supported in STATISTICA Neural Networks,
including conjugate gradient descent, Quasi-Newton, Levenberg-Marquardt, quick propagation, Delta-bar-Delta, and
Kohonen training (apart from noise in conjugate gradients, Kohonen and Levenberg-Marquardt, and selectable error
function in Levenberg-Marquardt).
The algorithm is available from the following dialogs:
Train Multilayer Perceptron - Quick tab
Train Dot Product Layers - Train tab
Technical Details. STATISTICA Neural Networks implements the on-line version of back propagation; i.e. it calculates the
local gradient of each weight with respect to each case during training. Weights are updated once per training case.
The update formula is:
h - the learning rate
d - the local error gradient
α- the momentum coefficient
oi - the output of the i'th unit
Thresholds are treated as weights with oi = -1.
The local error gradient calculation depends on whether the unit into which the weights feed is in the output layer
or the hidden layers.
Local gradients in output layers are the product of the derivatives of the network's error function and the units'
activation functions.
Local gradients in hidden layers are the weighted sum of the unit's outgoing weights and the local gradients of
the units to which these weights connect.
Classification
In classification, the aim is to assign input cases to one of a number of classes. In STATISTICA Neural Networks,
classification can be performed using multilayer perceptrons, radial basis function networks, SOFM networks, cluster
networks, linear networks, and probabilistic neural networks.
STATISTICA Neural Networks has a number of facilities to support classification. It automatically interprets nominal
network output variables for classification, and can generate statistics on overall classification performance. The user can set
control parameters that determine how classification is performed.
Classification problems fall into two categories in STATISTICA Neural Networks: two-class problems, and many-class
problems. A two-class problem is usually encoded using a single output neuron. Many-class problems use one output neuron
per class. It is also possible to encode two-class problems using this approach (i.e., using two output neurons), and this is in
fact the approach taken by PNN networks.
Single output neuron. In these two-class networks, the target output is either 1.0 (indicating membership of one class) or
0.0 (representing membership of the other).
Multiple output neurons. In many-class problems, the target output is 1.0 in the correct class output and 0.0 in the others.
The output neuron activation levels provide confidence estimates for the output classes. It is desirable to be able to interpret
these confidence levels as probabilities. If the correct choice of network error function is used during optimization,
combined with the correct activation function, such as interpretation can be made. Specifically, a cross entropy error
function is combined with the logistic activation function for a two-class problem encoded by a single output neuron or with
softmax for a three or more class problem.
The entropic approach corresponds to maximum likelihood optimization, assuming that the data is drawn from the
exponential family of distributions. An important feature is that the outputs may be interpreted as posterior estimates of class
membership probability.
The alternative approach is to use a sum-squared error function with the logistic output activation function. This has less
statistical justification - the network learns a discriminant function, and although the outputs can be treated as confidence
measures, they are not probability estimates (and indeed may not even sum to 1.0). On the other hand, such networks
sometimes train more quickly, the training process is more stable, and they may achieve higher classification rates.
If a network is trained so that the outputs estimate probabilities, STATISTICA Neural Networks can be adjusted to support a
loss matrix.

Classification neural networks must translate the numeric level on the output neuron(s) to a nominal output variable. There
are two very different approaches to assigning classifications in STATISTICA Neural Networks.
In one of these, the activation level of the output layer units determines the class, usually by interpreting the activation level
as a confidence measure, and finding the highest confidence class. That approach is used in most neural network types, and
is described in more detail in Classification Thresholds.
This topic discusses the alternative approach, which is used in SOFM and Cluster networks.
These types of networks store labeled exemplar vectors in their radial layer. When a new case is presented to the network,
the network in essence calculates the distance between the (possibly normalized) new case and each exemplar vector; the
activations of the neurons encode these distances. Each of these neurons has a class label. The class label of the
"winning" (smallest distance from input case) neuron is typically used as the output of the network. In STATISTICA Neural
Networks, the standard algorithm is extended slightly using the KL nearest neighbor algorithm; the class assigned by the
network is the most common class among the K winning neurons, provided that at least L of them agree (otherwise, the class
is "Unknown").
The input case might actually be very distant from any of the exemplar vectors, in which case it may be better to assign the
case as "Unknown." You can optionally specify an accept threshold for this eventuality. If the normalized distance is greater
than this threshold, the class is "Unknown."
One of the major uses of neural networks is to perform classification tasks, i.e., to assign cases to one of a number of
possible classes. The class of a case is indicated by the use of a nominal output variable.
You can generate overall statistics on classification performance by clicking the Descriptive Statistics button on the Results
dialog.
The classification statistics include, for each class:
Total. Number of cases of that class.
Correct. Number of cases correctly classified.
Wrong. Number of cases erroneously assigned to another class.
Unknown. Number of cases that could not be positively classified.
Correct (%). Percentage of cases correctly classified.
Wrong (%). Percentage of cases wrongly classified.
Unknown (%). Percentage of cases classified as unknown.
Classification neural networks must translate the numeric activation level of the output neuron(s) to a nominal output
variable. There are two very different approaches to assigning classifications in STATISTICA Neural Networks.
One of these, where the neural network determines the "winning" neuron or neurons in the radial layer of the network, and
then uses the class labels on those neurons, is described in Classification by Labeled Exemplars. This approach is used in
SOFM and Cluster networks.
Here we discuss the alternative approach, where it is the activation level of the output layer units, which are not radial units,
that determines the class. This approach is used in all other network types.
Two cases need to be distinguished: single output neuron versus multiple output neurons.
Single output neurons are typically used for two-class problems, with a high output neuron level indicating one class and a
low activation the other class. This configuration, which uses the Two-state conversion function, is the default chosen for
two-class problems by STATISTICA Neural Networks.
Multiple output neurons are typically used for three or more class problems. One neuron is used for each class, and the
highest activation neuron indicates the class. The neuron activation levels can be interpreted as confidence levels. This
method is implemented by using the One-of-N conversion function for the output variable. You can optionally configure a
multilayer perceptron to use two output neurons for a two-class output variable by selecting the One-of-N Conversion option
on the Neural Network Editor - Variables tab.
There is also an obscure and little-used option to encode three or more class options on a single output neuron - see ordinal
classification below.
Single output neuron. In the single output neuron case, two options are available. You can explicitly specify some
classification thresholds, or STATISTICA Neural Networks can determine them automatically for you.
Two thresholds are used: accept and reject. In the single output neuron case, the output is considered to be the first class if
the output neuron's activation is below the reject threshold, and to be the second class if its activation is above the accept
threshold. If the activation is between the two thresholds, the class is regarded as "unknown" (the so-called doubt option). If
the two thresholds are equal, there is no doubt option. There are two common configurations: accept=reject=0.5 implies no
doubt option, with the most likely class assigned; accept=0.95, reject=0.05 implies standard "95% confidence" in assignment
of a class, with doubt expressed otherwise. Both of these cases assume the standard logistic output neuron activation
function, which gives the output neuron a (0,1) range; the thresholds should be adjusted accordingly for different activation
functions [e.g., hyperbolic tangent uses output range (-1,+1)].
As an alternative to selecting the thresholds yourself, you can specify that STATISTICA Neural Networks determines them
automatically. You specify a loss coefficient that gives the relative "cost" of the two possible misclassifications (false-
positive versus false-negative). A loss coefficient of 1.0 indicates that the two classes are equally important. A loss
coefficient above 1.0 indicates that it is relatively more important to correctly recognize class two cases, even at the expense
of misclassifying more class one cases. STATISTICA Neural Networks determines the thresholds (which are equal; i.e. no
doubt option) by calculating a ROC curve and determining the point on the curve where the ratio of false positives to false
negatives equals the loss coefficient. This equalizes the weighted loss on each class, independent of the number of cases in
each (i.e. with a loss coefficient of 1.0, it is the proportion of misclassifications in each class that is equalized, not the
number of absolute number of misclassifications).
Multiple output neurons. In the multiple output neuron case, two options are available. You can specify that no thresholds
are used, or you can explicitly define the thresholds yourself.
If no thresholds are used, the network uses a "winner takes all" algorithm; the highest activation neuron gives the class.
There is a no "doubt option."
If you specify thresholds, the class is still assigned to the highest neuron, but there is a doubt option. The highest neuron's
activation must be above the accept threshold and all other neuron's below the reject threshold in order for the class to be
assigned; if this condition is not fulfilled, the class is "unknown."
If your multiple output neuron classification network is using the softmax activation function, the output neuron activations
are guaranteed to sum to 1.0 and can be interpreted as probabilities of class membership. However, in other cases, although
the activations can be interpreted as confidence levels in some sense (i.e. a higher number indicates greater confidence), they
are not probabilities, and should be interpreted with caution.
Ordinal classification. If you have a large number of classes, one-of-N encoding can become extremely unwieldy, as the
number of neurons in the network proliferates. An alternative approach then is to use ordinal encoding. The output is
mapped to a single neuron, with the output class represented by the ordinals 1, 2, 3, etc. The problem with this technique is
that it falsely implies an ordering on the classes (i.e. class 1 is more like class 2 than class 3).
However, in some circumstances it may be the only viable approach.
You can specify ordinal encoding in STATISTICA Neural Networks by changing the conversion function of the output
variable to minimax, using the Network Editor. The ordinals are then mapped into the output range of the neuron.
With ordinal encoding, the output is determined as follows. The output neuron's activation is linearly scaled using the factors
determined by minimax, then rounded to the nearest integer. This ordinal value gives the class.
The only classification threshold used in the accept threshold. If the difference between the output and the selected ordinal
is greater than the threshold, then the classification is instead "unknown."
Example. The output is 3.8, which is rounded to the nearest ordinal 4. The difference is 0.2. If an accept threshold less than
0.2 has been selected, the classification is rejected, and "unknown" is generated.
An accept threshold of 0.5 or above is equivalent to not using a threshold at all.
Class Labeling
In STATISTICA Neural Networks, a variety of clustering algorithms and networks are supported. All of these have networks
where the second layer consists of Radial units, and these units contain exemplar vectors. In SOFM and Cluster networks,
these are combined in a two layer neural network with a single nominal output variable and the KNearest output conversion
function to produce a classification based upon the nearest exemplar vector(s) to an input case.
The exemplar vectors can be positioned using a variety of cluster-center and sampling approaches; see Cluster Networks for
more details. However, it is also necessary to apply class labels to the radial units (i.e. label each radial unit as representative
of a particular class). Class labels can be applied using the class labeling algorithms described here, or interactively using the
Topological Map dialog.
These algorithms are available from the following dialogs:
Cluster Network Training
Radial Training
KL Nearest Neighbor Labeling. This algorithm assigns labels to units based upon the labels of the K nearest neighboring
training cases. Provided that at least L of the K neighbors are of the same class, this class is used to label the unit. If not, a
blank label is applied, signifying an "unknown" class.
Note that this is distinct from (although related to) the KL-Nearest algorithm used when executing cluster networks, which
reports the class of at least L of the K nearest units to the input case.
Voronoi Labeling. This algorithm assigns labels to units based upon the labels of the training cases that are "assigned" to
that unit. Assigned cases are those that are nearer to this unit than to any other (i.e. those that would be classified by this unit
if using the 1-NN classification scheme). These are the cases in the Voronoi neighborhood of the unit. The class of the
majority of the training cases is used to label the unit, provided that at least a given minimum proportion of the training cases
belong to this majority. If not, a blank label is applied, signifying an "unknown" class.

Conjugate gradient descent (Bishop, 1995; Shepherd, 1997) is an advanced method of training multilayer perceptrons. It
usually performs significantly better than back propagation, and can be used wherever back propagation can be. It is the
recommended technique for any network with a large number of weights (more than a few hundred) and/or multiple output
units. For smaller networks, either Quasi-Newton or Levenberg-Marquardt may be better, the latter being preferred for low-
residual regression problems.
Conjugate gradient descent is a batch update algorithm: whereas back propagation adjusts the network weights after each
case, conjugate gradient descent works out the average gradient of the error surface across all cases before updating the
weights once at the end of the epoch.
For this reason, there is no shuffle option available with conjugate gradient descent, since it would clearly serve no useful
function. There is also no need to select learning or momentum rates for conjugate gradient descent, so it can be much easier
to use than back propagation. Additive noise would destroy the assumptions made by conjugate gradient descent about the
shape of search space, and so is also not available.
Conjugate gradient descent works by constructing a series of line searches across the error surface. It first works out the
direction of steepest descent, just as back propagation would do. However, instead of taking a step proportional to a learning
rate, conjugate gradient descent projects a straight line in that direction and then locates a minimum along this line, a process
that is quite fast as it only involves searching in one dimension. Subsequently, further line searches are conducted (one per
epoch). The directions of the line searches (the conjugate directions) are chosen to try to ensure that the directions that have
already been minimized stay minimized (contrary to intuition, this does not mean following the line of steepest descent each
time).
The conjugate directions are actually calculated on the assumption that the error surface is quadratic, which is not generally
the case. However, it is a fair working assumption, and if the algorithm discovers that the current line search direction isn't
actually downhill, it simply calculates the line of steepest descent and restarts the search in that direction. Once a point close
to a minimum is found, the quadratic assumption holds true and the minimum can be located very quickly.
Note: The line searches on each epoch of conjugate gradient descent actually involve one gradient calculation plus a
variable number (perhaps as high as twenty) of error evaluations. Thus a conjugate gradient descent epoch is substantially
more time-consuming (typically 3-10 times longer) than a back propagation epoch. If you want to compare performance of
the two algorithms, you will need to record the time taken rather than relying on the epochs elapsed on the Progress Graph.
Multilayer Perceptron Training
Dot Product Training
Technical Details. Conjugate gradient descent is batch-based; it calculates the error gradient as the sum of the error
gradients on each training case.
The initial search direction is given by:
Subsequently, the search direction is updated using the Polak-Rebiere formula:
If the search direction is not downhill, the algorithm restarts using the line of steepest descent. It restarts anyway after W
directions (where W is the number of weights), as at that point the conjugacy has been exhausted.
Line searches are conducted using Brent's iterative line search procedure, which utilizes a parabolic interpolation to locate
the line minima extremely quickly.
Delta-bar-Delta
Delta-bar-Delta (Jacobs, 1988; Patterson, 1996) is an alternative to back propagation, which is sometimes more efficient,
although it can be more inclined to stick in local minima than back propagation. Unlike quick propagation, it tends to be
quite stable.
Like quick propagation, Delta-bar-Delta is a batch algorithm: the average error gradient across all the training cases is
calculated on each epoch, then the weights are updated once at the end of the epoch.
Delta-bar-Delta is inspired by the observation that the error surface may have a different gradient along each weight
direction, and that consequently each weight should have its own learning rate (i.e. step size).
In Delta-bar-Delta, the individual learning rates for each weight are altered on each epoch to satisfy two important heuristics:
l If the derivative has the same sign for several iterations, the learning rate is increased (the error surface has a low
curvature, and so is likely to continue sloping the same way for some distance);
l If the sign of the derivative alternates for several iterations, the learning rate is rapidly decreased (otherwise the algorithm
may oscillate across points of high curvature).
To satisfy these heuristics, Delta-bar-Delta has an initial learning rate used for all weights on the first epoch, an increment
factor added to learning rates when the derivative does not change sign, and a decay rate multiplied by the learning rates
when the derivative does change sign. Using linear growth and exponential decay of learning rates contributes to stability.
The algorithm described above could still be prone to poor behavior on noisy error surfaces, where the derivative changes
sign rapidly even within an overall downward trend. Consequently, the increase or decrease of learning rate is actually based
on a smoothed version of the derivative.
Technical Details. Weights are updated using the same formula as in back propagation, except that momentum is not used,
and each weight has its own time-dependent learning rate.
All learning rates are initially set to the same starting value; subsequently, they are adapted on each epoch using the
formulae below.
The bar-Delta value is calculated as:
d(t) is the derivative of the error surface,
θ is the smoothing constant.
The learning rate of each weight is updated using:
k is the linear increment factor,
f the exponential decay factor.

These algorithms assign deviations to the radial units in certain network types. The deviation is multiplied by the distance
between the unit's exemplar vector and the input vector, to determine the unit's output. In essence, the deviation gives the
size of the cluster represented by a radial unit.
Deviation assignment algorithms are used after radial centers have been set; see Radial Sampling and K Means.
These algorithms are available from the following dialogs:
Radial Basis Function Training
Radial Training
Explicit Deviation Assignment
The deviation is set to an explicit figure provided by the user.
Notes. The deviation assigned by this technique is not the standard deviation of the Gaussians; it is the value stored in the
unit threshold, which is multiplied by the distance of the weight vector from the input vector. It is related to the standard
deviation by:
If you want to explicitly assign differing deviations to individual radial units, use the Network Editor - Weights tab, to set the
Thresholds in the radial layer (the "threshold" holds the deviation in radial units).
Isotropic Deviation Assignment. This algorithm uses the isotropic deviation heuristic (Haykin, 1994) to assign the
deviations to radial units. This heuristic attempts to determine a reasonable deviation (the same for all units), based upon the
number of centers, and how spread out they are.
This isotropic deviation heuristic sets radial deviations to:
where d is the distance between the two most distant centers, and k is the number of centers.
The version implemented in STATISTICA Neural Networks multiplies the above formula by the constant given in the
Deviation field.
K-Nearest Neighbor Deviation. The K-nearest neighbor deviation assignment algorithm (Bishop, 1995) assigns deviations
to radial units by using the RMS (Root Mean Squared) distance from the K units closest to (but not coincident with) each
unit as the standard deviation (assuming the unit models a Gaussian). Each unit hence has its own independently calculated
deviation, based upon the density of points close to itself.
If less than K non-coincident neighbors are available, the algorithm uses the neighbors that are available.
Ensembles
Ensembles are collections of neural networks that cooperate in performing a prediction. Two types of ensembles are
supported in STATISTICA Neural Networks: output ensembles and confidence ensembles.
In STATISTICA Neural Networks, ensembles can be created by the Intelligent Problem Solver or Custom Network Designer,
or can be created and edited "by hand" (see the Create/Edit Ensemble dialog).
Output ensembles. Output ensembles are the most general form. Any combination of networks can be combined in an
output ensemble. If the networks have different outputs, the resulting ensemble simply has multiple outputs. Thus, an output
ensemble can be used to form a multiple output model where each output's prediction is formed separately.
If any networks in the ensemble have a shared output, the ensemble estimates a value for that output by combining the
outputs from the individual networks. For classification (nominal outputs), the networks' predictions are combined in a
winner-takes-all vote - the most common class among the combined networks is used. In the event of a tie, the "unknown"
class is returned. For regression (numeric variables), the networks' predictions are averaged. In both cases, the vote or
average is weighted using the networks' membership weights in the ensemble (usually all equal to 1.0).
Confidence ensembles. Confidence ensembles are much more restrictive than output ensembles. The network predictions
are combined at the level of the output neurons. To make sense, the encoding of the output variables must therefore be the
same for all the members. Given that restriction, there is no point in forming confidence ensembles for regression problems,
as the effect is to produce the same output as an output ensemble, but with the averaging performed before scaling rather
than after. Confidence ensembles are designed for use with classification problems.
The advantage of using a confidence ensemble for a classification problem is that it can estimate overall confidence levels
for the various classes, rather than simply providing a final choice of class.
Why use ensembles?
There are a number of uses for ensembles:
l Ensembles can conveniently group together networks that provide predictions for related variables without requiring that
all those variables be combined into a single network. Multiple output networks often suffer from cross-talk in the hidden
neurons, and make ineffective predictions. Using an ensemble, each output can be predicted separately.
l Ensembles provide an important method to combat over-learning and improve generalization. Averaging predictions
across models with different structures, and/or trained on different data subsets, can reduce model variance without
increasing model bias. This is a relatively simple way to improve generalization. Ensembles therefore are particularly
effective when combined with resampling. An important piece of theory shows that the expected performance of an
ensemble is greater than or equal to the average performance of the members.
l Ensembles report the average performance and error measures of their member networks. You can perform resampling
experiments, and save the results to an ensemble. Then, these average measures give an unbiased estimate of an
individual network's performance, if trained in the same fashion. It is standard practice to use resampling techniques such
as cross validation to estimate network performance in this fashion.
Error Function
The error function is used in training the network and in reporting the error. The error function used can have a profound
effect on the performance of training algorithms (Bishop, 1995). It is specified using the Network Editor.
The following four error functions are available.
Sum-squared. The error is the sum of the squared differences between the target and actual output values on each output
unit. This is the standard error function used in regression problems. It can also be used for classification problems, giving
robust performance in estimating discriminant functions, although arguably entropy functions are more appropriate for
classification, as they correspond to maximum likelihood decision making (on the assumption that the generating
distribution is drawn from the exponential family), and allow outputs to be interpreted as probabilities.
City-block. The error is the sum of the differences between the target and actual output values on each output unit;
differences are always taken to be positive. The city-block error function is less sensitive to outlying points than the sum-
squared error function (where a disproportionate amount of the error can be accounted for by the worst-behaved cases).
Consequently, networks trained with this metric may perform better on regression problems if there are a few wide-flung
outliers (either because the data naturally has such a structure, or because some cases may be mislabeled).
Cross-entropy (single & multiple). This error is the sum of the products of the target value and the logarithm of the error
value on each output unit. There are two versions: one for single-output (two-class) networks, the other for multiple-output
networks. The cross-entropy error function is specially designed for classification problems, where it is used in combination
with the logistic (single output) or softmax (multiple output) activation functions in the output layer of the network. This is
equivalent to maximum likelihood estimation of the network weights. An MLP with no hidden layers, a single output unit,
and cross entropy error function is equivalent to a standard logistic regression function (logit or probit classification).
Kohonen. The Kohonen error assumes that the second layer of the network consists of radial units representing cluster
centers. The error is the distance from the input case to the nearest of these. The Kohonen error function is intended for use
with Kohonen networks and Cluster networks only.
Joining Networks
It is sometimes useful to join two networks together to form a single composite network for a number of reasons:
l You might train one network to pre-process data, and another to further classify the pre-processed data. Once completed,
the networks can be joined together to classify raw data.
l You might want to add a loss matrix to a classification network, to make minimum cost decisions.
To join two networks together, take the following steps:
1. Make sure both networks are in the same network file. If they are not, use the Merge option on the Neural Network File
Editor - Advanced tab to merge the networks from one network file into the other.
2. Click the Join networks button on the Neural Network File Editor - Advanced tab.
3. Select the pre-processing network (which provides the input and succeeding layers of the joined network) from the
network list, and click OK.
4. Select the post-processing network (which provides the output and preceding layers) from the network list, and click OK.
Note: Networks can only be joined if the number of input neurons in the second network matches the number of output
neurons in the first network. The input neurons from the second network are discarded, and their fan-out weights are
attached to the output neurons of the first network.
Caution: The post-processing information from the first network and the input preprocessing information from the second
network are also discarded. The composite network is unlikely to make sense unless you have designed the two networks
with this in mind; i.e., with no post-processing performed by the first network and no preprocessing performed by the second
network.

Sometimes you may want to use the Intelligent Problem Solver to get an initial idea of the right architecture for a neural
network, then to build networks "by hand" yourself.
It is worth noting that, with many of the training algorithms used in STATISTICA Neural Networks, if you reinitialize a
neural network and retrain, you will almost certainly get different results even if you use exactly the same training
parameters. This occurs because the weights are randomized before applying most neural network training algorithms, and
the starting weight values affect the eventual results achieved. The lack of reproducibility implied may seem a cause for
concern if you are accustomed to the more predictable world of linear modeling, but it is the price paid for the greater power
of the nonlinear models represented by neural networks. Experience indicates that, providing you have sufficient training
data, most neural networks of the same architecture will achieve very similar performance figures (barring some that get
stuck in local mimina, and should be discarded).
This section contains instructions on three aspects of reconstructing Intelligent Problem Solver performance: how to copy a
network from the network set and train it as a new network, how to examine the parameters of a network created by the
Intelligent Problem Solver, and information about the algorithms used by the Intelligent Problem Solver.
Making an independent copy of a network. If you want to create a network similar to one generated by the Intelligent
Problem Solver, the easiest way is to first clone that network, then make any modifications that you please.
Select Network Set Editor from the Neural Networks Startup Panel - Advanced tab to display the Neural Network File Editor
dialog.
Select the network that you want to use from the network list. The list presents a variety of summary details, including a
brief description of the training algorithms used; see Summary Details for more details. Full details of the network itself can
be found using the Network Editor.
Click the Clone button. The clone is added to the end of the network set.
Click OK.
Select the cloned network on the Networks/Ensembles tab of the Startup dialog.
If you want to edit the network, use the Neural Network Editor (accessed by selecting Model Editor on the Neural Networks
Startup Panel - Advanced tab); if you want to retrain the network, select the Retrain option on the Analysis pull-down menu.
Reproducing the Intelligent Problem Solver training runs. The Intelligent Problem Solver uses a range of techniques,
and some results will not be precisely reproducible. The vast majority of the information needed to reproduce training
performance (at least up to the restriction that every training run anyway produces different results) is outlined in the
sections above. In addition, the following guidelines should be noted:
The Intelligent Problem Solver uses Weigend regularization extensively in multilayer perceptron training. Oversized
networks are trained with regularization and then pruned to size. Pruning of both inputs and hidden units is performed based
on weight magnitudes; this process can be reproduced using the Weigend decay factors on the Decay tab of the Multilayer
Perceptron Training dialog. In addition, input variables are pruned using Sensitivity Analysis. The networks recorded in the
network set are the consequence of pruning; therefore, it should not be necessary to prune them further, and in reproducing
runs you should not need to know the regularization factor used originally in training the network. However, if your data set
is relatively small for the number of variables, the network may suffer over-learning during training. In this case, you will
need to experimentally determine a suitable level of regularization to prevent over-learning, in order to reproduce the
performance of the Intelligent Problem Solver.
You can select a regularization factor using the Weigend decay factor on the Decay tab of the Multilayer Perceptron
Training dialog; typically, effective figures are in the range 0.05-0.001, although the Intelligent Problem Solver also
experiments with smaller and larger figures that this.
In two-output classification problems, the Intelligent Problem Solver uses the Calculate minimum loss threshold option
(available on most training dialogs) - the loss coefficient is defaulted to 1.0. In multiple-output classification problems, it
uses the Assign to highest confidence (no thresholds) option.
For Multilayer Perceptron training, the Intelligent Problem Solver uses a two-stage training process. First, a short burst of
back propagation training is applied (100 epochs, with learning rate 0.1 and momentum 0.3) - this usually locates the
approximate position of a reasonable minima. Second, a long period of conjugate gradient descent (several thousands
epochs) is used, with a stopping window of 50, to terminate training once convergence stops or over-learning occurs (in
reality, few problems will actually take thousands of epochs, and the algorithm usually stops after a couple of hundred
epochs). Once the algorithm stops, the best network from the training run is restored. You can reproduce this behavior by
selecting 50 for the Window parameter on the Train Multilayer Perceptron - End tab.
K-Means Algorithm
The K-means algorithm (Moody and Darkin, 1989; Bishop, 1995) assigns radial centers to the first hidden layer in the
network if it consists of radial units.
K-means assigns each training case to one of K clusters (where K is the number of radial units), such that each cluster is
represented by the centroid of its cases, and each case is nearer to the centroid of its cluster than to the centroids of any other
cluster. It is the centroids that are copied to the radial units.
The intention is to discover a set of cluster centers which best represent the natural distribution of the training cases.
This algorithm is available from the following dialogs:
Radial Training
Technical Details. K-means is an iterative algorithm. The clusters are first formed arbitrarily by choosing the first K cases,
assigning each subsequent case to the nearest of the K, then calculating the centroids of each cluster.
Subsequently, each case is tested to see whether the center of another cluster is closer than the center of its own cluster; if so,
the case is reassigned. If cases are reassigned, the centroids are recalculated and the algorithm repeats.
Caution. There is no formal proof of convergence for this algorithm, although in practice it usually converges reasonably
quickly. Consequently, STATISTICA Neural Networks cannot give a reliable estimate of the time needed for execution. If a
pathological case is encountered, be patient: the algorithm will terminate eventually.
Kohonen Algorithm
The Kohonen algorithm (Kohonen, 1982; Patterson, 1996; Fausett, 1994) assigns centers to a radial hidden layer by
attempting to recognize clusters within the training cases. Cluster centers close to one another in pattern-space tend to be
assigned to units that are close to each other in the network (topologically ordered).
The Kohonen training algorithm is the algorithm of choice for Self Organizing Feature Map networks. It can also be used in
STATISTICA Neural Networks to train the radial layer in other network types; specifically, radial basis function, cluster, and
generalized regression neural networks.
SOFM networks are typically arranged with the radial layer laid out in two dimensions. From an initially random set of
centers, the algorithm tests each training case and selects the nearest center. This center and its neighbors are then updated to
be more like the training case.
Over the course of the algorithm, the learning rate (which controls the degree of adaptation of the centers to the training
cases) and the size of the neighborhood are gradually reduced. In the early phases, therefore, the algorithm assigns a rough
topological map, with similar clusters of cases located in certain areas of the radial layer. In later phases the topological map
is fine-tuned, with individual units responding to small clusters of similar cases.
If the neighborhood is set to zero throughout, the algorithm is a simple cluster-assignment technique. It can also be used on a
one-dimensional layer with or without neighborhood definition.
If class labels are available for the training cases, then after Kohonen training, labels can be assigned using class labeling
algorithms and Learned Vector Quantization used to improve the positions of the radial exemplars.
SOFM Training
Radial Training
Technical Details. The Kohonen update rule is:
x is the training case,
h(t) is the learning rate.

The Learned Vector Quantization algorithm (LVQ) was invented by Tuevo Kohonen (Fausett, 1994; Kohonen, 1990), who
also invented the Self-Organizing Feature Map.
Learned Vector Quantization provides a supervised version of the Kohonen training algorithm. The standard Kohonen
algorithm iteratively adjusts the position of the exemplar vectors stored in the Radial layer of the Kohonen network by
considering only the positions of the existing vectors and of the training data. In essence, the algorithm attempts to move the
exemplar vectors to positions that reflect the centers of clusters in the training data. However, the class labels of the training
data cases are not taken into account. For superior classification performance, it is desirable that the exemplar vectors are
adjusted, to some extent, on a per-class basis - that is, that they reflect natural clusters in each separate class. An exemplar
located on a class boundary, equally close to cases of two classes, is unlikely to be of much use in distinguishing class. On
the other hand, exemplars located just inside class boundaries can be extremely useful.
There are several variants of Learned Vector Quantization, and STATISTICA Neural Networks supports three of these. The
basic version, LVQ1, is very similar to the Kohonen training algorithm. The closest exemplar to a training case is selected
during training and has its position updated. However, whereas the Kohonen algorithm would move this exemplar toward
the training case, LVQ1 checks whether the class label of the exemplar vector is the same as that of the training case. If it is,
the exemplar is moved toward the training case; if it is not, the exemplar is moved away from the case. The more
sophisticated LVQ algorithms, LVQ2.1 and LVQ3, take into account more information. They locate the nearest two
exemplars to the training case. If one of these is of the right class and one the wrong class, they move the right class toward
the training case and the wrong one away from it. LVQ3 also moves both exemplars toward the training case if they are both
of the right class. In both LVQ2.1 and LVQ3, the concept is to move exemplars where there is some danger of
misclassification.
Radial Training
Technical Details. The basic update rule is:
if the exemplar and training case have the same class,
if they do not.
x is the training case, ht is the learning rate.
In LVQ2.1, the two nearest exemplars are adjusted only if one is of the right class and one is not, and they are both "about
the same" distance from the training case. The definition of "about the same" distance uses a special parameter, e, and the
formulae below:
In LVQ3, an alternative formula is used to ensure that the two nearest are both "about the same distance" from the training
case:
In addition, in LVQ3, if both the two nearest exemplars are of the same class as the training case, they are both moved
toward the case, using a learning rate b times the standard learning rate at that epoch.
Levenberg-Marquardt
Levenberg-Marquardt (Levenberg, 1944; Marquardt, 1963; Bishop, 1995; Shepherd, 1997; Press et. al., 1992) is an
advanced non-linear optimization algorithm. In STATISTICA Neural Networks, it can be used to train the weights in a
network just as back propagation would be. It is reputably the fastest algorithm available for such training. However, its use
is restricted as follows.
Single output networks. Levenberg-Marquardt can only be used on networks with a single output unit.
Small networks. Levenberg-Marquardt has space requirements proportional to the square of the number of weights in the
network. This effectively precludes its use in networks of any great size (more than a few hundred weights).
Sum-squared error function. Levenberg-Marquardt is only defined for the sum squared error function. If you select a
different error function for your network, it will be ignored during Levenberg-Marquardt training. It is usually therefore only
appropriate for regression networks.
Note: Like other iterative algorithms in STATISTICA Neural Networks, Levenberg-Marquardt does not train radial units.
Therefore, you can use it to optimize the non-radial layers of radial basis function networks even if there are a large number
of weights in the radial layer, as those are ignored by Levenberg-Marquardt. This is significant as it is typically the radial
layer that is very large in such networks.
Levenberg-Marquardt works by making the assumption that the underlying function being modeled by the neural network is
linear. Based on this calculation, the minimum can be determined exactly in a single step. The calculated minimum is tested,
and if the error there is lower, the algorithm moves the weights to the new point. This process is repeated iteratively on each
generation. Since the linear assumption is ill-founded, it can easily lead Levenberg-Marquardt to test a point that is inferior
(perhaps even wildly inferior) to the current one. The clever aspect of Levenberg-Marquardt is that the determination of the
new point is actually a compromise between a step in the direction of steepest descent and the above-mentioned leap.
Successful steps are accepted and lead to a strengthening of the linearity assumption (which is approximately true near to a
minimum). Unsuccessful steps are rejected and lead to a more cautious downhill step. Thus, Levenberg-Marquardt
continuously switches its approach and can make very rapid progress.
Note: STATISTICA Neural Networks plots the error value at each tested point as Levenberg-Marquardt proceeds It may
therefore appear that the error is oscillating wildly as the algorithm tries out poor points. However, the network with the best
training error is automatically retained on each epoch, so this is no cause for alarm.
Technical Details. The Levenberg-Marquardt algorithm is designed specifically to minimize the sum-of-squares error
function, using a formula that (partly) assumes that the underlying function modeled by the network is linear. Close to a
minimum this assumption is approximately true, and the algorithm can make very rapid progress. Further away it may be a
very poor assumption. Levenberg-Marquardt therefore compromises between the linear model and a gradient-descent
approach. A move is only accepted if it improves the error, and if necessary the gradient-descent model is used with a
sufficiently small step to guarantee downhill movement.
Levenberg-Marquardt uses the update formula:
where is the vector of case errors, and Z is the matrix of partial derivatives of these errors with respect to the weights:
The first term in the Levenberg-Marquardt formula represents the linearized assumption; the second a gradient-descent step.
The control parameter governs the relative influence of these two approaches. Each time Levenberg-Marquardt succeeds in
lowering the error, it decreases the control parameter by a factor of 10, thus strengthening the linear assumption and
attempting to jump directly to the minimum. Each time it fails to lower the error, it increases the control parameter by a
factor of 10, giving more influence to the gradient descent step, and also making the step size smaller. This is guaranteed to
make downhill progress at some point.
Loss Matrix
If a network is trained so that the outputs estimate probabilities, STATISTICA Neural Networks can be adjusted to support a
loss matrix (Bishop, 1995).
In simple cases, a probability estimate may be used directly: an unknown case is simply assigned to the most-probable class.
Inevitably, this means that sometimes the network can be wrong (and this is unavoidable if data is noisy).
However, some mistakes can be more costly than others. For example, if diagnosing a potentially fatal illness, prescribing
medication to somebody who isn't actually ill may be considered a less grave error than failing to prescribe to somebody
who is.
A loss matrix is a square matrix of coefficients that reflect the relative costs of various misclassifications. It is multiplied by
the vector of probability estimates, resulting in a vector of cost estimates, and the case is assigned to the class with the
lowest cost estimate.
Since a correct classification has zero cost, the leading diagonal of a loss matrix always contains zeros; in other positions,
the coefficient in the n'th column and m'th row represents the cost of misclassifying a case that is actually in the n'th class as
being in the m'th class.
You can construct a loss matrix in STATISTICA Neural Networks by creating a linear network with the same number of
input and output units as the original network has output units. The loss matrix network's input variables should be normal
variables, but there should be a single output nominal output variable corresponding to the output of the network to which
the loss matrix is to be added.
To do this, you first need to create some numeric variables in your data set to correspond to the linear network's inputs.
Then, using the Custom Network Designer, create a linear network, specifying those variables as both the independent
variables and the desired output variable as the dependent variable. Click the Edit button in the Custom Nework Designer
rather than the OK button to enter the Network Editor directly.
Using the Neural Network Editor - Weights tab, set the Thresholds of the output layer to 0.0, and place the loss matrix
coefficients in the other weights. The leading diagonal of the loss matrix should consist entirely of 0.0s.
Once the loss matrix has been created, you can join it to the trained probability-estimation network in order to produce a
composite loss-estimation network. In this case, you should also change the joined network's Classification Output Type to
Error (Neural Network Editor - Advanced tab).
Note: A loss matrix can be included in STATISTICA Neural Networks' probabilistic networks when they are created. Once a
loss matrix has been created, you should not train the composite network further as this can disrupt the loss coefficients.
Model Profiles
Model profiles are concise text strings indicating the architecture of networks and ensembles. A profile consists of a type
code followed by a code giving the number of input and output variables and number of layers and units (networks) or
members (ensembles). For time series networks, the number of steps and the lookahead factor are also given. The individual
parts of the profile are:
Model Type. The codes are:
MLP Multilayer Perceptron Network

RBF Radial Basis Function Network
SOFM Kohonen Self-Organizing Feature Map
Linear Linear Network
PNN Probabilistic Neural Network
GRNN Generalized Regression Neural Network
PCA Principal Components Network
Cluster Cluster Network
Output Output Ensemble
Conf Confidence Ensemble
Network architecture. This is of the form I:N-N-N:O, where I is the number of input variable, O the number of output
variables, and N the number of units in each layer.
Example. 2:4-6-3:1 indicates a network with 2 input variables, 1 output variable, 4 input neurons, 6 hidden neurons, and 3
output neurons.
For a time series network, the steps factor is prepended to the profile, and signified by an "s."
Example. s10 1:10-2-1:1 indicates a time series network with steps factor (lagged input) 10.
Ensemble architecture. This is of the form I:[N]:O, where I is the number of input variable, O the number of output
variables, and N the number of members of the ensemble.

Summary details of models (networks and ensembles) are displayed throughout STATISTICA Neural Networks:
l On the Neural Networks Startup Panel - Networks/Ensembles tab
l In a list at the top of the Results dialog and the subsidiary results dialogs;
l In the Code Generator dialog.
The summary gives the most pertinent details you need to understand the type, configuration, performance, and training
history of the model.
Some of the details listed below are displayed only in certain contexts - these are marked with an asterisk (*).
Index - a unique identifier assigned when the model is created and preserved throughout its lifetime.
Lock - indicates whether the model is locked to prevent accidental deletion.
S/A - indicates whether a network is standalone. Standalone networks are models in their own right, which can optionally
their own right. Not applicable to ensembles.
Refs - the number of references to the network; that is, the number of ensembles that contain the network. A single network
can be a member of several ensembles. Not applicable to ensembles.
Profile - a summary of the model's structure. See Model Profiles.
Perf (Train/Select/Test) - the performance of a network on the subsets used during training; for an ensemble, the weighted
average performance of the ensemble's member networks. The performance measure depends on the type of network output
variable. For continuous variables (regression networks), the performance measure is the Standard Deviation Ratio (see
Regression Summary Statistics). For nominal variables (classification outputs), the performance measure is the proportion of
cases correctly classified (see Classification Statistics). This takes no account of doubt options, and so a classification
network with conservative confidence limits may have a low apparent performance, as many cases are not correctly
classified.
The vast majority of neural networks have only a single output variable, and the performance measures are reported on this
assumption. If you have a network with multiple output variables, the performance measure is with respect to the first output
variable only.
Error (Train/Select/Test) - the error of the network on the subsets used during training. This is less interpretable than the
performance measure, but is the figure actually optimized by the training algorithm (at least, for the training subset). This is
the RMS of the network errors on the individual cases, where the individual errors are generated by the network error
function, which is usually a function of the observed and expected output neuron activation levels (usually sum-squared or a
cross-entropy measure); see Error Function for more details.
Training - contains a concise description of the training algorithms used to optimize the network. It contains a number of
codes followed by the number of epochs for which the algorithm ran (if an iterative algorithm), and an optional terminal
code indicating how the final network was selected. For example, the code CG213b indicates that the Conjugate Gradient
Descent algorithm was used, that the best network discovered during that run was selected (for "best" read "lowest selection
error") and that this network was found on the 213th epoch.
The codes are:
BP Back Propagation
QN Quasi-Newton
DD Delta-Bar-Delta
SS (sub)Sample
KN K-Nearest Neighbor (Deviation Assignment)
PN Probabilistic Neural Network Training
GR Generalized Regression Neural Network Training
b Best Network (the network with lowest selection error in the run was restored)
s Stopping Condition (the training run was stopped before the total number of epochs elapsed as a
stopping condition was fulfilled)
c Converged (the algorithm stopped early because it had converged; that is, reached and detected a local
or global minimum. Note that only some algorithms can detect stoppage in a local minimum, and that
this is an advantage not a disadvantage!)
The field is editable, so you can add other information if you want.
Members - a list of the networks in an ensemble. This is not available for networks.
Note - a brief line of text that you can attach to a network for informational purposes.
Inputs - the number of input variables in the model. This is also reported in the profile.
Hidden (1+2) - the number of hidden units in the first and second layers of a network. This is also reported in the profile.
The number of hidden units, together with the number of input variables, defines the complexity of the network. As a
general rule, it is desirable to use networks with as low a complexity as possible, consistent with good performance. Some
network types have no hidden layers (Linear, SOFM), or only one hidden layer (three layer Multilayer Perceptron, Radial
Basis Function) in which case a dash is displayed in one or both of these fields. Not available for ensembles.
Network Sets
STATISTICA Neural Networks stores neural networks and ensembles (cooperating sets of neural networks) in network sets.
Networks and ensembles are referred to collectively as models. A network set can contain any number of models, each with
its own unique life-long index number. Networks and ensembles can be saved to network files that have the extension .snn.
Summary details on the networks and ensembles in the set are displayed on the Neural Networks Startup Panel -
Networks/Ensembles tab. Networks are displayed on the top half of the tab, and ensembles on the bottom half. Details of
networks include the unique model index number, profile (type, number of input and output variables, neurons in each layer,
and time series parameters if a time series network), error and performance measures, and training algorithms used on the
network. Summary details for ensembles are the same as for networks, except that the indices of the member networks are
listed in place of the training algorithm.
You can select a model or models to be used in some analyses (e.g. results generation), and these are highlighted on the
Neural Networks Startup Panel - Networks/Ensembles tab. To select a single model, click on it. To select a contiguous range
of models, click on the first one, then hold down the SHIFT key and click on the last. To add or remove a discontinuous range
models from the selection, hold down the CTRL key and click on the various models.
You can delete models from the Neural Network Editor - Advanced tab.
Summary details are also available using the Neural Network Editor dialog (accessible from the Neural Networks Startup
Panel - Advanced tab). The Network Editor also is used to delete models, lock them to prevent replacement, clone them,
specify whether networks that are members of ensembles are also available as standalone models, join networks together,
and specify replacement options.
See also:
Network Set Editor
Perceptrons
Perceptrons are a simple form of neural networks. They have no hidden layers, and can only perform linear classification
tasks. Perceptrons were devised by Rosenblatt (1958), and their limitations were criticized by Minsky and Papert (1969),
leading to a loss of interest in the field. Fausett (1994) gives a good history of these early developments.
A perceptron is modeled in STATISTICA Neural Networks by creating a two-layer MLP network, and changing the
activation function of the output layer to Step (in the Network Editor).
The perceptron learning algorithm is modeled by using back propagation with Momentum 0.0 and Shuffle turned Off.

This algorithm uses the singular value decomposition technique to calculate the pseudo-inverse of the matrix needed to set
the weights in a linear (dot product synaptic function + identity activation function) output layer, so as to find the least mean
squared solution. Essentially, it guarantees to find the optimal setting for the weights in a linear layer, to minimize the RMS
training set error (Bishop, 1995; Press et. al., 1992; Golub and Kahan, 1965). This is the standard least-squares optimization
technique.
Linear techniques are extremely important in optimization, not least because it is possible to find an optimal solution to a
linear model - something that is not guaranteed with nonlinear models, such as other types of neural networks, even if
training algorithms converge.
The pseudo-inverse procedure, in addition to guaranteeing to find the absolute minimum error, is also relatively quick.
Note. STATISTICA Neural Networks' pseudo-inverse option will optimize the last layer in any neural network, providing the
last layer is linear (i.e., has dot product Synaptic function and identity activation function).
Pseudo-inverse is typically used in a number of circumstances:
l To optimize the linear output layer in a radial basis function network, subsequent to center and deviation assignment
using the unsupervised algorithms.
l To optimize the output layer in a linear network. STATISTICA Neural Networks' Linear networks implement a simple
matrix-multiplication plus bias vector linear model, and are ideal for solving simple problems, and for bench-marking the
performance of the more complex network models.
l To fine-tune the final layer in a multilayer perceptron with a linear output layer, as used in regression problems.
Linear Training
Technical Details. The matrix G is calculated, whose i, j'th element is the input of the i'th output unit, when the j'th case is
executed.
The least-squares solution is then given by:
w = G+ d
G+ = (GTG)-IGT is the pseudo-inverse matrix.
w is the weight vector into an output unit.
d is the desired response vector (training outputs) for that output.
G+ is calculated using the singular value decomposition algorithm.
Caution. The singular value decomposition algorithm is usually numerically stable; however, occasionally a badly behaved
matrix can cause it to generate mathematical errors. If this occurs, follow the steps below:
1. Check that the training cases and (in the case of a radial basis function network) centers and deviations have been
sensibly assigned.
2. In particular, the algorithm performs badly if radial deviations are very high (i.e., the standard deviations of Gaussians are
very small). It may be necessary to increase the number of neighbors if assigning radial deviation using K-nearest
neighbors, or to increase the Deviation multiplier if using Isotropic deviation assignment.
3. If training cases, centers and deviations are all sensible, and the algorithm still fails, use Conjugate Gradient Descent to
set the weights in the linear layer. Although typically slower than pseudo-inverse, this algorithm does not generate
arithmetic errors and is guaranteed to find the minimum, as there are no local minima in this case.
Quasi-Newton
Quasi-Newton (Bishop, 1995; Shepherd, 1997) is an advanced method of training multilayer perceptrons. It usually
performs significantly better than Back Propagation, and can be used wherever back propagation can be. It is the
recommended technique for most networks with a small number of weights (less than a couple of hundred). If the network is
a single output regression network and the problem has low residuals, then Levenberg-Marquardt may perform better.
Quasi-Newton is a batch update algorithm: whereas back propagation adjusts the network weights after each case, Quasi-
Newton works out the average gradient of the error surface across all cases before updating the weights once at the end of
the epoch.
For this reason, there is no shuffle option available with Quasi-Newton, since it would clearly serve no useful function.
There is also no need to select learning or momentum rates for Quasi-Newton, so it can be much easier to use than back
propagation. Additive noise would destroy the assumptions made by Quasi-Newton about the shape of search space, and so
is also not available.
Quasi-Newton works by exploiting the observation that, on a quadratic (i.e. parabolic) error surface, one can step directly to
the minimum using the Newton step - a calculation involving the Hessian matrix (the matrix of second partial derivatives of
the error surface). Any error surface is approximately quadratic "close to" a minimum. Since, unfortunately, the Hessian
matrix is difficult and expensive to calculate, and anyway the Newton step is likely to be wrong on a non-quadratic surface,
Quasi-Newton iteratively builds up an approximation to the inverse Hessian. The approximation at first follows the line of
steepest descent, and later follows the estimated Hessian more closely.
Quasi-Newton is the most popular algorithm in nonlinear optimization, with a reputation for fast convergence. It does,
however, has some drawbacks - it is rather less numerically stable than, say, Conjugate Gradient Descent, it may be inclined
to converge to local minima, and the memory requirements are proportional to the square of the number of weights in the
network.
It is often beneficial to precede Quasi-Newton training with a short burst of Back Propagation (say 100 epochs), to cut down
on problems with local minima.
If the network has many weights, you are advised to use Conjugate Gradient Descent instead. Conjugate Gradient Descent
has memory requirements proportional only to the number of weights, not the square of the number of weights, and the
training time is usually comparable with Quasi-Newton, if somewhat slower.
The algorithm is available from the following analyses:
Technical Details. Quasi-Newton is batch-based; it calculates the error gradient as the sum of the error gradients on each
training case.
It maintains an approximation to the inverse Hessian matrix, called H below. The direction of steepest descent is called g
below. The weight vector on the ith epoch is referred to as fi below. H is initialized to the identity matrix, so that the first
step is in the direction g (i.e. the same direction as that chosen by Back Propagation). On each epoch, a back tracking line
search is performed in the direction:
d = – Hg
Subsequently, the search direction is updated using the BFGS (Broyden-Fletcher-Goldfarb-Shanno) formula:
This is "guaranteed" to maintain a positive-definite approximation (i.e. it will always indicate a descent direction), and to
converge to the true inverse Hessian in W steps, where W is the number of weights, on a quadratic error surface. In practice,
numerical errors may violate these theoretical guarantees and lead to divergence of weights or other modes of failure. In this
case, run the algorithm again, or choose a different training algorithm.
Quick Propagation
Despite the name, quick propagation (Fahlman, 1988; Patterson, 1996) is not necessarily faster than back propagation,
although it may prove significantly faster for some applications.
Quick propagation also sometimes seems more inclined to instability and to getting stuck in local minima, than back
propagation; these tendencies may determine whether quick propagation is more appropriate for a particular problem.
Quick propagation is a batch update algorithm: whereas back propagation adjusts the network weights after each case, quick
propagation works out the average gradient of the error surface across all cases before updating the weights once at the end
of the epoch.
For this reason, there is no shuffle option available with quick propagation, since it would clearly serve no useful function.
Quick propagation works by making the (typically ill-founded) assumption that the error surface is locally quadratic, with
the axes of the hyper-ellipsoid error surface aligned with the weights. If this is true, then the minimum of the error surface
can be found after only a couple of epochs. Of course, the assumption is not generally valid, but if it is even close to true, the
algorithm can converge on the minimum very rapidly.
Based on this assumption, quick propagation works as follows:
On the first epoch, the weights are adjusted using the same rule as back propagation, based upon the local gradient and the
learning rate.
On subsequent epochs, the quadratic assumption is used to attempt to move directly to the minimum.
The basic quick propagation formula suffers from a number of numerical problems. First, if the error surface is not concave,
the algorithm can actually go the wrong way. If the gradient changes little or not at all, the change can be extremely large, or
even infinite! Finally, if a zero gradient is encountered, a weight will stop changing permanently.
The version of the algorithm supported in STATISTICA Neural Networks has a number of features to deal with these
problems:
l If the algorithm is inclined to go the wrong way or to change the step size too quickly, the rate of change is limited using
an acceleration factor.
l If a previously zero gradient becomes non-zero, the algorithm restarts for that weight.
The algorithm is available from the following analyses:
Technical Details. Quick propagation is batch-based; it calculates the error gradient as the sum of the error gradients on
each training case.
On the first epoch, quick propagation updates weights just like back propagation.
Subsequently, weight changes are calculated using the quick propagation formula:
This formula is numerically unstable if s(t) is very close to, equal to, or greater than s(t-1). Since s(t) is discovered after a
move along the direction of the gradient, such conditions can only occur if the slope becomes constant, or becomes steeper
(i.e., it is not concave).
In these cases, the weight update formula is:
a - the acceleration coefficient.
If the gradient becomes zero, then the weight delta becomes zero, and by the above formulae remains zero permanently even
if the gradient subsequently changes. A conventional approach to solve this problem is to add a small factor to the weight
changes calculated above. However, this approach can cause numerical instability. STATISTICA Neural Networks instead
monitors the gradient. If a previously zero gradient becomes significantly non-zero, the update for that weight is reset to the
negative gradient.
Radial Sampling
Radial sampling is a simple technique to assign centers to radial units in the first hidden layer of a network by randomly
sampling training cases and copying those to the centers. This is a reasonable approach if the training data are distributed in
a representative manner for the problem (Lowe, 1989).
The number of training cases must at least equal the number of centers to be assigned.
Radial Training

When a neural network is used for classification, confidence levels (the Accept and Reject thresholds, available from the
Pre/Post Processing Editor) determine how the neural networks assigns input cases to classes.
In the case of two-class classification problems, by default the output class is indicated by a single output neuron, with high
output corresponding to one class and low output to the other. If the Reject threshold is strictly less than the Accept
threshold, then the network may include a "doubt" option, where it is not sure of the class if the output lies between the
Reject and Accept thresholds.
An alternative approach is to set the Accept and Reject thresholds equal. In this case, as the single decision threshold is
adjusted, so the classification behavior of the network changes. At one extreme all cases will be assigned to one class, and at
the other extreme to the other. In between these extremes, different compromises may be found, leading to different trade-
offs between the rate of erroneous assignment to each class (i.e. false-positives and false-negatives).
A Receiver Operating Characteristic curve (Zweig, 1993) summarizes the performance of a two-class classifier across the
range of possible thresholds. It plots the sensitivity (class two true positives) versus one minus the specificity (class one false
negatives). An ideal classifier hugs the left side and top side of the graph, and the area under the curve is 1.0. A random
classifier should achieve approximately 0.5 (a classifier with an area less than 0.5 can be improved simply by flipping the
class assignment). The ROC curve is recommended for comparing classifiers, as it does not merely summarize performance
at a single arbitrarily selected decision threshold, but across all possible decision thresholds.
The ROC curve can be used to select an optimum decision threshold. This threshold (which equalizes the probability of
misclassification of either class; i.e. the probability of false-positives and false-negatives) can be used to automatically set
confidence thresholds in STATISTICA Neural Networks, in classification networks with a nominal output variable with the
Two-state conversion function. This threshold optimization is controlled from the training dialog for the network type.
Regression
In regression problems the purpose is to predict the value of a continuous output variable. Regression problems can be
tackled in STATISTICA Neural Networks using multilayer perceptrons, radial basis function networks, (Bayesian) regression
networks, and linear networks.
Output Scaling. Multilayer perceptrons in STATISTICA Neural Networks include Minimax scaling of both input and output
variables. When the network is trained, STATISTICA Neural Networks determines shift and scale coefficients for each
variable, based on the minimum and maximum values in the training set, and transforms the data by multiplying by the scale
factor and adding the shift factor.
The net effect is that a 0.0 output activation level in the network is translated into the minimum value encountered in the
training data, and a 1.0 activation level is translated into the maximum training data value. Consequently, the network is able
to interpolate between the values represented in the training data. However, extrapolation outside the range encountered in
the training set is more circumscribed. Two approaches to encoding the output are available, each of which allows a certain
amount of extrapolation.
l A logistic activation function is used for the output, with scaling factors determined so that the range encountered in the
training set is mapped to a restricted part of the logistic functions (0,1) range (e.g. to [0.05, 0.95]. This allows a small
amount of extrapolation (significant extrapolation from data is usually unjustified anyway). Using the logistic function
makes training stable.
l Uses an identity activation function in the final layer of the network. This supports a substantial amount of extrapolation,
although not unlimited (the hidden units will saturate eventually). As a bonus, the final layer can be "fine-tuned" after
iterative training using the pseudo-inverse technique. However, iterative training tends to be less stable than with a non-
linear activation function, and the learning rate must be carefully chosen to avoid weight divergence during training (i.e.
less than 0.1), if using an algorithm such as back propagation.
Outliers. Regression networks can be particularly prone to problems with outlying data. The use of the sum-squared
network error function means that points lying far from the others have a disproportionate influence on the position of the
hyperplanes used in regression. If these points are actually anomalies (for example, spurious points generated by the failure
of measuring devices) they can substantially degrade the network's performance.
One approach to this problem is to train the network, test it on the training cases, isolate those that have extremely high error
values and remove them, then to retrain the network. You can do this in STATISTICA Neural Networks by excluding the
offending cases from selection using the Select Cases dialog (accessible by clicking the Select cases button on the Results
dialogs).
If you believe the outlier is caused by a suspicious value for one of the variables in that case, you can delete that particular
value, at which point the case is treated as having a missing value (see Missing Values, below).
Another approach is to use the city-block error function. Rather than summing the squared-differences in each variable to
work out an error measure, this simply sums the absolute differences. Removing the square function makes training far less
sensitive to outliers.
Whereas the amount of "pull" a case has on a hyperplane is proportional to the distance of the point from the hyperplane in
the sum-squared error function, with the city block error function the pull is the same for all points, and the direction of pull
simply depends on the side of the hyperplane to which the point lies. Effectively, the sum-squared error function attempts to
find the mean, but the city-block error function attempts to find the median.
Missing Values. It is not uncommon to come across situations where the data for some cases has some values missing;
perhaps because data was unavailable, or corrupted, when gathered. In such cases, you may still need to execute a network
(to get the best estimate possible given the information available) or (and this is more suspect) use the partially complete
data in training because of an acute shortage of training data.
STATISTICA Neural Networks has special facilities to handle missing values. If a case with missing values is to be used,
some value must be substituted in place of the missing value. STATISTICA Neural Networks can select an appropriate
substitution value using a variety of methods. The substitution values are derived from the training set, and may be (for
example) the mean of the existing values for the variable in the training set. The missing value substitution method can be
selected for each variable in the Network Editor when a network is created, and the substitution values are actually
calculated when the network is trained.
Where possible, it is usually good practice not to use variables containing a great many missing values. Cases with missing
values can be excluded.
Regression Summary Statistics

In regression problems, the purpose of the neural network is to learn a mapping from the input variables to a continuous
output variable, or variables.
A network is successful at regression if it makes predictions more accurate than a simple estimate.
The simplest way to construct an estimate, given training data, is to calculate the mean of the training data, and use that
mean as the predicted value for all previously unseen cases.
The average expected error from this procedure is the standard deviation of the training data. The aim in using a regression
network is therefore to produce an estimate that has a lower prediction error standard deviation than the training data
standard deviation.
STATISTICA Neural Networks automatically calculates the mean and standard deviation of the training and selection
subsets, when the entire data set is run. It also calculates the mean and standard deviations of the prediction errors.
The ratio of the prediction to data standard deviations is displayed; if this is 1.0, then the network does no better than a
simple average. A lower ratio indicates a better estimate.
In addition, STATISTICA Neural Networks displays the standard Pearson-R correlation coefficient between the actual and
predicted outputs. A perfect prediction will have a correlation coefficient of 1.0. A correlation of 1.0 does not necessarily
indicate a perfect prediction (only a prediction which is perfectly linearly correlated with the actual outputs), although in
practice the correlation coefficient is a good indicator of performance. It also provides a simple and familiar way to compare
the performance of your neural networks with standard least squares linear fitting procedures.
Note. It is standard practice in linear modeling to report the correlation between the independent and dependent variable.
This is actually equal (discarding the possible minus sign in a negative correlation) to the correlation between the prediction
of a linear model using that independent variable as an input, and the dependent variable.
The regression statistics are:
Data Mean. Average value of the target output variable.
Data S.D. Standard deviation of the target output variable.
Error Mean. Average error (residual between target and actual output values) of the output variable.
Abs. E. Mean. Average absolute error (difference between target and actual output values) of the output variable.
Error S.D. Standard deviation of errors for the output variable.
S.D. Ratio. The error:data standard deviation ratio.
Correlation. The standard Pearson-R correlation coefficient between the predicted and observed output values.
The degree of predictive accuracy needed varies from application to application. However, generally an s.d. ratio of 0.1 or
lower indicates very good regression performance.
Resampling
A major problem with neural networks is the generalization issue (the tendency to overfit the training data), accompanied by
the difficulty in quantifying likely performance on new data.
This difficulty can be disturbing if you are accustomed to the relative security of linear modeling, where a given set of data
generates a single "optimal" linear model. However, this security may be somewhat deceptive, as if the underlying function
is not linear, the model may be very far from optimal.
In contrast, in nonlinear modeling some choice must be made about the complexity (curvature, eccentricity) of the model,
and this can lead to a plethora of alternative models. Given this diversity, it is important to have ways to estimate the
performance of the models on new data, and to be able to select among them.
Most work on assessing performance in neural modeling concentrates on approaches to resampling. A neural network is
optimized using a training subset. Often, a separate subset (the selection subset) is used to halt training to mitigate over-
learning, or to select from a number of models trained with different parameters. Then, a third subset (the test subset) is used
to perform an unbiased estimation of the network's likely performance.
Although the use of a test set allows us to generate unbiased performance estimates, these estimates may exhibit high
variance. Ideally, we would like to repeat the training procedure a number of different times, each time using new training,
selection and test cases drawn from the population - then, we could average the performance prediction over the different
test subsets, to get a more reliable indicator of generalization performance.
In reality, we seldom have enough data to perform a number of training runs with entirely separate training, selection and
test subsets. However, intuitively we might think we can do better if we train multiple networks, as when a single network is
trained, only part of the data is actually involved in training. Can we find a way to use all the data in training, selection and
test?
Cross validation is the most simple resampling technique. Suppose that we decide to conduct ten experiments with a given
data set. We divide the data set into ten equal parts. Then, for each experiment we select one part to act as the test set. The
other nine tenths of the data set are used for training and selection. When the ten experiments are finished, we can average
the test set performances of the individual networks.
Cross validation has some obvious advantages. If training a single network, we would probably reserve 25% of the data for
test. By using cross validation, we can reduce the individual test set size. In the most extreme version, leave-one-out cross
validation, we perform a number of experiments equal to the size of the data set. On each experiment a single case is placed
in the test subset, and the rest of the data is used for training. Clearly this may require a substantial number of experiments if
the data set is large, but it can give you a very accurate estimate of generalization performance.
What precisely does cross validation tell us? In cross validation, each of the set of experiments should be performed with the
same process parameters (same training algorithms, number of epochs, learning rates, etc.). The averaged performance
measure is then an estimate of the performance on new data (drawn from the same distribution as the training data) of a
single network trained using the same procedure (including the networks actually generated in the cross validation
procedure).
We could select one of the cross validated networks at random and deploy it, using the estimates generated in cross
validation to characterize its expected performance. However, this seems intuitively wasteful - having generated a number of
networks, why not use them all? We can form the networks into an ensemble, and make predictions by averaging or voting
across the resampled member networks (ensembles can also usefully combine the predictions of networks trained using
different parameters, or of different architectures).
If we form an ensemble from the cross validated networks, is the performance estimate formed by averaging the test set
performance of the individual networks an unbiased estimate of generalization performance?
The answer is: no. The expected performance of an ensemble is not, in general, the same as the average performance of the
members. Actually, the expected performance of the ensemble is at least the average performance of the members, but
usually better. Thus you can use the estimate so-formed, knowing that it is conservatively biased.
There is a way to address the problem of biased ensemble performance estimates - we could use two-fold cross validation,
repeating the entire ensemble-building process multiple times with different samples held back for testing the ensemble.
However, this procedure is excessively computationally intensive, and is not supported in STATISTICA Neural Networks.
Cross validation is one technique for resampling data; there are others. STATISTICA Neural Networks supports two of these:
l Random (Monte Carlo) resampling - the subsets are randomly sampled from the available cases. Each available case is
assigned to one of the three subsets.
l Bootstrapping - this technique (Efron, 1979) samples a data set with replacement (i.e. a single case may be randomly
sampled several times into the bootstrap set). The bootstrap can be applied any number of times, for increased accuracy.
Compared with random sampling, the use of sampling with replacement can help to iron out generalization problems
caused by the finite size of the data set. Breiman (1996) suggested using the bootstrap sampling technique to train
multiple models for ensemble averaging (in his case the models were decision trees, but the conclusions carry over to
other models), a technique he refers to as bagging.
STATISTICA Neural Networks can conduct a sensitivity analysis on the inputs to a neural network. This indicates which
input variables are considered most important by that particular neural network. Sensitivity analysis can be used purely for
informative purposes (see Results dialog), or to perform input pruning (see Network Editor, or any training dialog).
Sensitivity analysis can give important insights into the usefulness of individual variables. It often identifies variables that
can be safely ignored in subsequent analyses, and key variables that must always be retained. However, it must be deployed
with some care, for reasons that are explained below.
Input variables are not, in general, independent - that is, there are interdependencies between variables. Sensitivity analysis
rates variables according to the deterioration in modeling performance that occurs if that variable is no longer available to
the model. In so doing, it assigns a single rating value to each variable. However, the interdependence between variables
means that no scheme of single ratings per variable can ever reflect the subtlety of the true situation.
Consider, for example, the case where two input variables encode the same information (they might even be copies of the
same variable). A particular model might depend wholly on one, wholly on the other, or on some arbitrary combination of
them. Then sensitivity analysis produces an arbitrary relative sensitivity to them. Moreover, if either is eliminated the model
may compensate adequately because the other still provides the key information. It may therefore rate the variables as of low
sensitivity, even though they might encode key information. Similarly, a variable that encodes relatively unimportant
information, but is the only variable to do so, may have higher sensitivity than any number of variables that mutually encode
more important information.
There may be interdependent variables that are useful only if included as a set. If the entire set is included in a model, they
can be accorded significant sensitivity, but this does not reveal the interdependency. Worse, if only part of the
interdependent set is included, their sensitivity will be zero, as they carry no discernable information.
In summary, sensitivity analysis does not rate the "usefulness" of variables in modeling in a reliable or absolute manner. You
must be cautious in the conclusions you draw about the importance of variables. Nonetheless, in practice it is extremely
useful. If a number of models are studied, it is often possible to identify key variables that are always of high sensitivity,
others that are always of low sensitivity, and "ambiguous" variables that change ratings and probably carry mutually
redundant information.
How does sensitivity analysis work? STATISTICA Neural Networks conducts sensitivity analysis by treating each input
variable in turn as if it were "unavailable" (Hunter, 2000). Every model in STATISTICA Neural Networks has defined a
missing value substitution procedure, which is used to allow predictions to be made in the absence of values for one or more
inputs. To define the sensitivity of a particular variable, v, we first run the network on a set of test cases, and accumulate the
network error. We then run the network again using the same cases, but this time replacing the observed values of v with the
value estimated by the missing value procedure, and again accumulate the network error.
Given that we have effectively removed some information that presumably the network uses (i.e. one of its input variables),
we would reasonably expect some deterioration in error to occur. The basic measure of sensitivity is the ratio of the error
with missing value substitution to the original error. The more sensitive the network is to a particular input, the greater the
deterioration we can expect, and therefore the greater the ratio.
If the ratio is one or lower, then making the variable "unavailable" either has no effect on the performance of the network, or
actually enhances it (!).
Once sensitivities have been calculated for all variables, they may be ranked in order. STATISTICA Neural Networks also
provides these rankings, for convenience in interpreting the sensitivities.
Stopping Conditions
The iterative gradient-descent training algorithms in STATISTICA Neural Networks (back propagation, Quasi-Newton,
conjugate gradient descent, Levenberg-Marquardt, quick propagation, Delta-bar-Delta, and Kohonen) all attempt to reduce
the training error on each epoch.
You specify a maximum number of epochs for these iterative algorithms. However, you can also define stopping conditions
that may cause training to determine earlier.
Specifically, training may be stopped when:
l the error drops below a given level;
l the error fails to improve by a given amount over a given number of epochs.
The conditions are cumulative; i.e., if several stopping conditions are specified, training ceases when any one of them is
satisfied. In particular, a maximum number of epochs must always be specified.
The error-based stopping conditions can also be specified independently for the error on the training set and the error on the
selection set (if any).
Note: You can always abort training prematurely by clicking the Finish button on the Training in Progress dialog.
Stopping conditions are available on any of the training dialogs that support iterative algorithms, usually on the End tab.
Target Error. You can specify a target error level, for the training subset, the selection subset, or both. If the RMS falls
below this level, training ceases. This refers to the normalized RMS error, as displayed on the progress graph.
Minimum Improvement. Specifies that the RMS error on the training subset, the selection subset, or both must improve by
at least this amount, or training will cease (if the Window parameter is non-zero).
Sometimes error improvement may slow down for a while or even rise temporarily (particularly if the shuffle option is used
with back propagation, or non-zero noise is specified, as these both introduce an element of noise into the training process).
To prevent this option from aborting the run prematurely, specify a longer Window.
It is particularly recommended to monitor the selection error for minimum improvement, as this helps to prevent over-
learning.
Specify a negative improvement threshold if you want to stop training only when a significant deterioration in the error is
detected. The algorithm will stop when a number of generations pass during which the error is always the given amount
worse than the best it ever achieved.
Window. The window factor is the number of epochs across which the error must fail to improve by the specified amount,
before the algorithm is deemed to have slowed down too much and is stopped.
By default the window is zero, which means that the minimum improvement stopping condition is not used at all.
Synaptic Functions
STATISTICA Neural Networks supports two main Synaptic functions.
Dot product. Dot product units perform a weighted sum of their inputs, minus the threshold value. In vector terminology,
this is the dot product of the weight vector with the input vector, plus a bias value. Dot product units have equal output
values along hyperplanes in pattern space. They attempt to perform classification by dividing pattern space into sections
using intersecting hyperplanes.
Radial. Radial units calculate the square of the distance between the two points in N dimensional space (where N is the
number of inputs) represented by the input pattern vector and the unit's weight vector. Radial units have equal output values
lying on hyperspheres in pattern space. They attempt to perform classification by measuring the distance of normalized cases
from exemplar points in pattern space (the exemplars being stored by the units). The squared distance is multiplied by the
threshold (which is, therefore, actually a deviation value in radial units) to produce the post synaptic value of the unit (which
is then passed to the unit's activation function).
Dot product units are used in multilayer perceptron and linear networks, and in the final layers of radial basis function, PNN,
and GRNN networks.
Radial units are used in the second layer of Kohonen, radial basis function, Clustering, and probabilistic and generalized
regression networks. They are not used in any other layers of any standard network architecture, although STATISTICA
Neural Networks can execute, but not train, such networks.
STATISTICA Neural Networks also supports a third synaptic function:
Division. This is specially designed for use in generalized regression networks, and should not be employed elsewhere. It
expects one incoming weight to equal +1, one to equal -1, and the others to equal zero. The post-synaptic value is the +1
input divided by the -1 input.
Time Series
Many important problems can be classified as time series problems; the objective is to predict the value of some (typically
continuous) variable, giving previous values of that and/or other variables (Bishop, 1995).
You can train any neural network architecture in STATISTICA Neural Networks to act on a time series.
To perform time series prediction, you must inform STATISTICA Neural Networks how to assemble a number of cases at
different time steps into patterns for training or execution. To do this set the following when creating the network:
l If a variable is to be used both as input (the past values) and output (the future value), select it for both the dependent and
independent variables. In most time series problems, there is actually only a single variable. However, STATISTICA
Neural Networks will support multiple variables - both inputs and outputs.
l Set the Steps parameter on the Custom Network Designer - Time Series tab, to indicate the number of time steps to be
used as input to the network.
l Set the Lookahead parameter to indicate how many time steps ahead of the last input case the output should be predicted.
In the vast majority of cases, the Lookahead is set to 1, the Steps varies according to the requirements of the problem
domain, and the variable(s) used are all input/output. If a single input/output variable is predicted using single step
Lookahead, STATISTICA Neural Networks can feed predictions back into the network to perform indefinite time series
projection.
Topological Map
The radial layer of a Kohonen network, with units laid out in two-dimensions, and trained so that inter-related clusters tend
to be situated close together in the layer. Used for Cluster analysis (Kohonen, 1982; Fausett, 1994; Haykin, 1994; Patterson,
1996).
Unit Types
STATISTICA Neural Networks supports a wide range of types of units (also referred to as neurons).
Units in the input layer are extremely simple: they simply hold an output value, which they pass onto units in the second
layer. Input units do no processing. Input units have their synaptic function set to Dot Product, and their activation function
set to Identity by default; actually these functions are ignored in input units.
Each hidden or output unit has a number of incoming connections from units in the preceding layer (the fan-in): one for each
unit in the preceding layer. Each unit also has a threshold value.
STATISTICA Neural Networks' units perform processing as follows:
The outputs of the units in the preceding layer, the weights on the associated connections, and the threshold value are fed
through the unit's synaptic function (post synaptic potential function) to produce a single value (the unit's input value).
The input value is passed through the unit's activation function to produce a single output value, also known as the activation
level of the unit.
STATISTICA Neural Networks supports several synaptic functions and a wide range of activation functions. These allow the
program to model a wide range of network types.
STATISTICA Neural Networks supports the following unsupervised learning algorithms. All, except principal components
analysis, are concerned with assignment of radial unit centers and deviations.
Unsupervised learning algorithms require a data set that includes typical input variable values. Observed output variable
values are not required. If output variable values are present in the data set, they are simply ignored.
Center Assignment
Kohonen Algorithm
Radial Sampling
K-Means Algorithm
Deviation Assignment
Explicit Deviation Assignment
Isotropic Deviation Assignment
K-Nearest Neighbor
Principal Components Analysis
Principal Components Analysis

A common problem in neural network training (particularly of multilayer perceptrons) is over-fitting. A network with a large
number of weights in comparison with the number of training cases available can achieve a low training error by modeling a
function that fits the training data well despite failing to capture the underlying model. An over-fitted model typically has
high curvature, as the function is contorted to pass through the points, modeling any noise in addition to the underlying data.
There are several approaches in neural network training to deal with the over-fitting problem, and STATISTICA Neural
Networks supports most of them (Bishop, 1995). These approaches are listed below.
l Select a neural network with just enough units to model the underlying function. The problem with this approach is
determining the correct number of units, which is problem-dependent.
l Add some noise to the training cases during training (altering the noise on each case each epoch): this "blurs" the position
of the training data, and forces the network to model a smoothed version of the data.
l Stop training (see Stopping Conditions) when the selection error begins to rise, even if the training error continues to fall.
This event is a sure sign that the network is beginning to over-fit the data.
l Use a regularization technique, which explicitly penalizes networks with large curvature, thus encouraging the
development of a smoother model.
Note: STATISTICA Neural Networks' Intelligent Problem Solver uses all of these techniques bar additive noise
automatically.
The last technique mentioned is regularization, and this section describes Weigend weight regularization (Weigend et. al.,
1991).
A multilayer perceptron model with sigmoid (logistic or hyperbolic tangent) activation functions has higher curvature if the
weights are larger. You can see this by considering the shape of the sigmoid curve: if you just look at a small part of the
central section, around the value 0.0, it is "nearly linear," and so a network with very small weights will model a "nearly
linear" function, which has low curvature. As an aside, note that during training the weights are first set to small values
(corresponding to a low curvature function), and then (at least some of them) diverge. One way to promote low curvature
therefore is to encourage smaller weights.
Weigend weight regularization does this by adding an extra term to the error function, which penalizes larger weights.
Hence the network tends to develop the larger weights that it needs to model the problem, and the others are driven toward
zero. The technique can be used with any of STATISTICA Neural Networks' multilayer perceptron training algorithms (back
propagation, conjugate gradient descent, Quasi-Newton, quick propagation, and Delta-bar-Delta) apart from Levenberg-
Marquardt, which makes its own assumptions about the error function.
The technique is commonly referred to as Weigend weight elimination, as it is possible, once weights become very small, to
simply remove them from the network. STATISTICA Neural Networks does not support the removal of individual weights
because the program's networks are fully connected from layer to layer (every unit is connected to every unit in the next
layer). However, STATISTICA Neural Networks can remove entire hidden units and/or input variables, where all the fan-out
weights from the hidden unit, or from all the input units corresponding to the input variable, are below a certain threshold.
This is an extremely useful technique for developing models with a "sensible" number of hidden units, and for selecting
input variables. You can turn on this pruning algorithm on the End tab of the appropriate training dialogs.
Once a model has been trained with Weigend regularization and excess inputs and hidden units removed, it can be further
trained with Weigend regularization turned off, to "sharpen up" the final solution.
Weigend regularization can also be very helpful in that it tends to prevent models from becoming over-fitted.
You can view the fan-out weights of units on the Neural Network Editor - Weights tab - each row gives the fan-out weights
from the units in a particular layer (each column gives the threshold and fan-in weights of units in the subsequent layer).
Note: When using Weigend regularization, the error on the progress graph includes the Weigend penalty factor. If you
compare a network trained with Weigend to one without, you may get a false impression that the Weigend-trained network
is under-performing. To compare such networks, view the error reported in the summary statistics on the model list (this
does not include the Weigend error term).
Weigend regularization is available on the Decay tab of the following dialogs:
Technical Details. The Weigend error penalty is given by:
where l is the Regularization coefficient, wi is each of the weights, and wo is the Scale coefficient.
The error penalty is added to the error calculated by the network's error function during training, and its derivative is added
to the weight's derivative. However, the penalty is ignored when running a network.
The regularization coefficient is usually manipulated to adjust the selective pressure to prune units. The relationship between
this coefficient and the number of active units is roughly logarithmic, so the coefficient is typically altered over a wide range
(0.01-0.0001, say).
The scale coefficient defines what is a "large" value to the algorithm. The default setting of 1.0 is usually reasonable, and it
is seldom altered.
A feature of the Weigend error penalty is that it does not just penalize larger weights. It also prefers to tolerate an uneven
mix of some large and some small weights, as opposed to a number of medium-sized weights. It is this property that allows
it to "eliminate" weights.
Neural Networks Examples


This section describes how to use STATISTICA Neural Networks (SNN) via a set of step-by-step examples. The first 10
examples concentrate on problem solving using the Intelligent Problem Solver. You can follow these examples even if you
have no previous experience with neural networks.
As you reach the more advanced examples, and in particular those using custom neural networks, you may find it necessary
to read through the Neural Networks Introductory Overviews, which give an outline introduction to the basic concepts of
neural networks. This includes a list of good introductory textbook references.
The examples are designed to be worked through in the given order:

Selecting a data set and specifying the nature and role of your variables in the neural network analysis is the first step toward
starting your analysis. In STATISTICA Neural Networks (SNN), variables are designated as independent variables (inputs)
and dependent variables (outputs). Input and output variables are used for training the neural network and for computing
predicted values or classifications from the trained network. In addition, you can include extra variables that are neither
input nor output variables. Such variables can be used in auxiliary results, such as scatterplots of residual values vs. those
variables, etc. Finally, you can (optionally) specify a single variable to assign cases into Train, Select, Test, and Ignore sets;
this variable is referred to as the Subset variable.
In addition, the following distinction is important for variables designated as input or output variables for a neural network:
STATISTICA Neural Networks distinguishes between continuous and categorical input/output variables. A categorical
variable must contain integer or text values that identify the class or group to which an observation belongs; a typical
example of a categorical or class variable is Gender with the groups Male and Female. Note that the (optional) subset
variable mentioned in the previous paragraph is also categorical in nature because it assigns observations to the groups
designated as Train, Selection, Test, or Ignore. Continuous variables are those that measure or indicate some quantity on a
continuous scale; a typical example of a continuous variable would be a person's Height or Weight, which can be measured
on continuous scales.
Note. By default, all the cases in the data set are used. However, you can use a nominal variable in the data set that indicates
which cases to use.
On the SNN Startup Panel, you can select variables using either the Quick or the Advanced tab. On the Quick tab you can
select the inputs/outputs directly. These can be categorical or continuous depending on the problem type (see above). This is
in contrast to the Advanced tab where input/output variables can be specified more flexibly in a two stage process: First
select the continuous and categorical variables for the analysis, then select from among that list those that are the output for
the neural network, and those that are the input variables. Thus by using the Advanced tab options you can specify neural
network problems that include both continuous and categorical variables as outputs (or inputs), and also specify auxiliary
variables that you may want to be available later on the Results dialog to produce various plots.
1. Quick tab. You should first determine the nature of the problem you want to analyze. STATISTICA Neural Networks
supports four types of problems: regression, classification, time series, and cluster analysis. These problem types will
constrain the types of variables you can select as input and output (independent and dependent) variables to conform to
the respective common type of analysis. For example, regression problems require continuous output variables,
classification problems require categorical output variables, time series problems allow overlapping lists of continuous
input and output variables, and cluster analysis problems require a categorical output variable (that can later be used to
label clusters).
2. Advanced tab. You can also specify variables, and hence the general types of analysis problems, with the options
available on the Advanced tab. Those options will not constrain the variable selections to conform to one of the
(common) analysis types listed above, so for example, you could specify simultaneously continuous and categorical
output variables (a mixture of a regression and classification problem), in time-series networks.
Example 1. Specifying a regression problem via the Quick tab options.
Suppose you want to find a good model for predicting continuous measurements of job satisfaction from continuous
measurements of satisfaction in other domains. Open example data file Factor.sta, which contains in the first three columns
measurements of satisfaction at the work place.
Panel.
Select the Regression option button in the Problem type group box on the Quick tab. Click the Variables button, and from
the variable selection dialog, select WORK_1-WORK_3 as continuous outputs and the rest of the variables as inputs. The
task is then to predict the values of WORK_1-WORK_3 based on the given values of the inputs. You can use the Intelligent
Problem Solver or Custom Network Designer (and specific network) to find a network architecture that accurately predicts
job satisfaction.
Example 2. Specifying a classification problem via Quick tab options.
Use the data file IrisSNN.sta to classify FLOWER type from four continuous variables SLENGTH to PWIDTH. Select
Classification in the Problem type group box on the Quick tab, and click the Variables button to display the variable
selection dialog. Select FLOWER as the Categorical Outputs variable; select variables SLENGTH through PWIDTH as
continuous input variables. The goal of the analysis is to accurately classify flowers as Setosa, Virginic, or Versicol based
on the values of the continuous input variables. You can use the Intelligent Problem Solver or Custom Network Designer
(and specific network) to find a network architecture that accurately classifies the different types of iris.
Example 3. Specifying a time series problem via Quick tab options.
Using the data file Series_G.sta, you can predict future values of variable SERIES_G (monthly airline passenger totals; see
Box & Jenkins, 1976) based on previous ("lagged") observations. Select Time Series in the Problem type group box on the
Quick tab. Click the Variables button and select SERIES_G both as the input and output variable. You can use the Intelligent
Problem Solver or Custom Network Designer (and specific network) to find a network architecture that accurately predicts
future airline passenger loads.
Example 4. Specifying a classification problem via Advanced tab options
Open the data file Boston2.sta. Click the Variable types button on the Neural Networks Startup Panel - Advanced tab and
select variables ORD1 through ORD12 as the Continuous variables for the analyses, variables PRICE and CAT1 as the
Categorical variables, and variable SAMPLE as the Subset variable.
Next, click the Input-Output variables button to specify how the variables selected in the previous step (excluding the subset
variable) should be used in the neural network analysis, i.e., whether they should be treated as input, output, or auxiliary
variables. Select variable PRICE as the Output variable, and variables CAT1 and ORD1 through ORD12 as Input variables
(note that the first variable is defined as categorical). Note that the subset variable SAMPLE does not appear in this variable
selection dialog since a subset variable cannot be an input or output as well as a subset.
Note that all variables that were first selected as categorical or continuous variables, but are not included as input or output
variables in the neural network analyses, will automatically be regarded as auxiliary variables. Those variables will be
available on the Results dialog where they can be selected for various graphs, etc. (e.g., for plots of residuals against specific
auxiliary variables).
To complete the data selection process, you will need to define the subset variable codes. This will determine which cases of
the data set will be used for training, selection, and testing of neural networks.
On the Startup Panel Advanced tab, first select the Train check box, and double click in the adjacent box. A dialog will be
displayed in which you can select the subset variable code; select LEARNING and click OK. Then, select the Selection check
box, double click in the adjacent box, and from the resulting dialog select TEST.
This selection will divide the data set such that cases 1-506 (identified with the code LEARNING in variable SAMPLE will
be used for training, and cases 507-1012 (TEST) for selection.
Note that the variable specification options on the Advanced tab will not only allow you to specify classification or
regression-type problems (i.e., categorical or continuous output variables), but also both (i.e., "mixed" regression and
classification problems). STATISTICA Neural Networks will automatically make available the appropriate analysis and
results options for the different types of output variables, i.e., continuous, categorical, or both.

The Intelligent Problem Solver (IPS) is designed to lead you step by step through the process of building a neural network,
making the process as easy as possible for you. This is actually computationally very demanding, as unlike traditional linear
approaches in statistics, it is not possible to uniquely identify a "best solution." Neural network training algorithms are
iterative, training over a period of time, and need to be repeated a number of times until a satisfactory solution is found.
What is more, some difficult decisions have to be made about the type of network to use (there are many variants), their
complexity (the correct answer to which depends on the unknown complexity of the underlying function), and the input
variables to use (similarly unknown), and the only guaranteed way to find out the answers to these questions is to try out
different networks and compare them.
The Intelligent Problem Solver hides all this complexity from you, allowing you to build neural networks right away even if
you have no previous experience of neural networks. However, if you have large data sets, you may need to be patient, as
training networks can take quite a long time.
When you first start using the Intelligent Problem Solver, you can limit yourself to the options available on the Quick tab,
but as you gain experience, you can select more advanced options from the other tabs, giving you finer-grained control over
the design, and as you become more familiar with STATISTICA Neural Networks, allowing you to train more powerful
neural networks. You can also control design even more closely by using the Custom Network Designer.
In this example, we will use the Intelligent Problem Solver to build a neural network. Open the data file Barotrop.sta via the
File - Open menu; it is in the /Examples/Datasets directory of STATISTICA. The file contains data on two classes of storms -
Barometric and Tropical. This is a modified version of the data reported in Elsner, Lehmiller and Kimberlain (1996). The
data recorded includes the type of storm and its longitude and latitude. We will attempt to predict the class of storm from its
longitude and latitude. This is a simple nonlinear problem, and so is a good way to illustrate STATISTICA Neural Networks
capabilities.
Panel.
Using the Intelligent Problem Solver. On the Neural Networks Startup Panel - Quick tab, select the Classification option
button in the Problem type group box, select Intelligent Problem Solver in the Select analysis list, and click the OK button. A
standard variable selection dialog will be displayed.
Select the dependent (output) variable: in this case, the output variable is CLASS, which gives the type of each storm. Then
select the independent (input) variables: both the LONGITUD and LATITUDE variables. Click the OK button.
Click the OK button on the Startup Panel to display the Intelligent Problem Solver dialog. The Quick tab should be selected
by default.
As there are only two independent variables in this problem, clear the Select a subset of independent variables check box.
This ensures that both independent variables are used as inputs to all the networks tested.
The Optimization time group box contains options to specify, in broad terms, how long the Intelligent Problem Solver should
spend trying to design an effective neural network for this problem. In general, the longer the Intelligent Problem Solver is
allowed to run, the better the solution it is likely to find. Actually, it can be configured to tell you every time it finds an
improved solution, so you can run it for a long period and simply stop it (by clicking the Finish button on the IPS Training
in Progress dialog) if you observe that it is failing to make any progress. You can specify the time taken either in terms of
how many networks should be created and tested (select the Networks tested option button), or specify how many hours and
minutes it should run (select the Hours/minutes option button). In either case, you can be confident in the fact that
STATISTICA Neural Networks uses some of the most advanced neural network training algorithms known, running up to
two orders of magnitude faster than the back propagation algorithm employed by most neural network packages. It is also
worth noting that, although neural network training may take a long time, neural network execution (use) is extremely fast.
For the purposes of this example, select the Networks tested option button, and increase the number of iterations to 25
network tests by entering 25 in the adjacent input box or by changing the value using the microscrolls.
The Intelligent Problem Solver can test a very large number of neural networks, and it is usually a good idea to retain more
than just the very best of these for a number of reasons - you may want to compare the performance of different types of
networks, and some networks may have slightly lower performance than others but have other desirable features such as
small size (and therefore fast execution) or a smaller number of input variables. By default, STATISTICA Neural Networks
deliberately selects networks for saving of different types and with different trade-offs between the performance (predictive
accuracy) of the network and its complexity.
STATISTICA Neural Networks also maintains a network set where the retained networks are kept for subsequent analysis.
The network set contains a number of networks related to the same data set, and you can subsequently choose which
network to use. The Intelligent Problem Solver will insert all the networks it retains into the network set. As a consequence,
it is possible to rapidly create very large numbers of networks (even into the hundreds) by repeatedly using the Intelligent
Problem Solver, and most of these will be of little use as they are superceded by better networks. To prevent this problem,
you can specify a maximum number of networks in the network file, and ensure that if it is full, STATISTICA Neural
Networks replaces inferior networks from the file rather than simply making it ever larger.
Set the Networks retained box to 10, so that only the best 10 networks created by the Intelligent Problem Solver are saved.
The Intelligent Problem Solver - Quick tab should look as the illustration below.
Now click the OK button, and the Intelligent Problem Solver will begin to design networks.
The IPS Training in Progress dialog is displayed. Each time an improved network is discovered, a new row is added to
spreadsheet on this dialog, displaying summary details about the network (the summary details are explained in Example 3).
In addition, the time elapsed is displayed at the bottom of the window, as well as the percentage of the process that is
complete. If you are conducting a long run and there has not been any progress for a considerable time, remember that you
can click the Finish button on the Progress dialog to terminate the network testing early.
For this example, however, you should not need to wait very long, and with a Pentium II or better computer, this design
should be completed in less than a minute.
When the search finishes, the Results (Run Models) dialog is displayed, which includes summary details on the retained
networks, and allows you to subject them to further testing.
Generating and Interpreting Descriptive Statistics. Click the Descriptive statistics button on the Quick tab of the Results
dialog, and two results spreadsheets are displayed: a Classification spreadsheet and a Confusion Matrix spreadsheet.
The Classification spreadsheet presents overall summary details on the classification performance. There are columns for
each output class as predicted by each model. For example, the column labeled CLASS.BARO.1 corresponds to the
predictions of model 1 on the BARO class of the CLASS variable.
The first row says how many cases of each storm type there were in the data set. The second row records, for each class,
how many are actually predicted correctly by the network, and the third how many are wrongly predicted. You are likely to
see all, perhaps barring one or two, correctly predicted in this case. The fourth row records "unknown" cases, a point to
which we will return in a later example using the Intelligent Problem Solver - there should be no such cases now.
The Confusion Matrix spreadsheet gives a more detailed breakdown of misclassifications, and is particularly useful for
problems with more than two output classes.
The summary details so far displayed are for all the cases in the data set. However, the cases are actually divided into a
number of subsets, and it is important to consider the performance on particularly subsets. The significance of these subsets
is discussed below.
An important limitation of neural networks (and related non-linear techniques) is that of over-fitting, or over-learning. Our
objective when designing a neural network is to find a function that accurately models the unknown underlying function that
relates the input variables to the output variables, and we estimate this function by fitting a function to the available data
points (cases). The problem with fitting a curve to the data points is that if we choose a sufficiently eccentric function (that
is, a function of high curvature) we can end up modeling the noise in the data rather than the underlying function. The figure
below illustrates this problem when fitting a regression line in one dimension - a much smoother line would be better here,
even if it did not pass through the data points, as it will predict results more accurately when new data is tested.
The ability to perform well on new data is called generalization, and is the most desirable property of a neural network. How
then can we ensure that a network will generalize well? An important technique is to hold back some of the data, and not to
use it for training the network. This data can be used to check the network's performance. This selection data is used in two
ways. First, as network training progresses, the curvature of the function increases, and so we can stop training if
performance on the selection set starts to deteriorate. This stops over-learning. Second, if we design a number of neural
networks and want to select the best one, we cannot safely compare the performance on the training sets, as one network
may have over-learned and have a deceptively good performance figure. We can, however, select the network with the best
selection subset performance.
When the selection subset performance is used in this way, another problem may be introduced. If a large number of
networks are tested, and the one with the best selection performance is selected, we are effectively conducting a sampling
experiment, and we may end up with a network with an deceptively good selection subset performance that does not
truthfully reflect its generalization ability, being biased by the sampling process. We may therefore reserve a further subset
of data, the test set, which is used purely at the end of the design process to check that the selection error is not artificial.
Providing that the selection and test errors are close together, we may be reasonably confident that the network will
generalize successfully.
By default, the Intelligent Problem Solver automatically assigns cases randomly to training, selection, and test subsets, and
summary statistics can be reported independently for the three subsets. All the experiments conducted in a particular run of
the Intelligent Problem Solver use the same division of cases into subsets, and so you can directly compare the performance
of the networks created.
To display the summary statistics broken down by subset, take the following steps:
Resume the analysis by clicking the Results button on the Analysis bar at the bottom of the screen, or by selecting Resume
from the Statistics pull-down menu, or by pressing CTRL+R.
In the Subsets used to generate results group box, select the All (separately) option button. Then, click the Descriptive
statistics button.
The resulting Classification spreadsheet is split into four sections. Columns now have titles with the prefixes T, S, X, and I
corresponding to the training, selection, test, and ignored subsets respectively (we do not have any ignored cases, so the
entries in the last section of the spreadsheet should be zeroes).
By default, the cases are divided in the proportions 2:1:1 between the three subsets, and in this case you should have 50
training cases, 25 selection cases, and 25 test cases. You should also find that performance on the subsets is comparable (and
perfect or near perfect) indicating that the neural network can indeed generalize well.
Finishing the Analysis. Once you are satisfied with the results generated by the Intelligent Problem Solver, you can
complete the analysis.
Click the OK button on the Results dialog. This confirms that the analysis has been successfully completed, and commits the
networks generated by the Intelligent Problem Solver to the network set, and returns you to the Startup Panel. If instead you
stop the analysis by clicking the Cancel button, those networks are discarded. However, to prevent accidental lose of results
you will be prompted to confirm that you want to cancel the current analysis.
Since neural networks take a considerable amount of time to train, and experimentation is generally needed to find the best
performance, it is normal practice to keep copies of successful networks rather than recreating them each time the data is
analyzed (which is the procedure often used for conventional statistical methods).
You can save the network set by selecting the Networks/Ensembles tab and clicking the Save Network File As... button to
display a standard Save File dialog, which is used to save the networks on the current workspace to a new network file (file
name extension .snn). If you are going directly to Example 3, there is no need to save the file at this time.

Once you have used the Intelligent Problem Solver (IPS) to design some neural networks and saved them to the network file,
you can later open these files for further analysis by clicking the Open Network File button on the Neural Networks Startup
Panel - Networks/Ensembles tab to display a standard open file dialog.
Select the Networks/Ensembles tab on the Startup Panel. The networks that you created in Example 2 should still be selected.
This tab includes summary statistics on each network, which are also displayed in the Intelligent Problem Solver Progress
dialog and the Results dialog. Now is a good time to describe these summary statistics.
Model Summary Statistics. STATISTICA Neural Networks supports two types of models - neural networks and network
ensembles. We will discuss ensembles in a later example. Currently, all the models we have created are neural networks.
Summary details for neural networks are presented in the Networks list on the Neural Networks Startup Panel -
Networks/Ensembles tab.
The summary statistics consist of:
Index. This is a unique life-long number assigned to each neural network when it is created. The indices are assigned in
chronological order.
Lock. This column indicates whether the model is locked to prevent accidental deletion.
S/A. This column indicates whether a network is standalone. Standalone networks are models in their own right, which can
optionally also be part of an ensemble. Non-standalone networks are members of ensembles that are not considered to be
models in their own right.
Refs. This column indicates the number of references to the network; that is, the number of ensembles that contain the
network. A single network can be a member of several ensembles.
Profile. This is the most useful summary statistic, packing a great deal of information into a short piece of text. It tells you
the network type, the number of input and output variables, the number of layers, and the number of neurons in each layer.
The format is <type> <inputs>:<layer1>-<layer2>-<layer3>:<outputs>, where the number of layers may vary. For example,
the profile MLP 2:2-3-1:1 signifies a Multilayer Perceptron with two input variables and one output variable, and three
layers of 2, 3, and 1 units respectively. For simple networks the number of input variables and output variables may match
the number of neurons in the input and output layers, but this is not always so, as we shall see in due course.
Train Perf/Select Perf/Test Perf. These columns give the performance of the networks on the training, selection, and test
subsets respectively. You should not give too much credence to the performance rate reported on the training set, which is
often deceptively good (indicating over-learning). Also, you should avoid using the test set performance to select models, as
that defeats the object of having it (which is to maintain some data not used for training or model selection, so that a
dispassionate final assessment of performance can be made).
Use the performance measure on the selection subset to discriminate between, and choose between, networks.
The meaning of the performance measure depends on the network type. For a classification network, it is the proportion of
cases in the subset correctly classified. For a regression network, is the ratio of the prediction to observation standard
deviations.
Train Error/Select Error/Test Error. Neural Network training algorithms optimize an error function (e.g. the RMS of the
cross entropy between observed and predicted outputs). These columns report the error rates on the subsets. The error rate is
less directly interpretable than the performance measure, but is of more significance to the training algorithms themselves.
Training. This is a brief description of the training algorithm used to train the network. A typical code might read:
BP200CG102b, which signifies "two hundred epochs of back propagation, followed by one hundred and two epochs of
conjugate gradient descent, at which point training was terminated due to over-learning and the best network in the training
run retrieved."
Note. A short text description that you can enter using the Network Editor - Quick tab.
Inputs. The number of input variables to the model (also displayed as part of the profile).
Hidden(1)/Hidden(2). The number of hidden units in the network.
Making Predictions Using the Neural Networks. Having trained the neural networks, we can use them to make
predictions. Predictions can be made based on cases in the data set or new cases entered by the user.
On the Networks/Ensembles tab, select a single network with good performance. Select the Advanced tab of the Startup
Panel. Select Run Existing Model from the Select analysis list, and click the OK button to display the Results dialog.
On the Quick tab of the Results dialog, select the Overall option button in the Subsets used to generate results group box.
Click the Predictions button.
STATISTICA Neural Networks generates a spreadsheet giving the predictions of the networks on the data set. Other pieces of
information may be included in the spreadsheet too; by default, the observed values from the data set are also included.
The spreadsheet contains the observed value (CLASS) in the first column, then the corresponding predictions of the neural
network (CLASS.n, where n is the index of the network chosen - e.g. CLASS.4 for the network with index 4) in the next
column. If you select multiple models, there will be one column for the observed value but multiple columns for the
different network predictions.
With the default selection, the datasheet shows the prediction made by the neural network - either BARO or TROP. The
neural network itself works entirely with numeric information, and the variable prediction is produced by the built-in post-
processing functions that accompany every neural network in STATISTICA Neural Networks. The prediction is actually
made by treating the activation level of the output neuron as a confidence level, and assigning the class by comparing this
activation level with some confidence thresholds (assigned automatically by the Intelligent Problem Solver).
Next, select the Predictions tab of the Results dialog. In the Prediction type(s) shown group box, clear the Prediction check
box, and select the Confidence levels check box. Click the Predictions button. You can then view the confidence levels of
the neural network on each case (these are all between 0 and 1). Low figures indicate a BARO storm, and high figures a
TROP storm. You should find that the network is very certain indeed of most storm types. Confidence levels of neural
networks should be treated with caution - for some network types, the confidence levels may be treated as class membership
probability estimates; for others, they may not be. If your network is a multilayer perceptron, trained with the default
configuration assigned by the IPS, the confidence levels may indeed be treated as probability estimates, on the assumption
that the generating distribution is drawn from the exponential family.
Plotting Response Surfaces. The neural network predicts a class for each combination of latitude and longitude. We can
display this prediction as a response surface in three dimensions, where the height of the surface indicates the predicted
class.
Select the Advanced tab of the Results dialog, and click the Response surface button to display the Response Surface dialog.
Ensure that the Independent (X axis) and Independent (Y axis) selections are LONGITUD and LATITUDE respectively
(which they should be by default). Click the Resp. surface button on the Quick tab to display the response surface. Note that
you can also display the response graph using the Response surface button on the Fixed Independents tab.
As you would expect, the response surface for TROP shows a plateau in the middle longitudes, corresponding to prediction
of the TROP storm class. The lower areas correspond to the prediction of the BARO storm class.
The response surface for BARO is the mirror image of that for TROP.
You can interactively rotate the response surface to generate different views by clicking the Rotate button on the Graph
Tools toolbar and using the options on the Point of View Settings and Exploratory Spin dialog.
You can also customize the graph's appearance in a variety of ways (including colors, fonts, graph style, legend, and
presentation of axes) by using the options on the various tabs of the All Options dialog, accessible by right-clicking on the
background of the graph and selecting Graph Properties (All Options) from the shortcut menu or by selecting All Options
from the Format pull-down menu.
By default, the response surface shows the activation of the output neuron, which corresponds to the network's confidence
level. It is also possible to display the network's class prediction as plateaus at various levels.
Select the Options tab of the Response Surface dialog, and select the Prediction (plateaus) option button in the Response
plotted for classification outputs group box; then click the Resp. surface button on the Quick tab. This will display the
network's class prediction as plateaus at various levels.
Making predictions on new data. Having trained a neural network, you will at some point want to use it to make
predictions on new data. There are three ways to do this in STATISTICA Neural Networks:
1. Place the new data in a data file and open it using the Open Data button on the Startup Panel. Then, open the network file
using the Open network file button on the Networks/Ensembles tab of the same dialog. Once the network file is loaded,
you will see the network details in the Network summary list on the Networks/Ensembles tab. After selecting the
networks you want to execute, select Run Existing Model on the Advanced tab of the Startup Panel and click OK. This
will display the Results dialog. Note: if you click OK without selecting a network first from Summary of
networks on the Networks/Ensembles tab, a model selection dialog will be displayed first. You can use this
dialog to select the model(s) you want to use for making predictions.
2. Load and run the neural network programmatically using the SNNK API (Application Programming Interface) or code
generated by the Code Generator from your own custom-designed programs; of course you can also write (or record via
the macro recording facilities) STATISTICA Visual Basic programs to load and then execute existing (trained) networks.
The Code Generator options are useful if you want to wrap the neural network up into a "black box" program for use by
staff who are not trained in the use of neural networks or STATISTICA Neural Networks. Also, STATISTICA Data Miner
will automatically deploy trained networks, i.e., allow you to "connect" new data to a trained network for prediction, by
simply attaching arrows to the respective symbolic representations of analysis objects (e.g., trained neural networks) in
the Data Miner project.
3. Run a single, user specified case directly in STATISTICA Neural Networks. This is useful for testing occasional new data
when the neural network designer is also responsible for making predictions.
We will demonstrate the last option.
Click the User defined case button on the Advanced tab of the Results dialog to display the User Defined Case Prediction
dialog. The Quick tab on this dialog contains a spreadsheet that shows the current input. To run an existing input, click the
Run current input button. To modify its value, click User defined input. This will display a spreadsheet. Enter the longitude
and latitude you want in the appropriate cells of this spreadsheet and click OK. This will display the new value(s) (user
defined) in the input list on the Quick tab. To run the new value(s), click the Run current input button. Note that a blank cell
indicates a missing value, and STATISTICA Neural Networks automatically substitutes an appropriate value when running
the network.
Enter the longitude and latitude you want to test in the input datasheet, and click the Run button. A row is added to the
output data sheet containing the networks' predictions.
You can also display the Advanced tab of the User Defined Case Prediction dialog, and select the Confidence levels check
box in the Prediction type(s) group box and the Inputs check box in the Also in spreadsheet group box.
Display the Quick tab, and run a few cases. This time the output spreadsheet will contain the class confidence level, and the
inputs you specified for the case.
Performing "What if?" Analysis. A frequent requirement is a "What-if?" analysis, where you want to know what would
happen if some changes were made to a particular case. For example, we might want to know what would happen if the
storm recorded in case 9 had been at a slightly higher latitude. STATISTICA Neural Networks has several facilities to
perform what-if analyses.
The User Defined Case Prediction dialog can be used for a simple what-if analysis.
Click the microscroll button next to the Input case field. As the case number is altered, the corresponding values are copied
into the input data set. You can then click the Run current input button to execute the networks, optionally modifying the
input values first.
Select Input case 9. The longitude and latitude of case 9 (62 longitude, 14 latitude) are displayed in the input datasheet.
Click the Run current input button to recalculate the prediction at the current position.
You can now experiment with adjusting the input in the User Defined Case Prediction dialog (accessible by clicking the
User defined input button). Alter the longitude to 64, and click the Predictions button. The prediction should change from
BARO to TROP. If you display the Advanced tab, change the Prediction type to Confidence levels, then return to the Quick
tab and repeat the experiment. You will be able to see how the confidence level changes as you cross the boundary between
the two storm types.
You can also plot the variation in the prediction as the longitude changes with the latitude held constant, using a Response
Graph.
Cancel the User Defined Case Prediction dialog, and click the Response graph button on the Advanced tab of the Results
dialog.
Ensure that the Independent (X Axis), is LONGITUD, then click the Resp. Graph button. The graph displays how alteration
of the longitude affects the prediction.
You may have noticed an important issue here; in order to run the network, we need values for both the longitude and
latitude, yet this graph plots the prediction against just one of those variables: longitude. What has happened to the latitude?
The answer is that the latitude must be fixed to some value before generating the response graph. In effect, the response
graph is a one-dimensional slice through an actually two-dimensional response surface (which we plotted earlier). In
general, there may be more than two inputs to a neural network, and the response graph is a one dimensional slice through
an N dimensional surface (the response surface is a two dimensional slice through this same N dimensional surface).
Select the Fixed Independents tab of the Response Graph dialog. This tab contains an input spreadsheet just like the one
used in the User Defined Case Prediction dialog, except that one of the cells is grayed out and marked X. This is the variable
being tested to provide the X axis of the response graph. You can provide values for any remaining variables. If you don't
provide a value, then STATISTICA Neural Networks treats them as missing values, and substitutes an appropriate "average"
value.
Enter a value for the latitude or select a case, which will retrieve its latitude, and then click the Response graph button again
to plot another response graph if you want to see how changes in longitude effect the prediction at different latitudes.
If you want to view the activation levels in numeric form rather than as a graph, click the Response spreadsheet button.
In problems with more than two input variables, you can similarly perform a what-if analysis using the response surface, in
which case the effect of adjusting two variables while all others are held constant can be viewed.
Having completed our tour of network testing windows, cancel the Response Graph dialog if it is still showing, and click the
OK button on the Results dialog to complete the analysis.
SNN Example 4: Using the Network File Editor

In STATISTICA Neural Networks, networks can be saved and loaded from a network file that must have the extension .snn.
See Example 2: Creating a Neural Network to set up this example.
You can view the networks in the current set on the Networks/Ensembles tab of the Neural Networks (SNN) Startup Panel,
and select them for execution, as previously discussed in Example 3: Testing a Neural Network. You can also delete
unwanted networks by clicking the Delete models button on the Advanced tab of the Neural Network Editor dialog.
Further options for maintaining the network set are available using the Neural Network File Editor. Select Network Set
Editor from the Select analysis box on the Advanced tab of the Neural Networks (SNN) Startup Panel and click OK to
display the Neural Network File Editor dialog. This dialog allows you to view, select between and organize the various
neural networks and ensembles associated with a particular data set.
Click on the Networks tab to view the network list. The details shown are the similar to those shown on the
Networks/Ensembles tab, the Results dialog, and elsewhere.
For the storm data, you should observe that the performance of the Linear networks is extremely poor. This is not surprising,
as we saw graphically that the two classes clearly are not linearly separable. Some of the networks are likely to have quite
poor performance, and have been included in the network set as they have very few units, and demonstrate what happens if a
very low complexity model is used. For example, you may find an MLP network with a single input and single hidden unit
has been selected. These networks are included as we requested of the Intelligent Problem Solver that a diverse range of
networks with different complexity versus performance trade-offs be included. However, there is likely to be at least one
network with a performance of 1 (i.e. 100% correct classification). In this particular problem, RBF networks are particularly
effective (they are good at modeling strongly local clusters, which this problem certainly has), and there may be several of
these.
Once you have studied the individual networks, you may want to delete some. For example, having established that the
Linear networks are very poor in this problem domain, and that some of the smaller MLPs are also highly inferior, we may
decide to remove them and not to experiment with those types again.
To delete networks from the network set, click on the Advanced tab of the Neural Network File Editor dialog, then click the
Delete model button. Select the networks to be deleted from the Select Networks and/or Ensembles dialog, and click OK.
The networks are permanently removed from the network set.
It is easy to generate many hundreds of neural networks using STATISTICA Neural Networks, and the network file can
rapidly become unmanageable. One alternative is to periodically purge unwanted networks.
Another alternative is to specify a maximum number of models (networks and ensembles) to be stored in the set.
Click on the File Details tab of the Neural Network File Editor dialog, select the Limit number of models (standalone
networks and ensembles) allowed in the file check box, and enter the number in the Maximum box.
When your network file is full, STATISTICA Neural Networks has to decide what to do if you try to create more networks.
You can specify how this is done using the options on the Replacement Options tab. Click on this tab now.
If the Inform user in advance of creating model check box is selected, and you try to create new networks when the file is
already full, a warning dialog will be displayed to let you know that this might require discarding some models.
If you go ahead and create new models, STATISTICA Neural Networks enters a two-stage process. First, it identifies existing
models that are candidates to be replaced by the newly created models. You can specify the criteria used to identify these
candidates. Second, it determines whether the newly created models are actually better than the candidates. If they are, all is
well and replacement goes ahead. If not, you can set options to specify either that the replacement takes place anyway, or
that the new network is discarded.
The default under Criterion to select candidate model for possible replacement is Try to maintain diversity. This attempts to
maintain a diverse range of network types and performance versus complexity trade-offs (the best performing network of
each type is always kept, irrespective of its complexity). The default under Action if the new network is inferior to the
candidate for replacement is Replace anyway.
With these settings, you should find an interesting range of different models is retained. There may be some "churning"
among the less effective models, as new ones displace the worst performing, but at least newly-created models will be
retained for testing.
For the moment, leave these settings as the defaults, and click on the Networks tab.
You may want to ensure that certain networks do not get replaced, even if a network does have to be removed to make way
for new ones. To ensure that a network is not inadvertently lost, you may lock the network by selecting it from the network
list on the Networks tab, then selecting the Lock (prevents deletion/replacement until unlocked) check box. Locked networks
are never selected for removal, irrespective of performance.

For the next set of examples, we will use the classic Iris data set. This contains information about three different types of Iris
flowers - Iris Versicolor, Iris Virginica, and Iris Setosa. The data set contains measurements of four variables (sepal length
and width, and petal length and width). The Iris data set has a number of interesting features:
1. One of the classes (Iris Setosa) is linearly separable from the other two. However, the other two classes are not linearly
separable.
2. There is some overlap between the Versicolor and Virginica classes, so it is impossible to achieve a perfect classification
rate.
3. There is some redundancy in the four input variables, so it is possible to achieve a good solution with only three of them,
or even (with difficulty) from two, but the precise choice of best variables is not obvious.
Open the Irisdat.sta data set, then launch STATISTICA Neural Networks. On the Quick tab of the Startup Panel, select
Intelligent Problem Solver in the Select analysis list, and select the Classification button in the Problem type group box.
Click the Variables button, select IRISTYPE as the dependent variable, and select SEPALLEN, SEPALWID, PETALLEN, and
PETLWID as independents. Click the OK button on the variable selection dialog and on the Startup Panel to display the
Intelligent Problem Solver dialog.
On the Quick tab, select the Select a subset of independent variables check box, set the Optimization time to a timed search
of 5 minutes (0 hours), and set the Networks retained to 10.
Select the Feedback tab, and select the All networks tested (real time) option button in the Summary details of networks
group box. Select the Generate spreadsheet of summary details when finished check box, and click the OK button.
The Intelligent Problem Solver will display the Progress dialog, display the percentage of training accomplished and the
elapsed time, and add entries to the summary spreadsheet each time a new network is tested. If you check the configuration
code for these networks, you are likely to see that some begin with a 3, indicating that the Intelligent Problem Solver has
discovered effective networks with only three input variables.
You can either wait until the five minutes have elapsed, or, if the results do not seem to be improving with time, or you want
to continue with the examples right away, click the Finish button. The Intelligent Problem Solver will then end, adding the
networks it has elected to retain to the network set, and displaying summary statistics on these models in the list box at the
top of the Results dialog.
The Intelligent Problem Solver will almost certainly display a warning message (provided the Display summary message...
option is selected) when it completes, stating that the training and selection errors of the best of these networks are
significantly different to each other. Significant variations between these figures invariably indicate that a network has
suffered over-learning, and so the training performance should be treated with extreme skepticism, and even the selection
error with some caution. However, if the selection and test errors are reasonably close together, you can have some
confidence that they do reflect the expected generalization performance.
This warning is indicative of a general problem with neural networks, which is that to be used with confidence you should
ideally have a large number of cases - typically at least in the high hundreds, and preferably in the high thousands. The
precise number required depends on the number of input variables, the complexity of the underlying function (and hence of
the correct network), and the amount of variance in the data. As a rule of thumb, you should aim to have at least five times
as many cases as connections in the network (if you don't know how many connections there will be, try squaring the
number of inputs if there are five or more, or use twenty-five otherwise), and preferably ten times as many. If you have less
cases than this, a neural network may still be effective (especially if the problem is highly nonlinear), but you should pay
careful attention to any disparities between selection and test errors, which indicate unreliability when you generalize to new
cases. It is also worth considering resampling and constructing ensembles, which are discussed in a later example.
If you display the Classification spreadsheet by clicking the Descriptive statistics button, you are likely to see perfect or
near-perfect classification on the training set, with a small number of errors in the selection and/or test sets. In this case, the
Confusion Matrix spreadsheet comes into its own, as you can see exactly how misclassified cases are misclassified. In this
problem, the only likely source of misclassification is confusion between Versicolors and Virginicas.
Sensitivity Analysis. Sensitivity Analysis gives you some information about the relative importance of the variables used in
a neural network. In sensitivity analysis, STATISTICA Neural Networks tests how the neural network would cope if each of
its input variables were unavailable. STATISTICA Neural Networks has facilities to automatically compensate for missing
values (typically, it substitutes the sample mean for a missing value). In sensitivity analysis, the data set is submitted to the
network repeatedly, with each variable in turn treated as missing, and the resulting network error is recorded. If an important
variable is removed in this fashion, the error will increase a great deal; if an unimportant variable is removed, the error will
not increase very much.
Click the Sensitivity analysis button on the Quick tab of the Results dialog to conduct a sensitivity analysis.
The spreadsheet shows, for each selected model, the ratio of the network error with a given input omitted to the network
error with the input available. It also shows the rank order of these ratios for each input, which puts the input variables into
order of importance. If the ratio is 1 or less, the network actually performs better if the variable is omitted entirely - a sure
sign that it should be pruned from the network (the Intelligent Problem Solver will have already done this for you in most
cases).
It is likely that the best network discovered by the Intelligent Problem Solver has either three or four input variables. The
sensitivity analysis, in most circumstances, will rank the variables in the order: PETALLEN, PETLWID, SEPALWID,
SEPALLEN. If your network has only three inputs, it is likely that SEPALLEN has been omitted. However, occasionally a
network is discovered where SEPALWID and SEPALLEN are reversed in sensitivity.
This is indicative of a limitation of sensitivity analysis. We tend to interpret the sensitivities as indicating the relative
importance of variables. However, they actually measure only the importance of variables in the context of a particular
neural model. Variables usually exhibit various forms of interdependency and redundancy. If several variables are
correlated, then the training algorithm may arbitrarily choose some combination of them and the sensitivities may reflect
this, giving inconsistent results between different networks. It is usually best to run sensitivity analysis on a number of
networks, and to draw conclusions only from consistent results. Nonetheless, sensitivity analysis is extremely useful in
helping you to understand how important variables are.
In the Iris problem, you are likely to find that Multilayer Perceptron (MLP) networks perform significantly better than either
Radial Basis Function (RBF) or Linear networks. In fact, this is typical of most problem domains - Multilayer Perceptrons
usually have the best performance.

Studying the networks created by the Intelligent Problem Solver can give you a very good picture of just what type of
network performs well, and what input variables should be used. Running the Intelligent Problem Solver with the Irisdat.sta
data file reveals that Multilayer Perceptrons perform best, that they tend to have a small number of hidden units (five or
less), and that they have three or four inputs (with SEPALLEN optional) (see Example 5: The Iris Problem).
We could simply run the Intelligent Problem Solver in "brute force mode" to attempt to find an optimal network - say
running it for several hours. In that time, it will try out thousands of networks on this relatively simple problem, and no
doubt select some extremely good ones. However, the Intelligent Problem Solver can work much more efficiently if its
design task is more constrained. Using the advanced features of the Intelligent Problem Solver, you can specify exactly the
input variables, number of hidden units and network type, among other features. This can give you a much more accurate
picture of what can be achieved. In problems with larger numbers of cases and variables, it can also help you to locate a very
good solution with acceptable execution time.
As an example, we will train a Multilayer Perceptron network with all four available inputs, and five hidden units. As we are
looking for a very specific type of network, we will retain only the best network created by the Intelligent Problem Solver.
Open the data file Irisdat.sta via the File - Open menu; it is installed in the /Examples/Datasets directory of STATISTICA.
Panel. Select the Classification option button in the Problem type group box, select Intelligent Problem Solver on the Neural
Networks Startup Panel - Quick tab, and click the OK button. A standard variable selection dialog will be displayed.
Select IRISTYPE as the Categorical Output variable and SEPALLEN, SEPALWID, PETALLEN, and PETLWID as the
Continuous Input variables. Click the OK button to display the Intelligent Problem Solver dialog.
On the Quick tab, clear the Select a subset of independent variables check box to ensure that all the networks tested use all
four input variables.
Click on the Types tab and select the Three layer perceptron check box (four layer networks are only occasionally needed,
for some rare classes of difficult problem). Clear all the other check boxes.
Click on the Complexity tab. Here you can specify how many hidden units should be used for each network type. You can
specify a minimum and maximum, and STATISTICA Neural Networks will experiment with figures between these two
limits. Enter the value 5 for both the Minimum and Maximum for Three layer MLP, layer 2. Specifying the same value for
the minimum and maximum ensures that the size is fixed to a given number.
Click on the Quick tab again, and select an Optimization time of 20 Networks tested. Set Networks retained to 1, as we only
want to keep the best network found.
Click the OK button.
If you run the Intelligent Problem Solver for a sufficient time period, you may reduce the selection error to 0.12 or even
lower. However, the performance is unlikely to improve - there will always be a couple of cases that are misclassified, as
there is some overlap between classes and so no technique can ever achieve perfect classification on this data set. The
existence of an (unknown) upper bound on performance (the so-called Bayes error) should always be borne in mind when
designing neural networks.
A more valuable line of enquiry may be to try reducing the number of input variables and hidden units, in an attempt to
produce a more compact, and therefore more reliable, network with equivalent performance.
Run the Intelligent Problem Solver again, this time without the SEPALLEN input variable. You are likely to discover that
networks with equal, and possibly even slightly better, performance than the four input case can be found.
Even with just the PETALLEN, and PETLWID variables, it is possible (although more difficult) to find networks with
extremely good performance, perhaps even with performance to match the three and four input networks. However, the
relative difficulty in finding these networks is an indication that the class boundaries in two dimensions are more complex
than in three (you can verify the complexity visually using the response surface window - a good solution with two inputs
requires a sharp peninsula-shaped response surface), and therefore retaining at least one extra input variable is justified.

A neural network processes numeric information. In a classification problem, the desired output of the neural network is a
nominal (i.e. text valued, categorical) variable. To convert from numeric to nominal form, STATISTICA Neural Networks
includes post-processing functions that compare the activation levels in the output neurons to classification confidence
thresholds, and thus determines the class.
In a two-state classification problem, the Intelligent Problem Solver automatically assigns a single output unit, the activation
of which is in the range (0,1). Two thresholds can optionally be set: the accept and reject thresholds. If the activation is
above the accept threshold, the case is deemed to belong to the second class; if it is below the reject threshold, the case is
deemed to belong to the first class, and if it is in between, the prediction is deemed to be "unknown" (i.e. the classification is
dubious, perhaps reflecting a point in areas of overlap between the two classes). If the accept threshold is equal to the reject
threshold, then the network always assigns one class or the other; unknown classifications never occur. A network that can
produce an unknown classification because its accept and reject thresholds are not equal is said to have a "doubt option."
In a two-class problem, by default the Intelligent Problem Solver automatically assigns a single confidence threshold (i.e. the
accept threshold is set equal to the reject threshold) such that the misclassification rate on the two classes are equalized.
Alternatively, you can set the classification thresholds explicitly on the Thresholds tab of the IPS.
In a classification problem with more than two states, such as the Iris problem, STATISTICA Neural Networks assigns one
output unit for each class. By default, the class is simply assigned to the highest activation output unit (the so-called
"winner-takes-all" algorithm). Alternatively, you can again specify that classification thresholds be used on the Thresholds
tab. The post-processing function checks that one of these units has an activation above the accept threshold and the others
below the reject threshold, in which case the predicted class is given by the winning unit. If no unit is a clear winner, then
the prediction is "unknown."
Open the Irisdat.sta data set, and run the Intelligent Problem Solver. Use the same selections as in the previous example,
except for the following:
Select the Thresholds tab of the IPS, select the Use the thresholds specified below option button, and then set the Accept
threshold to 0.9 and the Reject threshold to 0.1.
Click the OK button. When the Results dialog is displayed, click the Descriptive statistics button.
In the Classification spreadsheet, you are likely to see that some entries have now appeared in the "unknown" row (all of
them VERSICOL and VIRGINIC, which are the two types with overlapping distributions). You are also likely to see that
there are less (and quite probably no) entries in the "wrong" row; that is, no misclassifications. This is probably a more
desirable situation - if the classification is dubious, it is often better to be told so, rather than to make an essentially arbitrary
decision.
Resume the analysis, and click the Predictions button on the Results dialog.
You should find that some cases now have a blank output cell. The blank cell, indicating a missing value, shows that the
prediction was classified as unknown by the network's post-processing function.
It is, of course, always possible to examine the confidence levels of the network by selecting the Predictions tab of the
Results dialog and selecting the Confidence levels option button in the Prediction type(s) shown group box.
Do this, and then click the Predictions button.
The confidence levels give you some more detailed feeling for the classification of individual cases, especially those over
which there is some doubt, and for some network types may be interpreted as probabilities.
Open the Credit.sta data set. This file contains data concerned with granting loans to applicants, a classic neural network
application. The data is derived from a real database; the original variables have been disguised by replacing the names and
nominal values with code letters.
Among other interesting features, many of the input variables in the Credit applications data set are nominal valued
(STATISTICA Neural Networks has preprocessing functions to handle this), and there are a number of missing values
(STATISTICA Neural Networks attaches a "missing value substitution" function to each input variable). As a consequence,
you do not need to concern yourself with these features of the data set; they are handled automatically.
Run the Intelligent Problem Solver, specifying the following options:
Variables: select the RISK dependent variable; select V1-V9 as independent variables. Set the Optimization time to 10
iterations. Set the Networks retained to 1 network. Clear the Select a subset of independent variables check box.
Select the Thresholds tab, and select the Calculate minimum loss threshold option button. Select the Types tab, and select the
Three layer perceptron check box; clear the other network types.
The Intelligent Problem Solver will automatically determine a single classification threshold (accept threshold equals reject
threshold) to produce optimum misclassification. This optimum misclassification rate is chosen to equalize sensitivity and
specificity (i.e., the misclassification rate in the two classes will be made, as close as possible, equal).
By varying the classification confidence threshold, you can produce networks with different trade-offs between false
positive and false negative errors. With a classification threshold of 0, all cases are classified as class two, and so there are
maximum false positives and no false negatives; with a threshold of 1, the opposite is true. The ROC curve (Receiver
Operating Characteristic curve) summarizes the entire range of classification performance of a two-class classifier with no
doubt option. The Y axis displays the sensitivity of the network (proportion of positive cases correctly classified), and the X
axis one minus the specificity (negative cases incorrectly classified). As the decision threshold is raised, the network's
performance travels along the ROC curve from left to right.
To display the ROC curve, select the Advanced tab of the Results dialog, and click the ROC curve button.
A perfect classifier would hug the left and top axis, and the area underneath it would be 1.0 - this indicates that all cases are
correctly classified with any threshold. A random neural network should approximately follow the leading diagonal, and
have an area underneath of 0.5. In practice, an area underneath the ROC curve as close as possible to 1.0 indicates a good
neural network, and further shows whether the network performs well across the range of possible trade-offs between the
two types of error (classifiers that have high sensitivity but low specificity, or vice versa, may not be very useful).
In the Credit example, it is possible to get an area under the ROC curve of 0.93 or more, indicating good performance.
Sometimes, it may be more important to avoid errors in one class than in the other (this is frequently the case when
predicting disease in medical applications, for example). You can specify a loss coefficient, which states, in essence, how
much more important it is to classify class two cases correctly than to classify class one cases correctly. As the loss
coefficient is raised, less errors will be made on the class two cases at the expense of more on the class one cases.
Click the Descriptive statistics button, and observe the misclassification rates of Bad and Good credit risks on the
Classification spreadsheet.
Click OK to finish the analysis, then run the Intelligent Problem Solver again. Select the Thresholds tab.
Alter the Loss coefficient at the bottom of the tab to 2, run the Intelligent Problem Solver, click the Descriptive statistics
button, and observe the difference in the misclassification rates. The misclassification rate on the good credit risks drops
substantially, but this is balanced against an increase in the error rate on the bad risks.

The examples we have looked at so far are all classification problems - that is, the output is a nominal variable, indicating
that the case belongs to one of a small number of discrete classes.
Neural networks can also be used for regression problems, where the output is a continuous numeric variable, in which
context they act as a non-linear regression technique, where the complexity of the non-linear regression curve is controlled
"semi-parametrically" - the number of units in the neural network governs the complexity of the solution, but of course the
designer does not explicitly state a functional form for the regression curve.
As an example, we will exploit the redundancy in the Irisdat.sta data set by predicting the value of the PETALWID variable
from the SEPALLEN, SEPALWID, and PETALLEN variables.
First, open Irisdat.sta via the File - Open menu; it is in the /Examples/Datasets directory of STATISTICA. Select Neural
Networks from the Statistics - Data Mining menu to display the STATISTICA Neural Networks (SNN) Startup Panel. Select
the Regression option button in the Problem type group box and select Intelligent Problem Solver on the Neural Networks
Startup Panel - Quick tab.
Click the OK button. A standard variable selection dialog will be displayed. Select the PETALWID variable as the dependent
variable, and the SEPALLEN, SEPALWID, and PETALLEN variables as independent variables.
Click OK to display the Intelligent Problem Solver dialog. Then click OK again to run the Intelligent Problem Solver.
The figure reported for performance in the network set, when the Intelligent Problem Solver completes, has a different
meaning than for classification. In regression problems, the figure given is the regression ratio, also reported in the model
summary spreadsheet, displayed when you click the Models summary button on the Results dialog.
This is the ratio between the standard deviations of the residual and the target data. If only the observed output data were
available, with no input variables, the best estimate we could make of the target for a new case would be the mean of the
observed values in the training set. If that were done, the residuals would be the observed values minus the mean, and the
average error would therefore be the standard deviation of the target variable. When we use a neural network, we obviously
expect the residuals to be smaller than this, and the ratio reported measures this improvement. A ratio of 1.0 implies that our
network is doing no better than the most naive estimate available, and consequently that either there is no useful information
in the input variables, or that the network is failing to use the information successfully. As the network's performance
improves, the ratio becomes closer to zero. This ratio of standard deviations is closely related to the "explained variance" of
the model (it is equal to one minus the explained standard deviation).
Comparison with linear models. Click the Descriptive statistics button to produce a regression spreadsheet which displays
the standard Pearson-R correlation coefficient between the predicted and target values (in addition to other descriptive
statistics). In standard linear modeling, it is common practice to measure the goodness-of-fit of the model by the correlation
between the input and output variables, which is exactly equal to the correlation between predicted and observed values, if
the variables are positively correlated, or to the negative of this correlation if they are negatively correlated. You can thus
use the correlation coefficient to assess quite directly the comparative performance of standard linear modeling and neural
networks.
This comparison is made easier as STATISTICA Neural Networks supports least squares linear models, in the guise of Linear
neural networks. If you have used the Intelligent Problem Solver with a diverse range of networks to be saved (and more
than one network to be saved), you should find that a linear model has been created. The performance should be close to, but
probably slightly inferior to, the best non-linear neural network model.
It is always worth retaining at least one linear model as a standard of comparison. If the performance of the neural networks
is inferior to a linear model, then you should always choose the latter. This follows the principle of parsimony and Occam's
razor - always choose the simplest explanation that fits the facts. If you decide to use a linear model, you can assign all the
cases to the training set and train a linear network on that for best accuracy (linear models cannot over-learn, and so there is
no need to maintain selection and test subsets).
Although many problems are inherently non-linear, this is by no means always the case, and trying to fit a flexible curve to a
straight line relationship is clearly inappropriate. It is also the case that if you have a small number of cases, or a moderate
number of cases with a large amount of variance, you may be unable to fit a non-linear model with any accuracy even if the
underlying relationship is non-linear. If your neural networks have only a marginal improvement in accuracy over the linear
models, and you have a restricted number of cases, you should consider this possibility. In particular, if you observe that
your neural networks have slightly better training error and performance than the linear model, but worse selection
performance, then the improvement is almost certainly artifactual, and it is better to choose the "safer" option of a linear
model.
In the author's experiment with the Irisdat.sta regression problem, the best neural model achieved a standard deviation ratio
of 0.2304 on the selection set, compared with 0.2365 using a linear model, an improvement of a few percent (you are likely
to get slightly different results, as the random division of the data set will be different). Since the winning network was itself
not very complex (with four hidden units), this improvement is sufficient to justify keeping the neural network. If the
performance dropped to a one percent improvement or less, with such a small number of cases, it would be better to stick
with the linear model.
Now click the Residuals button. This generates a spreadsheet of raw residual values (with predicted values for a standard of
comparison). You may also specify other pieces of information to be stored with the residual, and different forms of
residual, on the Residuals tab of the Results dialog.

We have already mentioned that the data set is divided into three subsets: the training, selection, and test cases.
To reiterate, the neural networks are trained using the training subset only. The selection subset is used to keep an
independent check on the performance of the networks during training, with deterioration in the selection error indicating
over-learning. If over-learning occurs, the Intelligent Problem Solver (IPS) stops training the network and restores it to the
state with minimum selection error.
The selection error is also used by the IPS to select between the available networks. However, if a large number of networks
is tested, a random sampling effect can kick in, and you may get a network with a good selection error that is not actually
indicative of good generalization capabilities. Therefore, a third subset (the test subset) is maintained, and you can visually
inspect performance after training. Providing that selection and test errors are reasonably close together, the network is likely
to generalize well.
By default, the Intelligent Problem Solver randomly assigns the available cases in the proportions 2:1:1 between the training,
selection, and test subsets. Each time the IPS is run, a different random assignment of cases is made, and then used for all the
networks created by that run of the IPS.
You can specify that the cases be reassigned randomly for each network created, or that the same case division be used as
was used to train a preexisting network. You can also specify a subset variable (on the Advanced tab of the Startup Panel)
that explicitly lays out which cases should be assigned to which subset. You may want to exercise these options for a
number of reasons:
1. Reassigning cases randomly for each network can reduce sampling bias if you choose to form the networks created into
an ensemble and average across their predictions.
2. Conversely, ensuring that several runs use the same subset division makes it easier to compare results. This is particularly
helpful if you are trying to experimentally determine the best setting for some design parameter, such as the number of
hidden units, rather than actually seeking a final solution.
3. If over a number of tests you find that the test and selection results are reasonably consistent, you can be fairly confident
that the network will generalize effectively. In this case, you may decide to assign all the cases to training and selection,
so as to produce a more accurate solution. You might also (although with a greater sense of caution) decide to reduce the
number of selection cases, reassigning to the training subset, if the training and selection errors are consistent.
If you reassign the subsets, you should be careful in comparing the performance of networks in the network set trained with
different subsets, as the performance figures are not directly comparable.
Run the Intelligent Problem Solver on the Irisdat.sta data set again.
Click the Sampling button in the Intelligent Problem Solver to display the Sampling of Case Subsets for Intelligent Problem
Solver dialog.
Change the number of Training cases to 100, and the number of Selection cases to 50. Then, double click on Test to reset
that field to zero and to balance the remainder. There are sufficiently few dubious cases in the Iris problem that the
maintenance of a test set is not necessary. Click OK to confirm the new case division.
Having established a good idea of the expected error rate, it is safe to run the IPS for a limited period without using a test set.
However, as the networks habitually overfit the Iris data, you should not remove the selection subset.

The standard neural network architectures (multilayer perceptrons and radial basis functions) infer a parameterized model
(the weights forming the parameters) from available training data. The parameterized model (the network) is usually much
smaller than the training data, and can be executed quite quickly, although the time taken to train the model may be long.
An alternative approach is to model the function more-or-less directly from the training data. This has the advantage that
there is no need for training (or, at least, for "training" that is actually very simple, consisting of little more than changing the
form in which the training data is held). The disadvantage is that the resulting model is large, and so consumes much
memory and is relatively slow to execute.
Probabilistic Neural Networks (PNNs) and Generalized Regression Neural Network (GRNNs) are such methods,
"disguised" as neural networks, and used for classification and regression respectively. The first layer of these networks
contain radial units, which in the PNN actually store every training case, and in the GRNN store a large number of cluster
centers (usually not vastly smaller in number than the training set). These radial units have an output activation that is a
Gaussian function centered at the stored point (and hence acts, as is were, as "evidence" that the modeled function has some
probability density distributed around that point). Subsequent layers combine these outputs into estimates of class
probabilities (in the case of PNNs) or of the regression value (in the case of GRNNs).
If you have a reasonably small number of cases (say, 500 or less) then it is well worth considering the use of a PNN or
GRNN. We will demonstrate a PNN using the IRIS data set.
Open the Irisdat.sta data set, and start the Intelligent Problem Solver using the following settings:
Select the Classification option button in the Problem type group box on the Neural Networks Startup Panel - Quick tab and
Intelligent Problem Solver in the Select analysis list.
Variables: select IRISTYPE as the dependent variable, and SEPALLEN, SEPALWID, PETALLEN, and PETLWID as
independents. Click the OK button on the variable selection dialog, and on the Startup Panel to display the Intelligent
Problem Solver.
On the Intelligent Problem Solver - Quick tab, set Optimization time - Networks tested to 10.
Set Networks retained to 3.
Select the Select a subset of independent variables check box.
Display the Types tab, select the PNN or GRNN option button, and clear the other network types option buttons.
The search should be extremely fast, and the best network is likely to have three inputs (possibly four, depending on the
division of your training set).
The very fast training time makes PNNs and GRNNs extremely useful for initial experiments, and performance is usually
broadly comparable with the other network types. In most problem domains Multilayer Perceptrons seem to achieve
somewhat better performance, and are also far more compact. However, this is not always the case.
The Sonar data set contains data from a problem (described by Gorman and Sejnowski), where sonar measurements have
been taken from objects on the sea bed, either spherical rocks or mines. The objective is to tell them apart. From experience,
it appears extremely difficult to achieve classification rates of better than 83% on this data using Multilayer Perceptrons (you
may want to verify this using the Intelligent Problem Solver). However, PNN networks solve this problem extremely
effectively.
Open the Sonar data file, and run the Intelligent Problem Solver. Select the following options:
On the Startup Panel, select Classification as the Problem type.
Variables: select dependent variable TARGET; select independent variables VAR1-VAR60.
On the IPS dialog:
Set Optimization time - Networks tested to 20.
Select the Select a subset of independent variables check box.
Display the Thresholds tab, and select the Calculate minimum loss threshold option button.
Display the Types tab, select the PNN or GRNN option button, and clear the other network types.
Click OK.
The Intelligent Problem Solver should discover a network with approximately fifty input variables, and a performance rating
of 0.85 or above. This performance is quite sensitive to the data set division; however if you repeat the experiment,
reassigning the subsets each time, you may get quite different figures. This illustrates the dangers of building models with a
large number of input variables and a small number of cases.
Note: A feature of PNNs is that, with the right control parameters, they almost invariably achieve a 100% correct
classification rate on the training set. This should not be too surprising, as the technique models a probability density
function by locating a Gaussian peak at each training case, and if these peaks are sharp enough (and there are not two cases
of different classes exactly coincident) then this 100% performance can be guaranteed. Unfortunately, this tells us nothing
about the generalization performance of the networks. Even more so than with other types of network, you should not allow
performance on the training set to influence you. Assess performance by considering the selection and test sets only.

The Custom Network Designer can be used to design and train neural networks at a much lower level than the Intelligent
Problem Solver, specifying the precise choice of network architecture and training algorithm(s), while still maintaining
considerable ease of use. Finer control over the design process can allow you to achieve better results, albeit with greater
design effort.
Open Irisdat.sta via the File - Open menu; it is in the /Examples/Datasets directory of STATISTICA. Select Neural Networks
from the Statistics - Data Mining menu to display the STATISTICA Neural Networks (SNN) Startup Panel. Select the
Classification option button in the Problem type group box and select Custom Network Designer on the Neural Networks
Startup Panel - Quick tab. Click the Variables button and select IRISTYPE as the Continuous Outputs and SEPALLEN,
SEPALWID, PETALLEN, and PETLWID as the Continuous Inputs.
Click OK to display the Custom Network Designer dialog.
On the Quick tab, select the type of network you would like to create. For the purposes of this example, select the Multilayer
Perceptron network type.
Click on the Units tab. Here, you can specify the number of hidden layers, and the number of units in each of those layers.
Leave the Number of hidden layers at 1(which is the best selection for the vast majority of neural network applications), but
alter the Number of units per layer - Hidden layer 1 to 5.
Click OK to display the Train Multilayer Perceptron dialog.
See Example 12: Training the Network and Example 13: Stopping Conditions.
Note. Neural networks in STATISTICA Neural Networks are quite complex, including a range of user-selectable features
such as activation functions and input/output pre- and post-processing functions. Clicking OK on the Custom Network
Designer dialog assigns defaults to all these features. Alternatively, you can click the Edit button on the Custom Network
Designer dialog. This displays the Neural Network Editor, allowing you to further customize the network before
commencing training.

Having created an appropriate network, the next stage is to train the network.
STATISTICA Neural Networks supports a number of popular and advanced training algorithms for multilayer perceptrons,
including back propagation, conjugate gradient descent, Quasi-Newton, Levenberg-Marquardt, quick propagation, and
Delta-bar-Delta learning. For the purposes of this example, we will concentrate on the most popular learning algorithm: back
propagation (see Patterson, 1996; Haykin, 1994; Fausett, 1994).
The back propagation algorithm works by iteratively training the network using the training data available. On each iteration
(known as an Epoch), the entire training set is presented to the network, one case at a time. The inputs are presented to the
network, which is executed to produce output values.
The output values are compared with the desired outputs present in the data set, and the error between the desired and actual
outcome is used to adjust the weights in the network so that the error is likely to be lower.
The algorithm must compromise between the various cases, attempting to alter the weights so that the overall error across
the whole training set is reduced. As the algorithm works on a case-by-case basis, the overall error does not always decrease.
STATISTICA Neural Networks can track the error performance of the network on a graph, allowing you to see how training
progresses.
To train the Irisdat.sta network, observing the performance of the algorithm, follow the steps below.
Training Using Back Propagation. The Train Multilayer Perceptron dialog implements a two-phase training algorithm, for
reasons that we will discuss later. For the moment, we will use a single phase algorithm. Clear the Phase two check box on
the Quick tab to disable the second phase of training (the phase two options will be grayed out).
Select the Interactive tab, and select the Interactive training check box. This ensures that a graph will be displayed during
training.
Click the OK button to start the training algorithm. The Training in Progress dialog will be displayed, and two lines will be
plotted on the graph. The first gives the error on the training subset (data used to optimize the network's weights), while the
second gives the error on the selection subset (which gives a useful cross-check on training performance).
When the training algorithm stops, click the Extend button, and then click OK on the Extended Dot Product Training dialog
to plot more epochs of training. You can repeat this process until you are happy with the results achieved. Then click the
first OK button on the Training in Progress dialog (the one without a graph symbol).
The Results dialog is now displayed, just as it was after running the Intelligent Problem Solver. Use the Descriptive statistics
and Predictions buttons to test the network's performance, as before, and compare performance with the earlier networks by
checking the summary statistics in the list box at the top of the Results dialog.
When you have finished testing the network, click the OK button to commit it to the network set.
Note. Had we decided that the network was not good enough, we could click the Cancel button on the Results dialog to
redisplay the Train Multilayer Perceptron dialog, and then try training the network again.
Retraining an Existing Network. Rather than continually creating new networks, which will rapidly fill the network file,
we will retrain the network we have just created (this is only appropriate, of course, if you are maintaining the same network
architecture - type, input and output variables, number of hidden layers and units, etc).
Select the network you would like to retrain on the Startup Panel Networks/Ensembles tab, and then select Retrain network
from the Select analysis list on the Advanced tab.
Optimizing Training Performance. The performance of the back propagation algorithm is affected by a number of
parameters, which are available on the Train Multilayer Perceptron dialog. The default values are a reasonable starting point
for most real-world applications, but can always be fine-tuned; and can be improved somewhat for the IRIS problem.
Here is a brief description of the major parameters, with suggested settings for Iris:
On the Train Multilayer Perceptron - Quick tab:
Epochs. This box specifies how many training epochs should be undertaken on each click of the OK button. The default of
100 is rather low; increase it to 500 by highlighting the number and typing in 500.
Learning rate. A higher learning rate tends to speed the algorithm up, but may introduce instabilities in performance for
some problems (especially if data is noisy). Iris benefits from a somewhat lower learning rate with this network
configuration; e.g., 0.01.
With the settings given above, STATISTICA Neural Networks will normally solve Iris in less than a thousand generations.
Performing Multiple Runs. If you want to compare the performance of the algorithm using different settings, click the
Cancel button on the Training in Progress dialog at the end of run, then click OK on the Train Multilayer Perceptron dialog
to start the next run.
To prevent the graph from becoming too cluttered, display the Static Options tab of the Training in Progress dialog, and
clear the Keep lines from previous training runs check box. On the next training run, the graph will be cleared.
Entropy versus Sum-Squared Error Function for Classification. So far, we have trained networks using the cross-
entropy error function, which is the default for classification neural networks in STATISTICA Neural Networks. Training a
neural network with a cross entropy error function (and associated output layer activation functions) is equivalent to
maximum likelihood optimization of the network, on the assumption that the distribution from which the data is drawn is
one of exponential family of distributions (which includes the normal distribution). The confidence levels generated by such
a network can be interpreted as probabilities, which is of course extremely useful.
However, STATISTICA Neural Networks does support an alternative approach, which is to use logistic activation functions
for the output layer, together with the sum-squared error function. The confidence levels cannot in this case be interpreted as
probabilities (indeed, they may well not sum up to 1.0). However, there are advantages too; training is often faster and more
stable, and the final performance levels are sometimes better (possibly because the underlying distribution is not, in fact, a
member of the exponential family).
To train a classification neural network using the sum-squared approach, first terminate any analysis that is currently
running, run the Custom Network Designer, select the Multilayer Perceptron network type, and display the Units tab. Select
the Sum-squared option button in the Classification error function group box, and click OK.
With the sum-squared error function, a much higher learning rate can be tolerated. In the case of Iris, you can set the
Learning rate as high as 0.6, and reduce the number of Epochs of back propagation to 100.
There are further parameters for back propagation that can also be set. These are located on the BP(1) tab (i.e. training
parameters for back propagation on phase one).
Momentum. Generally improves performance by speeding up training where there is little change in the error and
introducing some extra stability. Should always be in the range (0.0,1.0) (i.e., greater than or equal to 0.0, strictly less than
1.0). It is often recommended to complement a high learning rate with a low momentum, and vice versa. In this case, a high
momentum is acceptable. Set it to 0.6.
Shuffle. When this check box is selected, the order of presentation of the cases is changed within each epoch of back
propagation. This adds some noise into the training process, which means that the error may oscillate slightly. However, the
algorithm is less likely to get stuck, and overall performance is usually improved.
Note: STATISTICA Neural Networks can alter the Learning rate and/or Momentum on each epoch, progressively altering
them from the starting values given in the left boxes on the BP (1) tab to the finishing values given in the right boxes. For
example, you can lower the learning rate as training progresses. To enable this option, you need to select the Adjust learning
rate and momentum each epoch check box. If this is not selected, STATISTICA Neural Networks just uses the fixed rates
given in the left boxes.
Using Two-Phase Training with Conjugate Gradient Descent. Return to the Quick tab, and select the Phase two check
box. This specifies a second phase of training, which by default uses conjugate gradient descent training algorithm.
Conjugate gradient descent (see Bishop, 1995) is an alternative training algorithm that is usually substantially quicker than
back propagation. A secondary advantage is that there are no learning rate or momentum parameters to choose, so it is also
simpler to use.
Click the OK button to start the algorithm. The expected behavior is as follows:
When training switches to conjugate gradient descent, there is substantially less noise in the training process. This is because
we have been using the shuffle option with back propagation, which does make for a substantially more noisy optimization
process. However, it also makes it less likely that the back propagation algorithm will get stuck in a local minima - a
perennial problem with neural networks.
Individual epochs of conjugate gradient descent actually take longer than back propagation epochs, so the algorithm may at
first appear to be going more slowly. However, the error rate can also drop substantially faster with conjugate gradient
descent, so the time elapsed in training will likely be significantly smaller.
On some runs, you may see radical progress made by the algorithm, to the extent that the training error drops, for all intents
and purposes, to zero, and this may happen as quickly as ten epochs or so. Alternatively, the training error may flatline or
decrease very slowly.
In contrast, you are likely to see the selection error noticeably increasing, something which is not usually apparent in this
problem domain with the noisy back propagation training.
Note. With back propagation, you may sometimes observe that the selection error is slightly lower than the training error in
the earlier epochs - something that you may find surprising. The reason is that the error on the training set is being
accumulated as each case is used in training, whereas the selection set error is calculated at the end of the epoch. The
selection error therefore has, as it were, on average a half-epoch lead on the training error. If you retest the network at the
end of training, you may find that the training error drops slightly from its last value, whereas the selection error will stay the
same.
You can see how good the network is in terms of classification performance producing the Classification statistics
spreadsheet, by clicking the Descriptive statistics button on the Results dialog after training. By default, these are composite
statistics for all cases, irrespective of the subset they belonged to. It is possible to get the misclassification down to two or
three cases. Sum-squared networks seem to learn rather more quickly on this problem, and also to achieve lower
misclassification rates a little more easily than entropy-trained networks. Of course, that does not give any good indication
of generalization performance - the sum-squared networks may be over-fitting, and there are too few dubious cases to make
any certain assessment. In this case, it is probably better to stick with the entropy based networks.
To display the statistics separately for the training and selection subsets, select the All (separately) option button in the
Subsets used to generate results group box on the Quick tab of the Results dialog before clicking the Descriptive statistics
button. The statistics are now replicated across a number of columns, with prefixes T, S, X, and I representing the results of
the training, selection, test, and ignored subsets respectively.
Resume the analysis and click the Predictions button. STATISTICA Neural Networks now generates four separate
spreadsheets for the different subsets. The last two are empty, as there are no test or ignored cases. The first two contain just
the cases for the training and selection subsets respectively.

So far we have trained networks for a set number of epochs (see Example 12: Training the Network). This is a reasonable
approach, especially if you are sitting and watching the network train, in which case you can always click the Finish button
or the Cancel button to abort training if you decide it is not doing well.
However, for longer training runs there may be better ways to specify when training should stop. You can do this in
STATISTICA Neural Networks on the End tab of the Train Multilayer Perceptron dialog.
Set up the analysis as you did in Example 11: Creating a Custom Neural Network. Then click OK on the Custom Network
Designer dialog to display the Train Multilayer Perceptron dialog, and click on the End tab.
The stopping conditions are actually used by all the iterative training algorithms in STATISTICA Neural Networks, including
back propagation and conjugate gradient descent. Besides the maximum number of epochs, you can also specify a target
error level at which training should stop, and/or a minimum level of improvement over a given number of epochs.
The most useful option is perhaps the Minimum improvement in error. This states that if the training and selection errors do
not improve by at least the amount given over a set number of epochs (the Window) then training should stop.
For example, set the Window to one epoch and click OK to train the network. If you have left the Minimum improvement in
error Training and Selection thresholds both as 0, STATISTICA Neural Networks will stop training if either the training or
selection errors deteriorate. (Click Cancel on the Results dialog to return to the Train Multilayer Perceptron dialog.)
You can check for deterioration in the selection error only by setting the Minimum improvement in error threshold for
Training to -1, and leaving the Selection threshold at 0. This means that even a wild deterioration in training error is
acceptable, but no deterioration in selection error is acceptable - effectively checking selection error only. When using back
propagation, the training error may sometimes deteriorate. When using conjugate gradient descent the training error will not
deteriorate, but the selection error may do so.
One problem with this approach is that the error often fluctuates during training (especially if using back propagation with a
high learning rate and shuffle option, but also to a lesser extent for other algorithms), sometimes rising only to fall again
soon afterwards. You can adjust the Window to take account of this, so that training only ceases if performance is
unsatisfactory for a number of epochs. With a window of ten epochs, for example, training only stops if the error
deteriorates, and then fails to improve over the last best value for ten epochs. This allows you to stop when a clear trend of
deterioration has set in (probably indicating the onset of over-learning), without stopping prematurely because of random
noise in the training process.
The Intelligent Problem Solver employs stopping conditions on both training and selection, with a relatively long window
(typically 50 epochs) to ensure that learning is not terminated prematurely. For most problem domains, you can afford to be
a little less conservative than that, and a window as low as 20 epochs is acceptable (provided that the back propagation
learning rate, which can cause fluctuation, is kept low).
Tracking the best network. There remains a problem. When training is stopped due to deterioration in the error rate, the
final network is (by definition) not the best one discovered during the training run. The network we should probably treat as
the output of the training process is the one with the lowest selection error discovered during the training run - not the final
network.
By default, STATISTICA Neural Networks keeps a copy of the best network found during training (i.e. the network with the
lowest selection error), and when training finishes, it is this network that is preserved. The problem with this approach is that
it imposes some computational expense - a copy of the network must be made at the end of every epoch on which an
improvement in selection error is made. Consequently, if you do not want to keep the best network (e.g. if you are
conducting experiments with network architectures, sampling or learning parameters, but do not at this stage intend to keep
resulting networks) then you can turn off the feature. Simply clear the Track and restore best network check box on the End
tab of the Train Multilayer Perceptron dialog.

STATISTICA Neural Networks supports a number of network architectures, of which Multilayer Perceptrons (MLPs) are
perhaps the best known.
The second most commonly used neural network architecture is the Radial Basis Function (RBF) network (see Haykin,
1994; Bishop, 1995).
The units in a multilayer perceptron each perform a linear transformation on the input vector (set of values entering the unit);
specifically, they perform a weighted sum of the inputs (i.e. they form the dot product of the input vector with the weight
vector), and subtract from this sum the threshold. In STATISTICA Neural Networks, this is referred to as the dot product
Synaptic (Post Synaptic Potential) function. The result is then passed through the non-linear activation function.
The Dot Product synaptic function means that multilayer perceptrons essentially work by dividing up pattern space using
hyperplanes (in two-dimensional space, a hyperplane is just a line) - a linear operation. Nonlinearity is added by the typically
sigmoidal activation function.
In contrast to the multilayer perceptron's linear approach, a radial basis function network uses a Radial Synaptic function.
Each unit measures the square of the distance of the input vector from the weight vector. This distance is then multiplied by
the threshold (actually, therefore, a deviation) before being passed through the activation function. Thus, radial basis
function networks work by dividing up pattern space using hyperspheres (in two-dimensional space, a hypersphere is just a
circle).
The two approaches have contrasting advantages and disadvantages. The radial approach is very localized, whereas the
linear approach is active over the entire pattern space. Consequently, RBF networks tend to need more units than MLPs, but
MLPs may make unjustified extrapolations if data that is unlike any of the training data is used, whereas an RBF will always
have a near-zero (or near sample mean) response in this case.
Theory also indicates that an MLP may need two hidden layers to solve some problems, and even more hidden layers may
sometimes be efficient. In contrast, an RBF with one hidden layer is always adequate.
An RBF always has exactly three layers: the input layer, the hidden layer that contains radial units, and (in the standard
formulation) a linear output layer. In STATISTICA Neural Networks, the linear output layer is represented by having a Dot
Product synaptic function, and the Identity activation function. The nonlinearity of the radial units allows this output layer to
be linear without losing the ability to handle nonlinear functions. STATISTICA Neural Networks includes standard linear
optimization techniques that allow the linear output layer to be optimized, provided that the earlier layers of the network are
first fixed.
The approach to RBF training is therefore quite different to that used in an MLP.
First, the radial centers and their deviations (spreads) are set using unsupervised techniques (i.e., techniques that consider
only the input variables in the training data). Essentially, the idea is to pick centers that lie at the heart of clusters of training
data, with deviations selected to reflect the density of the data.
Second, the linear output layer is optimized using the pseudo-inverse technique.
Although well-suited to regression problems, a standard RBF with a linear output layer is less suited to classification. The
output neuron activation levels are not constrained to sum to 1.0, and so cannot be interpreted as probabilities. However, as
with MLPs, by modifying the networks to use an entropy-based error function, and an appropriate output layer activation
function (either softmax, or logistic, depending on the number of output neurons), the network can be interpreted as
estimating probabilities, on the assumption that the underlying distribution is from the exponential family.
STATISTICA Neural Networks supports both standard sum-squared and entropy based RBF networks. Select the desired
error function on the Units tab of the Custom Network Designer.
One disadvantage of entropy-based RBFs is that linear optimization can no longer be deployed on the output layer. Instead,
STATISTICA Neural Networks uses Conjugate Gradient Descent. However, although slower than linear optimization, the
algorithm does converge in a reasonable amount of time and, unlike the case of Multilayer Perceptron networks, the error
function is unimodal, so that convergence is guaranteed.
Alternatively, you can select an iterative training algorithm of your choice via the Custom button on the RBF Training
dialog.
Using an RBF Network for the Iris Problem. To create an RBF network, use the Custom Network Designer as usual, but
change the Network type to Radial Basis Function. As with Multilayer Perceptrons, you can specify the number of hidden
units on the Units tab (you cannot change the number of hidden layers in an RBF, however).
The number of hidden units is typically much larger in an RBF than in an MLP. However, in this case quite a small number
of hidden units (ten, or even less) is adequate.
Click OK on the Custom Network Designer to display the Train Radial Basis Function dialog.
Training an RBF Network. On the Quick tab, Select the K-Means option button in the Radial assignment group box; this
assigns cluster centers to the radial units;
Select the Isotropic, scale by option button in the Radial spread group box; this assigns appropriate deviations based on the
number of units and the spread of the training data.
As you will see, another advantage of radial basis function networks is that they train quite rapidly, and in this case you will
probably find the performance comparable to that of the Multilayer Perceptron.
A disadvantage is that the pseudo-inverse algorithm is prone to error if the radial deviations are too small, so that the
Isotropic and K-Means deviation algorithms must be used with care. If you do run into numerical difficulties, use the custom
training facility to optimize the output layer using an iterative algorithm such as conjugate gradient descent.

The major reason for the popularity of neural networks is their ability to model nonlinear problems; i.e., classification
problems that cannot be solved simply by drawing a single hyperplane between classes, and regression problems that cannot
be solved simply by drawing a hyperplane through the data.
Despite this, linear models should not be neglected. It is not too uncommon to find that a problem that was perceived to be
difficult and nonlinear can actually be solved adequately by linear techniques, and a linear model always provides a good
benchmark against which to judge more complex techniques.
STATISTICA Neural Networks builds linear models using a special linear network type. A linear network has only two
layers: an input layer, and an output layer with linear SynapticP and activation functions. See also, Synaptic Functions and
Activation Functions.
Linear networks, like the output layer of an RBF network, can be optimized directly using the pseudo-inverse technique.
They are designed principally for regression problems, but are equally able to perform a simple discriminant analysis on
classification problems. See also, Linear Networks and Radial Basis Function Networks.
To create a linear model of the Iris regression problem, use the Custom Network Designer dialog as usual, setting the
Network type option to Linear. Linear networks do not have hidden units, so there are no further settings to make. Simply
click OK.
Training a Linear network is equally straightforward - just press the OK button on the Train Linear Network dialog.
Linear models are actually ill-suited to the Iris problem, as the VERSICOL and VIRGINIC species are not linearly separable,
and if you generate Descriptive statistics you will see a significant amount of misclassification between these two classes.

Self Organizing Feature Map (SOFM, or Kohonen) networks perform unsupervised learning: they learn to recognize clusters
within a set of unlabelled training data; that is, training data that includes input variables only.
SOFM networks also place related-clusters close together in the output layer, forming a Topological Map (see Haykin, 1994;
Fausett, 1994; Patterson, 1996).
SOFM networks are used in a quite different fashion to other network types, and STATISTICA Neural Networks has special
facilities to support their use, including the Topological Map dialog, which illustrates visually which cases have been
assigned to each cluster, and aids you in labeling units and cases appropriately. See also, Topological Map.
For this exercise we will use the Iris data set. Note that, although SOFM networks do not need output variables for training,
they can use them if present. The output of an SOFM network in STATISTICA Neural Networks is always a nominal variable
(i.e., they always perform classification).
On the Startup Panel Quick tab, select Custom Network Designer, click the Variables button, and select IRISTYPE as the
dependent variable, and SEPALLEN, SEPALWID, PETALLEN, and PETLWID as independents.
In real applications of SOFMs, there is not usually a dependent variable in the data set (i.e. the data set doesn't contain
observations of the desired output). In that case, you can simply click OK on the variable selection dialog without selecting a
dependent, and STATISTICA Neural Networks will automatically create an output variable for the network. Of course, you
must still provide independent variables.
On the Custom Network Designer dialog, select the Self Organizing Feature Map option button.
SOFM networks always have two layers: an input layer and the topological map output layer. The output layer of the SOFM
network is distinguished by being laid out in two dimensions.
Display the Units tab of the Custom Network Designer, and set the Width to 4 and the Height to 5.
Click OK.
Training an SOFM Network. SOFMs are trained using the Kohonen algorithm. The Kohonen training algorithm adjusts
the centers in the topological map layer to move them closer to cluster centers in the training data. During training, the
algorithm selects the unit whose center is closest to the training case. That unit and its neighbors are then adjusted to be more
like the training case.
The neighborhood plays a crucial role in Kohonen training. By updating surrounding units in addition to the winning unit,
Kohonen training assigns related data to contiguous areas of the topological map. During training the neighborhood size is
progressively reduced, together with the learning rate, so that early on a coarse mapping is produced (with large clusters of
units responding to similar cases). Finer detail is produced later (as individual units within a cluster respond to finer levels of
discrimination among related cases).
The Train Self Organizing Feature Map - Quick tab includes start and end values for both Learning rate and Neighborhood
size. Kohonen training is also usually explicitly split into a coarse (ordering) phase and a fine-tuning phase, and the training
dialog allows you to specify two phases, for this reason.
In this case, two runs of 50 epochs each are reasonably effective. On the first run (Phase one), reduce the Learning rate from
an initial level of 0.5 to a final level of 0.1, and keep the Neighborhood size at 1 throughout.
On the second run (Phase two), keep a steady Learning rate of 0.1 and a steady Neighborhood size of 0.
This is actually a very atypical training run for a SOFM - typically a much larger number of epochs are used.
On the Interactive tab, select the Interactive training check box.
Click OK.
Note. As with Multilayer Perceptron training, the RMS error is reported on the Training in Progress dialog. However, this is
an entirely different error to that reported during Multilayer Perceptron training.
For Multilayer Perceptrons, the standard error function is either cross entropy or sum-squared, both of which measure the
distance of the predicted output neuron vector from the target output neuron vector in some sense. The error reported on the
training error graph window is the RMS of this output error across the training set.
In Kohonen training, the error function is the distance of the winning radial unit's weight vector from the input vector. The
error reported on the progress graph window is the RMS of this input error across the entire training set.
Click the second OK button on the Training in Progress dialog to transfer the graph to the results workbook, then resume the
analysis. The Topological Map dialog is displayed.
Once the SOFM network has been trained, you can examine it to see where clusters have formed, and what they correspond
to, using the Topological Map dialog. The clusters can then be labeled. In this case the task is simple because we actually
have class labels for the cases in the data set (i.e. the IRISTYPE), which is not typically the case when using SOFM
networks.
Win Frequencies. The Topological Map dialog presents various pieces of information to help you make sense of the
SOFM. Each square in the illustration represents a neuron in the topological map layer. As you move the mouse pointer over
the Topological Map tab, the readout fields at the lower-right of the dialog display information about the neuron under the
pointer. The bottom field is the win frequency.
You can use the win frequency to observe where on the topological map clusters have formed. The network is run on all
cases in the training set, and a count is made of how many times each unit wins (i.e., is closest to the tested case).
High win frequencies indicate the centers of clusters on the topological map.
Units with zero frequencies aren't being used at all, and are generally regarded as an indication that learning was not very
successful (as the network isn't using all the resources available to it). However, in this case there are so few training cases
that some unused units are inevitable.
For convenience, you can specify that the win frequencies be displayed on the map itself.
Select the Win Frequencies tab, and select the Show win frequency on Topological Map check box.
Then, select the Topological Map tab.
You can also differentiate the training subsets in calculating the win frequencies displayed on the topological map by setting
the appropriate options on the Win Frequencies tab, and you can transfer the win frequencies to a results spreadsheet by
clicking the Win frequencies button on that page.
Interpreting the Topological Map. Once the overall distribution of cluster centers has been noted, you can use the
topological map to test the network in order to identify the meaning of the clusters.
The topological map displays the output layer graphically in two dimensions. When a case is run, each unit displays its
proximity to the case using a solid black square (a larger square indicating greater proximity), and the winning unit is
highlighted.
The proximity is actually the activation level of the neuron. As you move the cursor over the Topological Map tab, the
activation level of the neuron under the mouse pointer is shown in the second bottom-most readout field on the lower-right
of the dialog. You can also specify that the activation levels of the neurons be displayed numerically on the topological map
itself.
Display the Advanced tab, and select the Activation level check box. Then select the Topological Map tab.
If you test a number of cases (the most convenient way is by repeatedly clicking the microscroll up arrow button next to the
Case field), you should see from the pattern of activations how related units are grouped together, and respond strongly to
similar cases.
At this stage, we can begin to label the units to indicate their meanings as clusters. In this case, the first 50 cases are all
representatives of SETOSA.
Run the first case by clicking the microscroll up button next to the case field until the field displays the value 1, then label
the winning unit as follows:
Select SETOSA from the drop-down list.
The topological map is updated to display the new unit name.
In a real SOFM network application, you would run the remaining training cases, labeling the winning units as appropriate.
You should only label units using the training cases, which are usually scattered throughout the data set. The subset of the
current case is displayed in the read-out field under the Case field to help you with this task.
However, in this experiment we actually know the class labels of the training cases, as these are included in the data set.
That being the case, we can use algorithms in STATISTICA Neural Networks to automatically label the data.
Cancel the Topological Map dialog, then Cancel the Results dialog, to return to the SOFM Training dialog.
Click the Custom button to display the Train Radial Layer dialog, which contains all the algorithms for training radial layers
available in STATISTICA Neural Networks. STATISTICA Neural Networks supports hybridization of network types; for
example, it is possible to use Kohonen training on the radial layer of a Radial Basis Function network. We are going to use a
labeling algorithm usually associated with Cluster networks to apply class labels to our SOFM. Cluster networks are
normally used with labeled data, such as we have here.
Display the Kohonen tab, and click the Run button to rerun the Kohonen algorithm. Then click the first OK button on the
Training in Progress dialog. This will train the network using the Kohonen algorithm, and return you to the Radial Training
dialog.
Select the Labels tab, enter the value 0.9 into the Minimum proportions field, and click the Voronoi Neighbors button. This
runs all the training cases through the network, and counts how many cases are assigned to each neuron. Providing that at
least the given proportion (0.9) are of the same class, this class is used to label the neuron. If there is some ambiguity in the
neuron (that is, cases of different classes activate the same neuron) then the neuron is left unlabeled. The neuron may also be
left unlabeled if no cases are assigned to that neuron. By altering the minimum proportion to 0.0 and rerunning, you can tell
which neurons are entirely unused.
After the Voronoi labeling algorithm is complete, click the OK button to finish training the SOFM.
The Topological Map dialog is displayed, but this time some units on the topological map (those that respond to specific
classes) have been labeled with the corresponding class name.
Now page through the cases, seeing how well the SOFM classifies the selection and test cases.
If you were using a SOFM network in a real problem, you typically would not have observations for the output variable. The
procedure is then to simply label clusters symbolically (using the win frequencies datasheet to identify them). Subsequently,
you can examine the data to try to determine what meaning can be assigned to the clusters.
To aid you in this process, STATISTICA Neural Networks also allows you to label the cases on the Topological Map dialog.
In such a circumstance, you would follow the procedure below (don't do this now - there is no need as the Iris data is already
labeled).
l Train the SOFM network.
l Use the win frequencies to identify clusters in the Topological Map.
l Label clusters with symbolic names (e.g., C1, C2, etc.). You can add new class names by clicking the Edit Class List...
button, then clicking the Add... button on the Class List dialog.
l Test the cases on the Topological Map window, and assign cluster names to the cases.
l Examine the data (probably referring back to the application) to determine what the clusters represent.
l Replace the symbolic cluster names with meaningful labels.
l Retest the cases, and label them according to the winning unit.
In addition, you do not need to wait for a neuron in the topological map to win before you can label it. Simply click on a unit
to select it, then choose the class from the drop-down list. You can also click and drag to select a range of units, or hold
down the CTRL key and click or drag to select noncontiguous units or to deselect units, and then label the entire set of selected
units at once.

Probabilistic Neural Network. A probabilistic neural network (PNN) is used for classification problems, and therefore
requires a data set with a single nominal output variable (see Patterson, 1996; Bishop, 1995).
Open the Irisdat.sta data set via the File - Open menu; it is in the /Examples/Datasets directory of STATISTICA. Select
Neural Networks from the Statistics - Data Mining menu to display the STATISTICA Neural Networks (SNN) Startup Panel.
Select the Classification option button in the Problem type group box and select Custom Network Designer on the Neural
Networks Startup Panel - Quick tab. Click the Variables button and select IRISTYPE as the Continuous Outputs and
SEPALLEN, SEPALWID, PETALLEN, and PETLWID as the Continuous Inputs.
Click OK to display the Custom Network Designer dialog, select Probabilistic Neural Network as the Network type. You do
not need to specify the number of hidden units in a PNN - this is automatically set to equal the number of training cases.
Click OK and the Train Probabilistic Neural Networks dialog is displayed. There are few parameters to specify for PNN
training: a Smoothing factor, optional Prior probabilities and (for four layer PNNs) an optional Loss Matrix.
Click on the Quick tab. The Smoothing factor determines the widths of the Gaussian functions, centered at each training case
and stored in the first hidden layer radial units. A small smoothing factor gives a sharp, spiky approximation to the
underlying function, which may perform well on the training set but generalize poorly (and therefore perform badly on the
selection set). A large smoothing factor gives a "blurred", smooth approximation, which may perform relatively poorly on
the training set, but generalizes well to the selection set.
Usually, the algorithm is not too sensitive to the precise choice of smoothing factor. Let us try a few values between 0.1
(usually a sensible lower limit) and 10. Enter .1 into the Smoothing box and click OK to train the network. Observe the Train
Error and Select Error in the summary statistics at the top of the Results dialog. Then click Cancel to return to the Train
Probabilistic Neural Networks dialog. Now, enter 10 into the Smoothing box and click OK to train the network. Again,
observe the Train Error and Select Error in the summary statistics at the top of the Results dialog. You can repeat this
process with more values if you would like.
The figure to watch here is the selection error (you can always reduce the training error by reducing the smoothing factor, at
least until you exceed the precision of the machine when the Gaussian curves are calculated). If you try out a smoothing
factor of 0.01, you will likely get an extremely low training error and a high selection error.
Click cancel on the Results dialog to return to the Train Probabilistic Neural Networks dialog and click on the Priors tab.
The Prior probabilities are used in situations where you know that the training set is biased; for example, if in a two-class
problem the population at large has only 1% of cases positive, with the rest negative, but your training set has 50% positive
and 50% negative. As the PNN estimates class probabilities by adding "density estimates" centered at each case, the
resulting probabilities will be highly biased unless adjusted to reflect the imbalance between population and data set
representation. However, a common practice is to draw cases randomly, in which case there is no need for adjustment.
Similarly, if you don't know the prior probabilities, the working assumption must be that the data set fairly represents the
population distribution. In either case, prior probabilities do not need to be assigned. If you do want to change the prior
probabilities, you would select the Use specific prior probabilities check box and then click the Edit button and enter the
values in the edit spreadsheet.
Having experimented with a range of smoothing factors, and created a PNN with an optimal selection error, you can execute
it in the usual fashion. PNNs confidence levels are interpretable as class probability estimates.
Assuming that there is an element of noise in the data, the network (however good) will inevitably assign the wrong class
sometimes. The basic PNN is designed to make as few mistakes as possible. However, in reality some misclassifications are
often more expensive mistakes than others. You can compensate for this by adding a loss matrix.
Click Cancel on the Train Probabilistic Neural Networks dialog to return to the Custom Network Designer dialog. Click on
the PNN tab display the PNN page, and select the Include loss matrix in PNN check box.
Now click OK and STATISTICA Neural Networks will create a four layer PNN. The fourth layer has the same number of
units as the third, and contains the loss matrix.
On the Train Probabilistic Neural Networks dialog, click on the Loss matrix tab.
Click the Edit loss matrix button to display an edit dialog. The entries here represent the relative costs of various types of
misclassification, with the columns representing the predicted class of the case, and the rows representing the actual class.
The leading diagonal of the Loss Matrix always contains zeroes, as there is no cost involved in getting the right answer. You
can alter the Loss Matrix to indicate misclassification costs. For example, setting the third column, second row to 10 (and
leaving all other off-diagonals as 1) means that mistakenly classifying a Versicol as a Virginic is considered ten times as bad
as any other type of mistake. Consequently, the number of correct classifications of Versicols should rise, at the cost of
some misclassifications of Virginics. For now, leave the default settings.
When using a loss matrix, you are usually interested in making the minimum cost decision, and eschewing any "unknown"
or "undecided" states. Click on the Thresholds tab and make sure that the Assign to highest confidence (no threshold) option
is selected.
Generalized Regression Neural Networks. A generalized regression neural network (GRNN) is used for regression
problems.
For the purposes of this example, we will train a GRNN to predict the PETALWID variable of the Irisdat.sta data set from
the SEPALLEN, SEPALWID, and PETALLEN variables.
Either click the cancel button until you return to the STATISTICA Neural Networks (SNN) Startup Panel or begin a new
analysis by selecting Neural Networks from the Statistics - Data Mining menu.
Select the Regression option button in the Problem type group box and select Custom Network Designer on the Neural
Networks Startup Panel - Quick tab. Click the Variables button, and select PETLWID as the Continuous Output, and
SEPALLEN, SEPALWID, and PETALLEN as the Continuous Input variables. Click OK to display the Custom Network
Designer dialog. On the Quick tab select General Regression Neural Network.
By default, STATISTICA Neural Networks creates a GRNN with the same number of hidden units as training cases.
However, unlike a PNN, a GRNN can optionally be specified with less units than training cases, and a clustering algorithm
used to assign centers, and you can alter the number of hidden units on the Units tab. However, usually the number of
centers cannot be vastly less than the number of training cases without sacrificing performance (it might be half or a third,
rather than an order of magnitude less). The simplest approach is to use the full set of training cases.
Click OK to display the Train Generalized Regression Network dialog.
Make sure that the Set number of hidden units to equal number of training cases check box is selected (this is the default). If
this is selected, the Assignment of radial centers options may be ignored.
Like a PNN, the single training factor to be selected is the Smoothing factor, which has the same meaning as in PNNs: it
controls the deviation of the Gaussian kernel functions located at the radial centers. In this case, a smoothing factor of
around 0.1 is adequate.
Click the OK button to run the GRNN training algorithm. This configures the network to perform kernel-based estimation.
When the network is trained, click the Descriptive Statistics button on the Results dialog to generate the regression summary
statistics spreadsheet. An S.D. Ratio of around .3 should be achievable on the selection set. As with PNNs, an almost
arbitrarily good (and quite deceptive) S.D. Ratio can be achieved on the training set simply by lowering the Smoothing factor
sufficiently.

One of the most prevalent problems with neural networks is the tendency to overfit the data. A range of strategies can be
used to combat overfitting, and many of these are available in STATISTICA Neural Networks, including early stopping and
weight decay regularization.
Another important approach is to deploy ensembles of networks. Rather than making predictions using a single neural
network, we can average across the predictions of a number of networks. This can improve the reliability of the prediction,
especially if we resample the training set on each network.
Ensembles are effective at improving generalization performance because the networks are optimized against a finite data
set. The error made by a neural network can be divided into three parts. First, some error is due to intrinsic noise on the
function being modeled - this error is unavoidable. Second, the network may implement a function that is, on average across
possible data sets, systematically different to the underlying function. This is known as bias, and is a particular problem if
your neural network is insufficiently complex to capture the underlying function. Third, the neural network may be
particularly sensitive to the particular choice of data set, and the error may fluctuate accordingly - this is known as variance.
More complex models tend to exhibit higher variance, as they can fit to the variance-induced noise in the data set. Selecting
the neural network's complexity (i.e. number of hidden units) is actually a choice about the trade-off between bias and
variance errors.
Ensembles provide a very simple way of improving performance, because averaging across different neural networks lowers
the expected variance. Arguably, you can even afford to deploy more complex models (i.e. more weights) with lower bias,
on the basis that the resulting variance can be removed via the ensemble. An important piece of theory (Bishop, 1995) shows
that the expected performance of an ensemble is at least as good as the average performance of the members.
Creating an ensemble using the Intelligent Problem Solver. Open the Irisdat.sta data set via the File - Open menu; it is in
the /Examples/Datasets directory of STATISTICA. Select Neural Networks from the Statistics - Data Mining menu to display
the STATISTICA Neural Networks (SNN) Startup Panel. Select the Classification option button in the Problem type group
box and select Intelligent Problem Solver on the Neural Networks Startup Panel - Quick tab. Click the Variables button and
select IRISTYPE as the Categorical Outputs and SEPALLEN, SEPALWID, PETALLEN, and PETLWID as the Continuous
Inputs. Click OK to display the Intelligent Problem Solver dialog.
On the Quick tab, select the Form an ensemble from retained networks check box.
Click the OK button to train the network.
When the Results dialog is displayed, you will see that the summary list contains a single model, with a Profile of Output 4:
[5]:1 (or something similar). This signifies an output ensemble, with four inputs, one output, and five member networks.
You can execute the ensemble just as you would an individual network, with some minor exceptions. Specifically, and as
this is a classification problem, the ensemble forms its prediction using a voting strategy. Each network individually predicts
the class of the IRISTYPE output. The ensemble predicts the most common class from the members. If there is a tie
(disagreement between the members) the ensemble output a missing value (i.e. the class is "unknown"). Due to this voting
process, the ensemble cannot output confidence levels (there is a special form of ensemble, a Confidence ensemble, that can,
but it requires fairly homogeneous member networks).
The summary statistics displayed for the ensemble at the top of the Results dialog are actually the average statistics for the
member networks. Therefore, an ensemble is also a convenient way to group a set of networks, and generate overall results.
When networks are placed in an ensemble, you can still execute the individual networks. On the Results dialog, click the
Select models button to display the Select Networks and/or Ensembles dialog.
Click on the Options tab and select the Select the ensemble and its networks option under Selecting an ensemble.
Click OK. The list box in the Results dialog now contains both the ensemble, and its member networks.
Click the Predictions button. The resulting spreadsheet contains the predictions of the member networks, and the ensemble,
together for comparison. A portion of that spreadsheet is shown below. Remember that your results may vary slightly.
Click Cancel to return to the Intelligent Problem Solver dialog, and click the Sampling button at the bottom right of the
dialog. This displays the Sampling of Case Subsets for Intelligent Problem Solver dialog.
By default, the Selection of subsets for each network trained option is set to Fix in numbers given below (randomly assigned
at beginning). This implies that the case subsets are sampled once when the Intelligent Problem Solver starts, and are not
altered thereafter. You can change this so that the cases are resampled for each network. For example, select the Bootstrap
resampling (see Bootstrap tab) option, click OK, then click OK again on the Intelligent Problem Solver dialog to train the
network.
Bootstrap is an effective resampling technique for variance reduction. Each time a network is trained, the training set is
produced by sampling with replacement from the original data (i.e. some cases may not be selected, and some may be
selected more than once). This process acts to reduce the variance caused by the particular data set available. The ensemble
averages across the resulting networks to produce the final prediction.
Creating an ensemble using the Custom Network Designer. You may also create ensembles using the Custom Network
Designer. Besides the usefulness of ensembles for predictions, you can also use the facility to generate estimates of
generalization performance, by averaging across repeated sampling experiments.
Keep the Irisdat.sta data set open and return to the STATISTICA Neural Networks (SNN) Startup Panel (either by clicking
Cancel or beginning a new analysis). Select the Classification option button in the Problem type group box and select
Custom Network Designer on the Neural Networks Startup Panel - Quick tab. Click the Variables button and select
IRISTYPE as the Categorical Outputs and SEPALLEN, SEPALWID, PETALLEN, and PETLWID as the Continuous Inputs.
Click OK to display the Custom Network Designer dialog.
Select Multilayer Perceptron as the Network type on the Quick tab, then click OK to display the Train Multilayer Perceptron
dialog.
On the Train Multilayer Perceptron dialog click the Sampling button to display the Sampling of Case Subsets for Training
dialog.
Click on the Advanced tab, and select the Cross validated resampling as the Sampling method. Set the Number of samples
under Resampling to 5.
Click OK to return to the Train Multilayer Perceptron dialog and then click OK again to train the network.
STATISTICA Neural Networks will now run the training algorithm five times, each time resampling. When training is
finished, the five networks are formed into an ensemble, and their average performance statistics are displayed in the
summary line in the list box.
In this particular experiment, we have used the cross validation sampling approach. The data set is divided into equal parts
(in this case, five parts), and a number of experiments run. On each run, one part is held back for use as a test set, and the
other parts are used for training and selection. With the five-fold cross validation we used here, 80% of the data set is used
for training each network, and 20% for testing. Higher fold testing (even up to the limit of leave-one-out cross validation)
can be used, on the basis that the reliability of the individual models will be higher if each uses as much data as possible.
Compared with sampling and ensemble creation in the Intelligent Problem Solver, the Custom Network Designer is both
more flexible and more controlled. You can specify an exact network architecture, then built multiple networks using
different case divisions.

Among the decisions that the neural network designer must make, which of the available variables to use as inputs to the
neural network (independent variables) is one of the most difficult. The difficulty of this decision arises for a number of
reasons:
In real problem domains, a neural network is typically employed where you do not have a strong idea of the relationship
between the available variables and the desired prediction. Therefore, you are likely to accumulate a variety of data, with
some that you suspect is important and some that is of dubious (but unknown) value.
In nonlinear problems, there may be interdependencies and redundancies between parameters; for example, a pair of
parameters may be of no value individually, but extremely useful in conjunction, or any one of a set of parameters may be
useful. These features mean that it is not possible, in general, to simply rank parameters in order of importance.
The "curse of dimensionality" means that it is sometimes actually better to discard some variables that do have a genuine
information content simply to reduce the total number of input variables, and therefore the complexity of the problem and
the size of the network. Counter-intuitively, this can actually improve the network's generalization capabilities (see Bishop,
1995).
The only method that is guaranteed to select the best input set is to train networks with all possible input sets and all possible
architectures, and to select the best. In practice, this is impossible for any significant number of candidate inputs.
You can also experiment heuristically in STATISTICA Neural Networks by building a variety of networks with different
inputs, and gradually building up a picture of which inputs are useful. You can also employ Weigend weight regularization,
and eliminate any input units that have very small fan-out weights.
In addition, the Intelligent Problem Solver contains some highly sophisticated algorithms to select input variables, and for
most users will provide a reasonable analysis.
If you want to examine the selection of variables more closely yourself, there are two further approaches available in
STATISTICA Neural Networks: Sensitivity Analysis and Feature Selection algorithms.
Sensitivity Analysis. To perform sensitivity analysis, run a previously trained neural network, display the Sensitivity tab of
the Results dialog, and click the Sensitivity Analysis button.
The spreadsheet indicates the sensitivity of the model to each variable. The sensitivity is reported in two rows - the Ratio and
the Rank.
The basic sensitivity figure is the Ratio. For each variable, the network is executed as if that variable is "unavailable."
Unavailability of a variable used by the model will presumably cause some deterioration in its performance. The ratio
reported is the ratio of the error with the variable unavailable to the ratio with it available. Important variables have a high
ratio, indicating that the network performance deteriorates badly if they are not present. If the Ratio is one or lower, then
making the variable "unavailable" either has no effect on the performance of the network, or actually enhances it.
The Rank lists the variables in order of importance (i.e. order of descending Ratio), and is provided for convenience in
interpreting the sensitivities.
The possible redundancies and interdependencies between variables imply that the sensitivity figures must be interpreted
with caution - they may be quite different on another network applied to the same data set. However, if you find that certain
variables consistently have high or low sensitivity, you can begin to identify key and unnecessary variables.
In the Iris data, you will likely find the petal width and length the most sensitive variables, with the sepal length and width
not very important, and the sepal length in particular often identified as of no value.
As with all results generated by STATISTICA Neural Networks, you can choose which subsets to include in the sensitivity
analysis. It is a good idea to select (via the Results dialog - Quick tab) Subsets used to generate results - All (separately), and
then to run sensitivity analysis. You can then do a consistency check between the sensitivities on the train, select and test
subsets, which gives an indication of the meaningfulness of the rankings. It is also unsurprisingly common for sensitivities
on the selection and test sets to be lower than on the training set, and in particular you may find some low sensitivity inputs
with a ranking below 1.0 on these subsets, indicating that they are actually worthless.
When you train networks with any of the training algorithms, you can specify sensitivity-based pruning on the Pruning tab
of the training dialog. Specify a threshold (1.0, or slightly larger - perhaps 1.001 or 1.01), and after training, any inputs with
sensitivity below the given threshold are automatically pruned.
Feature Selection Algorithms. The Feature Selection Algorithms conduct a large number of experiments with different
combinations of inputs, building probabilistic or generalized regression networks for each combination, evaluating the
performance, and using this to further guide the search. This is a "brute force" technique that may sometimes find results that
the much faster Intelligent Problem Solver misses. Feature selection approaches include forward stepwise selection,
backward stepwise selection, and the genetic algorithm.
Genetic algorithms are a particularly effective search technique for combinatorial problems of this type (where a set of
interrelated yes/no decisions needs to be made). The method is time-consuming (it typically requires building and testing
many thousands of networks), but STATISTICA Neural Networks' combination of the technique with fast-training
PNN/GRNN networks means that it runs as fast as possible. For reasonably sized problem domains (perhaps 50-100 possible
input variables, and cases numbering in the low thousands), the algorithm can be employed effectively overnight or on the
weekend on a fast PC. With sub-sampling, it can be applied in minutes or hours, although at the cost of reduced reliability
for very large numbers of variables.
Open the Credit data file. Open STATISTICA Neural Networks, and on the Quick tab, click the Variables button; select RISK
as the dependent variable and V1-V9 as the independents, and click the OK button.
Select Feature Selection on the Startup Panel Advanced tab, and click the OK button to display the Feature (Independent
Variable) Selection dialog.
Display the Interactive tab, and in the Results reported group box, select the All combinations tested option button.
Click the OK button
The Feature Selection in Progress dialog is displayed, which shows the progress at the bottom of the dialog (elapsed time
and progress bar). In the main part of the dialog a spreadsheet is updated showing the progress to date.
You can terminate the feature selection algorithm at any time by clicking the Finish button (there may be a slight delay
while the current test terminates).
When the algorithm finishes, the spreadsheet is transferred to the results workbook.
Each row represents a particular test (of a combination of inputs). The row label indicates the stage; e.g. 2.3 indicates the
third test in stage 2. The final row replicates the best result found, for convenience. The first column is the selection error of
the PNN or GRNN. Subsequent columns indicate which inputs were selected for that particular combination.
It is sometimes a good idea to reduce the number of inputs to a network even at the cost of a little performance, as this
improves generalization capability and decreases the network size and execution size. You can apply some extra pressure to
eliminate unwanted variables by assigning a Unit penalty (on the Advanced tab of the Feature Selection dialog). This is
multiplied by the number of units in the network and added to the error level in assessing how good a network is, and thus
penalizes larger networks. Typical values (if used) are in the range 0.01-0.001. However, it is not necessary to choose a Unit
penalty; usually the algorithm is run without it, or with a very low value to eliminate only very marginal input variables.
If there are a large number of cases, the evaluations performed by the feature selection algorithms can be very time-
consuming (the time taken is proportional to the number of cases). For this reason, you can specify a sub-sampling rate.
However, in this case we have very few cases and a Sampling rate of 1.0 (the default) is fine (displayed on the Feature
Selection dialog Advanced tab).
Display the Feature Selection dialog, select the Genetic algorithm tab, and click the OK button. If the genetic algorithm
were run with the default settings, it would perform 10,000 evaluations (100 population times 100 generations). However,
since our problem has only nine candidate inputs, the total number of possible combinations is only 512 (2 raised to the 9th
power). STATISTICA Neural Networks automatically detects this fact, and performs the faster exhaustive evaluation. If the
exhaustive evaluation is taking an unreasonable amount of time on your computer, click the Finish button on the progress
dialog.
Feature selection is usually performed as a preamble to training a network or networks using the selected features as inputs.
You can instruct STATISTICA Neural Networks to display either the Intelligent Problem Solver or Custom Network
Designer when the feature selection algorithm finishes, with the feature set it chooses transferred to the independent variable
selection.
Display the Feature Selection dialog, select the Quick tab, and in the Method group box, select the Forward selection option
button.
Display the Finish tab, and in the On completion group box, select the Start Custom Network Designer using selected
features option button.
Display the Advanced tab, set the Unit penalty to 0.01, and click the OK button.
This time, there is significant pressure to reduce the number of inputs, even at the cost of some loss of error performance.
The precise result may depend on the cases that are assigned to your training and selection sets. However, a common result
is to find that V6 and V7 are selected, possibly also with V4.
The forward stepwise and backward stepwise algorithms are usually quicker than the genetic algorithm if you have a
reasonably small number of variables (say, 80 or less), and are equally effective if there are not too many complex
interdependencies between variables. One disadvantage in this case is that these algorithms give a single, consistent result,
whereas the stochastic nature of the genetic algorithms means that you can run it a number of times and perhaps find
alternative solutions. On the other hand, with the forward and backward stepwise selection algorithm, you can see which
variables are added or removed in which order. In common with sensitivity analysis, this can give you a feeling for the
relative importance of the variables.
Display the Feature Selection dialog, and on the Quick tab, select the Backward selection option button in the Method group
box.
Display the Interactive tab, and select the Best of each stage option button.
Display the Advanced tab, set the Unit penalty to 0.1, and click OK.
This Unit penalty is so high that the algorithm will actually suggest removing all input variables. However, it is useful to see
the order that the algorithm suggests removing the variables in
Run the Feature Selection algorithm again, this time selecting Method - Forward selection, and setting the Unit penalty to 0.
With no unit penalty, the algorithm suggests adding all the variables, roughly in the reverse order to the Backward selection
algorithm.
These experiments suggest a fairly consistent ordering on the variables, and can be verified by comparing with the rankings
generated by sensitivity analysis.

So far we have looked at a variety of classification and regression problems, in all of which the available data consists of a
set of cases, each containing values for a number of variables, where the objective is to predict the value of one of more
output variables from the input variables, and there is an inherent assumption that different cases are independent (see
Bishop, 1995).
There are many applications for time series problems, where variables are measured over a period of time, and there are (at
least expected to be) relationships between variables at successive times. In this case, our objective may be to predict the
value of a variable at a given time from the same and/or other variables at earlier times. In most cases, a single numeric
variable is observed, and the objective is to predict the next value of that variable from a number of preceding ones (ARIMA
models are frequently used in such circumstances).
Such a problem is actually just a specialized form of regression, and as such can be tackled by any form of neural network
suitable for use in regression, providing that the data set is suitably pre-processed into the correct form. This example
concentrates on one such typical time series problem, using the Series_G.sta file (this records the growth in airline
passengers over a period of time). However, it is worth noting that STATISTICA Neural Networks is not limited to time
series prediction using a single numeric variable: it is quite possible to predict multiple output variables, and nominal in
addition to numeric variables. STATISTICA Neural Networks can also predict a single step ahead (which is the most
common form of prediction) or multiple steps ahead.
MLPs for Time Series Prediction. Open the Series_G.sta data file via the File - Open menu; it is in the /Examples/Datasets
directory of STATISTICA. The first thing you will notice is that the data set includes only a single variable. We will be using
this both as the input and output of the neural network (but, of course, at different time steps).
Panel. Select the Time Series option button in the Problem type group box and select Intelligent Problem Solver on the
Neural Networks Startup Panel - Quick tab. Click the Variables button and select SERIES_G as both the Continuous Outputs
and the Continuous Inputs.
Click OK to display the Intelligent Problem Solver dialog.
For Time Series problems, the Intelligent Problem Solver has to make an additional design decision - the number of time
series steps to use as input to the network.
Determining the correct number of input steps is a difficult problem for the Intelligent Problem Solver. Frequently, it is
possible for you to determine the correct number, which simplifies the search task for the Intelligent Problem Solver. For
example, if the problem contains a natural cycle period (the Series_G data set contains monthly sales figures, so there is a
definite length 12 period in the data) then you should try specifying that cycle, or an integral multiple of it.
If the time series has no natural period, or you are not aware of one, then specify a range of possible values, and
STATISTICA Neural Networks will determine a period for you. In this case you are recommended to run the Intelligent
Problem Solver several times with a large number of iterations, and once the Intelligent Problem Solver has determined a
reasonable period for you, to rerun with a fixed period.
In this case, we will specify a period of 12, as this is the natural cycle of the time series.
Click on the Time Series tab, and set both the Minimum and Maximum steps under Range for steps (number of time steps
used as inputs) to 12.
Click the OK button to train the network.
When the Intelligent Problem Solver finishes, click on the Advanced tab of the Results dialog, and click the Time series
projection button.
and click the Time series projection button to display the Time Series Projection dialog.
Neural networks designed by the Intelligent Problem Solver for time series problems perform one-step-ahead prediction -
they predict the next time step from a series of previous time steps. By dropping the oldest of the original input points,
adding the newly predicted value, and rerunning the network, a prediction can be made a further step ahead. This process
can be repeated to generate an entire time series of predictions.
As any prediction errors will rapidly accumulate, such multiple-step predictions can only be trusted if the network has very
high predictive accuracy. In the case of the Series_G data, it is possible to get a standard deviation ratio of 0.13, or better,
which is sufficient to project ahead quite a few time steps.
The Time Series Projection dialog allows you to generate a graph showing the results of projecting ahead a given number of
time steps.
STATISTICA Neural Networks can start the time series projection either from a pattern extracted from the current data set, or
from a user-specified pattern. We will use the default, which is to start at the first available pattern in the data set (so we will
be able to compare the prediction with the entire data set).
The only control parameter we need to select is the Length of the projection. Our time series has only 144 cases, of which
twelve are effectively removed by pre-processing (predictions cannot be made for the first twelve points, as they do not have
sufficient preceding points to make the prediction), so the maximum available data to compare with is 132 steps. However, it
is quite permissible to project beyond the end of the available data: we will then see the predicted value projected beyond the
end of the data set, although we will not have a standard of comparison beyond that point.
Set the Length of the projection to 250, and click the Time series graph button.
The results may seem somewhat disappointing - the prediction saturates after a time. This is due to the influence of the
sigmoid functions in the hidden layer of the network, which saturate as the input grows.
In any event, it might actually be considered irrelevant to predict too far ahead - figures for the next year or two may be quite
adequate, or the reliability of more distant predictions may be anyway questionable. Nonetheless, the above example does
demonstrate a restriction of neural networks if they are blindly applied in circumstances where they will be required to
extrapolate beyond known data.
One solution to this is to pre-process the data into a more appropriate form. In this example, we may speculate that the
underlying trend (if we ignore the seasonal component) is linear. We can remove the trend by building a linear model, and
then use the more sophisticated neural model to estimate the residual.
To do this, use a Linear network. The Intelligent Problem Solver has probably already produced one (if not, run it again,
selecting 12 steps and the Linear network type. Since there is only one possible network with this specification, the training
process will be extremely quick.)
Click Cancel on the Time Series Projection dialog to return to the Results dialog. Click the Select models button to display
the Select Networks and/or Ensembles dialog. Select a Linear model and click OK. The top of the Results dialog will now
show one Linear network.
On the Quick tab of the Results dialog, click the Residuals button. The second column of the output spreadsheet gives the
residual (observed minus predicted). Select the this column and select Copy with Headers from the Edit menu, to copy the
entire column to the Clipboard. Now, click on the Series_G.sta data set and add a variable (from the Vars - Add menu), and
click Paste on the Edit menu to add the residuals to the data set as a new variable.
The data set should now contain the original variable, SERIES_G, and the residual from the linear model, R.SERIES_G.2 (or
similar, depending on the index number of the model, which supplies the postfix). (A portion of the spreadsheet is shown
below. Remember that yours will differ slightly.)
The first twelve cases in the data set are automatically treated specially by the Intelligent Problem Solver, as they cannot be
used in time series predictions (not having sufficient preceding cases). We are going to build a neural network that estimates
the residuals from the linear model, which we have just transferred into the data set. The first twelve residuals are
meaningless, and we will now need 12 preceding residuals to train our new network. Therefore, we need to ensure that all
the cases up to the 24th are not used in training.
Begin a new analysis by selecting Neural Networks from the Statistics - Data Mining menu to display the STATISTICA
Neural Networks (SNN) Startup Panel. Select the Time Series option button in the Problem type group box and select
Intelligent Problem Solver on the Neural Networks Startup Panel - Quick tab. Click the Variables button and select
R.SERIES_G.2 as both the Continuous Outputs and the Continuous Inputs. Click OK to display the Intelligent Problem
Solver dialog.
Click the Sampling button to display the Sampling of Case Subset for Intelligent Problem Solver dialog, then click the Select
Cases button to display the Select Cases dialog.
Specify the range of cases 25-144, and click OK to return to the Sampling of Case Subset for Intelligent Problem Solver
dialog, then OK again to return to the Intelligent Problem Solver dialog.
On the Quick tab, clear the Select a subset of independent variables check box.
Click on the Type tab, and clear all types except for Three layer perceptron.
Click OK to train the network.
When the Results dialog is displayed, select the Advanced tab and click the Time series projection button to display the Time
Series Projection dialog.
It is much more difficult to interpret the results shown on the Time Series Projection dialog in this case. Click the Time
series button to produce a time series graph.
Click cancel to return to the Results dialog and click the Descriptive statistics button. This can give us an idea of the
explained variance. In our test run, the S.D. Ratio of the best network was 0.6, indicating that the multilayer perceptron was
able to predict about a third of the remaining residual value once the linear component of the problem had been removed.
As the data set is very small, we also tried using the entire set for training, and achieved a residual of 0.26 on this problem,
accounting for three quarters of the remaining residual value. This demonstrates that the original data series has a great deal
of non-linear structure. It is thus possible to use a hybrid model, where you first form a linear prediction, then adjust that
using the residual estimation of the multilayer perceptron. However, a greater data volume is required for increased
accuracy.

Neural Network

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Network

Uploaded by

Copyright:

Available Formats

Neural Networks Página 1 de 220

STATISTICA Neural Networks Program Overview

Neural Networks Introductory Overview

The Biological Inspiration

The Basic Artificial Model

Using a Neural Network

Gathering Data for Neural Networks

Pre- and Post-processing

Multilayer Perceptrons - Part II

Radial Basis Function Networks

Probabilistic Neural Networks

Generalized Regression Neural Networks

Time Series Prediction in STATISTICA Neural Networks

Variable Selection and Dimensionality Reduction

Ensembles and Resampling

Specifying the Neural Networks Analysis

Extended Dot Product Training

Extended Dot Product Training Dialog

Extended Dot Product Training - Quick Tab

Extended Dot Product Training - End Tab

Extended Dot Product Training - Decay Tab

Extended Dot Product Training - BP (1)/(2) Tab

Extended Dot Product Training - QP (1)/(2) Tab

Extended Dot Product Training - DBD (1)/(2) Tab

Extended Kohonen Training

Extended Kohonen Training Dialog

Neural Networks (Startup Panel)

STATISTICA Neural Networks (SNN) Startup Panel

Neural Networks Startup Panel - Quick Tab

Neural Networks Startup Panel - Advanced Tab

Neural Networks Startup Panel - Networks/Ensembles Tab

Intelligent Problem Solver

Intelligent Problem Solver Overview

How the Intelligent Problem Solver Works

Intelligent Problem Solver Dialog

Intelligent Problem Solver - Quick Tab

Intelligent Problem Solver - Retain Tab

Intelligent Problem Solver - Types Tab

Intelligent Problem Solver - Complexity Tab

Intelligent Problem Solver - Thresholds Tab

Intelligent Problem Solver - Time Series Tab

Intelligent Problem Solver - MLP Tab

Intelligent Problem Solver - Feedback Tab

Intelligent Problem Solver - Progress dialog

Custom Network Designer

Custom Network Designer Dialog

Custom Network Designer - Quick Tab

Custom Network Designer - Units Tabs

Custom Network Designer - PNN Tab

Custom Network Designer - Time Series Tab

Create/Edit Ensemble Dialog

Multiple Model Selection

Select Networks and/or Ensembles Dialog

Select Networks and/or Ensembles - Models Tab

Select Networks and/or Ensembles - Options Tab

Run Code Generator

Select Neural Network to Retrain

Network Set Editor

Neural Network File Editor Dialog

Neural Network File Editor - File Details Tab

Neural Network File Editor - Networks Tab

Neural Network File Editor - Ensembles Tab

Neural Network File Editor - Replacement Options Tab

Neural Network File Editor - Advanced Tab