Neural Network For Consumer Choice Prediction

A Comparative Analysis of Neural Networks and Statistical Methods for Predicting Consumer Choice Author(s): Patricia M.
West, Patrick L. Brockett, Linda L. Golden Source: Marketing Science, Vol. 16, No. 4 (1997), pp. 370-391 Published by: INFORMS Stable URL: http://www.jstor.org/stable/184232 . Accessed: 26/04/2011 22:56
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at . http://www.jstor.org/action/showPublisher?publisherCode=informs. . Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
INFORMS is collaborating with JSTOR to digitize, preserve and extend access to Marketing Science.
http://www.jstor.org
Comparative Analysis
and Statistical Consumer Predicting
of
Neural for
Networks
Methods Choice
Patricia M. West * Patrick L. Brockett * Linda L. Golden

Departmentof MarketingAdministration,University of Texas at Austin, GraduateSchoolof Business, CBA 7.202, Austin, Texas 78712
Abstract
This paper presents a definitive description of neural network methodology and provides an evaluation of its advantages and disadvantages relative to statistical procedures. The development of this rich class of models was inspired by the neural architecture of the human brain. These models mathematically emulate the neurophysical structure and decision making of the human brain, and, from a statistical perspective, are closely related to generalized linear models. Artificial neural networks are, however, nonlinear and use a different estimation procedure (feed forward and back propagation) than is used in traditional statistical models (least squares or maximum likelihood). Additionally, neural network models do not require the same restrictive assumptions about the relationship between the independent variables and dependent variable(s). Consequently, these models have already been very successfully applied in many diverse disciplines, including biology, psychology, statistics, mathematics, business, insurance, and computer science. We propose that neural networks will prove to be a valuable tool for marketers concerned with predicting consumer choice. We will demonstrate that neural networks provide superior predictions regarding consumer decision processes. In the context of modeling consumer judgment and decision making, for example, neural network models can offer significant improvement over traditional statistical methods because of their ability to capture nonlinear relationships associated with the use of noncompensatory decision rules. Our analysis reveals that neural networks have great potential for improving model predictions in nonlinear decision contexts without sacrificing performance in linear decision contexts. This paper provides a detailed introduction to neural networks that is understandable to both the academic researcher and the practitioner. This exposition is intended to provide both the intuition and the rigorous mathematical models needed for successful applications. In particular, a step-bystep outline of how to use the models is provided along with a discussion of the strengths and weaknesses of the model. We also address the robustness of the neural network models
MARKETING SCIENCE/VO1.
and discuss how far wrong you might go using neural network models versus traditional statistical methods. Herein we report the results of two studies. The first is a numerical simulation comparing the ability of neural networks with discriminant analysis and logistic regression at predicting choices made by decision rules that vary in complexity. This includes simulations involving two noncompensatory decision rules and one compensatory decision rule that involves attribute thresholds. In particular, we test a variant of the satisficing rule used by Johnson et al. (1989) that sets a lower bound threshold on all attribute values and a "latitude of acceptance" model that sets both a lower threshold and an upper threshold on attribute values, mimicking an "ideal point" model (Coombs and Avrunin 1977). We also test a compensatory rule that equally weights attributes and judges the acceptability of an alternative based on the sum of its attribute values. Thus, the simulations include both a linear environment, in which traditional statistical models might be deemed appropriate, as well as a nonlinear environment where statistical models might not be appropriate. The complexity of the decision rules was varied to test for any potential degradation in model performance. For these simulated data it is shown that, in general, the neural network model outperforms the commonly used statistical procedures in terms of explained variance and out-of-sample predictive accuracy. An empirical study bridging the behavioral and statistical lines of research was also conducted. Here we examine the predictive relationship between retail store image variables and consumer patronage behavior. A direct comparison between a neural network model and the more commonly encountered techniques of discriminant analysis and factor analysis followed by logistic regression is presented. Again the results reveal that the neural network model outperformed the statistical procedures in terms of explained variance and out-of-sample predictive accuracy. We conclude that neural network models offer superior predictive capabilities over traditional statistical methods in predicting consumer choice in nonlinear and linear settings. (Consumer Decision Making; Neural Networks; Statistical Techniques) 0732-2399/97/1604/0370/$05.00
Copyright ? 1997, Institute for Operations Research and the Management Sciences
16, No. 4, 1997
pp. 370-391
A COMPARATIVE ANALYSIS OF NEURAL NETWORKS AND STATISTICAL METHODS FOR PREDICTING CONSUMER CHOICE
1. Introduction
The study of artificial neural networks (ANN) has drawn considerable attention in many disciplines, including biology, psychology, statistics, mathematics, business, and computer science. The development of this rich class of nonlinear models was inspired by the neural architecture of the human brain, which consists of multiple levels of neurons and synaptic connections with information transfer between neurons across synaptic arcs (Rosenblatt 1961). Due to the empirical success of artificial neural networks at predicting the outcome of complex nonlinear processes, the methodology has become recognized for its superior forecasting ability and is receiving considerable attention from statisticians. We propose that artificial neural networks will prove to be a valuable tool for marketers concerned with predicting the outcome of consumer decision processes, especially those that involve the use of noncompensatory choice heuristics. Figure 1 (to be discussed in more detail in ? 3.2) graphically illustrates three commonly used decision rules for consumer choice. The "Satisficing" and the "Latitude of Acceptance" rules are both nonlinear and noncompensatory, and pose severe prediction problems for standard linear statistical models. We show that neural network models have great potential for improving model predictions in these nonlinear decision contexts without sacrificing performance in linear decision contexts. Research in the area of artificial neural networks has moved along two distinct lines (Hart 1992). Cognitive scientists have built "connectionist" models of brain behavior to further understand human cognition (see Rumelhart, McClelland, and the PDP Research Group 1986 for a review). Some of the specific skills examined have been speech perception (McClelland and Elman 1986) and decoding (Elman and Zipser 1988), reading (McClelland 1986), learning and memory (McClelland and Rumelhart 1986), recognizing handwritten characters (Fukushima and Miyake 1984), inference generation (Shastri and Ajjanagadde 1993), and recognition and cued-recall (Chappell and Humphreys 1994). The purpose of these studies has been to better understand and model the associative nature of human cognition.
Figure 1
Threshold-BasedDecision Rules
1000
Accept Attribute 2
lReject.
o 1000
1 Attribute
(a) SatisficingRule-A thresholdis set for each attribute.Onlyalternatives thresholdare deemed "acceptable." that exceed each attribute 1000
Attribute
... ~~....~~~~~~~t~ ... . ~ ~~
.. .
Rejecto
1000
1 Attribute
of (b) Latitude AcceptanceRule-An acceptance range is set for each attribute. Only alternativesthat fall within the shared acceptance region are deemed "acceptable."
1000
Accept
Attribute2
-Reject ....
o
.. ....
1000
I Attribute
(c) Weighted-Additive Rule-A threshold is set for the combined value of attributes.Onlyalternatives whose attributesum exceeds the thresholdare considered"acceptable."
On the other hand, information scientists, statisticians, and outcome-based researchers examine the mathematical properties of the neural network models including their performance in data analysis such tasks as classification (Archer and Wang 1993), multiple criteria decision making (Malakooti and Zhou 1994, Wang 1994, Wang and Malakooti 1992, Hart 1992), regression (Dutta and Shekhar 1988, Surkan and Singleton 1990, Treigueiros and Berry 1991),
MARKETING SCIENCE/VOl.
16, No. 4, 1997
371
WEST, BROCKETT, AND GOLDEN A Comparative Analysis of Neural Networks and StatisticalMethods
discriminant analysis (Curram and Mingers 1994, Yoon, Swales, and Margavio 1993), and even the financial problems of portfolio choice and financial insolvency monitoring (Brockett et al. 1994). In each of these applications the role of the neural network is to enhance mathematical approaches to decision making. Neural networks are shown in these studies to outperform frequently used multivariate statistical techniques for these classification and decision-making tasks. Our intent is to demonstrate that neural networks offer significant improvement over traditional statistical methods generally used to model consumer judgment and decision making because of their ability to capture nonlinear relationships associated with the use of noncompensatory decision rules (Ganzach 1995, Payne, Bettman, and Johnson 1993). This paper reports the results of two studies. The first is a numerical simulation comparing the ability of neural networks with discriminant analysis and logistic regression for predicting choices made by noncompensatory decision rules that vary in complexity. The second is an empirical study bridging the behavioral and statistical lines of research: We examine the predictive relationship between retail store image variables and consumer patronage behavior. A direct comparison between a feed-forward backward propagation neural network model and the more commonly encountered techniques of discriminant analysis and logistic regression will be presented. Both the simulation results and the empirical analysis reveal that the neural network model performed as well or better than commonly used statistical procedures in terms of forecasting ability (out-of-sample predictive accuracy). In the context of consumer behavior, a strength of the neural network approach lies in its ability to mimic the functioning of the human brain and to estimate nonlinear and noncompensatory processes without first presupposing a parametric relationship between product attributes, perceptions, and behavior. We propose that neural networks hold great promise for enabling the prediction of behavior based on product attributes and/or image, particularly when the underlying choice process is noncompensatory in nature. From a cognitive perspective, the architecture of these models is consistent with the widely accepted
spreading activation model of human memory (Anderson 1976, Collins and Loftus 1975, Quillan 1968) and pattern recognition models of categorization and learning (Simon 1996). We propose that these models are also well suited for representing judgment and decision making, which often entails nonlinear and noncompensatory processing of information (Payne, et al. 1993). From a statistical perspective, these models are closely related to generalized linear model techniques, but provide superior predictions regarding consumer attitudes and choice over traditional statistical methods because the neural network does not require the same restrictive assumptions about the relationship between the independent variables and dependent
variable(s).'
1.1. Challenging the Robustness of Simple Linear Models Studies using compensatory linear models have abounded since the seminal writing of Martin Fishbein (1967) on attitudes. These models have been used to examine consumer preference (see Green and Srinivasan 1978, 1990 for reviews), attitudes (Wilkie and Pessimier 1973), judgment and decision making (Dawes and Corrigan 1974), and have been touted as being normatively appropriate for multiattribute decision making (Keeney and Raiffa 1976). From the perspective of mathematical convenience, one of the strengths of compensatory linear models has been the ease with which the models could be estimated using ordinary least squares regression, analysis of variance procedures, logistic regression, and discriminant analysis. These models are frequently used because of the common belief in their ability to mimic consumer judgment and choice processes. Challenges to the basic linear-additive structure of these models, however, have been levied almost since their inception. Anderson (1970, 1971, 1991) argued that in addition to the simple additive model, both
1Statistical models require the imposition of a parametric formulation for the relationship under study (e.g., a linear relationship, logit linear relationship, or a multivariate normal distribution within groups with equal covariance matrices, etc.). If the assumed relationship is too restrictive, then the predictive ability of the model is compromised relative to the flexible (indeed, universal approximator) form of the neural network.
372
16, No. 4, 1997
WEST, BROCKETT, AND GOLDEN A Comparative Analysis of Neural Networks and Statistical Methods
multiplicative and multilinear models are also legitimate combination rules for describing the information integration process (see Lynch 1985 for a discussion of testing alternative integration rules). In fact, Coombs and Avrunin (1977) demonstrated that preference is often not linear, but rather a single-peaked function. For example, most coffee drinkers who prefer cream and sugar would attest to the fact that there is an optimal level of each that is preferable. Moreover, unlike the compensatory structure of linear-additive models, Payne et al. (1993) demonstrated that much of consumer judgment and choice is the result of simple decision rules that are noncompensatory (see Figure l(a) and 1(b)) in nature rather than compensatory (see Figure 1(c)). These noncompensatory rules often involve the use of attribute thresholds that allow for easy elimination or acceptance of an alternative. Johnson, Meyer, and Ghose (1989) demonstrated that compensatory statistical models, such as the linear-additive structure, fail to capture noncompensatory decision rules particularly when attributes are negatively correlated. In each of these above cases, the deviations from the basic linear-additive structure have been tested, and methods developed for modeling nonlinear rules by incorporating the appropriate exponential terms and multiplicative interactions into the regression models. Noncompensatory decision processes can be modeled when the nature of the rule is known a priori by estimating separate slopes for the regions above and below the attribute thresholds. Essentially, to model a nonlinear process or noncompensatory decision rule a specific relationship is postulated and then examined for reduction in error variance relative to a more parsimonious linear model. We shall show in this paper that neural network modeling may offer significant advantages over the commonly used estimation procedures described above for assessing consumers' attitudes, preferences, judgment, and choice behavior. Neural network models provide superior predictive capabilities when the nature of the decision rule being applied is unknown. This conclusion is consistent with that of White (1989) who states that neural network models provide a "rich and interesting class of new methods
applicable to a regression problem requiring some sort of flexible functional form." In the following section we describe the ANN model in some detail. Subsequently we examine the results of two studies. First, we present a simulation that tests the performance of the ANN against traditional linear models for capturing three commonly observed consumer choice rules. Second, we test how the ANN performs in practice using an empirical data set examining consumer patronage behavior. Finally, we provide a set of guidelines for using ANN models and discuss some practical implications for managers.
2. Background on Neural Network Methods

The neural network model can be represented as a massive parallel interconnection of many simple processing units connected structurally in much the same manner as individual neurons in the brain. Just as the individual neurons in the brain provide "intelligent learning" through their constantly evolving network of interconnections and reconnections, artificial neural networks function by constantly adjusting the values of the interconnections between neural units (see Figure 2). In building a mathematical model of this biological process, we create a "node" and "arc"structure analogous with their biological counterparts. The nodes or "neurons" in the model are connected to each other via interconnection weights representing arcs. The interconnection weights are analogous to the lack of resistance of the electrical gap or "synapse" between neurons in the brain. In the brain, neural pathways become strengthened if they are useful, and the resistance at the synapse changes (reduces) to reinforce the ease of transmission of energy along successful neural pathways. In artificial neural networks, pathways also become strengthened if they are useful; however, it is the weight between neurons that changes (increases) to reinforce a successful neural pathway. The process by which the network judges success, learns to improve its performance, recognizes patterns, and develops generalizations is called the "training rule." The learning law proposed by Hebb (1949) served as the starting point for developing the training algorithms of neural networks. The subsequent development of the "back-propagation" training rule resolved
16, No. 4, 1997
373
WEST, BROCKETT, AND GOLDEN Analysis of Neural Networksand StatisticalMethods A Comparative
Figure 2
Neural NetworkTopology
Xi,wi ; \1
WI _,
}Neural
Activation
"Output" of Net (Ex wi) /Fj
\/ (
/ ~~~~~~~Function
~~~~Aggregation/
Inputsto NeuralUnit
(a) A single neuralprocessing unit (neuron)j.
of "Output" euralNet
F(Zwk2 j=1
F.(x
i=
Input layer I
Hidden layer2
Output layer
neuralprocessing units. Eachcircle in the hiddenlayerrepresentsa single neuralprocessing unit. (b) Multiple
certain computational problems outstanding for two decades and significantly enhanced the performance of neural networks in practice (Smith 1996). ANNs are based on a "feed-forward" system whereby the flow of the network is from input toward output (as, for example, occurs in path models and structural equation or maximum likelihood factor analysis "causal" models). Via "back propagation" the network updates the interconnection weights by starting at the derived output value, determining the error produced with the current configuration, and then propagating the error backward through the network to determine, at an aggregate level, how to best update the interconnection weights between individual neurons to improve the
overall predictive accuracy. The method of steepest gradient decent is used for updating the weights to minimize total aggregate error.
2.1. The General Neural Network Model All neural networks possess certain fundamental features. For example, the basic building block of an ANN is the "neural processing unit" or "neuron," which takes a multitude of individual inputs x = (xl, . . . , xi), determines (through the learning algorithm) the optimal connection weights w = (wl,..., wl) that are appropriate to apply to each input, then aggregates these weighted values to concatenate the multiple inputs
374
MARKETING SCIENCE/VOL.16, No. 4, 1997
WEST, BROCKETT, AND GOLDEN A ComparativeAnalysis of Neural Networks and Statistical Methods
into a single value 1. 1 wixi for the neuron. An activation function, F, is then applied to the aggregated weighted value to produce an individual output F(W=1 wixi) for the specific neural unit. The logistic ac- z)) is comtivation function F(z) = 1/(1 + exp( monly used, although other sigmoid functions such as the hyperbolic tangent function are possible.2 The logistic function has its steepest slope near the threshold (intercept) point q, indicating that the relative impact of inputs on the corresponding output values is most pronounced near the threshold, while extreme or outlier values have a decreasingly dramatic effect upon predictions. Figure 2(a) graphically displays the configuration of the single neuron as described above. Until now, we have discussed the working of a single neuron; however, the typology of most neural networks generally involves a collection of neurons that are configured in two layers (i.e., an input layer and an output layer) or three layers (i.e., an input layer, hidden layer, and output layer).3 For each layer of the neural network the same construction process is used to create an array of neural processing units. The ultimate topology for the network is obtained by connecting the input layer units (via connection to weights wMJ)) the hidden layer units, and connecting the hidden layer units (via connection weights w(2)) to the output unit (see Figure 2(b)).
2The output range for the logistic sigmoid function is (0, 1), whereas the output range for the hyperbolic tangent sigmoid functions is (-1, 1). The logistic sigmoid function has desireable mathematical properties (e.g., F'(z) = F(z) [1 - F(z)]), which can be used to simplify the computation of gradients and elasticities. These mathematical advantages together with the insensitivity of the ultimate results to the precise choice of sigmoid functional has contributed to the dominance of the logistic function in practical applications. A discussion of the attributes of various sigmoid formulae is given in Menon et al. (1996). Note also that 2(1/(1 + exp(-x))) = 1 + tanh (x/2) so that the logistic and hyperbolic tangent function are essentially mathematically equivalent. 3There is a lack of consistency in the literature with regard to the counting of the number of layers in a network (cf., Abbott 1996, Hart 1992, Rumelhart et al. 1996). These differences in terminology can be resolved by referring to the structure of the network based on the number of "hidden layers" it contains. Thus, the model presented in Figure 2(b) would be referred to as a "single hidden layer" neural network.
2.2. Network Neural Processing Units and Layers As indicated above, artificial neural networks are formed by modeling the interconnections between the individual neural processing units. It follows from the neural network topology given in Figure 2(b) that, when a logistic activation function is used, the mathematical structure of the neural network without any hidden layer is isomorphic to the standard logistic regression model. Thus, the neural network models considered here generalize the logistic regression models commonly used in marketing and social science research.4 Rosenblatt (1959) developed the simplest neural network without a hidden layer, provided a convergence algorithm for adjusting weights, and demonstrated that if the inputs presented to the network come from two linearly separable classes then the algorithm yields a hyperplane that distinguishes the two classes (see Figure 1(c)). However, networks that possess a hidden or intermediate processing layer between the input layer and output layer are able to represent more complex nonlinear rules, such as noncompensatory decision rules commonly used in consumer choice situations (see Figure 1(a) and 1(b)). Further substantiation for using a single hidden layer typology for modeling complex behavioral problems is derived from the following mathematical results. Prior research proves that a single hidden layer neural network allows for universal approximations of any continuous functional relationship between inputs and outputs (Funahashi 1989, Hornik, Stinchcombe, and White 1990). Consequently, any discontinuous functional relationship that can be approximated by a continuous function can also be universally approximated by a single hidden layer network. The hidden layer of the network captures nonlinearities and interactions between variables (Lippmann 1987). Essentially, these results show that the class of neural network models having a single hidden layer is "dense" in the space of all continuous functions of n variables, so that no matter what is the "true" (but unknown)
'Other standard statistical models also arise as subsidiary models of the simplest neural networks. For example, the Probit model arises in the two-layer network with a summative aggregation function and an activation function that is the standard normal distribution function.
16, No. 4, 1997
375
functional relationship between the input variables and the output variable, it can be well approximated by a single hidden layer neural network model. Indeed, such single hidden layer neural networks are able to predict the failure of savings and loan companies (cf., Salchenberger, Cinar, and Lash 1992) and the failure of property and casualty insurance companies (cf., Brockett et al. 1994) better than traditional statistical methods. 2.3. The Back-Propagation Algorithm One of the differentiating characteristics between neural network models and traditional statistical procedures is the back-propagation estimation algorithm used by neural networks. Much like the techniques used for maximum likelihood estimation, the backpropagation algorithm can be viewed as a gradient search technique wherein the objective function is to minimize the squared error between an observed output for a given configuration of input values and the computed output of the network using the same input values. A primary difference is that the backpropagation algorithm can sequentially consider data records, readjusting the parameters after each observation in a gradient search manner, whereas estimation procedures such as maximum likelihood and least squares use an aggregated error across the entire sample in the estimation.5 For example, the neural network is trained by presenting an input pattern vector X to the network and computing forward through the network until an output vector 0 is obtained. The output error is computed by comparing the computed output 0 with the actual output 0 for the input X. The network learns by adjusting the weights of each neural processing unit in such a fashion as to reduce this observed prediction error. The prediction errors are swept backward though the network, layer by layer, to associate a "square error derivative" (delta) with each processing unit, to compute a gradient from each delta, and finally to update the weights of each processing unit based upon the corresponding gradient (see Appendix 1).
5The back-propagation method can be specified to update the weights after processing small or large "batches" as well as after each individual observation by using the aggregate error for the batch in computing the error gradient.
This process is then repeated beginning with another input/output pattern in the training set. After all the patterns in the training set have been used, the algorithm examines the training set again one by one and readjusts the weights throughout the entire network structure until either the objective function (sum of squared prediction errors on the training sample) is sufficiently close to zero or the default number of iterations is reached. Eberhart and Dobbins (1990) present a computer algorithm implementing the backpropagation technique described above.
3. Implementation Procedures
Depending on the sophistication of the software being used to estimate the neural network model, a researcher may be called on to make a number of decisions regarding the typology of the network, the activation function used for training, and stopping rule used for terminating training. We examine these issues in detail and provide a set of guidelines for model implementation. 3.1. Variable Selection The choice of which variables to include in the model is an important consideration for any practical statistical model building procedure, neural networks included. Various statistical devices can be used to select pertinent variables or reduce the number of input variables. For example, logistic regression, which represents a simplified version of a neural network, can be used to create a "super list" of potentially significant variables by examining the t statistics of the parameters. Once the neural network has been best trained on this "super list," examination of the variables' neural connection weights can be used to identify prospective variables for elimination (see Appendix 2 for sensitivity calculations). Variables with small connection weights (i.e., low sensitivities) are good candidates for elimination. A new network is then created without these variables and the performance assessed. Alternatively, an information theoretic technique can be used to order variables with respect to the information they provide on the outcome variable. Networks can then be built by adding variables one at a time and examining if there is improved performance. Once the
376
MARKETING SCIENCE/VOL 16, No. 4, 1997
selection of variables is made, the model building can proceed. 3.2. Choosing an Activation Function The most common activation function used in neural network modeling is overwhelmingly the logistic activation function. The choice of the activation function depends upon the desired network output range. Within the class of sigmoid functions having the same output range, it matters little which particular sigmoid function is used. However, the logistic sigmoid function has certain desirable mathematical and computational properties (see Footnote 2). These mathematical advantages together with the insensitivity of the ultimate results to the precise choice of sigmoid functional has contributed to the dominance of the logistic function in practical applications. 3.3. Cross-Validation If your objective is to build a model that is generalizable to new data (i.e., you are interested in forecasting as opposed to merely fitting an existing data set), then it is important to use a validation sample to prevent overfitting the model. Cross-validation procedures are used to determine how well a network captures the nature of a functional relationship. This entails splitting the data into subsamples in order to validate the network's performance on additional examples that were not used in training the model. The data set should be partitioned into a training sample (Ti), a validation sample (T2), and a testing sample (T3). While there is not a precise figure for the size of these three samples, the "rule of thumb" we have used is 60 percent for Ti and 20 percent for T2 and T3. Subsample Ti is used to determine the network configuration and connection weights (i.e., for model building and parameter estimation). However, we cannot determine when to stop training simply by examining the error on the training sample Ti because the error continues to decline as the network learns the idiosyncrasies in the training sample (see Figure 3). Therefore, we spot overfitting by simultaneously observing the error on the validation sample, T2. The set T3 is used to test the network's out-of-sample predictive accuracy (generalizability). In summary, at each iteration the interconnection weights of the network are estimated using Ti. These
Figure 3
Stopping Rule for Trainingthe Neural Network
Thisfigurewas adaptedfrom Smith(1996, p. 127). Thetraining sample (T1) is used to determinethe networkconfiguration connectionweights. Beand cause the errorcontinuesto declineas the networklearnsthe idiosyncracies in the trainingsample, the validationsample (T2) is used to spot overfitting by simultaneouslyobserving the error in the validationsample. Training is stopped when the errorin the validationsample begins to increase. Error .08 .06 .04 .02 TrainingSample 0 0 . 25 50 Iterations 75 100 ValidationSample
Stop training
parameters are used to validate the developing model, using the data from T2, in order to determine when training should cease. The final weights are applied to T3 to determine the network's out-of-sample predictive accuracy. 3.4. Stopping Rule An important part of training the neural network is knowing when it is most appropriate to stop. Extending the training process too long can result in "overfitting" the idiosyncrasies of training data set at the expense of the network's ability to generalize to other data configurations. This is similar to the statistical issue of parsimonious model selection-better fitting models can be obtained for any particular data set by merely increasing the predictor variable set, but at the expense of out-of-sample predictive accuracy. Conversely, in neural network training, too few iterations of the data will prevent the network from adequately learning to discern the pertinent relationships between the input patterns and the output variable(s). The validation sample T2 is used to determine the appropriate number of iterations through the data to be used for adjusting the weights in the network, as is illustrated in Figure 3. When the error on the validation sample begins to increase we know that overfitting has occurred. At this point we stop training and go back to the earlier weights that produced less error
16, No. 4, 1997
377
on T2. The inflection point in curve for the validation sample, which is illustrated in Figure 3, represents the "peak of generalizability" to a holdout sample. Another criterion for deciding when to stop training is to continue iterative changes in the weights until the weight matrix reaches stationarity (Malakooti and Zhou 1994), or to continue training until the percentage misclassified in the training sample no longer continues to improve (Curram and Mingers 1994). If primary interest lies in out-of-sample predictive accuracy, then both of these methods run the risk of training the network to recognize idiosyncrasies of the training data set at the expense of generalizability. While these latter methods are common in statistical estimation (such as maximum likelihood) done in a "batch mode," this method is not recommended for creating predictive neural network models. 3.5. Network Configuration It has been proven that any continuous functional form can be well approximated by a single hidden layer neural network. Existing theory does not, however, provide a rule of thumb for determining the number of neurons (hidden units) to use in the hidden layer. Consequently, the appropriate number of hidden units to use will depend on the nature of the data and the problem. Each additional hidden unit adds capacity to the network because it increases the degrees of freedom with which the model can fit the data. Nevertheless, to prevent overfitting we also need to limit the number of hidden units. In practice, we advise testing simple models first and successively increasing the number of hidden units until a peak of generalizability to the validation sample occurs. Thus, we start with a "zero hidden unit" model and see if generalizability can be improved by adding a single hidden unit. We then add a second hidden unit, retrain, and see if we improve performance. When performance in the validation sample T2 begins to deteriorate by adding a unit, we drop back to the previous number of hidden units and use that model structure. In essence, the error in the model estimated on Ti will continue to decrease as additional hidden units are added; however, the error in the validation sample T2 will decline up to a point and then increase. This inflection point is used to determine the
appropriate number of hidden units. The pictorial representation of Figure 3 with the horizontal axis being the number of hidden units is the pertinent conceptual framework. 3.6. Sample Reuse Procedure The steepest descent gradient algorithm used ensures that the neural network has reached a local extrema; however, due to the nonconvex nonlinear nature of the neural network model, we may not be at a global extrema.6 A sample reuse procedure is recommended to minimize this risk. This procedure is analogous to using multiple starting points in a global maximum likelihood parameter search in statistical estimation, except that the unit of repetition in neural networks is the data set, so multiple data sets should be tested. In the interest of using the existing data to the maximum extent possible, we recommend a sample reuse procedure (similar to statistical bootstrapping techniques). This procedure entails creating multiple data partitions.7 For each partition, the data set should be randomly divided into sets Ti (training = 60 percent), (validation = 20 percent), and T3 (testing = 20 percent), as described earlier. Multiple partitioning provides information about the distribution of outcomes (i.e., the mean, range, and variance) and the expected accuracy and reliability of the modeling procedure. The range of performance for the neural network also provides useful information about the potential for error in prediction (worst and best case performance). A researcher interested in developing a predictive model would only be interested in finding the "best" training session. This can be accomplished by identifying the data partitioning that produced the
6This problem of local extrema is not unique to neural networks. Maximum likelihood estimation suffers from the same problem unless the density function is convex in the parameters. In maximum likelihood estimation the use of multiple starting points is used to address the search for maxima in a nonconvex setting. Analogously, for neural network models wherein learning and convergence are effected by data ordering, a sample reuse procedure with multiple data partitionings serves the same purpose for finding optima in nonconvex neural network models. 7The appropriate number of data partition needs to be determined. The more partitions used the better (as in bootstrap estimation), but practically speaking we have found that performing 10 separate random partitionings of the data into subsets TI, T2, and T3 is adequate.
378
16, No. 4, 1997
WEST, BROCKETT, AND GOLDEN Analysis of Neural Networks and Statistical Methods A Comparative
largest percentage of correctly classified data when applied to its own testing subsample T3 (i.e., the best outof-sample predictive accuracy). In essence, by taking the best performing model one achieves a more optimal result than that obtained using a single sample partitioning.
4. Study 1-A Numerical Simulation

4.1. Overview Prior research suggests that linear models are robust and capable of capturing noncompensatory processes in judgment (Dawes and Corrigan 1974, Einhorn 1970, Ganzach 1995) and choice (Johnson and Meyer 1984). This belief rests upon the past performance of simple linear models at explaining reliable variance in consumer evaluations and predictive accuracy in consumer choice among alternatives where the actual choice rule is inferred based on protocol analysis or processes tracing methods. As a result, the evidence for the sufficiency of linear models in predicting noncompensatory choice rules has been indirect. However, using numerical simulations, Johnson et al. (1989) demonstrate that noncompensatory choice models involving attribute thresholds such as "satisficing" (i.e., a conjunctive rule) are poorly fit using a linear-additive model. They examine the performance of linear models in contexts where the data are known to be generated by specific choice rules and find that even in an orthogonal attribute environment compensatory models fail to capture noncompensatory choice rules involving attribute thresholds. We examine the comparative ability of neural works and traditional linear-additive models at capturing three commonly used consumer decision rules. These include two noncompensatory decision rules and one compensatory decision rule that involve attribute thresholds. In particular, we test a variant of the satisficing rule (see Figure 1(a)) used by Johnson et al. (1989) that sets a lower bound threshold on all attribute values, and we test a "latitude of acceptance" rule (see Figure 1(b)) that sets both a lower threshold and an upper threshold on attribute values, mimicking the "ideal point" model described previously (Coombs and Avrunin 1977). Given the flexible nature of the
neural network models we expect to see better withinsample predictive accuracy and out-of-sample predictive accuracy from the neural network model than discriminant analysis or logistic regression for both of the noncompensatory decision rules. We also test a compensatory rule that equally weights product attributes and judges the acceptability of an alternative based on the sum of its attribute values. In this case we expect to see little improvement in performance between neural networks and the traditional methods, with all models predicting well, due to the compensatory nature of the rule. The complexity of the decision rule was also varied to test for degradation in model performance. Each of the three decision rules was tested using either two or four product attributes. Unlike Johnson et al. (1989), we did not model choice among a set of alternatives. Rather, the rule was used to determine the "acceptability" of individual alternatives. 4.2. Method We test the comparative ability of a neural network model, discriminant analysis, and logistic regression at capturing three riskless decision rules: satisficing (SAT), latitude of acceptance (LOA), and weightedadditive (WADD). The first two rules are selected because of prior evidence indicating the insufficiency of linear-additive models at representing these noncompensatory decision rules (Johnson et al. 1989). The weighted-additive rule implemented here also involves a threshold for classifying an alternative as either acceptable or unacceptable. However, unlike the satisficing and latitude of acceptance rules, it is compensatory in nature. This choice of decision rules allows for a clear demonstration of when the neural network model's performance will be comparable to traditional statistical methods, as well as when it will surpass them in performance (cf., Rumelhart et al. 1996). The three decision rules are discussed as follows. Each of these rules is examined separately at two levels of decision complexity: two attributes and four attributes. 1. Satisficing (SAT). For an alternative to be classified as an acceptable choice based on the satisficing rule it must surpass the established threshold (250) on all
16, No. 4, 1997
379
WEST, BROCKETT, AND GOLDEN A Comparative Analysis of Neural Networksand Statistical Methods
attributes. In the two-attribute environment, approximately 56 percent of the alternatives were deemed acceptable, and in the four-attribute environment, approximately 32 percent of the alternatives were acceptable, according to the satisficing rule.8 2. Latitude of Acceptance (LOA). For an alternative to be classified as an acceptable choice using a latitude of acceptance rule the value of each of its attributes must lie within the specified range (251 to 750). In the two-attribute environment, approximately 25 percent of the alternatives fit this criterion, and in the fourattribute environment, approximately 6 percent of the alternatives were deemed acceptable. 3. Weighted Additive (WADD). For an alternative to be classified as an acceptable choice using a weightedadditive rule the sum of its attribute values had to exceed a specified threshold (1,000 for two-attribute alternatives; 2,000 for four-attribute alternatives). Approximately 50 percent of the alternatives fit this criterion in both the two- and four-attribute environments. Two data sets, each consisting of 500 choice alternatives, were generated using the random number generator in SAS. Separate data sets were generated for the two-attribute and four-attribute choice environments. Attribute values were drawn from a uniform distribution ranging from 0 to 1,000. For each of the 500 choice alternatives the three decision rules were applied to determine the "acceptability" according to the specific rule. This generated a binary output variable associated with three decision rules for each of the 500 choice alternatives. 4.3. The Neural Network Model Two separate simulations of the decision rules were performed varying in decision complexity (two or four attributes). Six separate neural networks were trained (one for each of the three decision rules and for each level of rule complexity). Symbolically, these inputs are denoted by x = (xl, . . ., x1)' where I = 2 or 4. An output variable, 0, that the various models attempted to predict was associated with each decision rule. The
8This noncontext-dependent operationalization of a satisficing rule is similar to an elimination-by-aspects rule which sets thresholds on individual attributes and sequentially eliminates alternatives that do not meet one or more of the attribute cut-offs (cf., Tversky 1972).
functional relationship between the input units and neural output of the decision rule outcome was based on a logistic activation function
1 + exp(-
z)
(1)
where z = x'w.} and w.) . . ., wJ))isthe vector of weights (to be subsequently adjusted by learning) bridging from the original set of I input variables to the jth hidden unit. Thus, the calculated value of the jth hidden neural unit was:
1 + exp(-
- x'w7))' i
1,...,J
(2)
where rj) is the threshold for the jth hidden unit (the intercept term in the logistic regression formulation for Hj), and Hj is the derived numerical value for the jth hidden neural unit that results from the vector of input variables x = (x, . . . , xI)' and the current best estimate of the weight vector w(J). The final output was obtained by again using the logistic activation applied to the weighted summation of the hidden layer neural values H = (H1,. . ., H,)' with the number of hidden neural units being denoted as J. Thus, 11
~~~~~1
1 +ep-(2)-H'W(2)) - H'w Y) + exp( -
(3)
where the weight vector is now given by w(2) = which is applied to the hidden factors (w(2) ..,w2)7), H = (H1,..., Hj)'. By the iterative feed-forward and back-propagation algorithm (see Appendix 1), the neural network learned to simultaneously select the weights w.}, Iw I the threshold values (71), ... ., CJ and (2) in such a manner as to minimize the prediction error in the training sample. Of course, a change in any single weight w) changes the value of Hj so that the weights between the hidden layer and the output also change. 4.4. Comparative Analysis and Results For each of the three decision rules and the two levels of model complexity the performance of the neural network was compared to the results obtained using discriminant analysis and logistic regression. The procedure described in the previous section was used to test
380
16, No. 4, 1997
WEST, BROCKETT, AND GOLDEN Analysis of Neural Networksand Statistical Methods A Comparative
both the within- and out-of-sample predictive accuracy. For the neural networks, out-of-sample predictive accuracy was measured using the weights estimated from samples Ti (training) and T2 (validation) on T3 (testing). For logistic regression and discriminant analysis, out-of-sample predictive accuracy was measured using the weights estimated from the combined set Ti and T2 and applied to T3. Following the sample-reuse procedure described, 10 random partitionings of the set of choice alternatives were generated. All models were estimated using the first 400 observations and cross-validated on the last 100 observations. The same 10 partitionings were used for all six neural networks, as well as the logistic regressions and discriminant analyses. The simulation results are summarized in Tables 1 and 2. The mean and range of predictive accuracy, as well as Type I and Type II error rates across all 10 random samples of the data, are reported. The performance of all noncompensatory models, with the exception of the four-attribute LOA model, are seen to be statistically different (p < .0001) and inferior to the
neural network model, both in terms of explained variance (within-sample predictive accuracy) and, more importantly, in terms of forecasting (out-of-sample predictive accuracy). The results for the four-attribute LOA rule were slightly different from expected. While the neural network exhibited far superior within-sample and outof-sample accuracy than discriminant analysis (p < .0001), the performance of logistic regression was comparable to the neural network (p > .45). A closer examination of the logistic regression's performance in predicting a four-attribute LOA rule, however, indicates that the model's strong predictive accuracy is due to the low base rate of acceptable alternatives (theoretically, within a given sample, only 6 percent of the alternatives would be deemed acceptable) rather than its ability to differentiate between acceptable and unacceptable alternatives. An examination of the out-of-sample Type I error rate (100 percent) indicates that the logistic regressions were unable to predict the few acceptable alternatives in the data set and categorized all of the alternatives as unacceptable.
Table 1
ComparativeResults from NumericalSimulation Within-Sample Accuracy Numberof Attributes 2 4 Latitude of Acceptance 55.2%/** (53-57%) 55.0** (52-57) 79** (78-81) 92.4** (92-93) 97.2 (86-100) 94.3 (80-100) Weighted Additive 99.2%/* (99-100%) 97.4** (97-100) 100 (100) 100* (100) 99.6 (98-100) 99.3 (99-100) of Latitude Acceptance 51.8%** (47-59%) 51.6** (45-56) 77.0** (71-82) 91.6 (90-94) 94.8 (86-99) 90.7 (87-96) Out-of-Sample Accuracy Weighted Additive 97.4%* (93-100%) 98.2* (96-100) 100 (100) 99.9 (99-100) 99.5 (97-100) 99.3 (98-100)
AnalysisMethod Discriminant analysis
Satisficing 87.8** (87-89%) 77.9** (77-79) 87.7** (87-89) 82.0** (81-83) 99.7 (98-100) 94.8 (92-100)
Satisficing 87.7%/** (83-90%) 78.7** (76-81) 87.8** (84-91) 79.0** (74-85) 99.5 (98-100) 91.4 (86-97)
Logisticregression
2 4
Neuralnetwork
2 4
For each of the three decision rules and methods of analysis the predictiveaccuracywas calculatedacross 10 randomsamples of the data. Cellvalues representthe mean and rangeof predictiveaccuracyfor the 10 randomsamples. These averageswere stable across the 10 samples. The standarddeviations for the discriminant analysis models rangedfrom 0.004 to 0.036, with an averageof 0.016. Forthe logistic regression models the standarddeviationranged from 0 to 0.034, with an average of 0.079. Similarly, neuralnetworkmodels rangedfrom 0.01 to 0.05, with an averageof 0.02. A series of pairedttests the with df = 9 comparedthe performanceof the neural networkmodels across the 10 randomsamples with the performanceof the traditional statistical methods. Significanceof the ttests is representedas follows: *p < .01, **p < .0001.
16, No. 4, 1997
381
Table 2
Rates from NumericalSimulation Comparisonof Error Error Rates Out-of-Sample of Latitude Acceptance Numberof Attributes 2 4 Type I
a
Satisficing Type I
a
Weighted Additive Type II ,B 14%

(9-199%)
AnalysisMethod Discriminant analysis
Type II ,B 50% (45-56%) 48 (42-54) 0 0 4 (0-18) 8 (2-21)
Type I a 3% (0-9%) 3 (2-5) 0 0 0 1 (0-4)
Type II ,B 2% (0-8%) 1 (0-4) 0 0 1 (0-7) 1 (0-4)
Logisticregression
2 4
44% (31-50%) 55 (33-67) 100 100 10 (0-24) 28 (0-44)
Neuralnetwork
2 4
10% (4-18%) 20 (11-24) 11 (4-18) 50 (36-65) 1 (0-4) 10 (0-21)
22 (17-25) 14 (7-18) 10 (4-15) 0 (0-2) 8 (4-14)
For each of the three decision rules and methods of analysis the out-of-sampleerror rate was calculatedacross the 10 randomsamples. Cell values representthe mean and range of out-of-sampleerror.Type I errorrepresentsthe percentageof acceptablealternativescategorizedas unacceptableby the alternatives categorizedas acceptableby the model. model, and Type IIerrorrepresentsthat percentageof unacceptable
On the other hand, the neural networks were able to achieve the same aggregate level of out-of-sample predictive accuracy while balancing the Type I (28 percent) and Type II (8 percent) error rates.9 From a managerial perspective, the logistic regression's solution of rejecting all alternatives may not be feasible when attempting to find acceptable alternatives. Moreover, when assessing the performance of these models one needs to consider the cost associated with the two types of errors. For example, if the cost of a Type I error is $100 versus $1 for a Type II error, then the expected cost of using the logistic regression would be
9Because the logistic regression attempts to separate the space into two linearly separable regions, and it optimizes its parameters based on an examination of the aggregate sample, in situations such as that presented by the four-attribute LOA decision rule where one group is much larger than the other, the model will achieve its highest accuracy by classifying all observations into the larger of the two groups (i.e., the logistic regression "gives up" trying to predict the acceptable alternatives and goes with the odds). By contrast, the neural network is able to learn the nonlinear structural relationships even with disparate group sizes.
+ ($100*.06*100) ($1*.94*0)= $600, whereas the expected cost of using the neural network would be ($100*0.06*0.28) + ($1*0.94*0.08) = $175.52. As this result indicates, the high predictive accuracy rate of the logistic model in the four-attribute LOA decision rule context may be deceiving when both types of error are important. Practically speaking, on the average, the "best trained" neural network always out performed both discriminant analysis and logistic regression in terms of both within- and out-of-sample predictive accuracy for the noncompensatory decision rules (see Table 1). As expected, there was little difference between the neural networks' performance and that of discriminant analysis or logistic regression when predicting data generated by a two-attribute or four-attribute weighted-additive decision rule. All three modeling procedures performed exceptionally well in capturing this compensatory decision rule. Thus, you cannot go wrong by using neural networks in linear settings and can gain substantially in nonlinear (or unknown) settings.
382
16, No. 4, 1997
5. Study 2-An Empirical Examination of Consumer Patronage Behavior

An empirical study of the neural networks' ability to predict consumer choice (patronage behavior) was conducted to further substantiate the simulation results. Consumer perceptions and patronage behavior toward three nationwide mass-merchandise retailers were examined (Kmart, Montgomery Ward, and Sears Roebuck). A sample of 800 members from a nationwide consumer mail panel was analyzed. Since panel members were predominantly female, half of the cover letters asked the panel member to fill out the questionnaire and the other half instructed the panel member to ask their spouse to respond in order to balance for gender. 5.1. Survey Instrument The "image" dimensions of products and services are of interest to consumer behavior researchers because of their important role in influencing patronage behavior and consumer choice. In this study, image perceptions for retail chain stores were assessed by having respondents rate each of three mass merchandise retailers on 19 store characteristics, including price (low to high), cleanliness (dirty to clean), employee friendliness (unfriendly to friendly), etc. (see Table 3). These 19 store characteristics were selected based on a review of the literature on retail store image (Zimmer and Golden 1988). Each of the 19 store characteristics was assessed using a seven-point bipolar adjective scale (see Golden et al. 1992 for details on the questionnaire format). Besides assessing consumers' attitudes and perceptions of the three retailers a behavioral response measure of consumer patronage was elicited. For each of the three stores, respondents were asked how frequently they shopped at Kmart, Montgomery Ward, and Sears Roebuck. Response options ranged from 1 = never to 6 = at least once a week.The neural network technique described in this paper is capable of handling observations with missing data; however, the statistical methods against which the neural networks are being compared require complete data. The goal was to compare the neural network performance with
that of other common statistical methods; therefore, only surveys having complete data were used in the analysis. After imposing this restriction, 371 of the 800 data records remained for Sears Roebuck, 294 of the 800 records for Kmart, and 235 of the 800 records for Montgomery Ward. These three data sets were used to build separate neural networks and each was analyzed using the traditional statistical methods. 5.2. Comparative Analysis and Results Each of the three ANN models used the chain-specific responses to the 19 store characteristic variables as inputs and patronage frequency as the output variable. For consistency with the simulation results the scale of the behavioral response variable was converted from a 1-to-6 scale to a binary variable with 0 representing an "infrequent shopper" and 1 representing a "frequent shopper" based on a median split of the data. Respondents who reported shopping frequencies of two or less on the scale were classified as "infrequent shoppers" and those who reported shopping frequencies greater than two were classified as "frequent shoppers."'0 The same training and cross-validation procedures used in the numerical simulation were applied to the questionnaire data; that is 60 percent of the sample was used to train the network, 20 percent was used to determine a stopping point for training, and 20 percent was used for out-of-sample testing of the predictive accuracy. These sample sizes were 223, 74, 74 for Sears Roebuck, 194, 50, 50 for Kmart, and 141, 47, 47 for Montgomery Ward. In addition to building neural network models to predict consumer patronage behavior both discriminant analysis and logistic regression were tested as alternative statistical methods. Yet another statistical alternative was also tested. Due to the large number of input variables one might conceptualize consumer patronage as being based on a set of factors or latent variables that represent combinations of the 19 store characteristics. In this new alternative statistical method
10Neural networks are not limited to these types of binary classification problems. Continued efforts at testing the comparative performance of neural networks to classification tasks involving multilevel output variables is considered a fruitful area for future research.
16, No. 4, 1997
383
WEST, BROCKETT, AND GOLDEN Analysis of Neural Networksand StatisticalMethods A Comparative
Table 3
Retail Store Characteristics Montgomery Ward n= 235 Mean 4.33 4.29 4.72 4.18 3.95 3.77 4.05 4.87 4.52 3.81 4.23 5.32 5.31 4.45 4.70 4.46 3.60 4.24 4.00 Std. Dev. 3.29 3.34 3.23 2.56 3.96 3.37 3.28 3.34 2.87 5.42 2.53 2.97 3.42 3.60 3.39 3.08 2.81 3.29 2.59
Kmart n= 294 Mean Unfriendly/Friendly Unhelpful/Helpful Bad/GoodReputation Caliber Low/High Dislike/Like Uncongested/Congested Not Enjoyable/Enjoyable to Hard/Easy Exchange Value Bad/Good Location Inconvenient/Convenient Price Low/High Dirty/Clean Credit Hard/Easy Cluttered/Spacious Atmosphere Unpleasant/Pleasant Merchandise Quality Low/High Dull/Exciting Selection Bad/Good Customers Unsophisticated/Sophisticated 4.51 4.33 4.90 3.86 4.76 4.45 4.47 5.31 4.86 5.25 2.90 4.98 4.77 4.13 4.68 3.89 3.88 4.69 3.29 Std. Dev. 3.36 3.66 3.08 2.78 3.49 3.25 3.28 3.18 2.88 4.18 2.90 3.53 4.32 3.92 3.50 3.03 2.74 2.80 2.79 Mean 4.59 4.63 5.54 4.73 4.76 4.04 4.61 5.43 5.25 4.90 4.50 5.95 5.48 4.91 5.23 5.34 4.08 5.03 4.47
Sears n= 371 Std. Dev. 3.43 3.49 2.37 2.28 3.37 3.10 2.89 2.94 2.25 4.03 2.38 1.77 3.89 3.22 2.80 2.02 2.46 2.53 2.33
adjectivescale. Thevalues representthe meanand standarddeviationin responses was Eachof the 19 store characteristics assessed using a 7-pointbipolar to administered a nationwideconsumer panel. from a questionnaire
four latent variables were uncovered via factor analysis. These results were then used as inputs for analysis. The results for all three retail chains are presented in Table 4. The same sample reuse procedure described earlier was used for all three models to reduce the effects of outliers and sample particularities. The results in Table 4 demonstrate that once again the neural network model exhibits a superior ability to learn the patterns corresponding to consumer choice (in this case, patronage frequency) and product attributes (image dimensions). Consistent with the simulation results, the neural networks demonstrate significantly better (p < .05) out-of-sample predictive accuracy than the two traditional statistical methods for all three stores. Across all three store chains, the "best trained" neural network outperformed the best of all alternative statistical models tested (See Table 4).
It is well known that within-sample predictive accuracy represents a measure of explained variance and yields an upwardly biased estimate of future model performance. Hence, out-of-sample predictive accuracy is a more relevant measure of future performance. Nevertheless, in all but two instances (logistic regression for Kmart and Montgomery Ward), the average within-sample predictive accuracy of neural networks exceeded that of the other models. Taken together, our results suggest that the neural network produced less within-sample "overfitting" than the logistic regression or discriminant analysis. An informative byproduct of discriminant analysis and logistic regression is the ,Bweights that represent a quantification of the expected change in output (patronage frequency) that results from changes in the input variables (store characteristics). For nonlinear
384
16, No. 4, 1997
Table 4
ComparativeResults from ConsumerPatronage Survey Within-Sample Accuracy Montgomery Ward 80.72%** (76-83%) 80.56* (76-85) 82.73 (80-86) Sears Roebuck
71.26%/**
Out-of-Sample Accuracy Montgomery Ward 75.21%** (70-79%) 68.45*** (65-75) 84.58 (77-94)
AnalysisMethod Discriminant analysis Logisticregression Neuralnetwork
Kmart 78.04%** (74-18%) 80.78* (80-83) 82.08 (79-88)
Kmart
72.98%0/***
Sears Roebuck 64.17%0/** (58-68%) 64.39* (60-68) 69.72 (58-75)
(67-75%) 72.13 (71-75) 73.74 (71-77)
(69-78%) 77.45* (74-83) 83.51 (77-95)
Foreach retailchain and method of analysisthe predictive accuracywas calculatedfor 10 randomsamples of the data. Cellvalues representthe mean and range of predictiveaccuracyfor the 10 randomsamples. These averages were stable across the 10 samples. The standarddeviationsfor the discriminant analysis models rangedfrom 1.88 to 3.05, with an averageof 2.53. Forthe logistic regression models the standarddeviationrangedfrom 2.45 to 3.54, with an averageof 2.88. Similarly, neuralnetworkmodels rangedfrom 1.95 to 5.80, with an averageof 3.87. A series of pairedttests with df = 9 compared the the performance the neuralnetworkmodels across the 10 randomsamples with the performance the traditional of of statisticalmethods. Significanceof the ttests is representedas follows: *p < .05, **p < .01, ***p < .0001.
techniques, such as neural networks, the change in output due to changes in input is not a constant, as the relative weight assigned to each input can vary in both magnitude and sign across the range of an input variable. Changes in the output variable that result from changes in inputs can be estimated via the "sensitivity" or "elasticity" of output to input defined by aO/axi, the partial derivative of the network output with respect to input i. For linear models, this sensitivity measure, /B, is constant (but is not so for the neural network models). A closed-form solution for computing the sensitivity of each input i for a neural network is presented in Appendix 2. In our application, each input potentially has integer values that range from 1 to 7. Accordingly, the sensitivity of each input i was measured by evaluating the partial derivative, aO/axi, at each level of the 19 input variables while holding all other input variables fixed. The sensitivity for xi at response level k is obtained by calculating the sensitivity for each respondent according to the mathematical formula given in Appendix 2, then averaging across respondents. A sampling of the results of the sensitivity calculations is presented in Table 5 for Sears Roebuck, Kmart, and Montgomery Ward, respectively. As can be seen from this table, there is consistency in algebraic sign
for the sensitivity along each of the bipolar adjective scales for a given retailer, indicating that the networks learned the directionality of input to output that corresponds to the image variables. Moreover, the fact that some of the average sensitivity measures for the different variables were relatively large, while others were relatively small, indicates that patronage frequency has differing sensitivity to changes in the perceived image variables under investigation. For example, patronage frequency is relatively sensitive to the different levels of some variables such as "Dislike/Like" for all three stores (ranges from 0.03 to 15.16) but patronage frequency is relatively insensitive to changes in other variables, such as "Hard/Easy Credit" (ranges from 0.06 to 1.32). These findings have obvious managerial implications since the consumers' perceptions of the various store characteristics can be influenced by managerial actions. For some variables (e.g., Hard/Easy Credit), the financial costs needed to change consumers' perceptions might be large, but these results show that it may have little effect on consumer patronage and should therefore not be a priority item for managerial action. Conversely, since the ultimate shopping behavior is more strongly influenced by other variables, such as "Good/Bad Selection," managerial
16, No. 4, 1997
385
Table 5
Sensitivity of Output with Respect to a Selective Groupof Image Dimensions Level1 Level2 Level3 Level4 Level5 Level6 Level7
Image Dimension Low/High Price Sears Kmart Ward Montgomery Hard/Easy Credit Sears Kmart Montgomery Ward Cluttered/Spacious Sears Kmart Ward Montgomery Low/High Quality Merchandise Sears Kmart Montgomery Ward Bad/GoodSelection Sears Kmart Ward Montgomery Low/High Caliber Sears Kmart Montgomery Ward to Hard/Easy Exchange Sears Kmart Ward Montgomery
0.64 - 0.61 - 0.15 0.16 0.44 1.11 -1.69 -1.77 0.31 0.00 0.08 0.83 0.03 1.60 2.96 0.27 1.24 4.89 0.23 11.99 -0.17
- 8.09 - 0.76 - 0.17 0.70 0.66 0.06 - 0.98 -1.23 0.21 3.29 0.83 0.81 1.15 3.48 1.18 0.22 1.26 0.21 0.32 2.88 -0.16
- 4.57 -1.31 - 0.11 0.53 0.68 0.32 - 2.13 -1.21 0.39 2.45 0.89 0.68 1.89 2.96 1.62 1.89 1.77 2.72 0.31 10.48 -0.18
- 3.82 - 0.78 - 0.17 0.80 0.55 1.28 - 3.29 -1.46 0.46 2.72 0.71 1.12 1.97 1.39 2.66 1.45 1.04 4.60 0.61 11.89 -0.23
- 4.57 - 2.46 - 0.14 0.98 0.35 0.65 - 2.75 - 0.91 0.34 2.52 0.59 1.03 2.68 2.26 2.30 1.74 0.71 3.23 0.48 8.19 -0.10
- 3.40 -1.14 - 0.05 0.56 0.49 1.32 - 2.97 -1.52 0.26 3.46 0.28 1.06 3.28 1.24 2.04 1.19 0.33 3.88 0.61 5.13 -0.16
-1.77 -1.52 - 0.02 0.54 0.25 0.85 - 2.50 - 0.30 0.35 2.13 0.37 1.14 1.32 1.05 4.41 1.30 0.32 6.34 0.47 2.65 -0.15
The values above representa sensitivity measure for a selection of store image characteristicsacross the seven level bipolarscales. This measure is the thus the relative estimate in a statisticalmodel. However, measureis not averagedacross the rangeof the inputvariable, analogousto a parameter weight can varyacross the rangeof the inputvariable.
actions affecting the consumer's perception of the store's merchandise selection can better influence store patronage. The nonconstancy of the sensitivity measure across individual attribute levels, together with the knowledge of current consumer perceptions, allow the marketing manager to better estimate the impact of their strategic decisions. For example, as indicated in Table 3, the average perception of the Kmart consumer on the Low/High Price variable is 2.9, and if the manager could influence this perception value down to 1.9, then the patronage frequency could be expected to increase by .76. Other information is also available from Table
5. For example, we observed what appears to be a threshold model at work for Sears with respect to the variables "High/Low Quality Merchandise" and also "Bad/Good Selection" with the threshold value being at level 2. Once above the threshold, the sensitivity is fairly constant and positive; therefore, little is to be gained by wooing consumers whose perceptions are below the threshold. This information is not available with logistic regression or discriminant analysis because the constant sensitivity measure (,B)is averaged across attribute levels. The sensitivity information from Table 5 can also have differential managerial value. For example, while
386
16, No. 4, 1997
from Table 3 we see that the average consumer perceptions of Sears and Kmart shoppers on the "Hard/Easy to Exchange" variable are roughly equal (5.43 versus 5.31, respectively), from Table 5 we see that Sears has little to gain by changing consumer perceptions on exchange while Kmart stands to gain significantly by focusing on changing this perception. In addition, from Table 5, on the variable "Low/High Quality Merchandise" there is less to be gained for Kmart by increasing the perception of quality merchandise than there is to be gained by their increasing consumer perceptions of how easy it is to exchange merchandise. Instead of advertising about their "new" high quality merchandise. Kmart should attempt to convince consumers their, "new" exchange policy is an improvement. This will potentially pay off more in terms of patronage frequency than a quality of merchandise campaign. This same strategy is not optimal for Sears or Montgomery Ward as witnessed by their sensitivity values on these variables.
6. Conclusions and Implications

Artificial neural network research has progressed along two distinct lines: Cognitive scientists use such models to further understand human cognition, while information scientists, statisticians, and other quantitative researchers examine the mathematical properties of the neural network models to improve classification and prediction. This paper examines the mathematical properties of these models that offer tremendous potential for predicting consumer decision processes when noncompensatory decision heuristics are used, or product information is integrated in a nonlinear fashion. As outlined in this paper, the proper use of neural networks requires the researcher to make a series of important decisions in order to ensure accuracy in prediction. We present a set of implementation procedures for a researcher to follow when using neural networks to develop predictive models. This paper demonstrates that neural network models consistently outperform statistical methods such as discriminant analysis and logistic regression when predicting the outcome of a known noncompensatory choice rule. Even when the underlying choice rule is unknown, the neural network exhibits better out-ofsample predictive accuracy than traditional statistical
methods, due to the flexible nature of the model. This improvement in performance is accomplished through an iterative process, whereby the model "learns" complex relationships between product attributes (image variables) and consumer choice (patronage frequency). Neural network models differ from other statistical procedures in that the model does not presuppose any linear or causal relationship (e.g., logistic regression, maximum likelihood factor analysis, discriminant analysis, etc.) between the input variables and the output variable(s). Traditional statistical models are able to deal with nonlinear functional relationships by incorporating the appropriate exponential terms or multiplicative interactions but the form of nonlinearity must be known a priori. Similarly, noncompensatory choice rules, such as satisficing and latitude of acceptance, can be modeled using traditional methods by estimating separate slopes for the regions above and below the attribute thresholds, but again the exact form of the nonlinearity must be known precisely. The benefit of the neural network is its robustness to model mispecification. Tables 1 and 4 give guidance on two additional aspects of model robustness for the neural network models versus the statistical models: model overfitting and worst case model performance. Model "overfitting" can be quantified by contrasting the within-sample accuracy (where overfitting to the training sample may be present) with the out-of-sample accuracy (which is an unbiased assessment of the model's performance). Model overfitting is illustrated graphically in Figure 4. In the absence of overfitting the within- and out-ofsample accuracy would coincide (a point representing model performance in a two-dimensional space would fall on the 450 line). Distance below the 450 line is indicative of model overfitting. As illustrated in Figure 4 neural network models are on average more accurate and exhibit less overfitting than discriminant analysis and logistic regression. Overfitting is not a problem for neural networks as performed in this paper because of our use of the validation sample T2 to determine when to stop training (cf., Figure 3). A second aspect of robustness concerns the question of "how far wrong can you go using a particular modeling procedure?" Once again Tables 1 and 4 provide
16, No. 4, 1997
387
Figure4
Illustrationof Model Overfitting
corner of the graph the The closer to the upper right-hand of The graphs representan illustration model fit as well as overfitting. betterfittingthe model. Distancebelow the 450 line representsmodel overfitting. 100
80
60 60
Latitudeof Acceptance Rule
100
80
Satisficing Rule
A
10080
60
WeightedAdditiveRule
9220
9 20
9220
o
0 0 20
0i-I-t------
0
0 60 80 100 0
i-----I-I-----I-I
0
t-
40
20
40
60
80
100
20
40
60
80
100
WithinSample Accuracy
WithinSampkeAccuracy
K-Mart
100 . A 80 60 260
Ward Montgomery
100 80 A 280
Sears Roebuck
/ 100 80
-_ _
40-140'40 920-9 0
---I0
29 0
-;0-----
20 0 0 20 40 60 80 100 0 20 40 60 80 100
20
40
60
80
100
Legend:
M = Discnminant Analysis;A = LogisticRegression;0= NeuralNetwork
information that addresses this question for neural networks as compared to the discriminant analysis and logistic regression. In this case it is the range of out-ofsample predictive accuracy that provides information regarding the "worst-case accuracy." In summarizing the results from Table 1 and 4, there are nine comparisons that can be made between discriminant analysis and neural networks. In all nine cases neural networks showed a less extreme "worst case" performance, on the average having 14.8 percent better "worst case" accuracy. Similarly, examining logistic regression versus neural networks on the nine comparisons, neural networks showed less extreme "worst case" performance in two-thirds of the cases, with an average 10.8 percent better "worst case" accuracy. In the three instances where logistic regression showed less extreme "worst case" performance, it only averaged a 2.3 percent
improvement. Thus, neural networks appear more robust, in general, with respect to "how far wrong you can go" using the procedures examined. In addition, the neural network model has the potential for uncovering managerially relevant information. An examination of the input variable sensitivities, as presented in Appendix 2, may allow the practitioner to determine the relative influence of various input variables, as well as thresholds that may be used for determining consumer choice. However, interpretation of the interconnection weights in a neural network is not as simple as examining the parameters produced by a regression that, under appropriate distributional assumptions, can be subject to commonly used statistical significance tests. Because of the two-stage compositional character of the neural network, the interconnection weights are linked between the different
388
MARKETING SCIENCE/VOL 16, No.
4,1997
layers, creating interdependencies. While the sensitivity measures developed in this paper were designed to overcome this issue, if the focus of a marketing analysis requires the selection of "statistically significant" prediction variables and if a parametric statistical model can be presupposed with confidence at the onset of modeling, then traditional statistical measures may be preferred. Otherwise, neural network models offer improved predictive accuracy when the nature of the consumer's decision rule is unknown. For neural network models variable selection is discussed in ? 3. A potential drawback of the neural network approach is that its intrinsically nonlinear and nonconvex structure allows for the possibility that a given model has achieved a local rather than global minimum in error rate. The sample reuse procedure described earlier was designed to address this shortcoming; however, while it increases the likelihood of finding a good model, it does not guarantee the discovery of the "best" possible model." Assuming the main goal of the marketing researcher is to predict behavior, the results presented here are very promising. When neural network performance is compared to the predictive results that would be obtained using traditional marketing models, without exception the neural network model exhibits equivalent or better out-of-sample predictive accuracy than any of the comparative statistical methods examined. This portends great usefulness for the use of artificial neural network modeling for prediction of consumer choice based on product attributes, and suggests potential applications of the neural network methodology to numerous other marketing applications.12
"1Anotherpossible drawback concerning neural networks that is often heard is that the model is a "black box" methodology. As per the discussion in ? 2, however, it is a well-defined adaptive gradient search procedure for parameter fitting in a complex nonlinear model, and not a "black box" at all.
12The authors express sincere appreciation to the University of Texas at Austin, Graduate School of Business, and to the University of Texas Research Institute, both of which provided financial support. We thank Maureen Carter, John Hansen, Jaeho Jang, Kishore Krshna, Utai Pitaktong, Scott Swan, Yuying Wang, Xiaohua Xia, and Li Zhou for their assistance in data analysis. Suggestions and feedback should be directed to Patricia M. West, Assistant Professor of Marketing, the University of Texas at Austin, Graduate School of Business, CBA 7.202, Austin, TX 78712.
Appendix 1 The Training Procedure and Algorithm A summary of the feed-forward back-propagation algorithm for a neural network with a single output 6 is presented below. Let wMJ) the interconnection weight between unit j of the hidden be layer and the input variable xi, and let wu2) denote the interconnection weight between Hi = F(<') + x'w()), the jth neural unit in the hidden layer and the output variable 0. The threshold values i1) where j = 1 J...J J and q(2) are defined for each layer. We designate the logistic function 1/(1 + exp( - - z)) by F(x). The back-propagation algorithm proceeds by taking the response pattern (XU,0U) for choice alternative, or respondent u, and incrementally updating the interconnection weights to reflect this pattern. This process is formalized as follows: Step 1. Initialization: Set all weights w72), u(})J ,1), and ,(2) equal to small random values. Step 2. Feed forward: For each unit j of the hidden layer, compute where xu is the vector of inputs corresponding Hj F(1) + xu w(7)) to choice alternative, or respondent u from the input layer. Subsequently, compute the predicted output 6 = F(q(2) + H'w(2)) at the output layer. Step 3. Propagate backward: Using Ou as the "target" output for + the input pattern xu compute (52) = F'(q7(2) H'w(2))(Ou - U). At + xu' wJ1))W(2)>572) all j. the hidden layer, calculate (51) for F( (These computations are facilitated by the formula F' = F- (1 - F), which holds for the logistic activation function.) a Step 4. Update weights: First compute AwO) 6(2)Hj. Then, to update the connection weights between the hidden layer j and the w(2)old + Aw72). Similarly, to upoutput, use the formula: w 2)new date all the connection weights between the input layer and the hid= den layer, compute AwQj) c4x51Yxi update the weights via the and The parameter a in the preceding equation: w{l)new - w +j)o1d A Ow(). expressions represents a learning rate parameter. Step 5. Repeat: Return to Step 2 and repeat the previously described process for the next respondent's data pattern. When all respondents have been analyzed in the above manner, start over again with the first respondent, again updating the weights as necessary to reduce the average disparity between Ou and Ou. The iterations cease according to the stopping rules outlined subsequently.
Appendix 2 The Sensitivity of a Single Network Output with Respect to an Input Variable To calculate the sensitivity of the output 0 with respect to a particular input variable Xk,we must determine the partial derivative
-.
aXk
This derivative is then evaluated at the observed value of Xk holding fixed at their observed values the other variables xi, where i =#k. The results are averaged over the respondents in each response category for each variable to obtain the entries in Table 5. According to Equation (2), we may represent the partial derivative of the output 0 by the hidden layer neurons Hj as 0 = Rq(2)
MARKETINGSCIENCE/VOl. 16, No. 4, 1997
389
WEST, BROCKETT, AND GOLDEN A Comparative Analysis of Neural Networksand StatisticalMethods
+ H'w 2)), where H = (H1, ..., Hj) with Hi = F(q1() + x'w?P) and the activation function given by F(z) = 1/(1 + exp( - z)). Using the total derivative rule to take the partial derivatives and then applying the chain rule we have:
aO
axi where aand
-H= =
a d
MH
j=1 aHj axi
F(q'2' + H'w)w(2)
ax, Thus,
? = I
aXi j=1
F'(qj0) + x'w())W(l1) 17 I I
F'(7(2) + H'W(2))w52)-F'(i4) + x'WPj))Wj).
Now, using F'(z) = F(z) [1 - F(z)], the sensitivity can be succinctly written as
=
w52)w') F(q(2) + H'w(2))[1 - F(0'2) + H'w(2))]
F ?q x'.J1 - (1j) F(2)+ X'W(p))[j F(q17 + 'wQ))]. ll and After the training is terminated and weights Wi2),WQ), 17q1), q(2) are obtained, the average sensitivity of input xj with respect to output 0 evaluated at input response category level xi = k is calculated via the formula
N(k) dx
a N(k)i=I xji where the index of summation is over the N(k) respondents who answered response category xi = k. These are the values appearing in Table 5.
References
Abbott, L.F. (1996), "Statistical Analysis of Neural Networks," in Paul Smolensky, Michael C. Mozer, and David Rumelhart (Eds.), MathematicalPerspectives on Neural Networks, Mahwah, NJ: Erlbaum. Anderson, John R. (1976), LanguageMemory and Thought, Hillsdale, NJ: Erlbaum. Anderson, Norman H. (1970), "Functional Measurement and Psychophysical Judgment," PsychologicalReview, 77, 153-170. (1971), "Integration Theory and Attitude Change," Psychological Review, 78, 177-206. (1991), Contributionsto InformationIntegration TheoryVolume I: Cognition, Mahwah, NJ: Erlbaum. Archer, Norman P., and Shouhong Wang (1993), "Application of the Back Propagation Neural Network Algorithm with Monotonicity Constraints for Two-Group Classification," Decision Sciences, 24, 60-75.
Brockett, Patrick L., William W. Cooper, Linda L. Golden, and Utai Pitaktong (1994), "A Neural Network Method for Obtaining an Early Warning of Insurer Insolvency," Journalof Risk and Insurance, 61, September, 402-424. Chappell, Mark and Michael S. Humphreys (1994), "An AutoAssociative Neural Network for Sparse Representations: Analysis and Application of Models of Recognition and Cued Recall," PsychologicalReview, 101, 103-128. Collins, A.M., and E.F. Loftus (1975), "A Spreading Activation Theory of Semantic Memory," PsychologicalReview, 82, 407-428. Coombs, Clyde H. and George S. Avrunin (1977), "Single Peaked Functions and the Theory of Preference," PsychologicalReview, 84, 216-230. Curram, Stephen P. and John Mingers (1994), "Neural Networks, Decision Tree Induction and Discriminant Analysis: An Empirical Comparison," Journal of Operational Research Society, 45, 440-450. Dawes, Robin and Bernard Corrigan (1974), "Linear Models in Decision-Making," PsychologicalBulletin, 81, 95-106. Dutta S. and S. Shekhar (1988), "Bond Rating: A Non-Conservative Application of Neural Networks," Proceedings IEEE Internaon tional Conference Neural Networks, 11443-11450. Eberhart, Russell C. and Roy W. Dobbins (1990), Neural NetworkPC Tools:A Practical Guide, New York: Academic Press. Einhorn, Hillel J. (1970), "The Use of Nonlinear, Noncompensatory Models of Decision Making," PsychologicalBulletin, 73, 3, 221230. Elman, J. and D. Zisper (1988), "Learning the Hidden Structure of Speech," Journal of the Acoustical Society of America, 83, 16151626. Fishbein, Martin (1967), "A Behavior Theory Approach to the Relations Between Beliefs About an Object and the Attitude Toward That Object," in Readings in Attitude Theoryand Measurement, M. Fishbein (Ed.), New York: Wiley, 389-399. Fukushima, K. and S. Miyake (1984), "Neocognition: A New Algorithm for Pattern Recognition Tolerant of Deformations and Shifts in Position," Pattern Recognition,15, 455-469. Funahashi, K. (1989), "On the Approximate Realization of Continuous Mappings by Neural Networks," Neural Networks,2,183192. Ganzach, Yoav (1995), "Nonlinear Models of Clinical Judgment: Meehl's Data Revisited," PsychologicalBulletin, 118, 3, 422-429. Golden, Linda L., Patrick L. Brockett, Gerald Albaum, and Juan Zatarain (1992), "The Golden Numerical Comparative Scale Format for Economical Multiobject/Multiattribute Comparison Questionnaires," Journalof Official Statistics, 8, 77-86. Green, Paul and V. Srinivasan (1978), "Conjoint Measurement in Consumer Research: Issues and Outlook," Journalof Consumer Research,5, September, 103-123. and (1990), "Conjoint Analysis in Marketing: New Developments with Implications for Research and Practice," Journal of Marketing,54, 3-19. Hart, Anna (1992), "Using Neural Networks for Classification Tasks-Some Experiments on Datasets and Practical Advice," Journalof OperationalResearchSociety, 43, 251-226.
390
16, No. 4, 1997
Hebb, D.O. (1949), The Organizationof Behavior,New York: Wiley. Hornik, K.M. Stinchcombe, and H. White (1990) "Universal Approximation of an Unknown Mapping and Its Derivatives Using Multilayer Feedforward Networks," Neural Networks, 3, 551-560. Johnson, Eric J. and Robert J. Meyer (1984), "Compensatory Choice Models of Noncompensatory Processes: The Effect of Varying Context," Journalof ConsumerResearch,11, June, 528-541. ,and Sanjoy Ghose (1989), "When Choice Models Fail: Compensatory Models in Negatively Correlated Environments," Journalof Marketing Research,26, August, 255-270. Keeney, Ralph L. and Howard Raiffa (1976), Decisions with Multiple Objectives:Preferencesand Value Trade-Offs,New York: Wiley. Lippmann, R.P. (1987), "An Introduction to Computing with Neural Nets," IEEE ASSP Magazine, 1, April, 4-22. Lynch, John G. (1985), "Uniqueness Issues in the Decompositional Modeling of Multiattribute Overall Evaluations: An Information Integration Perspective," Journalof Marketing Research,22, 1-19. Malakooti, Behnam and Ying Q. Zhou (1994), "Feedforward Artificial Neural Networks for Solving Discrete Multiple Criteria Decision Making Problems," Management Science, 40, November, 1542-1561. Menon, A., K. Mehrotra, C.K. Mohen, and S. Ranka (1996), "Characterization of a Class of Sigmoid Functions with Applications to Neural Networks," Neural Networks, 9, 5, 819-835. McClelland, James L. (1986) "A Programmable Blackboard Model of Reading," in Parallel Distributed Processing: Explorations in the Microstructures of Cognition, David Rumelhart and James McClelland (Eds.), Cambridge, MA: MIT Press. and J.L. Elman (1986), "Interactive Processes in Speech Perception: The TRACE Model," in Parallel Distributed Processing:Explorations in the Microstructures of Cognition, David Rumelhart and James McClelland (Eds.), Cambridge, MA: MIT Press. and David E. Rumelhart (1986), "A Distributed Model of Human Learning and Memory," in Parallel Distributed Processing: Explorationsin the Microstructuresof Cognition,David Rumelhart and James McClelland (Eds.), Cambridge, MA: MIT Press. Payne, John W., James R. Bettman, and Eric J. Johnson (1993), The Adaptive Decision Maker, New York: Cambridge University Press. Quillian, M.R. (1968), "Semantic Memory," in Semantic Information Processing, M. Minsky (Ed.), Cambridge, MA, MIT Press. Rosenblatt, Frank (1959), "Two Theorems of Statistical Separability in the Perception," Proceedingsof a Symposium on the Mechanization of Thought Processes, London: Her Majesty's Stationary Office, 421-456.
(1961), Principles of Neurodynamics:Perceptronsand the Theoryof Brain Mechanisms, Washington, DC: Spartan Books. Rumelhart, David E., Richard Burbin, Richard Golden, and Yves Chauvin (1996), "Backpropogation: The Basic Theory," in Mathematical Perspectives on Neural Networks, Paul Smolensky, Michael C. Mozer, and David Rumelhart (Eds.), Mahwah, NJ: Elbaum. , James L. McClelland, and the PDP Research Group (1986), Parallel DistributedProcessing, Explorationsin the Microstructures of Cognition, Cambridge, MA: MIT Press. Salchenberger, Linda M., E. Mime Cinar, and Nicholas A. Lash (1992), "Neural Network: A New Tool for Predicting Thrift Failures," Decision Sciences, 23, July-August, 899-917. Shastri, Lokendra and Venkat Ajjanagadde (1993), "FromSimple Associations to Systematic Reasoning: A Connectionist Representation of Rules, Variables and Dynamic Bindings Using Temporal Synchrony," Behavioraland Brain Sciences, 16, 417-494. Simon, Herbert A. (1996), The Sciences of the Artificial, 3d ed., Cambridge, MA: MIT Press. Smith, Murray (1996), Neural Networksfor StatisticalModeling,Boston, MA: International Thomson Computer Press. Surkan, A.J. and J.C. Singleton (1990), "Neural Networks for Bonding Rating Improved by Multiple Hidden Layers," Proceedings IEEE InternationalConference Neural Networks, 11157-11162. on Treigueiros, D. and R. Berry (1991), "The Application of Neural Networks Based Methods to the Extraction of Knowledge from Accounting Reports," Proceedings,24th Annual Hawaii International Conferenceon System SciencesIV, 137-146. Tversky, Amos (1972), "Elimination by Aspects: A Theory of Choice," PsychologicalReview, 79, July, 281-299. Wang, Jun (1994), "A Neural Network Approach to Modeling Fuzzy Preference Relations for Multiple Criteria Decision Making," ComputerOperationsResearch,21, September, 991-1000. and Behnam Malakooti (1992), "A Feedforward Neural Network for Multiple Criteria Decision Making," ComputersOperations Research,19, February, 151-167. White, Halbert (1989), "Some Asymptotic Results for Learning in Single Hidden-Layer Feedforward Network Models," Journalof the American Statistical Association, 84, 1003-1013. Wilkie, William L. and Edgar A. Pessimier (1973), "Issues In Marketing's Use of Multi-Attribute Attitude Models," Journal of MarketingResearch,10, 428-441. Yoon, Youngohc, George Swales, and Thomas Margavio (1993), "A Comparison of Discriminant Analysis Versus Artificial Neural Networks," Journalof OperationalResearchSociety, 44, 51-60. Zimmer, Mary R. and Linda L. Golden (1988), "Impressions of Retail Stores: A Content Analysis of Consumer Images," Journalof Retailing, 64, 3, 235-293.
This paperwas receivedAugust 16, 1995, and has been with the authors 15 monthsfor 2 revisions;processedby Gary L. Lilien.
16, No. 4, 1997
391

Neural Network For Consumer Choice Prediction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Network For Consumer Choice Prediction

Uploaded by

Copyright:

Available Formats

A Comparative Analysis of Neural Networks and Statistical Methods for Predicting Consumer Choice Author(s): Patricia M.

Patricia M. West * Patrick L. Brockett * Linda L. Golden

16, No. 4, 1997

... ~~....~~~~~~~t~ ... . ~ ~~

16, No. 4, 1997

16, No. 4, 1997

2. Background on Neural Network Methods

16, No. 4, 1997

WEST, BROCKETT, AND GOLDEN Analysis of Neural Networksand StatisticalMethods A Comparative

"Output" of Net (Ex wi) /Fj

(a) A single neuralprocessing unit (neuron)j.

MARKETING SCIENCE/VOL.16, No. 4, 1997

16, No. 4, 1997

MARKETING SCIENCE/VOL 16, No. 4, 1997

Stopping Rule for Trainingthe Neural Network

16, No. 4, 1997

16, No. 4, 1997

4. Study 1-A Numerical Simulation

16, No. 4, 1997

16, No. 4, 1997

AnalysisMethod Discriminant analysis

16, No. 4, 1997

Weighted Additive Type II ,B 14%

AnalysisMethod Discriminant analysis

Type II ,B 50% (45-56%) 48 (42-54) 0 0 4 (0-18) 8 (2-21)

Type I a 3% (0-9%) 3 (2-5) 0 0 0 1 (0-4)

Type II ,B 2% (0-8%) 1 (0-4) 0 0 1 (0-7) 1 (0-4)

44% (31-50%) 55 (33-67) 100 100 10 (0-24) 28 (0-44)

10% (4-18%) 20 (11-24) 11 (4-18) 50 (36-65) 1 (0-4) 10 (0-21)

22 (17-25) 14 (7-18) 10 (4-15) 0 (0-2) 8 (4-14)

16, No. 4, 1997

5. Study 2-An Empirical Examination of Consumer Patronage Behavior

16, No. 4, 1997

WEST, BROCKETT, AND GOLDEN Analysis of Neural Networksand StatisticalMethods A Comparative

16, No. 4, 1997

AnalysisMethod Discriminant analysis Logisticregression Neuralnetwork

Kmart 78.04%** (74-18%) 80.78* (80-83) 82.08 (79-88)

Sears Roebuck 64.17%0/** (58-68%) 64.39* (60-68) 69.72 (58-75)

(67-75%) 72.13 (71-75) 73.74 (71-77)

(69-78%) 77.45* (74-83) 83.51 (77-95)

16, No. 4, 1997

16, No. 4, 1997

6. Conclusions and Implications

16, No. 4, 1997

Illustrationof Model Overfitting

Latitudeof Acceptance Rule

M = Discnminant Analysis;A = LogisticRegression;0= NeuralNetwork

MARKETING SCIENCE/VOL 16, No.

MARKETINGSCIENCE/VOl. 16, No. 4, 1997

WEST, BROCKETT, AND GOLDEN A Comparative Analysis of Neural Networksand StatisticalMethods

j=1 aHj axi

F'(7(2) + H'W(2))w52)-F'(i4) + x'WPj))Wj).

w52)w') F(q(2) + H'w(2))[1 - F(0'2) + H'w(2))]

16, No. 4, 1997

16, No. 4, 1997

You might also like

... ....~~~~~t~ ... . ~ ~~