You are on page 1of 14

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO.

2, FEBRUARY 2006 217

An Integrated Data Preparation Scheme


for Neural Network Data Analysis
Lean Yu, Shouyang Wang, and K.K. Lai

Abstract—Data preparation is an important and critical step in neural network modeling for complex data analysis and it has a huge
impact on the success of a wide variety of complex data analysis tasks, such as data mining and knowledge discovery. Although data
preparation in neural network data analysis is important, some existing literature about the neural network data preparation are
scattered, and there is no systematic study about data preparation for neural network data analysis. In this study, we first propose an
integrated data preparation scheme as a systematic study for neural network data analysis. In the integrated scheme, a survey of data
preparation, focusing on problems with the data and corresponding processing techniques, is then provided. Meantime, some
intelligent data preparation solution to some important issues and dilemmas with the integrated scheme are discussed in detail.
Subsequently, a cost-benefit analysis framework for this integrated scheme is presented to analyze the effect of data preparation on
complex data analysis. Finally, a typical example of complex data analysis from the financial domain is provided in order to show the
application of data preparation techniques and to demonstrate the impact of data preparation on complex data analysis.

Index Terms—Data preparation, neural networks, complex data analysis, cost-benefit analysis.

1 INTRODUCTION
management (e.g., [7]) to control (e.g., [8]), and many
P REPARING data is an important and critical step in neural
network modeling for complex data analysis and it has
an immense impact on the success of a wide variety of
software products, such as NeuroShell (http://www.
neuroshell.com), BrainMaker (http://www.calsci.com),
complex data analysis, such as data mining and knowledge and Neural Network Toolbox of Matlab (http://www.math-
discovery [1]. The main reason is that the quality of the works. com), have been applied successfully in many
input data into neural network models may strongly practical projects. However, most studies and commercial
systems focus almost exclusively on the design and
influence the results of the data analysis [2]. As Lou [3]
implementation of neural models. Data preparation in
stated, the effect on the neural network’s performance can
neural network modeling has received scant recognition.
be significant if important input data are missing or In almost all theoretical and practical researches about
distorted. In general, properly prepared data are easy to neural networks [5], [6], [7], [8], [9], [10], [11], [12], [13],
handle, which makes the data analysis task simple. On the [14], [15], [16], [17], [18], neural network data preparations
other hand, improperly prepared data may make data concentrate on data normalization for transformation and
analysis difficult, if not impossible. Furthermore, data from data division for training. Some studies even utilize
different sources and growing amounts of data produced neural networks for modeling without any data prepara-
by modern data acquisition techniques have made data tion procedure. In these studies, there is an implicit
preparation a time-consuming task. It has been claimed that assumption that all the data are prepared in advance and
50-70 percent of the time and effort in data analysis projects the data can be used directly in modeling. In practice,
is required for data preparation [2], [4]. Therefore, data data are not always prepared beforehand for specific data
preparation involves enhancing the data in an attempt to analysis tasks. Although there are some data for a specific
improve complex data analysis. project, the quality and completeness of that data is
In past decades, artificial neural networks (ANNs), as a limited. As a result, the complex data analysis process
class of typical intelligent data analysis tool, have been cannot succeed without a serious effort to prepare the
studied extensively in many fields of knowledge, from data. Strong evidence [18], [19], [20] reveals that data
science (e.g., [5]) to engineering (e.g., [6]) and from quality has a significant effect on neural network models.
In addition, various interpretations have been given to the
role and the need for data preparation. Zhang et al. [21]
. L. Yu is with the Institute of Systems Science, Academy of Mathematics revealed that data preparation could generate smaller
and Systems Science, Chinese Academy of Sciences, Beijing, 100080, magnitude and higher quality data, which can signifi-
China. E-mail: yulean@amss.ac.cn.
cantly improve the efficiency of complex data analysis. In
. S. Wang is with the Institute of Systems Science, Academy of Mathematics
and Systems Science, Chinese Academy of Sciences, Beijing, 100080, the case of neural network learning, data preparation
China, and the College of Business Administration, Hunan University, would enable users to decide how to represent the data,
China. E-mail: sywang@amss.ac.cn. which concepts to learn, and how to present the results of
. K.K. Lai is with the Department of Management Sciences, City University data analysis so that it is easier to explain them in the real
of Hong Kong, Hong Kong, and College of Business Administration,
world [22]. Data preparation is therefore crucial in neural
Hunan University, China. E-mail: mskklai@city.edu.hk.
network data analysis for guaranteeing data quality and
Manuscript received 20 Nov. 2004; revised 30 Mar. 2005; accepted 14 July completeness.
2005; published online 19 Dec. 2005.
For information on obtaining reprints of this article, please send e-mail to: Although data preparation is useful for any kind of
tkde@computer.org, and reference IEEECS Log Number TKDESI-0460-1104. analysis, neural networks, as a class of important intelligent
1041-4347/06/$20.00 ß 2006 IEEE Published by the IEEE Computer Society
218 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 2, FEBRUARY 2006

data analysis tool, have some special requirements for data skilled experts may find a good solution, but may also find
preparation. First of all, neural networks are a kind of novel it difficult to judge whether the preparation chosen is
learning paradigm, but it is a time-consuming data analysis appropriate. Third, data preparation requires extra effort,
tool, which is different from any other data analysis tool. To raising the question of cost versus benefits (Problem III). If
speed up the analysis process, data preparation is very they see data preparation as having little impact on final
important and necessary for complex data analysis tasks. neural network data analysis results, decision-makers may
Second, data preparation can largely reduce model com- be unwilling to invest in data preparation.
plexity for neural network modeling, which is important for In light of the three problems outlined above, the main
complex data analysis tasks relative to other data analysis motivations of this study are four-fold:
tools. Third, effective data preparation can increase general-
1. to propose an integrated data preparation frame-
ization ability of data analysis, especially for neural net-
work for neural network data analysis,
works. Basically, the main objective of complex data
2. to provide some intelligent solutions to some
analysis is to discovery knowledge that will be used to
important issues and dilemmas in the data prepara-
solve problems or make decisions, but problems about the
tion framework,
data may prevent this. In most cases, imperfections with the
3. to analyze and confirm the effects of the proposed
data are not noticed until the data analysis starts. This is the
data preparation scheme on a neural network data
case, especially for the neural networks. Therefore, data
analysis, and
preparation in neural networks is more important than that
4. to survey the literature about data preparation of
of other data analysis tool. neural network data analysis.
However, in many neural network data preparation
The study first proposes an integrated data preparation
studies [23], [24], [25], [26], [27], [28], [29], [30], [31], data
scheme for neural network modeling to contribute to the
preparation is restricted to data cleaning, normalization, and
solution of the first main problem and then presents, in
division. In their studies, most authors considered that data
detail, alternative solutions to the dilemmas of the data
preprocessing is equivalent to data preparation. Actually,
preparation framework. For the third problem, a cost-
the two terms are different. First of all, they have different
benefit analysis framework for neural network data
significances. According to the American Heritage Diction-
preparation is proposed. However, empirical evidence of
ary [32], “preparation” is defined as “a preliminary measure
the impact of data preparation on complex data analysis is
that serves to make ready for something,” while “preproces-
critical. Without loss of generality, we explored the effects
sing” is “to perform conversion, formatting, or other
of data preparation on data analysis within a specific
functions on (data) before further processing.” Second, basic
problem domain—the business financial risk classification
contents covered are different. Data preparation covers more area. This is a vital research field with a vast number and
contents than data preprocessing. Generally, data prepro- variety of important data sets.
cessing only includes data transformation and data format- The remainder of the study is organized as follows: A
ting, while data preparation contains more contents, such as brief description of neural network models for complex
data collection, data selection, data integration, and data data analysis is presented in Section 2. In view of the neural
validation in addition to data preprocessing. In our study, network data analysis framework, an integrated data
data preparation is expanded into an integrated scheme with preparation scheme for neural networks is proposed to fill
three phases from a systematic perspective. This integrated up the gap in the literature. Meanwhile, the steps in every
scheme is comprised of a data preanalysis phase, including phase, as well as important issues of the integrated scheme,
data collection, data selection and data integration, and a are described and discussed in detail. Accordingly, some
data postanalysis phase, such as data validation and data intelligent solutions to some important issues and dilemmas
readjustment, as well as a data preprocessing phase. In this are provided in Section 3. A comprehensive cost-benefit
sense, our study goes well beyond previous studies [23], analysis framework for analyzing the effect of the proposed
[24], [25], [26], [27], [28], [29], [30], [31]. integrated data preparation scheme on neural network data
In the overview of this topic, we found that there are analysis is proposed in Section 4. To verify the effects of
three main problems in neural network data preparation. data preparation on neural network data analysis, an
The first is that there is no universal scheme or methodol- empirical example is presented in Section 5. Finally, the
ogy for neural network data analysis (Problem I). That is, paper ends with concluding remarks and future directions
there is no perfect and systematic data preparation frame- in Section 6.
work or architecture for neural network data analysis.
Although some related studies are presented, a systematic
work about neural network data preparation has not been 2 NEURAL NETWORKS FOR COMPLEX
formulated so far. Most existing studies focused on the data DATA ANALYSIS
preprocessing. Furthermore, these researchers often con- The foundation of the artificial neural networks (ANNs)
fused the difference between data preparation and data paradigm was laid in the 1950s. Since then, ANNs have
preprocessing. Therefore, it is necessary to construct a earned significant attention because of the development of
universal data preparation scheme for neural network data more powerful hardware and neural algorithms [9]. ANNs
analysis. Second, in preparing data for neural network data have been studied and explored by many researchers and
analysis, some important issues and dilemmas are often been applied and manipulated in almost every field,
faced and are hard to handle (Problem II). For example, examples include system identification and modeling [10]
YU ET AL.: AN INTEGRATED DATA PREPARATION SCHEME FOR NEURAL NETWORK DATA ANALYSIS 219

Fig. 1. The process of neural network data analysis.

and prediction and classification [11], [12], [13], [14], [15],


[16], [17] Generally, ANNs can be used as an effective
intelligent data analysis tool for their unique learning
capability. In this section, an entire process of neural
network data analysis is presented, as shown in Fig. 1.
As can be seen from Fig. 1, neural network modeling for
complex data analysis has four main processes: problem
identification, data preparation, neural network modeling,
and data analysis. In the first process, we can identify a Fig. 2. The integrated data preparation scheme for neural network data
analysis.
problem by analyzing its expected results and consulting
the relevant domain experts. Problem definitions and
the importance of data preparation, we propose an
expected results are formulated to guide the subsequent
integrated data preparation scheme for neural network
tasks. The aim of the second process, data preparation,
which will be described later, is to prepare high-quality modeling, as illustrated in Fig. 2.
As is shown in Fig. 2, the integrated data preparation
data for data analysis so as to obtain satisfactory results. In
scheme consists of three phases: data preanalysis, in which
the third process, after initialization, neural network models
are trained iteratively. If the results of the data validation data of interest are identified and collected; data preproces-
are rational, the generalized results obtained from the sing, in which data are examined and analyzed and in
trained networks can be used for data analysis. Finally, which some data may be restructured or transformed to
depending on the generalized results, the goal of the make them more useful; and data postanalysis, in which
complex data analysis, such as data mining and decision some data are validated and readjusted. In the integrated
support, can be realized. data preparation scheme, every phase is comprised of
different processing steps. For example, the data preana-
lysis phase includes data requirement analysis, data
3 THE PROPOSED INTEGRATED DATA collection, data selection, and data integration. Data pre-
PREPARATION SCHEME processing is comprised of data inspection and data
In this section, we first propose an integrated data prepara- processing. Data postanalysis contains data division and
tion scheme based on the entire process of neural network redivision, data validation, and data readjustment. This
data analysis framework. We then present details of the phase is called “postanalysis” because the data preparation
scheme and overview some related literature. Subsequently, tasks may be adjusted in terms of feedback information in
some intelligent solutions to some important issues and the process of neural network training, learning, and
dilemmas in the integrated scheme are presented. validation (i.e., modeling process). Because the postanalysis
adjusts the data for modeling purpose, the data postanalysis
3.1 The Integrated Data Preparation Scheme for is still seen as the range of data preparation. In almost all
Neural Network Data Analysis existing studies, data preparation includes only the second
As noted earlier, neural network data preparation for phase, data preprocessing. Therefore, our proposed data
complex data analysis is very important. However, no preparation scheme is broader than others, which distin-
standard data preparation framework for neural network guishes our study from others. This is an important
modeling has so far been suggested (Problem I). In view of contribution that can fill up the gap in the literature.
220 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 2, FEBRUARY 2006

In the following sections, the three phases of the 3.2.4 Data Integration
integrated data preparation scheme are described in detail. If data are collected from many different sources by several
First, the detailed steps of every phase and some data different groups, the data are still in disorder and scattered,
problems that are normally encountered are discussed and and data integration treatment becomes vital [22]. This is
some existing methods and techniques to overcome these especially true when data contain text and symbolic
problems are overviewed step-by-step. For some important attributes and have to be combined for further analysis.
issues and dilemmas (Problem II), they are described and In general, data sources can be divided into internal and
some rational intelligent solutions presented. external sources [40]. Similarly, data representations can be
divided roughly into structural representation and non
3.2 Data Preanalysis of the Integrated Data structural representation. Therefore, there are different data
Preparation Scheme integration methods for different data representations from
As seen in Fig. 2, this phase consists of four steps: data multisources. Regarding structural data from different
requirement analysis, data selection, data collection, and sources, we can utilize mature database techniques, such
data integration. as virtual views and data warehouse techniques [41], to
integrate data relations from different sources via join and/
3.2.1 Data Requirement Analysis or union operations. Semantic and descriptive conflicts can
For a specific data analysis project, the first step is to be solved by renaming operations and conversions. Some
understand the data requirements of the project in metalevel conflicts and instance-level conflicts can be
solved by advanced schema transformation (e.g., transposi-
conjunction with the problem definitions and expected
tion of relations), reconciliation function, and user-defined
objectives. If the problem is outside one’s field of expertise,
aggregates. More information can be found in [41], [42].
interviewing specialists or domain experts may provide With regard to the integration of nonstructural data from
insight into the underlying process so that some potential different sources, [43] and [44] present some solutions.
problems may be avoided [23]. Questions may include the
following: 3.2.5 Important Issues and Dilemmas of This Phase
In this phase, two important issues are described and
1. What information would we like to have? discussed in detail.
2. What data are required for a specific task?
3. Where can the data be found? 1. Important Issue I: Data Variable selection with
4. What format are the data in? genetic algorithms. From the previous analysis, we
5. What external sources of data are available? find that data variable selection is an extremely
When understanding the data requirements, data will be important issue and many related studies [35], [36],
collected from various sources. [37], [38], [39] are presented. Here, we present an
intelligent solution—genetic algorithm (GA)—to this
3.2.2 Data Collection important issue for neural network data analysis. To
This is an important step because the outcome of the step date, genetic algorithms (GAs) have become a
will restrict subsequent phases or steps. Based on the data popular optimization method as they often succeed
in finding the best optimum in contrast to most
requirements, all kinds of approaches, such as information
common optimization algorithms. Genetic algo-
retrieval [33] and text mining [34], will be used to collect
rithms imitate the natural selection process in
various data from various sources. In some situations, some biological evolution with selection, mating repro-
important data may be hard to collect. Thus, surrogate data duction and mutation, and the sequence of the
is useful and necessary. different operations of a genetic algorithm is shown
in the left part of Fig. 3. The parameters to be
3.2.3 Data Variable Selection optimized are represented by a chromosome where-
Once data are collected, determining variables for modeling by each parameter is encoded in a binary string
becomes possible. The goal of any model should be called gene. Thus, a chromosome consists of as many
parsimony, i.e., to find the simplest explanation of the facts genes as parameters to be optimized. Interested
using the fewest variables. Therefore, it is best to identify readers can be referred to [45], [46] for more details.
the variables that will save modeling time and reduces the In the following, GA for data variable selection is
problem space [23]. There are many means of variable discussed.
selection [35], [36], [37], [38], [39]. For example, Lemke and First of all, a population, which consists of a given
Muller [35] used a modular approach and self-organizing number of chromosomes, is initially created by
randomly assigning “1” and “0” to all genes. In the
variable selection to realize variable reduction, while Tuv
case of variable selection, a gene contains only a
and Runger [36] presented a nonhierarchical metric
single bit string for the presence and absence of a
clustering method to deal with high-dimensional classifica- variable. The top right part of Fig. 3 shows a
tion. Some other methods, such as correlation analysis with population of four chromosomes for a three-variable
the Granger causality method [37], principal component selection problem. In this study, the initial popula-
analysis (PCA) [38], and stepwise multiple regression [39] tion of the GA is randomly generated, except for one
are also mentioned in the literature. chromosome, which was set to use all variables. The
YU ET AL.: AN INTEGRATED DATA PREPARATION SCHEME FOR NEURAL NETWORK DATA ANALYSIS 221

correspond to the accuracy of the neural networks.


Thereby, RMSEtraining is based on the prediction of
the training data used to build the neural nets,
whereas RMSEtesting is based on the prediction of
separate test data not used for training the neural
networks. It was demonstrated in [47] that using the
same data for the variable selection and for the model
calibration introduces a bias. Thus, variables are
selected based on data poorly representing the true
relationship. On the other hand, it was also shown
that a variable selection based on a small data set is
unlikely to find an optimal subset of variables [47].
Therefore, a ratio of 3:7 between the influence of
training and testing data was chosen. Although being
partly arbitrary, this ratio should give as little
influence to the training data as to bias the feature
selection yet taking the samples of the larger training
Fig. 3. The data varaible selection with the gentic algorithm. set partly into account. The third part of the fitness
function rewards small networks using only few
binary string of the chromosomes has the same size variables by an amount proportional to the para-
as variables to select from whereby the presence of a meter a. The choice of an influence is the number of
variable is coded as “1” and the absence of a variable variables used by the evolved neural nets. A high
as “0.” Consequently, the binary string of a gene value of results is in only a few variables selected for
consists of only one single bit. The subsequent work each GA, whereas a small value results in more
is to evaluate the chromosomes generated by variables being selected. In sum, the advantage of this
previous operation by a so-called fitness function, fitness function is that it takes into account not only
while the design of the fitness function is a crucial the testing error of test data but also partially the
point in using GA, which determines what a GA training error and primarily the number of variables
should optimize. In the case of a variable selection used to build the corresponding neural nets.
for neural network data analysis, the goal is to find a After evolving the fitness of the population, the
small subset of variables, which are most significant best chromosomes with the highest fitness value are
for complex data analysis. In this study, the complex selected by means of the roulette wheel. The
data analysis is based on neural networks for chromosomes are thereby allocated space on a
modeling the relationship between the input vari- roulette wheel proportional to their fitness and,
ables and the responses. Thus, the evaluation of the
thus, the fittest chromosomes are more likely
fitness starts with the encoding of the chromosomes
selected. In the following mating step, offspring
into neural networks whereby “1” indicates that a
chromosomes are created by a crossover technique.
specific variable is used and “0” that a variable is not
A so-called one-point crossover technique is em-
used by the network. Then, the networks are trained
ployed, which randomly selects a crossover point
with a training data set and, after that, a testing data
within the chromosome. Then, two parent chromo-
set is predicted. Finally, the fitness is calculated by a
so-called fitness function f. For a prediction/ somes are interchanged at this point to produce two
classification problem, for example, our fitness new offspring. After that, the chromosomes are
function for the GA variable selections can use the mutated with a probability of 0.005 per gene by
following form: randomly changing genes from “0” to “1” and vice
versa. The mutation prevents the GA from conver-
f ¼ 0:3RMSEtraining þ 0:7RMSEtesting ging too quickly in a small area of the search space.
ð1Þ
 ð1  nv =ntot Þ; Finally, the final generation will be judged. If yes,
then the optimized subsets are selected. If no, then
where nv is the number of variables used by the the evaluation and reproduction steps are repeated
neural networks, ntot is the total number of variables, until a certain number of generations, until a defined
and RMSE is the root mean square error, which is fitness, or until a convergence criterion of the
defined in (2) with N as total number of samples population are reached. In the ideal case, all
predicted, yt as the actual value and y^t as the chromosomes of the last generation have the same
predicted value: genes representing the optimal solution.
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2. Important Issue II: The integration of nonstructural
1 XN
RMSE ¼ ðy^t  yt Þ2 : ð2Þ data. Another important issue is the integration of
N t¼1
nonstructural data. Nonstructural data consists of
From (1), we find that the fitness function can be unstructured data and semistructural data. As ear-
broken up into three parts. The first two parts lier noted, integrating nonstructural data from
222 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 2, FEBRUARY 2006

different sources is difficult. Here, a three-phase 3.3.3 Important Issues and Dilemmas of This Phase
approach for this task is proposed. In this section, six important issues and two dilemmas
The first phase is to extract related features from about data preprocessing are presented.
text data or documents by semantic analysis and
formulate an event-specific summary. This extrac- 1. Important Issue I: Too many data and data
tion makes nonstructural data more readable and sampling. In many domains, such as space (e.g.,
representative. Some string matching algorithms, image data) and finance (e.g., stock price data every
such as [48], [49], can be used. The second phase is to five minutes), the volume of data and the rate at
transform the summary into corresponding nominal which data are produced may be a limiting factor in
performing on-time data analysis. Furthermore, the
variables or numerical variables by classification
amount of data is sometimes beyond the capability
algorithms, such as [49]. This transformation makes
of the hardware and software available for data
modeling easier because the transformed variables
analysis. Therefore, sample space reduction is
can be treated as dummy variables in models. The important. Here, clustering and data discretization
last phase is to tabulate the transformed variables, are used to treat the problem.
making them more easily used by models. Non If there are a large number of observations, i.e.,
structural data can thereby be formulated into a a large sample size, a useful approach is to obtain a
single data set using previous data integration representative subset of data or a data sampling.
techniques. An effective way of doing this is to divide the
sample by forming clusters of sample observations.
3.3 Data Preprocessing Phase of the Integrated
Every cluster can then be represented by one
Data Preparation Scheme
observation. This can be 1) one specific observation,
After the data are identified and collected, they must be 2) the mean value of all observations in the cluster,
examined to identify any characteristics that may be and 3) observation which has the lowest distance
unusual or indicative of more complex relationship. This from all the others, and so on. In addition, data
is because the data from the previous phase may be impure, sampling techniques [53], [54], which selects a
divergent, untrustworthy, or even fraudulent. Therefore, representative subset from a large population of
data preprocessing is required. In this study, data pre- data, can also be used.
processing is a transformation, or conditioning, of data If data clustering is difficult, discretization can be
designed to make modeling more robust, which include used. Discretization is aimed at reducing the number
data inspection and processing. of distinct values for a given attribute, particularly
for analysis methods requiring discrete attribute
3.3.1 Data Inspection values. Possible methods of discretization are 1) his-
The first step of data preprocessing is data inspection. The togram-based discretization, 2) discretization based
goal of data inspection is to find problems with data. Data on concept-hierarchies [53], and 3) entropy-based
inspection includes data quantity and quality inspection. discretization [54]. Here, we focus on histogram-
The former is to check the size of data sets and the latter any based discretization due to its simplicity.
unusual data set patterns. The data quantity inspection can 2. Important Issue II: Too few data and data regather-
be performed by observation. Generally, there are two main ing. Conversely, if too few data are collected, data
problems with this: a too-large data size or a too-small data regathering will be necessary for complex data
size. The data quality inspection can also be performed by analysis. Nguyen and Chan [31] found that neural
statistical methods including the check of data noise, data networks perform worse when few data are avail-
missing, data scale, and data trending and data nonstatio- able or data are insufficient. As data regathering
narity. There are four approaches to data quality inspection: may be difficult, all kinds of information channels
and gathering tools should be used for this task.
line graph for checking missing data and data trending,
3. Important Issue III: Noisy data and data de-
control plot [50] for data noise detection, unit root test [51]
noising. As has been stated [22], noise in the data
for data nonstationarity check, and SVM-OD [52] for outlier
weakens the predictive capability of the features.
detection. Therefore, noise reduction or data denoising is very
important for neural network data analysis. In the
3.3.2 Data Processing
existing literature, noise elimination has been exten-
The above step (i.e., data inspection) can identify seven sively studied [52], [55], [56], [57]. Here, we propose
main problems: too many data, too few data, noisy data a regression-based data denoising approach to
including outliers and errors, missing data, multiscale data, eliminate the effect of noise.
trending (or seasonal) data, and nonstationary data. In our approach, the first step in noise reduction
Accordingly, several processing techniques—data sampling is noise detection. We use a control plot to detect the
[53], [54], data regathering, data denoising [52], [55], [56], noise, as previously mentioned. The second step in
[57], data repairing [4], [58], [59], [60], [61], [62], [63], data noise reduction is noise filtering to eliminate out-
normalization [3], [64], and data difference [64], [65], liers. Here, we use regression technique. In linear
[66]—are used to deal with them. regression, also known as least square method, the
YU ET AL.: AN INTEGRATED DATA PREPARATION SCHEME FOR NEURAL NETWORK DATA ANALYSIS 223

goal is to find a straight line modeling a two- scaled into the range used by the input neurons in
dimensional data set. This line y ¼ x þ  is speci- the neural network. This is typically -1 to 1 or zero to
fied by the parameters  and , which are calculated 1 [3]. Many commercially available generic neural
from the known values of the attribute x and y. Let network development programs, such as Brain-
X  X  Maker, automatically scale each input. Moreover,
x ¼ xi n and y ¼ yi n; ð3Þ neural networks always require that the range of the
data is neither too small nor too large, so that the
then precision limits of the computer are not exceeded.
X X Otherwise, the data should be scaled. Furthermore,
2
¼ ðxi  xÞðyi  yÞ ðxi  xÞ and data normalization helps to improve the perfor-
ð4Þ
mance of neural networks [3]. Therefore, data
 ¼ y  
x: normalization is necessary for treating multiscale
The parameters  and  can now be used to data. The main reason is that the neural network
remove data items well away from the regression models often rely on Euclidean measures, and
line. For example, this can be decided simply based unscaled data could bias or interfere with the
on the absolute distance or by removing n percent of training process. Line scaling and sigmoidal function
items with the largest distance as noise or outliers. normalization are the commonly used methods.
4. Important Issue IV: Missing data and data repair- The line scaling method is a simple and effective
ing. Roughly, missing data can be divided into two approach. Let the maximal and minimal value of
types: missing attributes and missing attribute input range be Imax and Imin . Then, the formula for
values. Missing or insufficient attributes are exam- transforming each data D to an input value I is:
ples of data problems that may complicate data I ¼ Imin þ ½ðImax  Imin Þ  ðD  DminÞ=ðDmax  DminÞ;
analysis tasks, such as learning, and hinder accurate
ð5Þ
performance of most data analysis systems [22]. For
example, in the case of learning, these data insuffi- where Dmax and Dmin are the maximal and minimal
ciencies limit the performance of a learning algo- value of a given input range. This method of
rithm or statistical tool applied to the collected data, normalization will scale input data into the appro-
no matter how complex the algorithm is or how priate range.
many data are used. Furthermore, missing attributes In addition, a logistic function can be used as a
are a source of too few data, as was previously data normalization method, depending on the
revealed. Therefore, related attribute data should be characteristics of the data. Here, a sigmoidal func-
further regathered. tion is utilized in the following:
However, in most practical applications, an
important problem is the handling of missing ri
IðxÞ ¼ ; i ¼ 1; 2;    ; n; ð6Þ
attribute values in a data set. Several studies have 1 þ exp½pi ðxi  qi Þ
been done on dealing with missing values with where ri is used to constrain the range for the ith
numerous methods (see [4], [58], [59], [60], [61], [62], transformed element of the n-element data set, qi can
[63]). The aim of these methods is to recover the be selected as the smallest in the ith element of the
missing values that are as close as possible to the data set, and pi decides the sharpness of the
original values. The methods of doing this can be transferred function. Also, (6) can compress the
categorized into two types: imputation-based and abnormal data to a specific range. Note that different
data mining-based methods. The former is primarily continuous and differentiable transformation func-
for handling missing values of numerical data, while tion can also be selected.
the latter is for category data. The principle of 6. Important Issue VI: Trending data, seasonal data,
imputation methods is to estimate the missing and nonstationary data. For a neural predictor, the
values by using the existing values as an auxiliary presence of a trend may have undesired effects on
base. The underlying assumption is that there are the prediction performance [64]. Similarly, research-
certain correlations between different data tuples ers [65], [66] have demonstrated that seasonal data
over all attributes. The existing methods include have a significant impact on neural network predic-
mean imputation [58], hot-deck or cold-deck im- tion. As to univariate time series analysis with
putation [59], regression, and composite imputation neural networks, nonstationarity is a problem for
[60]. For the data mining-based methods, techniques time-series analysis [66]. Therefore, data detrending
such as associations [61], clustering [62], and regres- and deseasonalization and data stationarity are also
sion [63] are used to discover similar patterns important issues in complex data analysis. For
between data tuples so as to predict the missing trending and seasonal data and nonstationary data,
values. difference or log-difference [64], [65], [66] is a simple
5. Important Issue V: Multiscale data and data and effective treatment method that is widely used.
normalization. In neural network learning, data 7. Dilemma I: Data sampling and sample representa-
with different scales often lead to the instability of tive trade-off. In this phase, the first dilemma is a data
neural networks [64]. At the very least, data must be sampling size and sample representative trade-off
224 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 2, FEBRUARY 2006

problem. Generally, as a larger data sample is taken, testing error does not have a continuously decreasing trend
the variability of the sample tends to fluctuate even where a minimum value is found on the curve. Thus, there
less between the smaller and larger samples. To are two classes of problems (overfitting and underfitting) to
resolve the trade-off problem, this study presents a disturb the network. In overfitting, a network usually
novel convergence approach. performs worse instead of better after a certain point during
The convergence approach has two types: incre- training. This is because such long training may make the
mental and decremental. In the incremental type, a network memorize the training patterns, including all of
random sample is first selected and the distribution their peculiarities. Underfitting results from insufficient
properties (such as mean, standard deviation, skew- training. This makes the network’s generalization very
ness, and kurtosis) are calculated. Then, the sample poor. The solution to two problems is to validate the data
distribution is tested repeatedly by adding addi- with a section of extra data by cross-validation.
tional instances. If the sample distribution is recal-
3.4.3 Data Readjustment
culated as each additional instance is added, a low
number of instances will appear in the sample; that Depending on the feedback of data validation and model
is, each addition will make a large impact on the training results, data readjustment is often necessary.
shape of the curve. However, when the number of Different feedbacks mean different adjustments. For exam-
ple, if the training result of neural network is not
instances in the sample is modest, the overall shape
satisfactory, data redivision may be necessary. If there is
of the curve will settle down and will change little as
overfitting or underfitting in the data validation, data
new instance values are added. This settling down of
redivision is also required. After data adjustment is
the overall curve is the key to deciding the
completed, new learning begins once more.
“convergence” between two different data sets. The
decremental method is the opposite of the incre- 3.4.4 Important Issues and Dilemmas of this Phase
mental method.
In this phase, an important issue and two dilemmas are
8. Dilemma II: Noise and nonstationarity trade-off.
described in the following.
The second dilemma of this phase is the so-called
“noise-nonstationarity trade-off [66]” for neural net- 1. Important Issue I: Overfitting, underfitting, and
work univariate time-series models. That is, when model complexity. Neural networks are often
there is noise and nonstationarity in the time series referred to as universal function approximators
at the same time, neural network training on older since, theoretically, any continuous function can be
data sets (longer training window) can induce biases approximated to a prescribed degree of accuracy by
in predictions because of nonstationarity, whereas increasing the number of neurons in the hidden
using a shorter training window can increase layer of a feedforward network [68]. Yet, in reality,
estimation error (too much model variance) because the objective of a data analysis (e.g., prediction) is
of the noise in the limited data set. Moody [66] not to approximate a data set with an ultimate
suggested using the testing error against the training accuracy, but to find a suitable model with the best
window length in the choice of the optimal training possible generalizing ability [69]. The gap between
window. We followed this suggestion. the approximation of a data set and the model
generalization ability becomes the more problematic
3.4 Data Postanalysis Phase of the Integrated Data
the higher the number of variables and the smaller
Preparation Scheme
the data set, which will be explained below.
3.4.1 Data Division and Redivision For a prediction problem, the best measure for the
Following data preprocessing, data obtained from the generalizing ability is the prediction error of as many
previous phase is used for network training and general- independent separate validation data as possible.
ization. The first main data preparation task in this phase is According to the left side of Fig. 4, the prediction
to split data into subsets for neural network learning. error is composed of two main contributions, the
Usually, a data set is divided into training data and testing remaining interference error and the estimation error
data (sometimes, there is a third data set—a validation set). [70]. The interference error is the systematic error
(bias) due to unmodeled interference in the data, as
So far, there is no universal rule to determine the size of
the data analysis model (e.g., a prediction model) is
either a training data set or a testing data set. Brainmaker
not complex enough to capture all the interferences
software randomly selects 10 percent of the facts from the of the relationship. The estimation error is caused by
data set and uses them for testing. Yao and Tan [67] modeling measured random noise of various kinds.
suggested that historical data be divided into three sets: The optimal prediction is obtained when the
training, validation, and testing. The training set contains remaining interference error and the estimation
70 percent of the collected data, while the validation and the error balance each other, as shown in Fig. 4. The
testing sets contain 20 percent and 10 percent, respectively. effect of the prediction error increasing due to a too
Sometimes, through feedback of modeling results, data simple model is called underfitting, whereas the
redivision is required. effect of the increased prediction error due to a too
complex model is called overfitting.
3.4.2 Data Validation In the right side of Fig. 4, it is shown that the
Generally, the training error always decreases with an optimal complexity of the model highly depends on
increase in the number of cycles or epochs. In contrast, the the size and quality of the data set. For data sets
YU ET AL.: AN INTEGRATED DATA PREPARATION SCHEME FOR NEURAL NETWORK DATA ANALYSIS 225

Fig. 4. Overfitting, underfitting, and model complexity.

which are noisy and limited in size, a simple model Decision makers may not be willing to invest in data
is needed to prevent the overfitting. Neural net- preparation if it has little impact on the final data analysis
works which are too complex (too big), are in danger results of neural network models. This problem is analyzed
of learning these data by heart and, consequently, from three aspects:
model noise of the data. For big data sets, which
contain only a little noise, the best model is more 1. Total time saving for neural network modeling.
complex, resulting in an overall smaller prediction Although additional time is needed to prepare data,
error for the same functional relationship. Conse- the learning time of neural network data analysis
quently, for each data set, an optimal model may decrease sharply. As has been stated [71], every
complexity has to be found whereby the complexity hour invested in data preparation may save many
of the models is directly related with the number of days in training a network. Therefore, our scheme
data variables utilized by the model. will result in an overall time saving.
2. Dilemma I: Training set size and model fitting. In 2. Model complexity reduction for neural network
this phase, the first dilemma is the training set size modeling. Usually, the model complexity of neural
and model fitting dilemma, that is, how to determine networks can be referred to as the number of
the rational training data size. Generally, training parameters, namely, the number of weights and
data size is not too large or too small. A too large the number of biases. The complexity of neural
training set may lead to a long training time and network models can be mainly reduced to the
slower training speed, particularly when the entire number of adjustable parameters. Generally, the
training set is presented to a network between complexity of neural network models can be
weights updates and even lead to overfitting. When calculated as:
a training set size is too small, the network cannot
learn effectively; this leads to underfitting and weak CðnÞ ¼ nh ðni þ 1Þ þ no ðnh þ 1Þ; ð7Þ
generalization. To solve this problem, cross-valida- where CðnÞ is the model complexity, ni is the
tion, such as k-fold and leave-one-out, can be used. number of input nodes, nh is the number of hidden
3. Dilemma II: Training epochs and network general-
neurons, and no is the number of output nodes.
ization. Another dilemma is the number of training
From (7), we can see that the model complexity can
epochs or cycles and network generalization. Usual-
be reduced by appropriate data preparation work,
ly, if the number of training cycles increases, the
e.g., variable selection.
training error rate should decrease. The error rate on
3. Performance improvement for complex data analy-
test cases should begin to decrease and then it will
sis. In practice, many applications [12], [13], [14],
eventually turn upward. This corresponds to the
[15], [16], [17], [26], [27], [28], [29], [30], [31], also
dynamic of underfitting and then overfitting. So far,
revealed that data preparation minimizes error. To
there is no universal rule to determine training
explain this further, a bias-variance-noise decom-
epochs, except trial and error with incremental
position achieved by extending the bias-variance
algorithms.
decomposition originally proposed by [72] is used to
A key issue of our study is, however, whether the
integrated data preparation scheme is of value in analyze performance improvement.
neural network data analysis given that data Considering a classification or prediction problem, mean
preparation is time-consuming. In the next section, squared error is defined as a loss function of a neural
we will discuss this issue from a general viewpoint. network model. Assume that there is a true function,
y ¼ fðxÞ þ ", where " is normally distributed with zero
4 COSTS-BENEFITS ANALYSIS OF THE INTEGRATED mean and standard deviation . Given a set of training sets
DATA PREPARATION SCHEME D : fðxi ; yi Þg, we fit the unknown function hðxÞ ¼ w  x þ "
P
Data preparation requires extra time and effort, hence, the to the data by minimizing the squared error i ½yi  hðxi Þ2 .
question of costs versus benefits arises (Problem III). Given a new data point x with the observed value
226 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 2, FEBRUARY 2006

y ¼ fðx Þ þ ", the expected error E½ðy  hðx ÞÞ2  can be architecture is chosen to classify the corporate financial
decomposed into bias, variance, and noise below: risk. Because there is no systematic and normalized data
sets (i.e., database) for this specified classification task,
E½hðx Þ  y Þ2  ¼ E½ðhðx ÞÞ2  2hðx Þy þ ðy Þ2  almost all data are scattered. Furthermore, many financial
¼ E½hðx ÞÞ2   2E½hðx ÞEðy Þ þ Eðy Þ2  statements collected are paper manuscripts. Therefore,
these data are required to be integrated. As we collect
ð:::EðZ  ZÞ2 ¼ EðZ 2 Þ  Z2 Þ
100 firms with 10-year data, such a simple data preparation
¼ E½hðx Þ  hðx ÞÞ2  þ ðhðx ÞÞ2 is a time-consuming task.
 2hðx Þfðx Þ þ Eðy  fðx ÞÞ2  þ ðfðx ÞÞ2 At this stage in our proposed integrated data preparation
scheme, the data collected are raw and cannot be used for
¼ E½hðx Þ  hðx ÞÞ2  þ Eðy  fðx ÞÞ2  direct learning. Therefore, data combination and integration
þ ðfðx ÞÞ2 are first used. Using the financial statements of the firms
¼ V arðhðx ÞÞ þ Eð"2 Þ þ Bias2 ðhðx ÞÞ (i.e., balance sheets and income statements), 27 financial
ratios (variables) were calculated: net income/gross profit,
¼ V arðhðx ÞÞ þ 2 þ Bias2 ðhðx ÞÞ: gross profit/total assets, net income/total assets, net
ð8Þ income/net worth, net income/(long-term debt + current
liabilities), inventories/total assets, inventories/current
As revealed in (8), we can improve the data analysis
assets, current liabilities/(long-term debt + current liabil-
performance by three-fold data preparation work. First,
ities), net fixed assets/total assets, current assets/current
noise reduction and filtering can alleviate the effects of
liabilities, quick assets/current liabilities, working capital/
noise because the noise does not always follow the normal
total assets, working capital/current assets, (long-term debt
distribution. Next, data division, data validation, and data
+ current liabilities)/net worth, (long-term debt + current
regrouping can effectively eliminate the effects of bias.
liabilities)/net fixed assets, net worth/( long-term debt +net
Finally, every data preprocessing technique can reduce the
worth), net income/working capital, current liabilities/
effects of variance. However, this is only a theoretical
inventories, current liabilities/net worth, net worth/net
discussion. The impact of the data preparation on neural
fixed assets, inventories/working capital, and (long-term
network data analysis will be verified in the following.
debt + current liabilities)/working capital, net worth/total
assets, current liabilities/total assets, quick assets/total
5 EMPIRICAL STUDY assets, working capital/net worth, current assets/total
In this section, we provide a typical example, “business assets. In this phase, these data can be modeled by neural
financial risk classification,” to show the application of the networks. We call this “neural network data analysis model
data preparation technique and to explore the impact of with ordinary data preparation.” By trial and error, a
data preparation on neural network data analysis. In this BPNN with a 27-51-4 structure is used for classification.
study, a back propagation neural network (BPNN), a In terms of our proposed data preparation scheme, some
widely used network type, is selected as an agent to test other data preparation tasks are required at this stage. The
the impact of data preparation. first is variable selection. Here, GA is used, resulting in the
retention of 12 financial ratios from 27 available ratios. The
5.1 Experimental Design—Basics of Data 12 are: net income/gross profit, gross profit/total assets, net
Preparation and Experiment Settings income/total assets, net income/net worth, current assets/
This application is related to the problems of evaluating current liabilities, quick assets/current liabilities, (long-
corporate financial risk. Generally, corporate financial risk term debt + current liabilities)/total assets, net worth/(net
is roughly divided into four types: security, light-warning, worth + long-term debt), net worth/net fixed assets,
heavy-warning, and crisis. The objective of financial risk inventories/working capital, current liabilities/total assets,
classification is to evaluate a certain corporate financial and working capital/net worth. Similarly, with regard to
condition. The data to be analyzed are a large amount of missing data and the nonavailability of sales volumes,
financial data from financial statements. Through objective corresponding processing techniques are used to eliminate
analysis, we can use financial ratios to realize financial risk the effects of data anomalies. In addition, all samples are
evaluation. A large number of firms from the Shanghai and divided into three parts: training sets (60 firms) for learning,
Shenzhen Stock Exchange in China from 1991 to 2003 were validation sets (25 firms) for reducing the fitting problem of
collected using open financial statements. From this large network learning, and testing sets (15 firms) for testing the
set, 100 firms meeting the criteria of 1) having been in generalization of the network. According to the feedback of
business for more than 10 years and 2) having data the neural network, we can adjust data division so as to
available were selected. If the data collected in this phase make learning effective. Here, we use the term “neural
are used in neural network learning and data analysis, we network data analysis model with integrated data pre-
call the phase “neural network data analysis model with paration scheme.” In this phase, we use a BPNN with a
simple data preparation.” In this phase, we have 11 vari- 12-25-4 architecture to classify the corporate financial risk
ables (total assets, net income, gross profit, net worth, long- based upon the results of many experiments.
term debt, current liabilities, inventories, current assets, net Specifically, the experimental platform is windows and
fixed assets, quick assets, working capital) for analysis. the BPNN model is based on the Matlab neural network
Through many experiments, a BPNN with an 11-23-4 toolbox. Accordingly, a learning rate of 0.50, a momentum
YU ET AL.: AN INTEGRATED DATA PREPARATION SCHEME FOR NEURAL NETWORK DATA ANALYSIS 227

TABLE 1
Comparison of Preparation Time and Learning Time

TABLE 2
Comparison of Model Complexity

rate of 0.15, and random initial weights are chosen. The the network learning time may be longer due to increased
maximum number of learning epochs (cycles) is set at complexity of the network, as improvement I indicates. 3) In
10,000. A learning epoch means that the network goes general, total time saved is about 10 hours for this
through all the years of training data once. An activation classification task. We can save more time in extremely
function of the logistic function for the hidden layer and a complex data analysis, such as complex satellite image data.
linear function for the output layer are selected. The stop
This implies that the integrated data preparation scheme
rule is that MSE is less than 0.0002. In addition, the output
has a significant impact on neural network learning time.
(1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), and (0, 0, 0, 1) represent the
Subsequently, we compare the model complexity among
four financial conditions, i.e., security, light warning, heavy
warning, and crisis, as earlier mentioned. three different data preparation schemes, as shown in
Table 2.
5.2 Experimental Results—The Impact of Data From Table 2, we can find that the integrated data
Preparation on Neural Data Analysis preparation scheme can effectively reduce model complex-
In this section, we focus on analyzing the impact of data ity from improvement II, relative to the ordinary data
preparation on neural network data analysis from three preparation. The main reason leading to this situation is
perspectives: total time saving, model complexity reduction, that ordinary data preparation cannot process data com-
and performance improvement. Based on the previous
pletely and resulting in the increase of model complexity.
descriptions, neural network data analysis can be performed Table 3 shows performance improvement via classifica-
with three different data preparations. A comprehensive
tion accuracy, with some meaningful results:
analysis is undertaken to evaluate the effects of data
preparation on neural network data analysis. First, we 1. The classification accuracy of a neural network
compare data preparation time and network learning time model with an integrated data preparation scheme
with different degrees of data preparation (see Table 1). It is is much greater than those of neural network models
worth noting that the calculation of data preparation time with simple and ordinary data preparation, in terms
starts data requirement analysis and ends data readjustment. of training set, validation set, and testing set. The
From Table 1, we can see 1) that, although integrated main reason is because redundant information is
data preparation time takes a little longer than simple and reduced or eliminated under the proposed scheme.
ordinary data preparation (cost invested), the overall 2. The classification accuracy of the testing set is not as
learning time with our integrated data preparation scheme high as in the training and validation sets, as is
is much shorter (benefit obtained). 2) Usually, prepared shown in lines 1-3 in Table 2. A possible reason is
data can speed up network learning time, such as that classifying the unknown mode is difficult
improvements II and III. If data are processed incompletely, because of the uncertainty of future events.
228 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 2, FEBRUARY 2006

TABLE 3
Comparison of Classification Accuracy

3.Unlike in Table 1, a neural network with ordinary promising solution to improve the performance of neural
data preparation can obtain better classification network data analysis, which is worth being generalized in
results than a neural network with simple data the future.
preparation, although the former increases learning To summarize, the main contributions of this study are
time. the following four. One contribution is to propose an
4. As expected, the performance improvement with the integrated data preparation scheme with three-phase data
integrated data preparation scheme is great relative processing for neural network data analysis. Although
to simple data preparation. some related studies about data preparation are presented
For example, the improvement of the testing set is over in the literature, a systematic study of data preparation has
40 percent (86.67 percent - 40.00 percent = 46.67 percent). not been formulated so far. Another contribution is to
That is, although it takes additional effort to process the present an overview of data preparation. In neural network
data, the benefit obtained is very large. This implies that data analysis, we have discussed a number of data
additional efforts to perform data preparation in practical preparation techniques in three phases and, accordingly,
data analysis applications are worthwhile.
some practical solutions to some dilemmas are provided.
To summarize, the benefits of data preparation are three-
Although many techniques are shown in the previous
fold: 1) decreased running time of neural networks
literature, such as the work of Fayyad et al. [73], but there
modeling, 2) reduced the model complexity of neural
are some distinct differences between our work and the
network modeling, and 3) improved performance in neural
previous work. First of all, the previous work, such as [21],
network data analysis. These findings imply that data
preparation has a significant effect on neural network data [73], only focused on the data preprocessing, while our
analysis. In addition, empirical results demonstrated the work is broader and mainly focuses on all data processing
effectiveness and efficacy of the proposed scheme. There- including data preanalysis and data postanalysis in addi-
fore, the proposed data preparation scheme is worth being tion to data preprocessing. Second, their work does not
generalized. present a systematic study for data preparation. Their
works about data preparation are scattered and nonsyste-
matic. Comparatively speaking, our proposed integrated
6 CONCLUSIONS AND FUTURE DIRECTIONS scheme presents a more comprehensive study for data
In this study, a comprehensive data preparation scheme preparation. The third contribution of our integrated data
with three-phase data processing is proposed to obtain preparation scheme is to present a comprehensive survey
better performance for specific neural network data analysis about neural network data preparation, which is different
task. The novel integrated data preparation scheme pro- from others. In addition, some new data preparation
posed in this paper will enhance the neural network techniques, such as GA for variable selection, are suggested
learning process and reduce the complexity of neural for neural network data preparation. The final contribution
network models and will be of immense help for complex is to provide a full cost-benefit analysis framework for
data analysis. Through empirical investigations, several integrated data preparation scheme. These contributions fill
important conclusions were obtained: 1) The integrated
up the gap in previous studies. However, there are still
data preparation scheme can significantly speed up data
some important issues to be considered for future research
analysis, reduce model complexity, and improve the
on the data preparation for complex data analysis:
performance of data analysis tasks. 2) The scheme is
necessary and beneficial in data preparation for neural 1. The scope of this study is limited to neural network
network data analysis. As the proposed integrated data data analysis. Future research should extend more
preparation scheme has been proven to be very effective in data analysis models such as data mining and
the performance improvement of neural network data knowledge discovery models.
analysis, this leads to the final conclusion. 3) The proposed 2. The study only presents some limited data prepara-
integrated data preparation scheme can be used as a tion techniques in the integrated data preparation
YU ET AL.: AN INTEGRATED DATA PREPARATION SCHEME FOR NEURAL NETWORK DATA ANALYSIS 229

scheme. More new data preparation techniques in all [16] K.A. Smith and J.N.D. Gupta, Neural Networks in Business:
Techniques and Applications. Hershey, Pa.: Idea Group Publishing,
steps of the integrated data preparation scheme are 2002.
worth exploring. [17] G.P. Zhang, Neural Networks in Business Forecasting. IRM Press,
3. To perform meaningful data preparation, either the 2004.
domain expert should be a member of the data [18] B.D. Klein and D.F. Rossin, “Data Quality in Neural Network
Models: Effect of Error Rate and Magnitude of Error on Predictive
analysis team or the domain should be extensively Accuracy,” OMEGA, The Int’l J. Management Science, vol. 27,
studied before the data are preprocessed. The pp. 569-582, 1999.
involvement of the domain expert would lead to [19] T.C. Redman, Data Quality: Management and Technology. New
useful feedback for verifying and validating the use York: Bantam Books, 1992.
[20] T.C. Redman, Data Quality for the Information Age. Norwood,
of particular data preparation techniques. Thus, the Mass.: Artech House, Inc., 1996.
integration of expert opinions into a neural network [21] S. Zhang, C. Zhang, and Q. Yang, “Data Preparation for Data
data preparation framework is an important issue. Mining,” Applied Artificial Intelligence, vol. 17, pp. 375-381, 2003.
4. A module for analyzing the effects of data prepara- [22] A. Famili, W. Shen, R. Weber, and E. Simoudis, “Data Preproces-
sing and Intelligent Data Analysis,” Intelligent Data Analysis, vol. 1,
tion should be added to neural network software pp. 3-23, 1997.
packages so that users working in other domains can [23] R. Stein, “Selecting Data for Neural Networks,” AI Expert, vol. 8,
more easily understand the impact of data prepara- no. 2, pp. 42-47, 1993.
[24] R. Stein, “Preprocessing Data for Neural Networks,” AI Expert,
tion techniques on their work. vol. 8, no. 3, pp. 32-37, 1993.
[25] A.D. McAulay and J. Li, “Wavelet Data Compression for Neural
Network Preprocessing,” Signal Processing, Sensor Fusion, and
ACKNOWLEDGMENTS Target Recognition, vol. 1699, pp. 356-365, SPIE, 1992.
The authors would like to thank the guest editors and three [26] V. Nedeljkovic and M. Milosavljevic, “On the Influence of the
anonymous reviewers for their valuable comments and Training Set Data Preprocessing on Neural Networks Training,”
Proc. 11th IAPR Int’l Conf. Pattern Recognition, pp. 1041-1045, 1992.
suggestions. Their comments helped to improve the quality [27] J. Sjoberg, “Regularization as a Substitute for Preprocessing of
of the paper immensely. This work is partially supported Data in Neural Network Training,” Artificial Intelligence in Real-
by NSFC, CAS, SRG of City University of Hong Kong. Time Control, pp. 31-35, 1992.
[28] O.E. De Noord, “The Influence of Data Preprocessing on the
Robustness and Parsimony of Multivariate Calibration Models,”
REFERENCES Chemometrics and Intelligent Laboratory Systems, vol. 23, pp. 65-70,
1994.
[1] X. Hu, “DB-H Reduction: A Data Preprocessing Algorithm for
[29] J. DeWitt, “Adaptive Filtering Network for Associative Memory
Data Mining Applications,” Applied Math. Letters, vol. 16, pp. 889-
Data Preprocessing,” Proc. World Congress Neural Networks, vol. IV,
895, 2003.
pp. 34-38, 1994.
[2] K.U. Sattler and E. Schallehn, “A Data Preparation Framework
[30] D. Joo, D. Choi, and H. Park, “The Effects of Data Preprocessing in
Based on a Multidatabase Language,” Proc. Int’l Symp. Database
the Determination of Coagulant Dosing Rate,” Water Research,
Eng. & Applications, pp. 219-228, 2001.
vol. 34, pp. 3295-3302, 2000.
[3] M. Lou, “Preprocessing Data for Neural Networks,” Technical
Analysis of Stocks & Commodities Magazine, Oct. 1993. [31] H.H. Nguyen and C.W. Chan, “A Comparison of Data Preproces-
[4] D. Pyle, Data Preparation for Data Mining. Morgan Kaufmann, 1999. sing Strategies for Neural Network Modeling of Oil Production
[5] M.W. Gardner and S.R. Dorling, “Artificial Neural Networks (the Prediction,” Proc. Third IEEE Int’l Conf. Cognitive Informatics, 2004.
Multilayer Perceptron)—A Review of Applications in the Atmo- [32] J. Pickett, The American Heritage Dictionary, fourth ed. Boston:
spheric Sciences,” Atmospheric Environment, vol. 32, pp. 2627-2636, Houghton Mifflin, 2000.
1998. [33] P. Ingwersen, Information Retrieval Interaction. London: Taylor
[6] M.Y. Rafiq, G. Bugmann, and D.J. Easterbrook, “Neural Network Graham, 1992.
Design for Engineering Applications,” Computers & Structures, [34] U.Y. Nahm, “Text Mining with Information Extraction: Mining
vol. 79, pp. 1541-1552, 2001. Prediction Rules from Unstructured Text,” PhD thesis, 2001.
[7] K.A. Krycha and U. Wagner, “Applications of Artificial Neural [35] F. Lemke and J.A. Muller, “Self-Organizing Data Mining,”
Networks in Management Science: A Survey,” J. Retailing and Systems Analysis Modelling Simulation, vol. 43, pp. 231-240, 2003.
Consumer Services, vol. 6, pp. 185-203, 1999. [36] E. Tuv and G. Runger, “Preprocessing of High-Dimensional
[8] K.J. Hunt, D. Sbarbaro, R. Bikowski, and P.J. Gawthrop, “Neural Categorical Predictors in Classification Setting,” Applied Artificial
Networks for Control Systems—A Survey,” Automatica, vol. 28, Intelligence, vol. 17, pp. 419-429, 2003.
pp. 1083-1112, 1992. [37] C.W.J Granger, “Investigating Causal Relations by Econometric
[9] D.E. Rumelhart, “The Basic Ideas in Neural Networks,” Comm. Models and Cross-Spectral Methods,” Econometrica, vol. 37,
ACM, vol. 37, pp. 87-92, 1994. pp. 424-438, 1969.
[10] K.S. Narendra and K. Parthasarathy, “Identification and Control [38] K.I. Diamantaras and S.Y. Kung, Principal Component Neural
of Dynamic Systems Using Neural Networks,” IEEE Trans. Neural Networks: Theory and Applications. John Wiley and Sons, Inc., 1996.
Networks, vol. 1, pp. 4-27, 1990. [39] D.W. Ashley and A. Allegrucci, “A Spreadsheet Method for
[11] M.R. Azimi-Sadjadi and S.A. Stricker, “Detection and Classifica- Interactive Stepwise Multiple Regression,” Proceedings, pp. 594-
tion of Buried Dielectric Anomalies Using Neural Networks— 596, Western Decision Sciences Inst., 1999.
Further Results,” IEEE Trans. Instrumentations and Measurement, [40] X. Yan, C. Zhang, and S. Zhang, “Toward Databases Mining:
vol. 43, pp. 34-39, 1994. Preprocessing Collected Data,” Applied Artificial Intelligence, vol. 17,
[12] A. Beltratli, S. Margarita, and P. Terna, Neural Networks for pp. 545-561, 2003.
Fconomic and Financial Modeling. London: Int’l Thomson Publish- [41] S. Chaudhuri and U. Dayal, “A Overview of Data Warehousing
ing Inc., 1996. and OLAP Technology,” SIGMOD Record, vol. 26, pp. 65-74, 1997.
[13] Y. Senol and M.P. Gouch, “The Application of Transputers to a [42] S. Abiteboul, S. Cluet, T. Milo, P. Mogilevsky, J. Simeon, and S.
Sounding Rocket Instrumentation: On-Board Autocorrelators with Zohar, “Tools for Translation and Integration,” IEEE Data Eng.
Neural Network Data Analysis,” Parallel Computing and Transputer Bull., vol. 22, pp. 3-8, 1999.
Applications, pp. 798-806, 1992. [43] A. Baumgarten, “Probabilistic Solution to the Selection and Fusion
[14] E.J. Gately, Neural Networks for Financial Forecasting. New York: Problem in Distributed Information Retrieval,” Proc. SIGIR’99,
John Wiley & Sons, Inc., 1996. pp. 246-253, 1999.
[15] A.N. Refenes, Y. Abu-Mostafa, J. Moody, and A. Weigend, Neural [44] Y. Li, C. Zhang, and S. Zhang, “Cooperative Strategy for Web Data
Networks in Financial Engineering. World Scientific Publishing Mining and Cleaning,” Applied Artificial Intelligence, vol. 17,
Company, 1996. pp. 443-460, 2003.
230 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 2, FEBRUARY 2006

[45] J.H. Holland, “Genetic Algorithms,” Scientific Am., vol. 267, pp. 66- [71] R. Rojas, Neural Networks: A Systematic Introduction. Berlin:
72, 1992. Springer-Verlag, 1996.
[46] D.E. Goldberg, Genetic Algorithm in Search, Optimization, and [72] S. Geman, E. Bienenstock, and R. Doursat, “Neural Networks and
Machine Learning. Reading, Mass.: Addison-Wesley, 1989. the Bias/Variance Dilemma,” Neural Computation, vol. 4, pp. 1-58,
[47] A.M. Kupinski and M.L. Giger, “Feature Selection with Limited 1992.
Datasets,” Medical Physics, vol. 26, pp. 2176-2182, 1999. [73] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,
[48] Mani Bloedorn and E. Bloedorn, “Multidocument Summarization Advances in Knowledge Discovery and Data Mining. Menlo Park,
by Graph Search and Matching,” Proc. 15th Nat’l Conf. Artificial Calif.: AAAI Press, 1996.
Intelligence, pp. 622-628, 1997.
[49] M. Saravanan, P.C. Reghu Raj, and S. Raman, “Summarization Lean Yu received the PhD degree in manage-
and Categorization of Text Data in High-Level Data Cleaning for ment sciences and engineering from the Institute
Information Retrieval,” Applied Artificial Intelligence, vol. 17, of Systems Science, Academy of Mathematics
pp. 461-474, 2003. and Systems Sciences, Chinese Academy of
[50] W.A. Shewhart, Economic Control of Quality of Manufactured Sciences. He is currently a research fellow in the
Product. New York: D. Van Nostrand, 1931. Department of Management Sciences at the City
[51] D.A. Dickey and W.A. Fuller, “Distribution of the Estimators for University of Hong Kong. His research interests
Autoregressive Time Series with a Unit Root,” J. Am. Statistical include artificial neural networks, computer
Assoc., vol. 74, pp. 427-431, 1979. simulation, decision support systems, and finan-
[52] J. Wang, C. Zhang, X. Wu, H. Qi, and J. Wang, “SVM-OD: A New cial forecasting.
SVM Algorithm for Outlier Detection,” Proc. ICDM’03 Workshop
Foundations and New Directions of Data Mining, pp. 203-209, 2003.
[53] J. Han and Y. Fu, “Dynamic Generation and Refinement of Shouyang Wang received the PhD degree in
Concept Hierarchies for Knowledge Discovery in Database,” Proc. operations research from the Institute of Sys-
AAAI ’94 Workshop Knowledge Discovery in Database, pp. 157-168, tems Science, Chinese Academy of Sciences
1994. (CAS), Beijing, in 1986. He is currently a Bairen
[54] U. Fayyad and K. Irani, “Multiinterval Discretization of Contin- Distinguished Professor of Management
uous-Valued Attributes for Classification Learning,” Proc. 13th Science in the Academy of Mathematics and
Int’l Joint Conf. Artificial Intelligence, pp. 1022-1027, 1993. Systems Sciences at CAS and a Lotus Chair
[55] A. Srinivasan, S. Muggleton, and M. Bain, “Distinguishing Professor at Hunan University, Changsha. He is
Exceptions from Noise in Nonmonotonic Learning,” Proc. Second the editor-in-chief or a coeditor of 12 journals. He
Int’l Workshop Inductive Logic Programming, 1992. has published 18 books and more than 120 jour-
[56] G.H. John, “Robust Decision Trees: Removing Outliers from nal papers. His current research interests include financial engineering,
Data,” Proc. First Int’l Conf. Knowledge Discovery and Data Mining, e-auctions, and decision support systems.
pp. 174-179, 1995.
[57] D. Gamberger, N. Lavrac, and S. Dzeroski, “Noise Detection and K.K. Lai received the PhD degree from Michigan
Elimination in Data Preprocessing: Experiments in Medical State University. He is the Chair Professor of
Domains,” Applied Artificial Intelligence, vol. 14, pp. 205-223, 2000. management science at City University of Hong
[58] G.E. Batista and M.C. Monard, “Experimental Comparison of Kong and he is also the associate dean of the
K-Nearest Neighbor and Mean or Mode Imputation Methods with Faculty of Business. Currently, he is also acting
the Internal Strategies Used by C4.5 and CN2 to Treat Missing as the dean of the College of Business Admin-
Data,” Technical Report 186, ICMC USP, 2003. istration at Hunan University, China. Prior to his
[59] G.E. Batista and M.C. Monard, “An Analysis of Four Missing Data current post, he was a senior operational
Treatment Methods for Supervised Learning,” Applied Artificial research analyst at Cathay Pacific Airways and
Intelligence, vol. 17, pp. 519-533, 2003. the area manager on marketing information
[60] R.J. Little and P.M. Murphy, Statistical Analysis with Missing Data. systems at Union Carbide Eastern. Professor Lai’s main research
New York: John Wiley and Sons, 1987. interests include logistics and operations management, computer
[61] A. Ragel and B. Cremilleux, “Treatment of Missing Values for simulation, AI, and business decision modeling.
Association Rules,” Proc. Second Pacific-Asia Conf. Knowledge
Discovery and Data Mining, pp. 258-270, 1998.
[62] R.C.T. Lee, J.R. Slagle, and C.T. Mong, “Application of Clustering
. For more information on this or any other computing topic,
to Estimate Missing Data and Improve Data Integrity,” Proc. Int’l
please visit our Digital Library at www.computer.org/publications/dlib.
Conf. Software Eng., pp. 539-544, 1976.
[63] S.M. Tseng, K.H. Wang, and C.I. Lee, “A Preprocessing Method to
Deal with Missing Values by Integrating Clustering and Regres-
sion Techniques,” Applied Artificial Intelligence, vol. 17, pp. 535-544,
2003.
[64] A.S. Weigend and N.A. Gershenfeld, Time Series Prediction:
Forecasting the Future and Understanding the Past. Addison-
Wesley, 1994.
[65] F.M. Tseng, H.C. Yu, and G.H. Tzeng, “Combining Neural
Network Model with Seasonal Time Series ARIMA Model,”
Technological, Forecasting, and Social Change, vol. 69, pp. 71-87,
2002.
[66] J. Moody, “Economic Forecasting: Challenges and Neural Net-
work Solution,” Proc. Int’l Symp. Artificial Neural Networks, 1995.
[67] J.T. Yao and C.L. Tan, “A Case Study on Using Neural Networks
to Perform Technical Forecasting of Forex,” Neurocomputing,
vol. 34, pp. 79-98, 2000.
[68] K. Hornik, M. Stinchcombe, and H. White, “Multilayer Feedfor-
ward Networks Are Universal Approximators,” Neural Networks,
vol. 2, no. 5, pp. 359-366, 1989.
[69] A. Esposito, M. Marinaro, D. Oricchio, and S. Scarpetta,
“Approximation of Continuous and Discontinuous Mappings by
a Growing Neural RBF Based Algorithm,” Neural Networks, vol. 13,
pp. 651-665, 2000.
[70] H. Martens and T. Naes, Multivariate Calibration. New York: John
Wiley & Sons Inc., 1989.

You might also like