You are on page 1of 4

Expert Systems with Applications 38 (2011) 53845387

Contents lists available at ScienceDirect

Expert Systems with Applications


journal homepage: www.elsevier.com/locate/eswa

Exploring the risk factors of preterm birth using data mining


Hsiang-Yang Chen a,1, Chao-Hua Chuang b,,1, Yao-Jung Yang a, Tung-Pi Wu c
a

Department of Applied Information, Hsing Kuo University of Management, Tainan, Taiwan


Department of Nursing, Chang Jung Christian University, Tainan County, Taiwan
c
Department of Obstetrics and Gynecology, SinLau Hospital, Tainan, Taiwan
b

a r t i c l e

i n f o

Keywords:
Preterm birth
Data mining
Neural network
Decision tree

a b s t r a c t
Preterm birth is the leading cause of perinatal morbidity and mortality, but a precise mechanism is
still unknown. Hence, the goal of this study is to explore the risk factors of preterm using data mining with neural network and decision tree C5.0. The original medical data were collected from a prospective pregnancy cohort by a professional research group in National Taiwan University. Using the
nest case-control study design, a total of 910 motherchild dyads were recruited from 14,551 in the
original data. Thousands of variables are examined in this data including basic characteristics, medical history, environment, and occupation factors of parents, and variables related to infants. The
results indicate that multiple birth, hemorrhage during pregnancy, age, disease, previous preterm history, body weight before pregnancy and height of pregnant women, and paternal life style risk factors related to drinking and smoking are the important risk factors of preterm birth. Hence, the
ndings of our study will be useful for parents, medical staff, and public health workers in attempting to detect high risk pregnant women and provide intervention early to reduce and prevent preterm birth.
2010 Elsevier Ltd. All rights reserved.

1. Introduction

2. Literature review

Preterm birth, the birth of an infant prior to 37 completed


weeks of gestation, is the leading cause of perinatal morbidity
and mortality (Goldenberg, Culhane, Dlams, & Romero, 2008;
McCormick, 1985). The prevalence rate for such birth is about
1213% in the USA, and 59% in Europe, other developed
countries and Taiwan (Chuang, Chang, Hsieh, et al., 2007;
MacDorman, Martin, Mathews, Hoyert, & Ventura, 2005; Slattery
& Morrison, 2002). The reasons for preterm birth remain
unclear, although data mining is a promising approach to
explore potential factors from large amount of data (Chang,
2007; Chen, Hou, Chuang, & TBPS Research Group, 2009;
Courtney, Stewart, Popescu, & Goodwin, 2008; Liao, Hsieh, &
Huang, 2008). Hence, the purpose of this work, based on the
nest case-control study design, is to explore the risk factors of
preterm by neural network and decision tree in data mining,
to nd more potential information.

2.1. Preterm birth


Preterm birth is the birth of an infant within 37 weeks of gestation, which accounts for 75% of perinatal mortality and half the
long-term morbidity (McCormick, 1985). Studies show that maternal race, age, weight, income, previous preterm history, weight
gain, infection, stress during pregnancy and other immunologically medicated processes are the risks factors for such birth
(Goldenberg et al., 2008; Moore, 2003; Romero et al., 2006).
However, despite the identication of such factors, a precise
mechanism cannot be established in most cases, and only about
half of women who experience preterm birth have an identiable
risk factor (Moore, 2003). Consequently, other factors that may
be associated with preterm birth are currently being explored by
data mining (Courtney et al., 2008).
2.2. Data mining

Corresponding author.
E-mail addresses: i14248@mail.hku.edu.tw (H.Y. Chen), chchuang@mail.cjcu.
edu.tw (C.H. Chuang).
1
These authors contributed equally to the work.
0957-4174/$ - see front matter 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2010.10.017

Data mining is the process of extracting hidden patterns from a


huge amount of data (Kantardzic, 2003). It is commonly used in a
wide range of proling practices, such as marketing, fraud detection, performance, scientic discovery and medicine (Chen et al.,
2009; Huang, Chang, & Wu, 2009; Lin, Shiue, Chen, & Cheng,
2009; akr, als, & Kksille, 2009). Preterm birth is now

5385

H.Y. Chen et al. / Expert Systems with Applications 38 (2011) 53845387

thought to be a syndrome initiated by multiple mechanisms, most


of which still can not be established. Thus, this works uses data
mining to uncover potentially related factors, and our methodology is shown in Fig. 1. Firstly, we use neural network to nd the
top 15 impact factors from thousands of variables in our database.
Then, we use decision tree C5.0 to classify these factors by weight.
The results of our study can provide the information for medical
staff or pregnant women to prevent the incidence of preterm.
2.2.1. Neural network
A neural network model is an information processing paradigm
that is inspired by the way biological nervous systems, such as the
brain, process information (Liu, Yuan, & Liao, 2009; Rajan, Ramalingam, Ganesan, Palanivel, & Palaniappan, 2009). Neural network is
composed of articial neurons or nodes. A single neuron may be
connected to other neurons, and the total number of neurons
and connections in a network may be extensive. Such neural network can be make predictions. For example, a credit card company
may use a neural network to quickly identify transactions which
have a high probability of being fraudulent. In this work, we use
a neural network to nd the top 15 impact factors from thousands
of variables in our database. These factors are than analyzed by our
secondary strategy, a decision tree, as explain below.

set at each of its tree nodes, seeking the attribute that best separates the instances. ID3 later evolved into C4.5 (Quinlan, 1993),
and this was an important with regard to the splitting rule and
the calculation method. C5.0 is a commercial version of C4.5, and
is available as a closed-source product, such as Clementine and
RuleQuest (Han & Kamber, 2007). C5.0 improves the rule generation of C4.5, and can obtain similar results with considerably smaller decision trees (Quinlan, 1997). Other decision tree methods
include CART (Breiman, Friedman, Olshen, & Stone, 1984) and
CHAID (Loh & Shih, 1997), which provide a set of rules that can
be applied to a new (unclassied) dataset to predict which records
will have a given outcome. CART segments a dataset by creating
two-way splits, while CHAID creates multi-way splits. CART typically requires less data preparation than CHAID. QUEST, another
type of decision tree, is similar to the CART algorithm, but is designed to reduce the processing time required for large CART analyses (Agrawal, Mehta, Shafer, & Srikant, 1996).
A decision tree is exploratory in nature, identifying clusters or
segments of interest. We thus try use a decision tree to identify
the 15 most important impact factors for preterm birth. We used
the C5.0 algorithm, which obtained considerably more results than
the other decision tree methods.
3. Empirical study

2.2.2. Decision tree


A decision tree is a predictive model, which is used in classication, clustering, and prediction (Duman, Erdamar, Erogul, Telatar, &
Yetkin, 2009; Wu, Lee, Huang, Liu, & Horng, 2009). A decision tree
is a diagram, which uses a tree-like graph or model of decisions
and their possible consequences as a visual and analytical decision
support tool. The expected values of competing alternatives in each
node are calculated, and a decision tree is thus a mapping from
observations of an item to conclusions of its target value. Commonly used decision tree models (algorithms) include ID3 (Iterative Dichotomiser 3), C4.5, C5.0, CART (Classication and
Regression Tree), CHAID (Chi-squared Automatic Interaction detection) and QUEST (Quick, Unbiased and Efcient Statistical Tree).
ID3 and its successors were developed by Ross Quinlan, who
discovered it while working in the 1970s (Quinlan, 1986). ID3 is
a heuristic method for providing a decision tree, which it generates
by employing a top-down, greedy search through the training data

3.1. Research structure


Clementine 10.0 is a commercial data mining tool (SPSS, 2005)
that supports various analyses, such as neural network, decision
tree, regression, logistic, and so on. To analyze the nominal variables, we used the neural network and C5.0 of Clementine 10.0
to mine the data.
The research procedures were as follow:
1. Problem denition: Preterm birth is one of the leading causes of
diseases and death among newborns. In addition, preterm
infants often suffer long-term health problems, including lung
diseases, vision and hearing impairments, and learning disabilities. However, the mechanism that causes preterm birth
remains unclear. Hence, it is very important to investigate
potential risk factors and offer early intervention when
necessary.
2. Data collection: The medical data were collected from a prospective pregnancy cohort, which was established between
1984 and 1987. Pregnant women with 26 or more weeks gestation who came to one Hospital in Taiwan for prenatal care
were enrolled in the study and interviewed using a structured

Table 1
Relative importance of inputs found by neural network.

Strategy 1

Strategy 2

Fig. 1. The process of data mining to explore risk factors of preterm birth.

Number

Factor

Coefcient

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Number of birth
Paternal smoking
Hemorrhage during pregnancy
Parity
Maternal age
Paternal occupation
Maternal hypertension
Medicines taken during pregnancy
Maternal gynecological diseases
Maternal body height
Maternal body weight before pregnancy
Paternal age
Paternal drinking
Previous preterm birth
Vitamins taken during pregnancy

0.3597
0.1086
0.1077
0.0899
0.0659
0.0636
0.0620
0.0600
0.0599
0.0567
0.0500
0.0499
0.0492
0.0415
0.0350

5386

H.Y. Chen et al. / Expert Systems with Applications 38 (2011) 53845387

Table 2
Decision tree on predictors of preterm birth.
Number
1
2
3
4
5
6
7
8
9
10

Rule

Accuracy
(%)

Multiple birth
Single birth, and hemorrhage during pregnancy
Single birth, no hemorrhage during pregnancy, previous preterm birth, and paternal drinking
Single birth, no hemorrhage during pregnancy, previous preterm birth, no paternal drinking, no maternal gynecological diseases, paternal
smoking, maternal body weight before pregnancy is 4650 kg, and medicines used during pregnancy
Single birth, no hemorrhage during pregnancy, no previous preterm birth, rst parity, paternal age < 35, paternal smoking, maternal age >= 20,
maternal body height > 160 cm, and paternal age <= 19
Single birth, no hemorrhage during pregnancy, no previous preterm birth, rst parity, paternal age < 35, paternal smoking, maternal age >= 20,
maternal body height > 160 cm, paternal age > 19, and maternal gynecological diseases
Single birth, no hemorrhage during pregnancy, no previous preterm birth, rst parity, paternal age < 35, paternal smoking, maternal age >= 20,
maternal body height is 156160 cm, medicines used by pregnant women, and paternal occupation is neither manual nor non-manual
Single birth, no hemorrhage during pregnancy, no previous preterm birth, rst parity, paternal age >= 35, paternal smoking
Single birth, no hemorrhage during pregnancy, no previous preterm birth, rst parity, paternal age < 35, paternal smoking, maternal age >= 20,
maternal body height is 156160 cm, medicines used by pregnant women, and paternal occupation is non-manual
Single birth, no hemorrhage during pregnancy, previous preterm birth, paternal drinking, maternal gynecological diseases, paternal smoking,
and maternal body weight before pregnancy over 55 kg

100.0
100.0
100.0
100.0

questionnaire. Many factors were measured in the questionnaire, including the pregnant womens and their husbands
characteristics (age, education, occupation, body height and
weight), life style, family income, past and present medical
information, and living environment. After the participating
women giving birth, the birth weight, gestational duration
and characteristics of the live born infants were gathered from
the Taiwan National Birth Register. A total of 14,551 pairs were
recruited in the database (Chuang et al., 2006).
3. Data organization: Incomplete data were deleted, and in the
current data, based on the nest case control study method,
910 motherinfant pairs were selected from 14,551 in the original set.
4. Analysis: A neural network was used to mine the 15 most
important factors related to preterm. A decision tree C5.0 was
than used to classify the risk factors, so high risk groups for preterm birth would be detected.
3.2. Data preparation
There was a total of 455 preterm birth in this medical data set,
and thus a signicant difference between preterm (455) and full
term infants (14,096) in the original data. Thus, according to the
1:1 principle for case (preterm birth) and control (full term birth),
we randomly sampled 455 full term babies from the original medical data. Thus, a total of 910 pairs (mothers and their infants) were
nally utilized in our analysis.
4. Research results
The relative importance of inputs, as derived by the neural network, is shown in Table 1. Because a lot of variables were measured
in the current medical data, we decided to explore it in two stages.
First, a neural network was used to investigate the risk factors of
preterm birth, and the 15 top important factors, with coefcients
larger than 0.0300, were than used in the next stage of our study.
These factors were as follows: number of birth, paternal smoking,
hemorrhage during pregnancy, parity, maternal age, paternal occupation, maternal hypertension, medicines taken during pregnancy,
maternal gynecological diseases, maternal body height, maternal
body weight before pregnancy, paternal age, paternal drinking,
previous preterm birth, and vitamins taken during pregnancy.
We then used these results 15 factors for the decision tree analysis. Through the construction of a decision tree, 17 rules were explored to predict preterm birth. Ten of these rules, with an
accuracy of 80% or more, are listed in Table 2. A multiple birth

100.0
100.0
100.0
92.4
83.3
80.0

was the highest risk factor, with hemorrhage during pregnancy


the second most important. This is consistent with medical evidence. The specic risk factors related to pregnant women were
previous preterm birth, diseases, body height, and weight before
pregnancy, while the, paternal risk factors were smoking, drinking,
age, and occupation.
5. Conclusions
Preterm birth is the leading cause of perinatal morbidity and
mortality, but to date the precise mechanism is still unknown
now, and few prospective studies have been undertaken that explore the risk factors using data mining methods. Hence, we conducted a study based on the nest case-control design method
and used data mining to explore risk factors of preterm. Our results
show that multiple birth and hemorrhage during pregnancy are
the top two risk factors. In addition, several maternal factors such
as age, disease, previous preterm history, body weight before pregnancy and height, are also related to preterm birth. The prior statement about risks is commonly known and accepted at present.
Furthermore, paternal risk factors related to, drinking, smoking,
and occupation are also shown in our results. This is a signicant
suggestion that men also contribute to the risk of preterm birth.
Thus, it is essential for prospective fathers to form good life style
habits in order to prevent his baby from being born preterm.
Hence, the ndings of our study will be useful for parents, medical
staff, and public health workers in attempting to detect high risk
pregnant women and provide intervention early to reduce and prevent preterm birth.
Acknowledgements
We thank professor Jung-Der Wang, Pau-Chung Chen and Hui-I
Hsieh for providing us medical health data and helpful comments.
References
Agrawal, A., Mehta, M., Shafer, J., & Srikant, R. (1996). The quest data mining system.
In Proceedings of 2nd international conference on knowledge discovery and data
mining.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classication and
regression trees. California: Wadsworth.
akr, A., als, H., & Kksille, E. U. (2009). Data mining approach for supply
unbalance detection in induction motor. Expert Systems with Applications, 36(9),
1180811813.
Chang, C. L. (2007). A study of applying data mining to early intervention for
developmentally-delayed children. Expert Systems with Applications, 33(2),
407412.

H.Y. Chen et al. / Expert Systems with Applications 38 (2011) 53845387


Chen, H. Y., Hou, T. W., Chuang, C. H., & TBPS Research Group (2009). Applying data
mining to explore the risk factors of parenting stress. Expert Systems with
Applications. doi:10.1016/j.eswa.2009.05.028.
Chuang, C. H., Chang, P. J., Hsieh, W. S., et al. (2007). The combined effect of
employment status and transcultural marriage on breast feeding: a populationbased survey in Taiwan. Paediatric and Perinatal Epidemiology, 21(4), 319329.
Chuang, C. H., Doyle, P., Wang, J. D., Chang, P. J., Lai, J. N., & Chen, P. C. (2006). Herbal
medicines used during the rst trimester and major congenital malformations
An analysis of data from a pregnancy cohort study. Drug Safety, 29(6), 537548.
Courtney, K. L., Stewart, S., Popescu, M., & Goodwin, L. K. (2008). Predictors of
preterm birth in birth certicate data. Studies in Health Technology and
Informatics, 136, 555560.
Duman, F., Erdamar, A., Erogul, O., Telatar, Z., & Yetkin, S. (2009). Efcient sleep
spindle detection algorithm with decision tree. Expert Systems with Applications,
36(6), 99809985.
Goldenberg, R., Culhane, J. F., Dlams, J., & Romero, R. (2008). Epidemiology and
causes of preterm birth. Lancet, 371(9606), 7584.
Han, J., & Kamber, M. (2007). Data mining concepts and techniques. Amsterdam:
Elsevier. pp. 291310.
Huang, S. C., Chang, E. C., & Wu, H. H. (2009). A case study of applying data mining
techniques in an outtters customer value analysis. Expert Systems with
Applications, 36(3), 59095915.
Kantardzic, M. (2003). Data mining: Concepts, models, methods, and algorithms.
0471228524. New Jersey: John Wiley & Sons.
Liao, S. H., Hsieh, C. L., & Huang, S. P. (2008). Mining product maps for new product
development. Expert Systems with Applications, 34(1), 5062.
Lin, S. W., Shiue, Y. R., Chen, S. C., & Cheng, H. M. (2009). Applying enhanced data
mining approaches in predicting bank performance: A case of Taiwanese
commercial banks. Expert Systems with Applications, 36(9), 1154311551.

5387

Liu, D., Yuan, Y., & Liao, S. (2009). Articial neural networks for optimization of goldbearing slime smelting. Expert Systems with Applications, 36(9), 1167111674.
Loh, W. Y., & Shih, Y. S. (1997). Split selection methods for classication trees.
Statistica Sinica, 7(4), 815840.
MacDorman, M. F., Martin, J. A., Mathews, T. J., Hoyert, D. L., & Ventura, S. J. (2005).
Explaining the 200102 infant mortality increase: Data from the linked birth/
infant death data set. National Vital Statistics Reports, 53(12), 122.
McCormick, M. C. (1985). The contribution of low birth weight to infant mortality
and childhood morbidity. New England Journal of Medicine, 312(2), 8290.
Moore, M. L. (2003). Preterm labor and birth: what have we learned in the past two
decades? Journal of Obstetric Gynecologic & Neonatal Nursing, 32(5), 638649.
Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Fransisco: Morgan
Kaufman.
Quinlan, J. R. (1997). C5. 0 and See 5: Illustrative examples. RuleQuest Research.
htttp://www.rulequest.com.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81106.
Rajan, K., Ramalingam, V., Ganesan, M., Palanivel, S., & Palaniappan, B. (2009).
Automatic classication of Tamil documents using vector space model and
articial neural network. Expert Systems with Applications, 36(8), 1091410918.
Romero, R., Espinoza, J., Kusanovic, J. P., Gotsch, F., Hassan, S., Erez, O., et al. (2006).
The preterm parturition syndrome. British Journal of Obstetrics and Gynaecology,
113(Suppl 3), 1742.
Slattery, M. M., & Morrison, J. J. (2002). Preterm delivery. Lancet, 360(9344),
14891497.
SPSS (2005). Introduction to clementine. USA: SPSS Inc.
Wu, L. C., Lee, J. X., Huang, H. D., Liu, B. J., & Horng, J. T. (2009). An expert system to
predict protein thermostability using decision tree. Expert Systems with
Applications, 36(5), 90079014.

You might also like