You are on page 1of 16

Knowledge-Based Systems 150 (2018) 95–110

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Novel binary encoding water cycle algorithm for solving Bayesian


network structures learning problem
Jingyun Wang, Sanyang Liu∗
School of Mathematics and Statistics, Xidian University, Xi’an 710126, China

a r t i c l e i n f o a b s t r a c t

Article history: Constructing Bayesian network structures from data is a computationally hard task. One important
Received 30 October 2017 method to learn Bayesian network structures uses the meta-heuristic algorithms. In this paper, a novel bi-
Revised 28 February 2018
nary encoding water cycle algorithm is proposed for the first time to address the Bayesian network struc-
Accepted 2 March 2018
tures learning problem. In this study, the sea, rivers and streams correspond to the candidate Bayesian
Available online 7 March 2018
network structures. Since it is a discrete problem to find an optimal structure, the logic operators have
Keywords: been used to calculate the positions of the individuals. Meanwhile, to balance the exploitation and ex-
Water cycle algorithm ploration abilities of the algorithm, the ways how rivers and streams flow to the sea and the evaporation
Binary encoding process have been designed with the new strategies. Experiments on well-known benchmark networks
Heuristic algorithm demonstrate that the proposed algorithm is capable of identifying the optimal or near-optimal structures.
Bayesian network In the comparison to the use of the other algorithms, our method performs well and turns out to have
Structure learning
the better solution quality.
© 2018 Published by Elsevier B.V.

1. Introduction hybrid learning [34] algorithms. The constraint-based algorithms


construct BNs by analyzing conditional independent relations be-
Bayesian networks (BNs) combine graph and probability theo- tween nodes in network. The score-based algorithms learn BNs by
ries to obtain a comprehensible representation of the joint proba- maximizing the scores of candidate structures with some heuristic
bility distribution. Bayesian networks have been seen as one of the search algorithms. The hybrid algorithms combine aspects of both
best way to represent causal knowledge and used in reasoning and constraint-based and score-based algorithms, they use conditional
decision making tasks in uncertain domains. Due to their powerful independence tests and network scores at the same time. The fo-
representation, inference and learning abilities, Bayesian networks cus of this paper is on the score-based algorithms rather than the
have become increasingly popular in large number of research ar- constraint-based methods. Learning the Bayesian network struc-
eas, such as risk analysis [21,32,41], bioinformatics research [38,39], tures from data is an NP-hard problem [8], the number of possible
medical problem [29,35] and image processing [23], etc. structures grows super-exponentially by the number of nodes [25].
The learning task in Bayesian networks can be grouped into two To efficiently search the optimum or near-optimum in the space of
subtasks: structure learning and parameter estimation. The first possible solutions, the meta-heuristic algorithms have been used
subtask is to identify the best topology for a network, and the sec- for finding out optimal structures, such as genetic algorithm (GA)
ond subtask is to learn the parameters that define the conditional [6,18], evolutionary programming (EP) [36], ant colony optimiza-
probability distribution for a given network topology. Throughout tion (ACO) [10,24], artificial bee colony (ABC) [17], bacterial forag-
the last decade, there has been a growing interesting in the area of ing optimization (BFO) [40] and particle swarm optimization (PSO)
learning the structure of Bayesian networks from data. The struc- [2,14,37].
ture learning can be considered the problem of selecting a prob- Several heuristic algorithms have been shown that they are
abilistic model that explains the given data. A wealth of literature promising candidates for solving BN structures learning prob-
are available to provide methods of learning BN structures. Roughly lem. For instance, an artificial bee colony algorithm for learning
speaking, there are three methods to learning the structures of Bayesian networks (ABC-B) [17]; structural learning of Bayesian
Bayesian network: constraint-based [22,33] , score-based [1,15] and networks by bacterial foraging optimization (BFO-B) [40]; BNC-
PSO: structure learning of Bayesian networks by particle swarm
∗ optimization (BNC-PSO) [14]. However, some drawbacks still exist
Corresponding author.
E-mail addresses: ygbai@stu.xidian.edu.cn (J. Wang), liusanyang@126.com (S.
in these algorithms. ABC-B and BFO-B algorithms generate candi-
Liu). date solutions based on neighbor searching. When ABC-B is imple-

https://doi.org/10.1016/j.knosys.2018.03.007
0950-7051/© 2018 Published by Elsevier B.V.
96 J. Wang, S. Liu / Knowledge-Based Systems 150 (2018) 95–110

mented, the inductive pheromone transmission mechanism guides learn the parameters that define the conditional probability dis-
the search for global solution, the risk of the algorithm be trapped tribution for a fixed network structure. In this work, we focus on
in the local optimum is increased. Although BFO-B has the certain the problem of structure learning, and learn the structure from ob-
probability of jumping out the local optimum, it generally spends served data based on score-based method.
much time in finding out the optimal solution. BNC-PSO generates
a new solution based on mutation and crossover operators, the al-
gorithm converges fast and obtains the optimum in a small num- 2.2. k2 scoring metric of the Bayesian network
ber of generations, it works well for small databases but may be
trapped in the local optimum for large databases. Score-based approach is one of the most popular method of
Water cycle algorithm (WCA) is a novel meta-heuristic algo- inducing BNs from data. Generally, given a scoring metric, ap-
rithm developed by Eskandar et al. in 2012 [12], which is inspired proaches to BN learning concentrate on finding one or more struc-
from nature and based on the observation of water cycle process tures that fit the observed data well according to the scoring met-
and the ways how rivers and streams flow to the sea in the real ric. Assuming a structure G, the score as shown in Eq. (2) is the
world. The basic WCA and its modified versions have been applied posterior probability of G given the data set D = {d1 , d2 , · · · , dN }
to constrained optimization and engineering design problems, the with N cases. Since P (D ) does not depend on the structure, the
algorithms have good performance in convergence rate and qual- joint distribution P (G, D ) can be used as scoring metric.
ity of optimized designs [26–28]. However, to our knowledge, the P ( D |G )P ( G ) P ( G, D )
WCA algorithm has not been applied to learning BN structures. P (G|D ) = = . (2)
P (D ) P (D )
In this paper, we propose a novel binary encoding water cycle
algorithm for structures learning of Bayesian networks (BEWCA- The k2 metric is one of the widely used Bayesian scoring meth-
BN). Since the solution space of the problem is binary-structured, ods. Since the metric is first used in k2 algorithm [9], so it adopts
xor, and and or operators are used to generate solutions in the the name of the algorithm and refers to as the k2 metric. The k2
new optimization technology. Unlike classical water cycle algo- metric is used to measure the joint probability of a BN G and a
rithm where the individuals are updated only based on the best data D, it can be expressed as Eq. (3):
(sea) and better (rivers) solutions, each individual in the proposed
method also learns from randomly selected individuals in the cur- 
n 
qi
( r i − 1 )! 
ri
P ( G, D ) = P ( G ) Ni jk !, (3)
rent population. In addition, we define a method to calculate the
i=1 j=1
(Ni j + ri − 1 )!
k=1
distance of two individuals, and the evaporation process is per-
formed by mutation operator. The new strategies are introduced where ri is the number of different states of the variable xi , qi is
in the proposed algorithm to enhance the search performance the number of possible configurations for Pa(xi ), Nijk is the number
and improve the quality of the solutions. The theorem of Markov of samples in D, in which xi is in its kth state and Pa(xi ) is in its
r
chain is used to prove the convergence of the proposed algorithm. jth configuration, and Ni j = ki=1 Ni jk .
BEWCA-BN algorithm is then implemented on several well-known The score decomposability is an important property for a scor-
benchmark networks and compared with other state of the art al- ing metric. If the score of the whole network can be written as
gorithms. the sum of scores that depend only on one node and its parents,
The rest of this paper is organized as follows: we begin in we say that the score metric is decomposable. The major benefit
Section 2 with the concepts of BNs and the scoring metric related of this property is that a local change in a BN does not alter the
to BNs. In Section 3 we discuss the WCA algorithm. In Section 4 we scores of other parts. When assuming a uniform prior for P (G ) and
develop the novel binary encoding algorithm for structures learn- using log(P (G, D )) instead of P (G, D ), the k2 metric can be writ-
ing of BNs. In Section 5 we present the experimental evaluation of ten as Eq. (4), the score of each node xi is defined as Eq. (5):
the proposed algorithm. Finally, Section 6 contains the conclusions
and possible future directions. 
n
f (G, D ) = log(P (G, D )) ≈ f (xi , Pa(xi )), (4)
i=1
2. Bayesian networks and scoring metric
   
2.1. Bayesian networks 
qi
( r i − 1 )! 
ri
f (xi , Pa(xi )) = log + log(Ni jk ! ) . (5)
(Ni j + ri − 1 )
k=1
Let G = (X, E ) be a directed acyclic graph (DAG), where X = j=1

(x1 , x2 , · · · , xn ) is the set of nodes representing the system ran-


dom variables, E = {ei j } is the set of edges representing the di- 3. Water cycle algorithm
rect dependence relationships between the variables. If there is a
directed edge from node xj to node xi , we say xj is a parent of 3.1. Representation of initial population
xi . Pa(xi ) is defined as the set containing the parents of xi in the
graph. Let P be a joint probability distribution of random variables Water cycle algorithm (WCA), a new population-based algo-
in set V. If (G, P ) satisfies the Markov condition, then (G, P ) is rithm, is inspired by the observation of water cycle process and
called a Bayesian network (BN). Together with the graph structure, the ways how rivers and streams flow to the sea in the real world.
the joint probability distribution of the domain can be decomposed Similar to other meta-heuristic methods, WCA algorithm requires
into a product of local conditional probability distributions accord- an initial population of individuals so called raindrops. An initial
ing to Eq. (1), and each conditional probability distribution involves population is randomly generated by first assuming the rain or
a node and its parents only. precipitation phenomena, then the quality of each raindrop is eval-

n uated to determine the sea, the rivers and the streams. The rain-
P ( x1 , x2 , · · · , xn ) = P (xi |Pa(xi )). (1) drop with the best quality is chosen as a sea, a number of good
i=1 raindrops are considered as rivers and the rest of raindrops are
The learning task in BNs consists of two subtasks: structure chosen as streams that flow to the rivers and sea.
learning and parameter estimation. The first subtask aims at iden- In D dimensional optimization problem, each raindrop is repre-
tifying the best topology for a network. The second subtask is to sented as an array of 1 × D, the total population is represented as
J. Wang, S. Liu / Knowledge-Based Systems 150 (2018) 95–110 97

follows: distribution) between 0 and c × d. The new positions of streams


⎛ ⎞ and rivers are updated according to Eqs. (12)–(14):
Xsea
⎜ Xriver ⎟ t+1
Xstream t
= Xstream + rand × c × (X t t
− Xstream ), (12)
⎜ 1
⎟ river
⎜ Xriver ⎟
⎜ 2

⎜ .. ⎟ t+1
Xstream t
= Xstream + rand × c × (Xsea
t t
− Xstream ), (13)
⎜ . ⎟
Population = ⎜
⎜ Xriver


⎜ Nsr −1
⎟ X t+1 = X t + rand × c × (Xsea
t
− Xt ),
⎜ Xstream1 ⎟ river river river
(14)
⎜ ⎟
⎜ .. ⎟
⎝ . ⎠ where rand is a random value on the interval [0,1], t is the number
XstreamN of iterations. If the new stream obtained by Eq. (12) (or Eq. (13) is
pop −Nsr better than its connecting river (or sea), then the positions of the
⎛ ⎞ river (or sea) and the new stream are exchanged, i.e., the river (or
x11 x12 ··· x1D
⎜ x21 x22 ··· x2D ⎟ sea) becomes a stream and the new stream becomes a river (or
= ⎜ .. .. .. .. ⎟, (6) sea). If the new river obtained by Eq. (14) is better than the sea,
⎝ . . . . ⎠ then exchange the positions of the sea and the new river, i.e. the
xNpop 1 xNpop 2 ··· xNpop D sea becomes a river, and the new river becomes the sea.

in which, Npop is taken as the number of population size, Nsr is 3.3. Evaporation condition
the summation of the sea and the number of rivers as shown in
Eq. (7): In WCA, each stream updates its position according to previ-
Nsr = number of rivers + 
1 , (7) ous position and connecting river (i.e. the better solution), each
river updates its position according to previous position and the
sea
sea (i.e. the best solution). With the increasing of iterations, the
and the number of streams is calculated using Eq. (8): differences X t t
− Xstream t
and Xsea − Xt become very small,
river river
Nstreams = Npop − Nsr . (8) which means that the positions of individuals are very close to
each other. Thus the new positions may not be able to explore
The cost of each individual is evaluated by the cost function given any better region in the searching space and the population con-
as Eq. (9): verges to a small domain. In this case, if the best position is not
the global optimum, the population may be trapped in local op-
Costi = f (xi1 , xi2 , · · · , xiD ), i = 1, 2, 3, · · · , Npop . (9)
timum. Taking into consideration these facts, evaporation process
To start the optimization problem, Npop individuals are gener- is defined to encourage the exploration phase of the search and
ated and evaluated. A number of Nsr from the best individuals are prevent the population from trapping in local optimum. As can be
chosen as a sea (best solution) and rivers (better solutions), and seen in nature, the evaporation process occurs when rivers and
the rest of Nstreams individuals are considered as streams. Rivers streams are close enough to the sea. If Xsea t − Xt  < dmax or
river
flow to the sea, and streams flow to the rivers or may directly flow rand < 0.1 (i = 1, 2, . . . , Nsr − 1), then the raining process is em-
to sea. Sea and each river absorb streams based on their magnitude ployed by Eq. (15):
of flow. The designated streams that flow to the sea and rivers are
calculated by Eqs. (10) and (11):
new
Xstream = LB + rand × (UB − LB ), (15)
in which UB and LB are defined as the upper and lower bounds
Cn = Costn − CostNsr+1 , n = 1, 2, 3, · · · , Nsr , (10)
of the given problem, rand is a random value in the interval [0,1],
dmax is a small value close to zero.
 
Cn 4. Binary encoding water cycle algorithm for BN structures
NSn = round | N | × Nstreams , n = 1, 2, 3, · · · , Nsr , learning
sr C
n=1 n

(11) In this section, a novel binary encoding water cycle algorithm is


proposed for solving the problem of BNs learning. In our proposed
where NSn is the number of streams that flow to the sea or specific
algorithm, each candidate solution is represented as a connectivity
river. According to Eqs. (10) and (11), we observe that the sea as
matrix, the individuals update their positions based on the logic
the best solution absorbs more streams than rivers do. If it is a
operators, the bit − f lip mutation operator is adopted to enable
maximization problem, more streams flow to the sea which has
the algorithm to avoid premature convergence. The details of this
the highest cost, and other streams flow to the rivers which have
algorithm are described in the next subsections.
higher costs.
4.1. Solution representation and initialization
3.2. Updating rules
As discussed in Section 2, the structure learning task can be
Similar to the phenomena in nature, streams are created from viewed as an optimization problem, that is to find the network
waters and join each other to form new rivers. Some streams may with the highest score in the search space. Each candidate solu-
flow directly to the sea. All rivers and streams end up in the sea tion can be represented by a D × D connectivity matrix, where D is
that corresponds to the current best individual. the number of nodes. In detail, let a BN be defined as follows:
A stream flow to a specific river along their connecting line us- ⎛ ⎞
ing a randomly generated distance x (x ∈ (0, c × d), c > 1), where x11 x12 ··· x1D
1 < c < 2, and the best value for c may be selected as 2. d is the ⎜ x21 x22 ··· x2D ⎟
X =⎜ .. .. .. ⎟, (16)
distance between the current stream and its specific river. x is dis- ⎝ . . .

tributed random number (uniformly or may be any appropriate
xD1 xD2 ··· xDD
98 J. Wang, S. Liu / Knowledge-Based Systems 150 (2018) 95–110

Table 1 where, , , and + operates denote the and, xor and or logic op-
Logic operators.
erators, respectively. Xr1 and Xr2 are random individuals, r1 and r2
X Y xor(X, Y ) and(X, Y ) or(X, Y ) are random integers between 1 and Npop . Rand is a D × D random
0 0 0 0 0 matrix, each element in Rand is 0 or 1. I is a D × D unit matrix.
0 1 1 0 1 The candidate solution is generated through the instructions given
1 0 1 0 1 in Algorithm 1.
1 1 0 1 1

Algorithm 1 Update-WCA.
in which xij is defined as: Input:
 1: Input:ct , Xbetter , Xr , Xr1 , Xr2
1, if i is a parent of j Output:
xi j = . (17)
0, otherwise 2: Output:Vr
3: Generate a matrix Rand, where each element in Rand is zero
Since, WCA is a population-based algorithm, in order to solve
4: Vr1 = xor(Xbetter , Xr )
the structure learning problem, the first step is to generate the ini-
5: Vr2 = xor(Xr1 , Xr2 )
tial population as the initial solutions. The procedure to establish
6: for i = 1 : D do
the initial networks is given similar to that proposed in [40]. To
7: for j = 1 : D do
generate an initial network, the procedure is performed by starting
8: if rand < ct then
with an empty graph (i.e. a graph with all the nodes in the net-
9: Rand (i, j ) = 1
work, without any edge between them), and then adding an edge
10: end if
to current graph if and only if the new graph is a directed acyclic
11: end for
graph and the score of the new graph is larger than that of the
12: end for
old one. The process of edge added is repeated until the number
13: Vr1 = and (Vr1 , Rand )
of edges satisfies the pre-defined number.
14: Vr2 = and (Vr2 , I − Rand )
15: Vr = or (Vr1 , Vr2 )
4.2. Algorithm description
16: Vr = xor (Vr , Xr )
17: if !acyclic (Vr ) then
4.2.1. Update position based on logic operator
18: Remove cycles and obtain a directed acyclic graph Vr
The basic WCA works on continuous space to optimize the
19: end if
problem with real-valued parameters. However, the problem of
BNs learning is set in the binary space. Hence, the strategies de-
signed for updating streams and rivers should be modified for solv- In Algorithm 1, Xr represents the rth individual (a river or a
ing binary optimization problems. Inspired by the ideas of the bi- stream), Xbetter represents a good solution (the sea or a river), Xr1
nary encoding differential evolution algorithm [3,11], the elements and Xr2 are random individuals in the population excluding Xr and
of binary space can be regarded as logic number. Thus, we use the Xbetter . The matrix Rand lies on the parameters rand and ct , where
logic operators (xor, or and and) to generate the candidate solu- rand indicates a uniformly distribution random number in (0,1) and
tions in our proposed algorithm. The truth table of logic operators ct is scale factor parameter. If ct is set to a large value, the num-
is show in Table 1. ber of elements that are equal to 1 in Rand is larger than the
In basic WCA, the algorithm generates the candidate solutions number of elements that are equal to 1 in I − Rand. From step
based on Eqs. (12)–(14). We notice that the individuals are updated 4 we know, if Xbetter (i, j) = Xr (i, j), then Vr1 (i, j ) = 1. At step 13,
only depend on the information including the better and the best if Rand (i, j ) = 1, then the new Vr1 (i, j) is equal to 1. Since step
solutions. Once these good solutions are stagnant, individuals in 15 employs or operator, then Vr (i, j ) = 1. If Xr (i, j ) = 1, it must
the population will quickly converge to these solutions. If the cur- be Xbetter (i, j ) = 0, Xr (i, j ) = Vr (i, j ), hence we have Vr (i, j ) = 0 at
rent convergence is not the global optimum, the population will step 16, which means that the new individual learns from Xbetter (i,
hardly escape from the local optimum. To tackle this problem, the j). If Xr (i, j ) = 0, it must be Xbetter (i, j ) = 1, Xr (i, j) = Vr (i, j), thus
individuals in our proposed method not only update their positions we have Vr (i, j ) = 1 at step 16. Similarly, if the number of ele-
according to the good solutions but also learn from other individ- ments that are equal to 1 in I − Rand is larger than the number
uals in the current population. Learning from the other individuals of elements that are equal to 1 in Rand, it is more likely that the
has the advantage of ensuring the diversity of the population and new individual learns from random individuals. Thus, if the value
prohibiting the population from getting stuck in local optimum [7]. of ct is large, the algorithm has a priority in local searching, if it is
Thus, in this paper, we update the streams and rivers according to small, it will have a priority in global searching. Generally, at the
Eqs. (18)–(20): initial generations of the search, the exploration is typically pre-

ferred, but the exploitation of promising solution is required as the
t+1
Xstream t
= Xstream  Rand  (X t t
 Xstream )
river search progresses. Therefore, ct increases linearly from cmin to cmax
 according to Eq. (21), we set cmin = 0.7 and cmax = 0.9 in our work,
+(I − Rand )  (Xr1
t t
 Xr2 ), (18)
t is the current iteration number and T is the maximum number
of iterations.

t+1
Xstream t
= Xstream  Rand  (Xseat t
 Xstream ) t
 ct = cmin + (cmax − cmin ) , (21)
+(I − Rand )  (Xr1  Xr2 ) ,
t t
(19) T

In the process of building a BN, an invalid solution (i.e., directed


 cycles in the graph) may be generated. In order to detect and re-
Xt+1
=X  Rand  (Xsea  X
t
) t t
river river river move the cycles, we use the search procedure proposed in [14]. Af-
 ter detecting all black edges, each black edge is deleted or reversed
+(I − Rand )  (Xr1
t t
 Xr2 ) , (20)
randomly, and then a directed acyclic graph is generated.
J. Wang, S. Liu / Knowledge-Based Systems 150 (2018) 95–110 99

4.2.2. Evaporation process 4.3. Convergence analysis


For preventing the algorithm from rapid convergence and pro-
moting diversity, the resulting streams and rivers may undergo
Definition 4.1. Let Y (Y = si , i = 1, 2, · · · ) be the stochastic process
some kind of randomized change. For the binary problem, we use
of discrete parameters defined in probability space (, F, P) over a
the bit − f lip mutation (i.e., each element is flipped from 0 to 1
finite state space S = {si } (i = 1, 2, . . . , T ), where T is the total num-
or 1 to 0). In other words, in solving problem of BNs learning, the
ber of states. X is a finite Markov chain when,
evaporation process is employed by randomly adding or removing
a particular edge to a BN. Algorithm 2 illustrates the process of the P (Yi+1 = si+1 |Y1 = s1 , Y2 = s2 , · · · , Yi = si )
= P (Yi+1 = si+1 |Yi = si ),
Algorithm 2 Evaporation-WCA. P (Y1 = s1 , Y2 = s2 , · · · , Yi = si ) > 0. (24)
Input: .
1: Input: Stream Xstream
Output: Let probability pij (t) be the transition probability that is the
2: Output:New stream Xstream new conditional probability of process from state si to sj at time t. The
new
3: Xstream = Xstream T × T matrix P = ( pi j (t )) is the transition matrix, in which pij ∈ [0,

4: Randomly select an edge between node i and node j 1] for i, j ∈ [1, T], and Tj=1 pi j = 1 for all i. Because the elements
5: if Xstream (i, j ) = 1 then in each row sum to 1, P matrix is called a stochastic matrix. If the
6: new (i, j ) = 0
Xstream transition probability pij (t) does not depend on the time t, then the
7: else Markov chain is said to be homogeneous. Given an initial distribu-
8: new (i, j ) = 1
Xstream tion of states π (0) as a row vector, the probability distribution of
9: end if the Markov chain after t steps is given π (t ) = π (0 )Pt . Therefore,
new ) then
10: if !acyclic (Xstream a homogenous finite Markov chain is completely specified by its
11: new
Remove cycles and obtain a directed acyclic graph Xstream initial distribution π (0) and transition probability P. For homoge-
12: end if neous finite Markov chain, we have the following theorems [16].

Theorem 4.1. Let P be a primitive stochastic matrix of order T, that


mutation operator. is, all of the elements of Pt are positive for some integer t. Then Pt
The evaporation (mutation) process occurs when rivers and converges as t → ∞ to a stochastic matrix which has all nonzero en-
streams are close enough to the sea. To measure the difference be- tries. That is, for all i, j ∈ [1, T],
⎛ ⎞
tween a river (or a stream) and the sea, the distance is defined as π
follows:
P∞ = lim ( pi j )(t ) = ⎝ ... ⎠ (25)
m01 t→∞
Dis(Xriver , Xsea ) = , (22) π T ×T
m01 + m11
, where π = (π1 , π2 , . . . , πT ), and π j = 0 for 1 ≤ j ≤ T.
where, m01 is the number of cases when Xriver (i, j) = Xsea (i, j), m11
is the number of cases when Xriver (i, j ) = Xsea (i, j ) = 1, in which Theorem4.2. Let Pbe a stochastic matrix of order T with the struc-
i, j = 1, 2, · · · , D. C 0
ture P= , where C is a primitive stochastic matrix of or-
In the proposed algorithm, on one hand, if the distance be- R Q
tween a river and the sea is less than a value d or a random value der m and R, Q = 0. Then Pt converges as t → ∞ to a stochastic ma-
in the interval [0,1] is less then a pre-defined value 0.1, the corre- trix which has all nonnegative entries. That is, for all i, j ∈ [1, T],
sponding streams that flow to the river are updated according to ⎛ ⎞
π
Algorithm 2. On the other hand, if the distance between the sea
and a stream which directly flows to the sea is less than a value d,
∞ (t )
P = lim ( pi j ) = ⎝ .. ⎠ (26)
t→∞ .
we employ Algorithm 2 and generate the new stream around the π T ×T
sea. Actually, this approach improves the exploitation to perform a
, where π = (π1 , π2 , . . . , πm , 0, . . . , 0 ), and π j = 0 for 1 ≤ j ≤ m < T.
better search in smaller region near the sea.
The small value of d limits the number of searching, while In BEWCA-BN algorithm, each possible solution is represented
by a D × D matrix Xk (k = 1, 2, . . . , n ), where n = 2(D −D ) is the car-
2
the large value encourages search near the sea. Thus, the value
of d controls the search intensity near the sea. Eq. (23) shows a dinality of the search space. Let  be the set of all the possible
solutions, we have || = n = 2D −D . We use N to denote the popu-
2
quadratic time-dependent rule to calculate d [5]:
  t 2  lation size, then the number of possible population distributions
2t
dt = dmax − (dmax − dmin ) − , (23) is T = ||N . Suppose that a population distribution represents a
T T Markov state. In the search process of BEWCA-BN, the individuals
are obtained though updating their positions by logic and muta-
where, t is the current iteration number and T is the maximum
tion operators, the new population depends on the current popu-
number of iterations. In this paper, we set dmax = 0.2 and dmin =
lation. The conditional probability that the population transitions
0.01.
from one population distribution to another satisfies Eq. (24). The
total number of possible population distributions is finite. There-
4.2.3. Description of the proposed algorithm fore, BEWCA-BN algorithm can be described as a homogeneous and
A novel binary encoding water cycle algorithm is proposed for finite Markov chain.
learning Bayesian network structures (BEWCA-BN). The proposed To derive BEWCA-BN convergence properties, the state transi-
algorithm not only provides a new method to solve Bayesian net- tion mechanism is analyzed for two updating operators. Let L =
work structures learning problem but also enriches the applica- (Lls )N×n and M = (Mks )n×n be intermediate transition matrixes cor-
tions of the water cycle algorithm. The framework of the BEWCA- responding to only logic operator and only mutation operator, re-
BN algorithm for learning Bayesian network structures is shown in spectively. Lls is the probability that the lth individual in the pop-
Algorithm 3. ulation transitions to the sth individual in the search space. Mks is
100 J. Wang, S. Liu / Knowledge-Based Systems 150 (2018) 95–110

Algorithm 3 Binary encoding water cycle algorithm for structures learning of Bayesian networks (BEWCA-BN).

(continued on next page)


J. Wang, S. Liu / Knowledge-Based Systems 150 (2018) 95–110 101

Algorithm 3 (continued)

the probability that kth individual in the search space transitions matrix P = ( pi j )T ×T is positive, i.e. all of the elements of P are pos-
to sth individual in the search space. We use v = (v1 , v2 , . . . , vn ) itive.
to denote population vector, where vk is the number of candi-
n
date solution Xk in a population, we have k=1 vk = N. We use
P r (u|v ) to denote the transition probability from population vector Theorem 4.3. The transition matrix of BEWCA-BN with logic and mu-
v = (v1 , v2 , . . . , vn ) to population vector u = (u1 , u2 , . . . , un ) in one tation operators is positive.
generation . The state transition matrix P = ( pi j )T ×T is obtained by
calculating the transition probability pi j = P r (u|v ) for ith popula-
tion vector v to jth population vector u, where i, j ∈ [1, T]. The Proof. When only the logic operator is applied, we calculate the
transition probability P r (u|v ) can be expressed by a multinomial probability that the lth individual in current population transitions
distribution as [30]: to the sth individual in the search space. Let φ ls be the number
of cases Xl (i, j ) = Xs (i, j )(i, j = 1, 2, . . . , D ). From Algorithm 1, we

N 
n
p i j = P r ( u| v ) = (Qlk (v ))Jlk , observe that the final solution Vr is obtained by employing xor
J∈Y l=1 k=1
operator at step 16. If Xl (i, j) = Xs (i, j), Xl (i, j) transitions to Xs (i, j)
  after implementing Algorithm 1. Based on Table 1, if Vr (i, j) ob-

n 
N
tained from step 15 is equal to 1, then Xl (i, j) is able to transition to
Y ≡ J∈R N×n
: Jlk ∈ {0, 1}, Jlk = 1, ∀l, Jlk = uk , ∀k , (27)
1 − Xl (i, j ). Similarly, if Xl (i, j ) = Xs (i, j ), we expect that the value
k=1 l=1
Xl (i, j) remains unchanged after employing the algorithm. Based on
where, Q = (Qlk )N×n = LMT , l ∈ [1, N] and k ∈ [1, n]. Qlk is the prob- Table 1, if Vr (i, j) obtained from step 15 is equal to 0, then Xl (i, j)
ability that lth individual in current population state v transitions remains unchanged. Without loss of generality we assume that Vr
to kth individual in the search space through logic and mutation generated at step 16 is a directed acyclic graph. Now, there are two
operators. The following theorem shows that the state transition cases should be considered. 
102 J. Wang, S. Liu / Knowledge-Based Systems 150 (2018) 95–110

 
Case 1: Xl (i, j) = Xs (i, j). Based on Algorithm 1, we should calcu- Num0r1 · Num0r2 + Num1r1 · Num1r2
late the probability P (Vr (i, j ) = 1 ) at step 15, · 1− . (36)
D4
P1 = P (Vr (i, j ) = 1 ) = P (Vr1 (i, j ) = 1 ∪ Vr2 (i, j ) = 1 )
If f(Vr ) is better than f(Xl ), then Xl is replaced by Vr . Thus, the
= 1 − P (Vr1 (i, j ) = 0 ∩ Vr2 (i, j ) = 0 ) probability that Xl transitions to Xs is given as Eq. (37):
= 1 − P (Vr1 (i, j ) = 0 ) · P (Vr2 (i, j ) = 0 ). (28) 
φ D2 −φls
Since the algorithm employs and operator at steps 13 and 14, the Lls = P1 ls · P0 , if f (Xs ) > f (Xl ) , (37)
probabilities P (Vr1 (i, j ) = 0 ) and P (Vr2 (i, j ) = 0 ) are directly calcu-
0, otherwise
lated according to the xor operator at steps 4 and 5.
where, P1 and P0 are calculated according to Eqs. (31) and (36).
P (Vr1 (i, j ) = 0 ) = P (Xbetter (i, j ) = Xl (i, j )) Therefore, L is a nonnegative stochastic matrix and each row sums
= P (Xbetter (i, j ) = 0 ) · P (Xl (i, j ) = 0 ) to 1.
When only the mutation operator is applied, the probability
+ P (Xbetter (i, j ) = 1 ) · P (Xl (i, j ) = 1 ) that Xs transitions to Xk is given as Eq. (38):
Num0b Num0l Num1b Num1l
= · + ·  1
φsk  1
D2 −φsk
D2 D2 D2 D2 Msk = · 1− 2 . (38)
Numb · Numl + Numb · Num1l
0 0 1 D2 −D D −D
= . (29)
D4 We can see that M is a positive matrix, and each of row sums to 1.
Since, M is a positive and stochastic matrix, every entry of
P (Vr2 (i, j ) = 0 ) = P (Xr1 (i, j ) = Xr2 (i, j )) M is positive. L is a nonnegative matrix, at least one element in
each row of L is larger than 0. Therefore, we have Qlk = (LMT )lk =
= P (Xr1 (i, j ) = 0 ∩ Xr2 (i, j ) = 0 ) n  N n
s=1 Lls Mks > 0. Then pi j = P r (u|v ) = k=1 (Qlk (v ))
Jlk >
J∈Y l=1
+ P (Xr1 (i, j ) = 1 ∩ Xr2 (i, j ) = 1 ) 0 for i, j ∈ [1, n] , in which Y is given in Eq. (27). Hence, the transi-
Num0r1 · Num0r2 + Num1r1 · Num1r2 tion matrix of BEWCA-BN P is positive.
= . (30) Before we analyze the convergence properties of the BEWCA-BN
D4
algorithm, some definitions are required [20].
Thus, we have

(Num0b · Num0l + Num1b · Num1l ) · (Num0r1 · Num0r2 + Num1r1 · Num1r2 ) Definition 4.2. Let X (t ) = {xi (t )|i ∈ [1, N], xi (t ) ∈ } be the popu-
P1 = 1 − , lation at generation t. We use ∗ = {x∗ |x∗ = arg max{ f (x )|x ∈ }}
D8
(31) to denote the set of the global solutions, each member in which
has global maximum fitness. We use ∗ (t ) = {x∗j (t )} ⊂ X (t ) to de-
where, Num1r is the number of elements that are equal to 1 in Xr ,
note the best individuals at the generation t, in which f (x∗j (t )) ≥
Num0r is the number of elements that are equal to 0 in Xr .
f (xi (t )) for all x∗j (t ) ∈ ∗ (t ) and for all i ∈ [1, N]. The BEWCA-BN
Case 2: Xl (i, j ) = Xs (i, j ). Based on Algorithm 1, we should cal-
culate the probability P (Vr (i, j ) = 0 ) at step 15, algorithm is said to converge if

P0 = P (Vr (i, j ) = 0 ) = P (Vr1 (i, j ) = 0 ∪ Vr2 (i, j ) = 0 ) P r ( lim x∗ (t ) ∈ ∗ ) = 1 ⇔ P r (x∗ ∈ lim X (t )) = 1. (39)
t→∞ t→∞
= 1 − P (Vr1 (i, j ) = 1 ∩ Vr2 (i, j ) = 1 )
From Definition 4.2 it may be concluded that the BEWCA-BN
= 1 − P (Vr1 (i, j ) = 1 ) · P (Vr2 (i, j ) = 1 ). (32)
algorithm is convergent if and only if Eq. (39) holds. Obviously,
Since the algorithm employs and operator at steps 13 and 14, the process that the evaluation of x∗ (t) is a finite and homoge-
neous Markov chain, which we call an x∗ (t )−chain. Now, we sort
P (Vr1 (i, j ) = 1 ) = P (Vr1 (i, j ) = 1 ∩ Rand (i, j ) = 1 ) = ct
all the states xj j ∈ [1, n] in order of descending fitness, that is  =
·P (Vr1 (i, j ) = 1 ), {x1 , x2 , · · · , xn } and f(x1 ) ≥ f(x2 ) ≥  ≥ f(xn ). Let S = {1, 2, · · · , n} be
P (Vr2 (i, j ) = 1 ) = P (Vr2 (i, j ) = 1 ∩ (I − Rand )(i, j ) = 1 ) the set of indices of , S∗ = { j|x j ∈ ∗ }. The following definition
= (1 − ct ) · P (Vr2 (i, j ) = 1 ), (33) will be useful.

where P (Vr1 (i, j ) = 1 ) and P (Vr2 (i, j ) = 1 ) on the right side of the Definition 4.3. Let P ˆ = ( pˆ i j ) be the transition matrix of an
equations are the probabilities that Vr1 (i, j ) = 1 and Vr1 (i, j ) = 1 at x∗ (t )−chain, where pˆ i j i, j ∈ [1, n] is the probability that x∗ (t ) = xi
steps 4 and 5, respectively. Since, the algorithm employs xor oper- at generation t transitions to x∗ (t + 1 ) = x j at generation t + 1. The
ator at steps 4 and 5, we have, BEWCA-BN algorithm is said to converge to a global optimum if
and only if x∗ (t) transitions from any state xi to x∗ as t → ∞ with
P (Vr1 (i, j ) = 1 ) = P (Xbetter (i, j ) = Xl (i, j ))
probability one, that is, if
= 1 − P (Xbetter (i, j ) = Xl (i, j ))
 t
Num0b · Num0l + Num1b · Num1l lim (Pˆ )i j = 1, ∀i ∈ S. (40)
= 1− (34) t→∞
j∈S∗
D4

Theorem 4.4. If the transition probability P ˆ = ( pˆ i j ) of an


P (Vr2 (i, j ) = 1 ) = P (Xr1 (i, j ) = Xr2 (i, j )) x∗ (t )−chain is positive and stochastic matrix, then the BEWCA-
= 1 − P (Xr1 (i, j ) = Xr2 (i, j )) BN algorithm with logic and mutation operators  does not  converge
Num0r1 · Num0r2 + Num1r1 · Num1r2 ˆ= C 0
= 1− . (35) to any of the global optima. However, if P , where C
D4 R Q
is a positive stochastic matrix of order |S∗ |, and R, Q = 0, then the
Therefore,
algorithm converges to one or more of the global optimal solutions.
P0 = 1 − ct · (1 − ct ) · P (Vr1 (i, j ) = 1 ) · P (Vr2 (i, j ) = 1 )
  Proof. We first prove that the algorithm does not converge to any
Num0b · Num0l + Num1b · Num1l
= 1 − ct · (1 − ct ) · 1− of the global optima. Since the positive and stochastic matrix is
D4
J. Wang, S. Liu / Knowledge-Based Systems 150 (2018) 95–110 103

also a primitive matrix. Based on Theorem 4.1, experiments, it includes the databases, the original networks, the
 t  t number of cases of each network, the number of nodes of each
limt→∞ j∈S∗ (Pˆ )i j = 1 − limt→∞ j∈S−S∗ (Pˆ )i j network, the number of edges of each network, and the k2 scores.
 t
= 1 − j∈S−S∗ limt→∞ (P ˆ )i j (41) To evaluate the quality of each algorithm, several metrics used
| S |
= 1 − j=|S∗ |+1 π j < 1, for the analysis and comparison of the proposed method are listed
as follows:
where, S − S∗ contains the elements of S that do not belong to S∗ .
• Correct Edges (CEs): the edges are correctly detected (with the
Obviously, Eq. (40) is not satisfied, the algorithm does not converge
right orientation) in comparison to the original network.
to any of the global optima. Then, we prove that the  algorithm • Deleting Edges (DEs): the edges are not detected by the algo-
ˆ= C 0
converges to one or more of the global optima if P . rithm but present in the original network.
R Q
• Adding Edges (AEs): the edges are detected by the algorithm
From Theorem 4.2, we know that,
but not present in the original network.
⎛ ⎞
π • Wrong Orientation Edges (WOEs): the edges are detected but

ˆ
P = lim ( pˆ i j )(t ) = ⎝ ... ⎠ (42) having the opposite orientations in comparison to the original
t→∞ network.
π |S|×|S| • Structural Difference (SD): the difference between the learned
, where π = (π1 , π2 , . . . , π|S∗ | , 0, . . . , 0 ), and π j = 0 for 1 ≤ j ≤ |S∗ | network and the original network, it is the summation of the
|S∗ | three types mentioned above (DEs+AEs+WOEs). The lower SD
and j=1 j
π = 1. For any i ∈ S, we have, indicates the better network.
|S |
∗ • k2 Score: the k2 score of the discovered network. The higher
 t  t 
lim (Pˆ )i j = lim (P
ˆ )i j = π j = 1. (43) score, the network is better.
t→∞
j∈S∗ j∈S∗
t→∞
j=1
• Mean Result (Mean): the average results over all trials.
• Standard Deviation (Std): the standard deviation resulting
Eq. (40) is satisfied, that is, the proposed algorithm converges from all trials.
to the global optima ˆ
 if P is  a positive stochastic matrix with • Best Result (Best): the best result over all trials.
ˆ= C 0 • Worst Result (Worst): the worst result over all trials.
the structure P . The proposed algorithm preserves
R Q • Execution Time (ET): it is the time (in seconds) needed to learn
the best individual at each generation, it is one of the way to a network.
implement elitism in many evolutionary algorithms. In [20], the • Iterations (Its): it is the number of iterations needed to learn a
Biogeography-Based optimization has been shown that the algo- network.
rithm with elitism will converge to global optima for any binary
problem. In spite of differences between Biogeography-Based opti- In addition, to make a study among different algorithms with
mization and the proposed algorithm, the convergence of the pro- a certain level of confidence, a statistical analysis of the experi-
posed algorithm is similar to that of the Biogeography-Based opti- mental results has been performed [40]. Because, the BEWCA-BN
mization algorithm.  algorithm and each comparison method are implemented indepen-
dently, there is no correlation. Hence, the Kruskal–Wallis tests are
5. Experiments used to perform non-parametric analysis of the experimental re-
sults. To illustrate the differences between the proposed algorithm
In this section, to evaluate the performance of the proposed and the comparison approaches, we apply the Kruskal–Wallis test
method, we first test BEWCA-BN algorithm on several benchmark to perform paired tests with the confidence level 95%, i.e., the dif-
networks, and then compare the proposed method with other al- ferences are unlikely to have occurred by chance with a probabil-
gorithms. The algorithms are coded with Matlab 2010b and all ex- ity 95%. If the p-value obtained from the test is less than 5%, we
periments are implemented on a Pentium(R) 3.20 GHz with 2GB consider that a significant difference exists in the corresponding
RAM. experimental results.

5.2. Algorithms and parameters


5.1. Experimental databases and evaluation metrics

We compare the proposed method BEWCA-BN with other


To test the behavior of the proposed method, we select five
Bayesian network structures learning algorithms: particle swarm
well-known benchmark networks (Alarm, Asia, Child, Credit and
optimization (BNC-PSO) [14], artificial bee colony (ABC-B) [17],
Tank networks). The Alarm network [4] is a medical diagnostic sys-
the sparse candidate algorithm (SCA) [13] and the max-min hill-
tem for patient monitoring, it consists of 37 nodes and 46 edges
climbing algorithm (MMHC) [34].
connecting them. The Asia network [19] is used for a fictitious
The parameters for BEWCA-BN algorithm are chosen as: the
medical example, it consists of 8 nodes and 8 edges connecting
number of population size Npop = 50, the summation of a sea
them. The Child network [31] is a preliminary diagnostic model
and the number of rivers Nsr = 8. The scale factor parameter c in-
for newborn babies with congenital heart disease, it consists of 20
creases linearly from cmin to cmax according to Eq. (21), in which
nodes and 25 edges connecting them. The Credit network is used
cmin = 0.7, cmax = 0.9. The search intensity parameter d is calcu-
for assessing credit worthiness of an individual, it is available in
lated according to Eq. (23), in which dmin = 0.01, dmax = 0.2. The
GeNie software,1 the network consists of 12 nodes and 12 connec-
parameters for BNC-PSO algorithm are list as (according to Gheis-
tions between them. The Tank network is a simple network for di-
ari and Meybodi [14]): the population size is 50, the inertia weight
agnosing possible explosion in a tank, it consists of 14 nodes and
20 edges connecting and is also available in GeNie software. We
ω decreases linearly from 0.95 to 0.4, the acceleration coefficient
c1 decreases linearly from 0.82 to 0.5, the acceleration coefficient
test the algorithms on databases drawn from these networks by
c2 increases linearly from 0.4 to 0.83. The parameters for ABC-B
probabilistic logic sampling. Table 2 shows the databases used in
algorithm are chosen as (according to Ji et al. [17]): the popula-
tion size is 50, the parameters α and β that determine the rel-
1
http://dslpitt.org/genie/. ative importance of the pheromone with respect to the heuristic
104 J. Wang, S. Liu / Knowledge-Based Systems 150 (2018) 95–110

Table 2
Databases used in experiments.

Database Original network Number of cases Number of nodes Number of arcs Score

Asia-10 0 0 Asia 10 0 0 8 8 −1101.11


Asia-30 0 0 Asia 30 0 0 8 8 −3326.34
Credit-10 0 0 Credit 10 0 0 12 12 −5360.58
Credit-30 0 0 Credit 30 0 0 12 12 −15822.36
Credit-50 0 0 Credit 50 0 0 12 12 −26374.30
Tank-10 0 0 Tank 10 0 0 14 20 −1632.52
Tank-30 0 0 Tank 30 0 0 14 20 −4848.08
Tank-50 0 0 Tank 50 0 0 14 20 −8138.93
Child-10 0 0 Child 10 0 0 20 25 −6406.11
Child-30 0 0 Child 30 0 0 20 25 −18731.68
Child-50 0 0 Child 50 0 0 20 25 −31157.76
Alarm-10 0 0 Alarm 10 0 0 37 46 −5044.07
Alarm-20 0 0 Alarm 20 0 0 37 46 −9739.44
Alarm-30 0 0 Alarm 30 0 0 37 46 −14512.89
Alarm-40 0 0 Alarm 40 0 0 37 46 −19160.06
Alarm-50 0 0 Alarm 50 0 0 37 46 −23780.46

information are equal to 1 and 2 respectively, the parameter of the algorithm detects 20.6 edges out of 25 edges and the standard
the pheromone evaporation ρ = 0.1, the parameter q0 that used deviation of k2 score is greater than 1.0. For Child-30 0 0, on the av-
to determine the relative importance of the exploitation versus ex- erage, our algorithm can identify 23.5 edges with one incorrectly
ploration is equal to 0.8, the maximum number of solution stag- deleted edges, the algorithm also detects 0.5 edges which have the
nation limit = 3. SCA and MMHC algorithms are implemented in opposite orientations in comparison to the original network. In the
the Causal Explorer system2 and we use the default values in the best case, our algorithm is capable of identifying the perfect struc-
software implementations for the parameters. We define the max- ture on database Child-50 0 0. As can be observed from the table,
imum number of iterations MaxIter = 500. the number of correctly identified edges increases with the sample
size increases for Child network. Alarm is a larger network com-
5.3. Experimental results and analysis pared with other four networks. The standard deviation values of
k2 score for Alarm with different cases are greater than 1. In ad-
5.3.1. Learning BNs using BEWCA-BN dition, for Alarm-10 0 0, our algorithm discovers the structure with
To verify the feasibility and effectiveness of the BEWCA-BN al- 35 edges correctly detected in the worst case, the reason may be
gorithm, the algorithm is implemented ten times independently on that Alarm-10 0 0 does not contain enough cases to correctly learn
each database listed in Table 2. Table 3 summarizes the results in- a BN structure. As also can be observed that the number of cor-
cluding best, worst, mean and standard deviation of different eval- rectly identified edges increases with the sample size increases.
uation metrics. Although, on the average, our algorithm identifies 42.3 edges on
For Asia network, the best and worst results of the k2 score are Alarm-40 0 0, which is greater than the result obtained on Alarm-
the same for Asia-10 0 0 and Asia-30 0 0, which means that the pro- 50 0 0, the standard deviation is relatively small for Alarm-50 0 0.
posed algorithm is capable of identifying the same scores over ten From observations above, we conclude that our algorithm can
executions. However, for Asia-10 0 0, on the average, the algorithm identify Bayesian network structures with the large k2 scores and
can find about 6.4 edges with one incorrectly deleted edge and 0.6 small structure differences. In the best case, the algorithm is capa-
wrongly orientated edges. For Asia-30 0 0, on the average, the algo- ble of obtaining the perfect structures. In addition, we also observe
rithm can detect 7 edges without incorrectly added and orientated that the larger databases are more conducive to correctly identify-
edges, but one edge is wrongly deleted. For Credit-10 0 0, on the ing the network structures in the most cases. The proposed algo-
average, our algorithm can correctly identify 10.5 edges, one edge rithm is a feasible and effective method to learn Bayesian network
is wrongly deleted and 0.5 edges are detected with the opposite structures from databases.
orientations in comparison to the original network. The algorithm
obtains relatively good results on Credit-30 0 0, on the average, only
5.3.2. Comparing BEWCA-BN with other algorithms
one edge is incorrectly orientated, the algorithm can detect 11 cor-
We compare the proposed method with three score-based al-
rect edges without wrongly added and deleted edges. For Credit-
gorithms and a hybrid algorithm. Each algorithm is employed ten
50 0 0, although the standard deviation of k2 score is 0.00, the al-
times independently on databases Asia-30 0 0, Credit-30 0 0, Tank-
gorithm, on the average, adds one more incorrect edges in com-
30 0 0, Child-30 0 0, Alarm-30 0 0 and Alarm-50 0 0. Tables 4–6 present
parison to the structural difference obtained by the algorithm em-
the experimental results based on k2 scores and structure differ-
ployed on Credit-30 0 0. For Tank network, the standard deviation
ences, and the best results are marked in bold. To demonstrate
of k2 score resulting from our algorithm is greater than 0.0 and
the differences of the proposed algorithm and the comparison
less than 1.0 on each database. The algorithm detects 16.9 edges
approaches, we perform pair-wise Kruskal–Wallis tests on each
out of 20 edges on database Tank-10 0 0, this is a relatively poor re-
database. Fig. 1(a) and (b) present the test results for k2 score and
sult for small network. However, as can be observed from the ta-
structural difference respectively, the line is the benchmark (0.05).
ble, the structural difference obtained by our algorithm decreases
When test results returned from two algorithms are exactly the
as simple size increases. In the best case, our algorithm is capable
same, the test value is infinite, NA in Fig. 1 denotes this value.
of identifying the perfect structures on databases Tank-30 0 0 and
(A) k2score Table 4 shows the k2 scores of each algorithm on
Tank-50 0 0. It is worth notice that, on the average, only 0.5 edges
different databases. The standard deviation values of k2 score for
are detected incorrectly by our algorithm on database Tank-50 0 0.
Asia-30 0 0 obtained by BEWCA-BN, BNC-PSO, SCA and MMHC al-
Similar results can be observed for Child network. For Child-10 0 0,
gorithms are equal to 0.00, which indicates that each algorithm
except ABC-B obtains the same k2 score over ten runs. The best
2
http://www.dsl-lab.org/causal_explorer/. value (-3325.87) is obtained by SCA algorithm, but it is only 0.01
J. Wang, S. Liu / Knowledge-Based Systems 150 (2018) 95–110 105

Table 3
Experimental results of the BEWCA-BN algorithm.

Database k2 Score AEs DEs WOEs SD CEs Its

Asia-10 0 0 Best −1100.78 0 1 0 1 7 7


Worst −1100.78 0 1 1 2 6 4
Mean −1100.78 0 1 0.6 1.6 6.4 5.7
Std 0.00 0.00 0.00 0.52 0.52 0.52 1.06
Asia-30 0 0 Best −3325.88 0 1 0 1 7 3
Worst −3325.88 0 1 0 1 7 13
Mean −3325.88 0 1 0 1 7 5.6
Std 0.00 0.00 0.00 0.00 0.00 0.00 3.17
Credit-10 0 0 Best −5330.90 0 1 0 1 11 5
Worst −5337.75 0 1 1 2 10 13
Mean −5335.53 0 1 0.5 1.5 10.5 8.3
Std 2.51 0.00 0.00 0.53 0.53 0.53 2.36
Credit-30 0 0 Best −15806.22 0 0 1 1 11 8
Worst −15806.22 0 0 1 1 11 37
Mean −15806.22 0 0 1 1 11 10
Std 0.00 0.00 0.00 0.00 0.00 0.00 10.63
Credit-50 0 0 Best −26362.73 1 0 1 2 10 10
Worst −26362.73 1 0 1 2 10 37
Mean −26362.73 1 0 1 2 10 21.4
Std 0.00 0.00 0.00 0.00 0.00 0.00 8.71
Tank-10 0 0 Best −1630.60 2 0 0 2 18 48
Worst −1631.24 4 1 1 6 14 48
Mean −1630.78 2.6 0.3 0.2 3.1 16.9 34.2
Std 0.30 0.97 0.48 0.42 1.60 1.60 7.43
Tank-30 0 0 Best −4848.08 0 0 0 0 20 22
Worst −4848.63 2 0 1 3 17 92
Mean −4848.33 1 0 0.5 1.5 18.5 38.2
Std 0.27 1.05 0.00 0.53 1.58 1.58 20.42
Tank-50 0 0 Best −8138.93 0 0 0 0 20 24
Worst −8140.16 2 1 0 3 17 99
Mean −8139.11 0.4 0.1 0 0.5 19.5 46.2
Std 0.41 0.84 0.32 0.00 1.08 1.08 22.15
Child-10 0 0 Best −6379.39 0 4 0 4 21 11
Worst −6382.67 0 4 1 6 19 61
Mean −6379.74 0 4 0.3 4.4 20.6 36.4
Std 1.02 0.00 0.00 0.48 0.70 0.70 17.68
Child-30 0 0 Best −18727.63 0 1 0 0 24 28
Worst −18727.63 0 1 1 2 23 81
Mean −18727.63 0 1 0.5 1.5 23.5 53.6
Std 0.00 0.00 0.00 0.53 0.53 0.53 19.79
Child-50 0 0 Best −31157.76 0 0 0 0 25 32
Worst −31157.76 0 0 2 2 23 83
Mean −31157.76 0 0 0.7 0.7 24.3 48.9
Std 0.00 0.00 0.00 0.67 0.67 0.67 15.25
Alarm-10 0 0 Best −5032.21 3 2 0 5 41 94
Worst −5039.99 6 3 3 11 35 166
Mean −5035.71 4.7 2.3 1.4 8.4 37.6 129.8
Std 2.51 1.83 0.48 0.97 2.01 2.01 25.66
Alarm-20 0 0 Best −9721.76 1 2 0 3 43 103
Worst −9726.61 5 2 2 8 38 215
Mean −9723.21 2.60 2 0.8 5.4 40.6 150.5
Std 1.49 1.08 0.00 0.63 1.43 1.43 36.28
Alarm-30 0 0 Best −14510.68 1 1 0 2 44 119
Worst −14515.59 5 1 2 8 38 193
Mean −14512.86 3.3 1 1.1 5.4 40.6 152.2
Std 1.61 1.16 0.00 0.74 1.84 1.84 24.19
Alarm-40 0 0 Best −19148.45 1 1 0 2 44 120
Worst −19158.73 2 2 2 5 41 263
Mean −19151.71 1.7 1.1 0.9 3.7 42.3 181
Std 2.73 0.48 0.32 0.87 1.16 1.16 46.77
Alarm-50 0 0 Best −23769.12 1 1 0 3 43 105
Worst −23775.54 3 1 2 5 41 209
Mean −23769.99 2.1 1 0.9 4.0 42 165.4
Std 2.03 0.57 0.00 0.74 0.82 0.82 38.45

greater than the average value of the k2 score (−3325.88) result- BEWCA-BN, BNC-PSO and ABC-B algorithms can obtain the highest
ing from BEWCA-BN algorithm. For Credit-30 0 0, the best result k2 scores (−4848.08) on Tank-30 0 0. On the average, the BNC-PSO
(−15806.22) is returned by BEWCA-BN and MMHC algorithms re- algorithm gets the relatively larger k2 score (−4848.31) in com-
spectively. Although, BNC-PSO gets the highest k2 score in the best parison to the other algorithms, but the mean values of BEWCA-
case, the worst value of the BNC-PSO algorithm is smaller than that BN and ABC-B are very close to −4848.31. For Child-30 0 0, the
of the BEWCA-BN algorithm. It is obvious that the SCA algorithm best, worst and mean values are the same for BEWCA-BN, ABC-
has the worst performance among five algorithms. In the best case, B and MMHC algorithms, which indicates that these three algo-
106 J. Wang, S. Liu / Knowledge-Based Systems 150 (2018) 95–110

Table 4
k2 score comparisons among five algorithms.

Algorithm Asia-30 0 0 Credit-30 0 0 Tank-30 0 0 Child-30 0 0 Alarm-30 0 0 Alarm-50 0 0


(−3326.34) (−15822.36) (−4848.08) (−18731.68) (−14512.89) (−23780.46)

Best BEWCA-BN −3325.88 −15806.22 −4848.08 −18727.63 −14511.54 −23769.12


BNC-PSO −3325.88 −15806.22 −4848.08 −18727.63 −14511.59 −23769.12
ABC-B −3325.88 −15815.69 −4848.08 −18727.63 −14511.28 −23769.12
SCA −3325.87 −16473.54 −7325.20 −20034.56 −24686.66 −38389.35
MMHC −3325.88 −15806.22 −4901.68 −18727.63 −14681.68 −23977.14
Worst BEWCA-BN −3325.88 −15806.22 −4848.63 −18727.63 −14515.59 −23775.54
BNC-PSO −3325.88 −15822.33 −4849.74 −18753.35 −14525.96 −23787.67
ABC-B −3327.43 −15815.69 −4850.02 −18727.63 −14531.39 −23786.25
SCA −3325.87 −16473.54 −7325.20 −20034.56 −24686.66 −38389.35
MMHC −3325.88 −15806.22 −4901.68 −18727.63 −14681.68 −23977.14
Mean(Std) BEWCA-BN −3325.88 ± 0.00 −15806.22 ± 0.00 −4848.33 ± 0.27 −18727.63 ± 0.00 −14512.83 ± 1.61 −23769.99 ± 2.03
BNC-PSO −3325.88 ± 0.00 −15813.00 ± 5.77 −4848.31 ± 0.52 −18730.20 ± 8.13 −14516.23 ± 5.26 −23775.80 ± 6.52
ABC-B −3326.04 ± 0.49 −15815.69 ± 0.00 −4848.44 ± 0.61 −18727.63 ± 0.00 −14516.54 ± 7.08 −23771.22 ± 5.31
SCA −3325.87 ± 0.00 −16473.54 ± 0.00 −7325.20 ± 0.00 −20034.56 ± 0.00 −24686.66 ± 0.00 −38389.35 ± 0.00
MMHC −3325.88 ± 0.00 −15806.22 ± 0.00 −4901.68 ± 0.00 −18727.63 ± 0.00 −14681.61 ± 0.00 −23977.14 ± 0.00

Table 5
Structure comparisons among five algorithms on Asia, Credit and Tank networks.

Asia-30 0 0 Credit-30 0 0 Tank-30 0 0

Best Worst Mean( ± Std) Best Worst Mean( ± Std) Best Worst Mean( ± Std)

AEs BEWCA-BN 0 0 0 ± 0.00 0 0 0 ± 0.00 0 2 1 ± 1.05


BNC-PSO 0 0 0 ± 0.00 0 0 0 ± 0.00 0 2 0.4 ± 0.70
ABC-B 0 2 0.2 ± 0.63 0 0 0 ± 0.00 0 2 0.7 ± 0.95
SCA 0 0 0 ± 0.00 7 7 7.0 ± 0.00 20 20 20 ± 0.00
MMHC 0 0 0 ± 0.00 0 0 0 ± 0.00 11 11 11 ± 0.00
DEs BEWCA-BN 1 1 1 ± 0.00 0 0 0 ± 0.00 0 0 0 ± 0.00
BNC-PSO 1 1 1 ± 0.00 0 1 0.5 ± 0.53 0 1 0.1 ± 0.32
ABC-B 1 2 1.1 ± 0.32 1 1 1 ± 0.00 0 0 0 ± 0.00
SCA 1 1 1 ± 0.00 3 3 3 ± 0.00 13 13 13 ± 0.00
MMHC 1 1 1 ± 0.00 0 0 0 ± 0.00 3 3 3 ± 0.00
WOEs BEWCA-BN 0 0 0 ± 0.00 1 1 1 ± 0.00 0 1 0.5 ± 0.53
BNC-PSO 0 1 0.2 ± 0.42 0 2 0.6 ± 0.70 0 1 0.3 ± 0.48
ABC-B 0 0 0 ± 0.00 0 0 0 ± 0.00 0 2 1.1 ± 0.74
SCA 1 1 1 ± 0.00 5 5 5 ± 0.00 3 3 3 ± 0.00
MMHC 1 1 1 ± 0.00 1 1 1 ± 0.00 5 5 5 ± 0.00
SD BEWCA-BN 1 1 1 ± 0.00 1 1 1 ± 0.00 0 3 1.5 ± 1.58
BNC-PSO 1 2 1.2 ± 0.42 0 2 1.1 ± 0.57 0 3 0.8 ± 1.32
ABC-B 1 4 1.3 ± 0.95 1 1 1 ± 0.00 0 4 1.9 ± 1.67
SCA 2 2 2 ± 0.00 15 15 15 ± 0.00 36 36 36 ± 0.00
MMHC 2 2 2 ± 0.00 1 1 1 ± 0.00 19 19 19 ± 0.00

Table 6
Structure comparisons among five algorithms on Child and Alarm networks.

Child-30 0 0 Alarm-30 0 0 Alarm-50 0 0

Best Worst Mean( ± Std) Best Worst Mean( ± Std) Best Worst Mean( ± Std)

AEs BEWCA-BN 0 0 0 ± 0.00 1 5 3.3 ± 1.16 1 2 2.1 ± 0.57


BNC-PSO 0 0 0 ± 0.00 3 6 4.5 ± 1.08 1 5 2.3 ± 1.18
ABC-B 0 0 0 ± 0.00 1 3 2.0 ± 0.82 0 2 1.4 ± 1.17
SCA 5 5 5 ± 0.00 17 17 17 ± 0.00 24 24 24 ± 0.00
MMHC 0 0 0 ± 0.00 2 2 2 ± 0.00 1 1 1 ± 0.00
DEs BEWCA-BN 1 1 1 ± 0.00 1 1 1 ± 0.00 1 1 1 ± 0.00
BNC-PSO 1 1 1 ± 0.00 1 1 1 ± 0.00 1 1 1 ± 0.00
ABC-B 1 2 1.1 ± 0.32 1 1 1 ± 0.00 1 1 1 ± 0.00
SCA 9 9 9 ± 0.00 28 28 28 ± 0.00 27 27 27 ± 0.00
MMHC 1 1 1 ± 0.00 1 1 1 ± 0.00 2 2 2 ± 0.00
WOEs BEWCA-BN 0 1 0.5 ± 0.53 0 2 1.1 ± 0.74 0 2 0.9 ± 0.74
BNC-PSO 0 3 0.8 ± 1.23 1 3 2.7 ± 1.06 0 6 1.2 ± 1.67
ABC-B 0 4 2.0 ± 1.41 2 5 3.0 ± 1.25 0 4 1.3 ± 1.64
SCA 3 3 3 ± 0.00 4 4 4 ± 0.00 6 6 6 ± 0.00
MMHC 3 3 3 ± 0.00 9 9 9 ± 0.00 6 6 6 ± 0.00
SD BEWCA-BN 1 2 1.5 ± 0.53 2 8 5.4 ± 1.84 3 5 4.0 ± 0.82
BNC-PSO 1 4 1.8 ± 1.23 6 12 8.2 ± 1.93 3 12 4.5 ± 2.67
ABC-B 1 4 3.1 ± 1.37 4 9 6 ± 1.89 1 9 3.7 ± 2.50
SCA 17 17 17 ± 0.00 49 49 49 ± 0.00 57 57 57 ± 0.00
MMHC 4 4 4 ± 0.00 12 12 12 ± 0.00 9 9 9 ± 0.00
J. Wang, S. Liu / Knowledge-Based Systems 150 (2018) 95–110 107

Fig. 1. Pair-wise Kruskal–Wallis tests: (a) k2 scores, (b) structural differences.

rithms return the same results over ten runs. For Alarm-30 0 0, in
some cases, ABC-B algorithm obtains the highest k2 score. On the
average, the k2 scores returned by five algorithms except BEWCA-
BN are smaller than the score (-14512.89) of the original network.
For Alarm-50 0 0, in the best case, BEWCA-BN, BNC-PSO and ABC-B
algorithms are capable of obtaining the highest score (−23769.12).
On the average, BEWCA-BN preforms well in terms of k2 score.
Fig. 1(a) shows that BEWCA-BN algorithm is significantly better
than SCA method on all databases, with the higher k2 scores and
p-value < 0.05. In comparison to the use of MMHC algorithm, the
proposed algorithm outperforms on databases Tank-30 0 0, Alarm-
30 0 0 and Alarm-50 0 0. BEWCA-BN performs better than ABC-B and
BNC-PSO on Credit-30 0 0, it is also significantly superior to BNC-
PSO on Alarm-50 0 0. However, there are no significant differences
among three heuristic algorithms (BEWCA-BN, BNC-PSO and ABC-
B) on other databases.
From the view of the k2 score, we observe that three heuris-
tic algorithms (BEWCA-BN, BNC-PSO and ABC-B) return relatively
larger k2 scores on all databases. Although, the MMHC algorithm Fig. 2. The execution time of three algorithms on different databases.
performs well on databases Credit-30 0 0 and Child-30 0 0, it returns
the smaller values for databases Alarm-30 0 0 and Alarm-50 0 0 com-
pared with three heuristic algorithms. In the comparison to the
use of BNC-PSO and ABC-B algorithms, BEWCA-BN is stable on all
databases and capable of finding the better quality networks with number of edges identified by our algorithm is larger than that of
higher k2 scores. the ABC-B algorithm, the standard deviation is significantly smaller
(B) Structural difference Tables 5 and 6 present the structural than ABC-B algorithm. From Tables 5 and 6, we notice that ABC-B
differences of each algorithm on different databases. The Asia net- algorithm is easy to detect the edges having the opposite orien-
work is a simple structure, five algorithms can obtain the struc- tations in comparison to the original networks. The factor may be
tures with small structure differences. For Credit-30 0 0, BNC-PSO that ABC-B has been designed based on neighbor searching (ad-
can identify the perfect structure in the best case, but the mean dition, deletion, reversion and move operators) and the reversion
value is larger than that obtained by BEWCA-BN, ABC-PSO and operator may reverse the edges that have the same orientations in
MMHC algorithms. For Tank-30 0 0, BEWCA-BN, BNC-PSO and ABC-B comparison to the original networks in some cases. Actually, some
algorithms are capable of finding the true network in the best case, of the wrong orientations can be explained by Markov equivalent
and BNC-PSO performs well on average. As can be observed from structures.
the table, our algorithm outperforms other algorithms on database Fig. 1(b) presents the test results in terms of structural differ-
Child-30 0 0. For Alarm-30 0 0, the best result obtained by BEWCA- ences. It is obvious that the proposed algorithm is significantly bet-
BN algorithm have 40.6 edges out of the 46 edges with 3.3 in- ter than SCA algorithm on all databases with the smaller structural
correctly added edges, one deleted edge and 1.1 wrongly orien- differences and p-value < 0.05. In addition, there is no significantly
tated edges on average. For Alarm-50 0 0, ABC-B performs better difference between BEWCA-BN and MMHC algorithms on Credit-
than other algorithms, on the average, it can find about 42.3 edges. 30 0 0. BEWCA-BN performs better than BNC-PSO on Alarm-30 0 0
In the comparison to the use of ABC-B algorithm, although the and ABC-B on Child-30 0 0, and there are no significantly differences
between three heuristic algorithms on other databases.
108 J. Wang, S. Liu / Knowledge-Based Systems 150 (2018) 95–110

As can be observed from the table, the mean and standard de-
viation values obtained by our algorithm are relatively small for
databases generated from five benchmark networks, which indi-
cates that the proposed algorithm is an effective method to cor-
rectly learn Bayesian network structures from databases.
(C) Execution time To study the time performance, we first
test three heuristic algorithms on databases Asia-30 0 0, Child-30 0 0,
Tank-30 0 0, Credit-30 0 0, Alarm-30 0 0 and Alarm-50 0 0. Then, we
compare three algorithms on five databases sampled from Alarm
network. Fig. 2 and Fig. 3 show the experimental results, in which
the execution time of each algorithm is the average of ten indepen-
dent runs. As shown in Fig. 2, BEWCA-BN spends less time than
ABC-B algorithm on all databases. BNC-PSO performs better than
BEWCA-BN algorithm on databases Child-30 0 0, the reason may be
that the particle swarm optimization has advantage in its ability
of quick discovery of optimal solutions. However, the problem of
early convergence in the particle swarm optimization may lead to
the search process to be trapped in a local optimum. ABC-B al-
gorithm is the worst among three algorithms on databases sam-
pled from different networks. This is because the ABC-B algorithm
Fig. 3. Time performance of three algorithms on Alarm network.
spends a relatively high proportion of its time on neighbor search-
ing. Fig. 3 shows the average execution time of BEWCA-BN in com-
parison to BNC-PSO and ABC-B algorithms on five databases gener-
The results in terms of the structure differences are similar to ated from Alarm network. It can be seen that the execution time of
those in terms of k2 scores. Three heuristic algorithms (BEWCA- three algorithms generally increase with the increase of the sam-
BN, BNC-PSO and ABC-B) are capable of finding near-optimal struc- ple size. Obviously, the running time of the BEWCA-BN algorithm
tures. SCA algorithm fails to identify large network structures, and increases slowly, whereas ABC-B algorithm is sensitive to the in-
it is also difficult for MMHC algorithm to learn large networks. crease of the sample size. We notice that ABC-B takes less time

Fig. 4. The score convergence of three algorithms on: (a) Credit-30 0 0, (b) Tank-30 0 0, (c) Child-30 0 0, (d) Alarm-50 0 0.
J. Wang, S. Liu / Knowledge-Based Systems 150 (2018) 95–110 109

on learning Alarm networks compared with BNC-PSO algorithm [7] R. Cheng, Y. Jin, A social learning particle swarm optimization algorithm for
when the sample size equals 30 0 0. BNC-PSO performs better than scalable optimization, Inf. Sci. 291 (2015) 43–60.
[8] D.M. Chickering, D. Heckerman, C. Meek, Large-sample learning of bayesian
BEWCA-BN algorithm on database Alarm-40 0 0. However, the over- networks is np-hard, J. Mach. Learn. Res. 5 (10) (2004) 1287–1330.
all optimization results indicate that BEWCA-BN algorithm gener- [9] G.F. Cooper, E. Herskovits, A bayesian method for the induction of probabilistic
ally takes less time in finding optima in most cases. Therefore, we networks from data, Mach. Learn. 9 (4) (1992) 309–347.
[10] L.M. De Campos, J.M. Fernandez-Luna, J.A. Gámez, J.M. Puerta, Ant colony opti-
can conclude that the proposed algorithm is superior to BNC-PSO mization for learning bayesian networks, Int. J. Approximate Reasoning 31 (3)
and ABC-B algorithms in terms of execution time. (2002) 291–311.
(D) Convergence Fig. 4 illustrates the k2 scores with respect to [11] C. Deng, B. Zhao, Y. Yang, H. Peng, Q. Wei, Novel binary encoding differential
evolution algorithm, Adv. Swarm Intell. (2011) 416–423.
the number of iterations for solving BN structures learning prob-
[12] H. Eskandar, A. Sadollah, A. Bahreininejad, M. Hamdi, Water cycle algorithm–a
lems. The convergence graphs present the average scores obtained novel metaheuristic optimization method for solving constrained engineering
by three heuristic algorithms using databases Credit-30 0 0, Tank- optimization problems, Comput. Struct. 110 (2012) 151–166.
[13] N. Friedman, I. Nachman, D. Peér, Learning bayesian network structure from
30 0 0, Child-30 0 0 and Alarm-50 0 0. As can be observed, three al-
massive datasets: the sparse candidate algorithm, in: Proceedings of the Fif-
gorithms improve the quality of solutions at the beginning of the teenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann
search but the convergence rate for BEWCA-BN algorithm is faster Publishers Inc., 1999, pp. 206–215.
than BNC-PSO in earlier iterations. Although, ABC-B algorithm ob- [14] S. Gheisari, M.R. Meybodi, Bnc-pso: structure learning of bayesian networks by
particle swarm optimization, Inf. Sci. 348 (2016) 272–289.
tains the optimal solutions faster than BEWCA-BN algorithm in [15] D. Heckerman, D. Geiger, D.M. Chickering, Learning bayesian networks: The
Fig. 4(a) and (b), the quality of the optimal solution returned by combination of knowledge and statistical data, Mach. Learn. 20 (3) (1995)
BEWCA-BN algorithm is better than that of ABC-B algorithm. As 197–243.
[16] M. Iosifescu, Finite Markov Processes and Their Applications, Courier Corpora-
shown in Fig. 4, the convergence rates for three algorithms become tion, 2014.
nearly the same as the increase of the number of iterations. How- [17] J. Ji, H. Wei, C. Liu, An artificial bee colony algorithm for learning bayesian
ever, the proposed algorithm obtains the better average solutions networks, Soft comput. 17 (6) (2013) 983–994.
[18] P. Larrañaga, M. Poza, Y. Yurramendi, R.H. Murga, C.M.H. Kuijpers, Structure
compared with ABC-B and BNC-PSO algorithms. The overall results learning of bayesian networks by genetic algorithms: A performance analy-
indicate that the BEWCA-BN algorithm converges to the global op- sis of control parameters, IEEE Trans. Pattern Anal. Mach. Intell. 18 (9) (1996)
tima faster and more accurate than other considered algorithms. 912–926.
[19] S.L. Lauritzen, D.J. Spiegelhalter, Local computations with probabilities on
graphical structures and their application to expert systems, J. R. Stat. Soc. Ser.
6. Conclusion B (Methodological) (1988) 157–224.
[20] H. Ma, D. Simon, M. Fei, On the convergence of biogeography-based optimiza-
This paper has proposed a novel binary encoding water cycle tion for binary problems, Math. Problems Eng. 2014 (2014).
[21] I. Maglogiannis, E. Zafiropoulos, A. Platis, C. Lambrinoudakis, Risk analysis of
algorithm for learning Bayesian network structures. The logic op- a patient monitoring system using bayesian network modeling, J. Biomed. In-
erators have been introduced into water cycle algorithm to solve form. 39 (6) (2006) 637–647.
the discrete problem. In addition, the new update and evaporation [22] D. Margaritis, Learning Bayesian Network Model Structure from Data, Technical
Report, Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science, 2003.
processes have been proposed to improve the performance of the
[23] S. Nikolopoulos, G.T. Papadopoulos, I. Kompatsiaris, I. Patras, Evidence-driven
BEWCA-BN algorithm. Extensive experiments have been performed image interpretation by combining implicit and explicit knowledge in a
to evaluate the performance of the proposed algorithm. The re- bayesian network, IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 41 (5)
(2011) 1366–1381.
sults illustrate that the BEWCA-BN algorithm is capable of identify-
[24] P.C. Pinto, A. Nagele, M. Dejori, T.A. Runkler, J.M. Sousa, Using a local discovery
ing the optimal or near-optimal networks with high k2 scores and ant algorithm for bayesian network structure learning, IEEE Trans. Evol. Com-
small structural differences. In comparison to the use of other two put. 13 (4) (2009) 767–779.
heuristic algorithms, the proposed algorithm works well in terms [25] R.W. Robinson, Counting unlabeled acyclic digraphs, Combinatorial Math. V
622 (1977) (1977) 28–43.
of the accuracy and convergence speed. However, it is still a chal- [26] A. Sadollah, H. Eskandar, A. Bahreininejad, J.H. Kim, Water cycle algorithm with
lenge to find the optimal structures in large scale networks. For fu- evaporation rate for solving constrained and unconstrained optimization prob-
ture research, we would like to study the optimization mechanisms lems, Appl. Soft Comput. 30 (2015) 58–71.
[27] A. Sadollah, H. Eskandar, A. Bahreininejad, J.H. Kim, Water cycle, mine blast
of the water cycle algorithm, and extend our study for BN learn- and improved mine blast algorithms for discrete sizing optimization of truss
ing problems with large scale nodes or even dynamic BN learning structures, Comput. Struct. 149 (2015) 1–16.
problems. [28] A. Sadollah, H. Eskandar, J.H. Kim, Water cycle algorithm for solving con-
strained multi-objective optimization problems, Appl. Soft Comput. 27 (2015)
279–298.
Acknowledgment [29] F.L. Seixas, B. Zadrozny, J. Laks, A. Conci, D.C.M. Saade, A bayesian network
decision model for supporting the diagnosis of dementia, alzheimer’s disease
The research is supported by the National Natural Science Foun- and mild cognitive impairment, Comput. Biol. Med. 51 (2014) 140–158.
[30] D. Simon, M. Ergezer, D. Du, R. Rarick, Markov models for biogeography-based
dation of China (Grant no. 61373174), (Grant no. 61772391) and optimization, IEEE Trans. Syst. Man, Cybern. Part B (Cybernetics) 41 (1) (2011)
(Grant no. 11401454). 299–306.
[31] D.J. Spiegelhalter, A.P. Dawid, S.L. Lauritzen, R.G. Cowell, Bayesian analysis in
References expert systems, Stat. Sci. (1993) 219–247.
[32] M. Sykora, J. Markova, D. Diamantidis, Bayesian network application for the
[1] J.R. Alcobé, Incremental hill-climbing search applied to Bayesian network risk assessment of existing energy production units, in: Stochastic Models
structure learning, in: Proceedings of the 15th European Conference on Ma- in Reliability Engineering, Life Science and Operations Management (SMRLO),
chine Learning, Pisa, Italy, 2004. 2016 Second International Symposium on, IEEE, 2016, pp. 656–664.
[2] S. Aouay, S. Jamoussi, Y.B. Ayed, Particle swarm optimization based method [33] I. Tsamardinos, C.F. Aliferis, A.R. Statnikov, E. Statnikov, Algorithms for large
for Bayesian network structure learning, in: Modeling, Simulation and Ap- scale markov blanket discovery., in: FLAIRS Conference, 2, 2003, pp. 376–380.
plied Optimization (ICMSAO), 2013 5th International Conference on, IEEE, 2013, [34] I. Tsamardinos, L.E. Brown, C.F. Aliferis, The max-min hill-climbing bayesian
pp. 1–6. network structure learning algorithm, Mach. Learn. 65 (1) (2006) 31–78.
[3] A. Banitalebi, M.I.A. Aziz, Z.A. Aziz, A self-adaptive binary differential evolu- [35] M. Velikova, J.T. van Scheltinga, P.J. Lucas, M. Spaanderman, Exploiting
tion algorithm for large scale binary optimization problems, Inf. Sci. 367 (2016) causal functional relationships in bayesian network modelling for personalised
487–511. healthcare, Int. J. Approximate Reasoning 55 (1) (2014) 59–73.
[4] I.A. Beinlich, H.J. Suermondt, R.M. Chavez, G.F. Cooper, The alarm monitoring [36] M.L. Wong, K.S. Leung, An efficient data mining method for learning bayesian
system: A case study with two probabilistic inference techniques for belief networks using an evolutionary algorithm-based hybrid approach, IEEE Trans.
networks, in: AIME 89, Springer, 1989, pp. 247–256. Evol. Comput. 8 (4) (2004) 378–404.
[5] H. Cao, X. Qian, Z. Chen, H. Zhu, Enhanced particle swarm optimization for size [37] H. Xing-Chen, Q. Zheng, T. Lei, S. Li-Ping, Learning bayesian network struc-
and shape optimization of truss structures, Eng. Optim. (2017) 1–18. tures with discrete particle swarm optimization algorithm, in: Foundations of
[6] A. Carvalho, A cooperative coevolutionary genetic algorithm for learning Computational Intelligence, 2007. FOCI 2007. IEEE Symposium on, IEEE, 2007,
bayesian network structures, in: Proceedings of the 13th Annual Conference pp. 47–52.
on Genetic and Evolutionary Computation, ACM, 2011, pp. 1131–1138.
110 J. Wang, S. Liu / Knowledge-Based Systems 150 (2018) 95–110

[38] J. Xuan, J. Lu, G. Zhang, R.Y. Da Xu, X. Luo, A bayesian nonparametric model [40] C. Yang, J. Ji, J. Liu, J. Liu, B. Yin, Structural learning of bayesian networks
for multi-label learning, Mach. Learn. 106 (11) (2017) 1787–1815. by bacterial foraging optimization, Int. J. Approximate Reasoning 69 (2016)
[39] J. Xuan, J. Lu, G. Zhang, R.Y. Da Xu, X. Luo, Bayesian nonparametric relational 147–167.
topic model through dependent gamma processes, IEEE Trans. Knowl. Data [41] B. Yet, A. Constantinou, N. Fenton, M. Neil, E. Luedeling, K. Shepherd, A
Eng. 29 (7) (2017) 1357–1369. bayesian network framework for project cost, benefit and risk analysis with
an agricultural development case study, Expert Syst. Appl. 60 (2016) 141–155.

You might also like