You are on page 1of 21

Liu, Smith 1

6.825 – Project 2
11/4/2004

1. Variable Elimination Functionality

After executing our variable elimination procedure, we obtained the following results for
each of the queries below.

For the sake of easy analysis of the PropCost probability distributions obtained
throughout this project from the insurance network, we define the function f to be a
weighted average across the discrete domain, resulting in a single scalar value
representative of the overall cost. More specifically,

<[Burglary] = [false]> = 0.7158281646356072

<[Burglary] = [true]> = 0.284171835364393

<[Earthquake] = [false]> = 0.8239331615949207

<[Earthquake] = [true]> = 0.17606683840507917

<[PropCost] = [HundredThou]> = 0.1729786918964137

<[PropCost] = [Million]> = 0.02709352198178344
<[PropCost] = [TenThou]> = 0.3427002442093675
<[PropCost] = [Thousand]> = 0.45722754191243536

(f = 48275.62)

These results are consistent with those obtained by executing the given enumeration
procedure, and those given in Table 1 of the project hand-out.

2. More Variable Elimination Exercise

A. Insurance Network Queries
Liu, Smith 2

1. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou,

MakeModel = SportsCar)

If the MakeModel of the car in question is that of a sports car then,

based on the network as illustrated in Figure 1 of the handout, we
expect that the driver would be less risk averse, the driver would have
more money, the car would be of higher value. All of these things
should cause the cost of insurance to “go up,” relative to our previous
query which did not involve any evidence about the MakeModel of the
car. An increase in the PropCost domain sense means that the
probability distribution should be shifted towards the higher cost
elements of the domain (e.g. Million might have a higher probability
than Thousand).

Indeed, this is what happens. As can be seen below, f is about four

thousand dollars greater in this case relative to that from Section 1.3.

<[PropCost] = [HundredThou]> = 0.17179333672003955

<[PropCost] = [Million]> = 0.03093877334365239
<[PropCost] = [TenThou]> = 0.34593039737969233
<[PropCost] = [Thousand]> = 0.45133749255661565

(f = 52028.74)

2. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou,

GoodStudent = True)

In this case, counter-intuitive as it may seem, if the driver is a

GoodStudent, then the overall cost of insurance goes up. This
follows from the network as shown in Figure 1 of the project handout,
i.e. GoodStudent is only connected to the network through two
parents: Age and SocioEcon. Since Age is an evidence variable,
SocioEcon is the only node affected by the augmentation of
GoodStudent to the evidence. More specifically, if the adolescent
driver is a good student, they are likely to have more money, and thus
drive fancier cars, be less risk averse, et cetera.

This result is manifested in the results after variable elimination given

the proper evidence. More specifically, f is a little less than four
thousand dollars greater in this case relative to that from Section 1.3.

<[PropCost] = [HundredThou]> = 0.1837467917616061

<[PropCost] = [Million]> = 0.029748793596801583
<[PropCost] = [TenThou]> = 0.32771416728772235
<[PropCost] = [Thousand]> = 0.4587902473538701

(f = 51859.40)
Liu, Smith 3

<[N112] = > = 0.9880400004226929

<[N112] = > = 0.01195999957730707

<[N143] = > = 0.899999996961172

<[N143] = > = 0.10000000303882783

A. Histograms

Histogram of Computation Time under

Random Elimination Ordering: Problem 1

6000

5000

4000

3000

2000

1000

0
1 2 3 4 5 6 7 8 9 10
Trials

Figure 1. Histogram of Computation Time for P(ProbCost | Age =

Adolescent, Antilock = False, Mileage = FiftyThou, MakeModel =
SportsCar).
Liu, Smith 4

Histogram of Computation Time under

Random Elimination Ordering: Problem 2

6000
5000
4000
3000
2000
1000
0
1 2 3 4 5 6 7 8 9 10
Trials

Figure 2. Histogram of Computation Time for P(ProbCost | Age =

Adolescent, Antilock = False, Mileage = FiftyThou, GoodStudent = True).

Histogram of Computation Time under

Random Elimination Ordering: Problem 3

6000

5000

4000

3000

2000

1000

0
1 2 3 4 5 6 7 8 9 10
Trials

Figure 3. Histogram of Computation Time for P(N112 | N64 = "3", N113 =

"1", N116 = "0").
Liu, Smith 5

Histogram of Computation Time under

Random Elimination Ordering: Problem 4

6000

5000

4000

3000

2000

1000

0
1 2 3 4 5 6 7 8 9 10
Trials

Figure 4. Histogram of Computation Time for P(N143 | N146 = "1", N116 =

"0", N121 = "1").

B. Discussion
elimination algorithm for each of the problems in Task 2 of the project
handout. We ran the algorithm ten times for each problem. For each bar,
if there it is stacked with a purple bar on top of it, then the heap ran out of
memory during that execution. In this case, we know that the execution
would have taken at least the amount of time illustrated by the blue bar,
the time it executed before running out of memory. We suppose that
each execution where the computer ran out of memory would have taken
at least 5000 seconds to complete.

It is worth noting that the time taken on the successful runs (the samples
without a purple bar) is much lower than the time taken to execute the
unsuccessful runs before they crashed. I.e. the successful blue bars tend
to be shorter than the unsuccessful blue bars. This indicates that either
random ordering tends to get it very right or very wrong.
Liu, Smith 6

A. Histograms

1.4
1.2
1
0.8
0.6
0.4
0.2
0
1 2 3 4
Problem Number

problems.

Problem Average Time (seconds)

Insurance – 1 0.629
Insurance – 2 1.086
Carpo – 1 0.088
Carpo – 2 0.087

Table 1. Average time of execution for variable elimination for the problems
from Task 2. Averages are constructed across ten independant runs each,
which are illustrated in Figure 5.

B. Discussion
As can be seen from
Table 1, the time needed for variable elimination is much smaller for a
greedy elimination ordering versus a random ordering. This makes a lot
of sense, because the random ordering could happen to eliminate a
parent of many children, creating a huge factor which slows down the
algorithm and eats up memory. On the contrary, greedy ordering variable
elimination works very well. Even in the cases from Section 3 in which
we did not run out of memory, the greedy algorithm tends to be about
100-200 times faster.
Liu, Smith 7

5. Likelihood Weighting and Gibbs Sampling

Functionality
Each of our results below look like they are in the right neighborhood. We
give more explicit quality results in the problems that follow this one.

<[Burglary] = [false]> = 0.5448387970739699

<[Burglary] = [true]> = 0.4551612029260302

<[Earthquake] = [false]> = 0.9997158283603297

<[Earthquake] = [true]> = 2.8417163967036946E-4

<[PropCost] = [HundredThou]> = 0.17105091038203132

<[PropCost] = [Million]> = 0.021563876240368398
<[PropCost] = [TenThou]> = 0.35877461270610517
<[PropCost] = [Thousand]> = 0.44861060067149516

4. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou,

MakeModel = SportsCar)

<[PropCost] = [HundredThou]> = 0.16339257873401916

<[PropCost] = [Million]> = 0.030620517617711222
<[PropCost] = [TenThou]> = 0.35048331774243846
<[PropCost] = [Thousand]> = 0.4555035859058312

5. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou,

GoodStudent = True)

<[PropCost] = [HundredThou]> = 0.20177159162635994

<[PropCost] = [Million]> = 0.032866049889275516
<[PropCost] = [TenThou]> = 0.30414914618811645
<[PropCost] = [Thousand]> = 0.46121321229624807

<[N112] = > = 0.9910128302117664

<[N112] = > = 0.00898716978823346
Liu, Smith 8

<[N143] = > = 0.9172494563262301

<[N143] = > = 0.08275054367376986

<[Burglary] = [false]> = 0.71

<[Burglary] = [true]> = 0.29

<[Earthquake] = [false]> = 0.842

<[Earthquake] = [true]> = 0.158

<[PropCost] = [HundredThou]> = 0.06

<[PropCost] = [Million]> = 0.01
<[PropCost] = [TenThou]> = 0.355
<[PropCost] = [Thousand]> = 0.5750000000000001

4. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou,

MakeModel = SportsCar)

<[PropCost] = [HundredThou]> = 0.09

<[PropCost] = [Million]> = 0.011
<[PropCost] = [TenThou]> = 0.34
<[PropCost] = [Thousand]> = 0.559

5. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou,

GoodStudent = True)

<[PropCost] = [HundredThou]> = 0.213

<[PropCost] = [Million]> = 0.038
<[PropCost] = [TenThou]> = 0.372
<[PropCost] = [Thousand]> = 0.377

<[N112] = > = 0.97

<[N112] = > = 0.03
Liu, Smith 9

<[N143] = > = 0.922

<[N143] = > = 0.078

A. Results

Prefix Throwaway in Gibbs Sampling

4.00E-03
3.50E-03
3.00E-03
KL Divergence

2.50E-03
2.00E-03
1.50E-03
1.00E-03
5.00E-04
0.00E+00
0 200 400 600 800 1000
Size of Prefix Thrown Away

Figure 6. Quality (KL divergence) of estimates produced by Gibbs sampler.

Each run used 2000 samples, and threw away the first x samples, the
independant variable expressed on the x-axis.

Prefix Throwaway in Gibbs Sampling

1.20E-03
Average KL Divergence

1.00E-03
8.00E-04
6.00E-04
4.00E-04
2.00E-04
0.00E+00
0 200 400 600 800 1000
Size of Prefix Thrown Away

Figure 7. Averages for different prefix throwaway sizes from Figure 6.

Liu, Smith 10

B. Discussion
In this analysis, we ran the Gibbs sampler with 2000 samples on the
same problem (Carpo – 1). For each iteration, we threw away a variable
number of the first samples. The idea is that since Gibbs sampling is a
Markov Chain algorithm, each sample highly depends on the samples
before it. Since we choose a random initialization vector for each
variable, it can take some “burn in” time before the algorithm begins to
settle into the right global solution.

The results of our experiments are expressed in Figure 6 and Figure 7. We

have a fairly nice characteristic curve as can be seen in the average
graph, with the only exception being when we threw away the first 600
samples. Looking at each run, however, at x = 600 there was a single
outlier with an extremely high KL divergence; we can ignore it based on
the many runs that we did. It seems that the ideal “burn in” time, a trade-
off between good initialization and diversity of counted samples, is 800
samples.

7. Detailed Analysis – KL Divergences

A. Results
We present results indexed first by the algorithm (Likelihood Weighting,
then Gibbs Samples) and then by the problem. Within each problem we
display two graphs: the first showing the results from ten iterations, and
the second showing the average KL divergence across each iteration.
Liu, Smith 11

1. Likelihood Weighting
Likelihood Weighting - Problem Insurance1

7.00E-02

6.00E-02

5.00E-02
KL Divergence

4.00E-02

3.00E-02

2.00E-02

1.00E-02

0.00E+00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
)Number of Samples (x1000

Figure 3. KL Divergences when applying Likelihood Weighting to P(PropCost |

Age = Adolescent, Antilock=False, Mileage = FiftyThou, MakeModel = SportsCar).

Likelihood Weighting: Average KL Divergence -

Problem Insurance1

0.035

0.03

0.025
KL Divergence

0.02

0.015

0.01

0.005

0
00

00

00

00

00

00

00

00

00

00

00
0

0
10

20

30

40

50

60

70

80

90
10

11

12

13

14

15

16

17

18

19

20

Number of Samples

Figure 4. Average KL Divergence when applying Likelihood Weighting to

P(PropCost | Age = Adolescent, Antilock=False, Mileage = FiftyThou, MakeModel =
SportsCar) to sample sizes between 100 and 2000.
Liu, Smith 12

7.00E-02

6.00E-02

5.00E-02
Divergence

4.00E-02

3.00E-02

2.00E-02

1.00E-02

0.00E+00
0

0
00

00

00

00

00

00

00

00

00

00

00
10

20

30

40

50

60

70

80

90
10

11

12

13

14

15

16

17

18

19

20
Sample Size

Figure 8. KL Divergences when applying Likelihood Weighting to P(PropCost | Age

= Adolescent, Antilock=False, Mileage = FiftyThou, GoodStudent = True).

Likelihood Weighting: Average KL Divergence -

Problem Insurance2

0.02
0.018
0.016
KL Divergence

0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
00

00

00

00

00

00

00

00

00

00

00
0

0
10

20

30

40

50

60

70

80

90
10

11

12

13

14

15

16

17

18

19

20

Number of Samples

Figure 9. Average KL Divergence when applying Likelihood Weighting to

P(PropCost | Age = Adolescent, Antilock=False, Mileage = FiftyThou, GoodStudent
= True).
Liu, Smith 13

Likelihood Weighting - Problem 3

8.00E-02

7.00E-02

6.00E-02

5.00E-02

4.00E-02

3.00E-02

2.00E-02

1.00E-02

0.00E+00

1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
100
200
300
400
500
600
700
800
900
Number of Samples

Figure 10. KL Divergences when applying Likelihood Weighting to P(N112 | N64 =

"3", N113 = "1", N116 = "0").

Likelihood Weighting: Average KL Divergence -

Problem Carpo1

0.05
0.045
0.04
KL Divergence

0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
00

00
00

00

00
00
00
00

00

00
00
0
0

0
0

0
0

0
10
20

30

40
50

60
70

80

90
10

11
12

13

14
15

16
17

18

19
20

Number of Samples

Figure 11. Average KL Divergence when applying Likelihood Weighting to P(N112 |

N64 = "3", N113 = "1", N116 = "0").
Liu, Smith 14

Likelihood Weighting - Problem 4

2.50E-02

2.00E-02

1.50E-02

1.00E-02

5.00E-03

0.00E+00

1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
100
200
300
400
500
600
700
800
900
Number of Samples

Figure 12. KL Divergences when applying Likelihood Weighting to P(N143 | N146 =

"1", N116 = "0", N121 = "1").

Likelihood Weighting: Average KL Divergence -

Problem Carpo2

0.007

0.006

0.005
KL Divergence

0.004

0.003

0.002

0.001

0
0
0
0
0
0
0
0
0
0
00
00
00
00
00
00
00
00
00
00
00
10
20
30
40
50
60
70
80
90
10
11
12
13
14
15
16
17
18
19
20

Number of Samples

Figure 13. Average KL Divergence when applying Likelihood Weighting to P(N143 |

N146 = "1", N116 = "0", N121 = "1").
Liu, Smith 15

2. Gibbs Sampling

Gibbs Sampling: KL Divergences vs Number of

Samples for Problem 1

1.2

0.8

0.6

0.4

0.2

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
Number of Samples

Figure 14. Divergences resulting from Gibbs Sampling applied to P(PropCost | Age
= Adolescent, Antilock = False, Mileage = FiftyThou, MakeModel = SportsCar) for
sample sizes between 1000 and 25000.
Liu, Smith 16

Gibbs Sampling: Average KL Divergence vs Number

of Samples for Problem 1

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
Number of Samples

Figure 15. Average divergence resulting from Gibbs Sampling applied to

P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, MakeModel
= SportsCar) for sample sizes between 1000 and 25000.

Gibbs Sampling: KL Divergences vs Number of

Samples for Problem 2

1.2

0.8

0.6

0.4

0.2

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000

Number of Samples

Figure 16. Divergences resulting from Gibbs Sampling applied to P(PropCost | Age
= Adolescent, Antilock = False, Mileage = FiftyThou, GoodStudent = True) for
sample sizes between 1000 and 25000.
Liu, Smith 17

Gibbs Sampling: Average KL Divergence vs Number

of Samples for Problem 2

0.25

0.2

0.15

0.1

0.05

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
Number of Samples

Figure 17. Average divergence resulting from Gibbs Sampling applied to

P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou,
GoodStudent = True) for sample sizes between 1000 and 25000.

Gibbs Sampling: KL Divergences vs Number of

Samples for Problem 3

4.00E-03

3.50E-03

3.00E-03

2.50E-03

2.00E-03

1.50E-03

1.00E-03

5.00E-04

0.00E+00
00

00

00

00

00

0
00

00

00

00

00

00

00

00
10

30

50

70

90

11

13

15

17

19

21

23

25

Number of Samples

Figure 18. Divergences resulting from Gibbs Sampling applied to P(N112 | N64 =
"3", N113 = "1", N116 = "0") for sample sizes between 1000 and 25000.
Liu, Smith 18

Gibbs Sampling: Average KL Divergence vs Number

of Samples for Problem 3

1.20E-03

1.00E-03

8.00E-04

6.00E-04

4.00E-04

2.00E-04

0.00E+00
00

00

00

00

00

0
00

00

00

00

00

00

00

00
10

30

50

70

90

11

13

15

17

19

21

23

25
Number of Samples

Figure 19. Average Divergence resulting from Gibbs Sampling applied to P(N112 |
N64 = "3", N113 = "1", N116 = "0") for sample sizes between 1000 and 25000.

Gibbs Sampling: KL Divergences vs Number of

Samples for Problem 4

0.035

0.03

0.025

0.02

0.015

0.01

0.005

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000

Number of Samples

Figure 20. Divergences resulting from Gibbs Sampling applied to P(N143 | N146 =
"1", N116 = "0", N121 = "1") for sample sizes between 1000 and 25000.
Liu, Smith 19

Gibbs Sampling: Average KL Divergence vs Number

of Samples for Problem 4

0.014

0.012

0.01

0.008

0.006

0.004

0.002

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
Number of Samples

Figure 21. Average divergence resulting from Gibbs Sampling applied to P(N143 |
N146 = "1", N116 = "0", N121 = "1") for sample sizes between 1000 and 25000.

B. Discussion of Results
Four interesting things:

1. Number of samples in Gibbs versus Likelihood Weighting

As seen from the figures in Section 7.A.1, Likelihood weighting tends
to converge after about 500 samples, but always after 1000 in our
problems and analyses.

We originally assumed that Gibbs sampling would converge in about

the same time, if not better. It turns out that Gibbs takes much longer;
it typically converges by 5000 samples, a full order of magnitude
higher, as can be seen from the figures in Section 7.A.2. This is likely
because of the Markov Chain approach used; since each sample
depends on the ones before it, it can take many iterations before the
algorithm settles into the global optima, whereas likelihood weighting
by definition discovers the appropriate probabilities (i.e. weights).

2. Variance of time to converge can be high

The convergence of Likelihood Weighting in Problem 3, as illustrated
in Figure 10 and Figure 11, exhibits very interesting properties. In the
other problems, likelihood weighting runs tended to exhibit relatively
low variance in time to convergence. However, here we see some
runs which converged very quickly, and others that took abnormally
Liu, Smith 20

long. This high variance occurred with high consistency in this

problem, and thus is likely induced by some characteristic in the
problem; one likely explanation is that our query variable is a leaf
node in a very poly-tree-like network.

3. Convergence is logarithmic
This is an evident feature of all of the graphs, but has enormous
implications for a choice of algorithms.

The criterion for “completeness” of an algorithm is that it arrives at the

right answer. In the case of the sampling methods that we surveyed,
unfortunately it takes an infinite time to arrive at the right answer.
However, it is important to note that variable elimination always
arrives at the exact answer. Thus, if a user needs completeness (i.e.
the right answer), they should probably use variable elimination.

However, if they only need a certain level of completeness, i.e. they

want to be x% right, they still cannot rely on sampling methods. This
gives rise to the x% correct y% of the time metric. We certainly see
this from our graphs.

4. Local optima in Gibbs sampling, but not in Likelihood Weighting

This is a very interesting point. In both problems 3 and 4 from Task 2
under Gibbs sampling, one of the runs from each of these problems
do not converge to zero. Instead, they seem to converge to a local
optima (which is not the global optima). This can be seen in the pink
line in Figure 18 and the jungle green line in Figure 20.

This is probably more likely in some networks than others. We could

probably construct a very simple network that would not provoke this
behavior.

C. Computational Considerations – Sampling versus Variable Elimination

In comparing the computation time of sampling methods to variable
elimination, we limit ourselves to discussion of greedy ordering variable
elimination; since random ordering is very sub-optimal (see Section 4).

It turns out that for the networks and queries that we considered, variable
elimination is the champ on both accuracy and speed. As can be seen
from Table 2, variable elimination performed in near-second times on each
problem, while Gibbs took about 15 seconds and Likelihood Weighting
took around 5 seconds.

This is with 1000 samples for the sampling algorithms, and an effective
infinite samples for variable elimination.

Our results might have been different if the networks involved were much
more dense (i.e. connected) or much larger.
Liu, Smith 21

Insurance Insurance . .
1 2 Carpo Carpo
1 2
Variable 0.741 1.142 0.120 0.090
Eliminatio
n
Gibbs 12.778 13.530 19.228 18.045
Sampling
Likelihoo 4.377 4.687 5.608 5.317
d
Weighting
Table 2. Execution time of various algorithms on the four problems from Task 2.