You are on page 1of 5

Andrew Emrazian – u0349111

CS 4300 - AI
Nov. 30, 2016
Assignment A8: Policy Iteration

1. Introduction

This assignment implements the value policy algorithm. The agent learns a policy in the
following Wumpus World scenario:

W P

Using the following actions: A = {UP, LEFT, DOWN, RIGHT}

The agent moves the intended direction with a 0.8 probability, or to the left or right of the
intended direction with a 0.1 probability.

I hope to answer the following the same questions as A7:

• Will the agent be able to pick the policies that avoids death whenever possible?
• How many iterations will it take to converge to the best policies?
• Are the policies the same as A7?

2. Method

The given pseudo code

1
This assignment consists of the following noteworthy functions

1. CS4300_MDP_policy_iteration.m

This is implementation of Figure 17.7. The Markov Decision Process involves the states S,
actions A, and a transition model P. It is an alternating process between policy evaluation and
policy improvement. This process is repeated until there has been less change in the utilities
than eta (termination threshold defined by the user).

S = 1:16;
A = [UP, LEFT, DOWN, RIGHT];
P = CS4300_transition_model();
R = -ones(1,16);
R(1,7) = -1000;
R(1,11) = -1000;
R(1,12) = -1000;
R(1,16) = 1000;

[p,U,Ut] = CS4300_MDP_policy_iteration(S,A,P,R,100,.9999);

2. CS4300_policy_evaluation.m

Given the utilities and a set of policies, calculate a new set of utilities.

U = CS4300_policy_evaluation(policy,U,S,A,P,R,k,gamma);

3. CS4300_A8_driver.m

2
This function run the policy iteration with gamma of 0.99999. The utilities, utilities trace, and
policies of the last run are returned.

[U,Ut,p] = CS4300_A8_driver();

4. CS4300_transition_model.m

This function returns a transition state matrix with probabilities based on the Wumpus World
scenario as described in the Introduction.

P = CS4300_transition_model();

5. CS4300_transition_model_example.m

This function returns a transition state matrix with probabilities based on the 4x3 example
from the book.

P = CS4300_transition_model_example();

3. Verification of Program

To verify the value iteration, I used the example from the book. It lists the expected utilities for
the 4x3 table as:
0.812 0.868 0.918 1
0.762 0.660 -1
0.705 0.655 0.611 0.388

The utilities that my function output from A7 was:

0.8116 0.8678 0.9178 1


0.7616 0.6603 -1
0.7053 0.6553 0.6114 0.3878

The utilities that my function output was:

0.8112 0.8676 0.9177 1


0.7611 0.6601 -1
0.7047 0.6547 0.6107 0.3878

These numbers are pretty much identical.

To verify the policies, I used the utilities from above to produce the best state policies. My
output which matches the book’s is:

3
1

-1

4. Data and Analysis

The utilities for the Wumpus World scenario are:

0.9837 0.9852 0.9871 1


0.9822 0.9811 -1 -1
0.9806 0.9793 -1 0.657
0.9791 0.9779 0.956 0.954
The best policies are:
1

-1 -1

-1

4
When choosing R(s) = -100, the policies are:

-1 -1

-1

5. Interpretation

I this algorithm did finish much faster than policy iteration. It seems more efficient to iterate on
the actions and using the one step look ahead.

The agent did in fact avoid choose actions that avoids death when possible. The only exception
was (4,2). At best, there is still a %10 chance of death.

With the Wumpus example, it only took 5 iterations of policy evaluations to reach the best
policies.

6. Critique

If I had more time, I would have dynamically generated the transition model to verify its
correctness. I used excel to hard code the transition probabilities but it seems prone to error.

7. Log

I spent about 5 hours coding the assignment.


I spent about 1 hours on this write up.
12 hours in total

You might also like