Professional Documents
Culture Documents
CS 4300 - AI
Nov. 30, 2016
Assignment A8: Policy Iteration
1. Introduction
This assignment implements the value policy algorithm. The agent learns a policy in the
following Wumpus World scenario:
W P
The agent moves the intended direction with a 0.8 probability, or to the left or right of the
intended direction with a 0.1 probability.
• Will the agent be able to pick the policies that avoids death whenever possible?
• How many iterations will it take to converge to the best policies?
• Are the policies the same as A7?
2. Method
1
This assignment consists of the following noteworthy functions
1. CS4300_MDP_policy_iteration.m
This is implementation of Figure 17.7. The Markov Decision Process involves the states S,
actions A, and a transition model P. It is an alternating process between policy evaluation and
policy improvement. This process is repeated until there has been less change in the utilities
than eta (termination threshold defined by the user).
S = 1:16;
A = [UP, LEFT, DOWN, RIGHT];
P = CS4300_transition_model();
R = -ones(1,16);
R(1,7) = -1000;
R(1,11) = -1000;
R(1,12) = -1000;
R(1,16) = 1000;
[p,U,Ut] = CS4300_MDP_policy_iteration(S,A,P,R,100,.9999);
2. CS4300_policy_evaluation.m
Given the utilities and a set of policies, calculate a new set of utilities.
U = CS4300_policy_evaluation(policy,U,S,A,P,R,k,gamma);
3. CS4300_A8_driver.m
2
This function run the policy iteration with gamma of 0.99999. The utilities, utilities trace, and
policies of the last run are returned.
[U,Ut,p] = CS4300_A8_driver();
4. CS4300_transition_model.m
This function returns a transition state matrix with probabilities based on the Wumpus World
scenario as described in the Introduction.
P = CS4300_transition_model();
5. CS4300_transition_model_example.m
This function returns a transition state matrix with probabilities based on the 4x3 example
from the book.
P = CS4300_transition_model_example();
3. Verification of Program
To verify the value iteration, I used the example from the book. It lists the expected utilities for
the 4x3 table as:
0.812 0.868 0.918 1
0.762 0.660 -1
0.705 0.655 0.611 0.388
To verify the policies, I used the utilities from above to produce the best state policies. My
output which matches the book’s is:
3
1
-1
-1 -1
-1
4
When choosing R(s) = -100, the policies are:
-1 -1
-1
5. Interpretation
I this algorithm did finish much faster than policy iteration. It seems more efficient to iterate on
the actions and using the one step look ahead.
The agent did in fact avoid choose actions that avoids death when possible. The only exception
was (4,2). At best, there is still a %10 chance of death.
With the Wumpus example, it only took 5 iterations of policy evaluations to reach the best
policies.
6. Critique
If I had more time, I would have dynamically generated the transition model to verify its
correctness. I used excel to hard code the transition probabilities but it seems prone to error.
7. Log