You are on page 1of 7

Department of Computer Science

Technical University of Cluj-Napoca

Intelligent Systems
Laboratory activity 2018-2019

Project title: Actor-Critic method


Tool: Pytorch

Name: Chirodea Mihai Cristian — Condrea Stefan


Group: 30434
Email: m.chirodea@gmail.com — stefan.condrea.7@gmail.com

1
Contents

1 Introduction 3
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Main functionalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Installing the tool and running it . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Algorithm details and Examples 4


2.1 Algorithm details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Example description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Example analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Project Details 7
3.1 What will the system do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Narrative description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 Top level design of the scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.6 Knowledge acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2
Chapter 1

Introduction

1.1 Overview
1. PyTorch is an open-source machine learning library for Python, based on Torch,
used for applications such as natural language processing. It is primarily developed
by Facebook’s artificial-intelligence research group, and Uber’s ”Pyro” Probabilistic
programming language software is built on it.

1.2 Main functionalities


PyTorch provides two high-level features:

1. Tensor computation (like NumPy) with strong GPU acceleration.

2. Deep neural networks built on a tape-based autodiff system.

3. Our example, outputs at each end of an episode the average reward for that episode,
additionally, every episode is animated for a better understanding of the algorithm.

1.3 Installing the tool and running it


1. First install anaconda.

2. Using anaconda, run the commands conda install pytorch - c soumith, pip install
gym[all], pip install numpy.

3. The commands needed are the following:

(a) Create an enviorment in anaconda (conda create name).


(b) Run main.py using python main.py.

3
Chapter 2

Algorithm details and Examples

2.1 Algorithm details


1. Steps repeated by the program for an episode:

(a) First the actor picks an action based on it’s policy and gets feedback from
the critic, the action chosen influences the enviornment and thus, the reward
for that action. An episode consists of 10000 actions, and at the end of each
action, the policies for the actor and critic are updated, unlike the normal
reinforcement methods where the update occurs at the end of the episode.
(b) The longer the agent keeps the pendulum up, the more it gets rewarded, so as
we can see from the episodes above, at the beginning the actor doesn’t perform
well, as expected, but after 500 episodes it manages to keep the pendulum up
for quite some time.
(c) Pseudo Code:

2.2 Example description


1. After many tries to find a propper working example, we managed to find one that
was almost working, we just had to undertand it and correct the deprecated methods
and old incompatible code. This example uses the pendulum enviorment from gym
in which the ai is required to keep a pendulum in the upright position, by moving
either left or right.

2. As for structure, the program is split into 5 files or modules: main, buffer, train,
models, utils.

4
2.3 Example analysis
1. As stated above, the output for the program is the reward for each episode and
an animation describing the decisions taken by the actor. For analysis, we will
concentrate only on the reward part and not on the animated part.

2. Episodes up to 10:

(a) EPISODE :- 1 Episode Completed, AVG reward: -0.14063631507713914


(b) EPISODE :- 2 Episode Completed, AVG reward: -0.15245825319563192
(c) EPISODE :- 3 Episode Completed, AVG reward: -0.1553249504442359
(d) EPISODE :- 4 Episode Completed, AVG reward: -0.11799205773615105
(e) EPISODE :- 5 Episode Completed, AVG reward: -0.14789929131569005
(f) EPISODE :- 6 Episode Completed, AVG reward: -0.13701401380974024
(g) EPISODE :- 7 Episode Completed, AVG reward: -0.1439521036382022
(h) EPISODE :- 8 Episode Completed, AVG reward: -0.14397906303281907
(i) EPISODE :- 9 Episode Completed, AVG reward: -0.1279225645576127
(j) EPISODE :- 10 Episode Completed, AVG reward: -0.1543431574260769

3. Episodes from 501 to 511

(a) EPISODE :- 501 Episode Completed, AVG reward: -0.012322675092690518


(b) EPISODE :- 502 Episode Completed, AVG reward: -0.03488088185067334
(c) EPISODE :- 503 Episode Completed, AVG reward: -0.023373503903452462
(d) EPISODE :- 504 Episode Completed, AVG reward: -0.0123346942695461
(e) EPISODE :- 505 Episode Completed, AVG reward: -0.013230738989921456
(f) EPISODE :- 506 Episode Completed, AVG reward: -0.025706292150818193
(g) EPISODE :- 507 Episode Completed, AVG reward: -0.02513481264172472
(h) EPISODE :- 508 Episode Completed, AVG reward: -0.03586770043686828
(i) EPISODE :- 509 Episode Completed, AVG reward: -0.03565572025182023
(j) EPISODE :- 510 Episode Completed, AVG reward: -0.025022857883360693
(k) EPISODE :- 511 Episode Completed, AVG reward: -0.00026405064733336984

5
Chapter 3

Project Details

3.1 What will the system do


The current system works only on the pendulum enviornment and we intend to modify it by
changing it’s policies and models in order to make it race on a track.

3.2 Narrative description


The racetrack enviornment uses 3 main action in order to move the car: brake, accelerate and
steer. The track is randomised on each run of the program and the car has to follow the road
and go as fast as possible, if it goes off the track it get -1000 points and dies. We have to
be carefull not to press accelerate and steer at the same time as the car is a powerfull RWD
machine and it will start losing traction.

3.3 Facts
The facts of the scenario are that we need to implement new policies for the actor and critic
methods and make the connection with the enviornment as in the current state, the program
crashes on start when we give it the racetrack.

3.4 Specifications
When starting the program, the user will first see the animated window in which the car will try
to move along the track, for the first few episodes the agent won’t be able to perform very well,
but as in the pendulum enviornment, given time, it will learn to control the car and eventually,
move really fast along the track.

3.5 Top level design of the scenario


The main program will reside in the main module, there, the other modules are instantiated
and also there the actor takes the action and gets feedback from the critic. Other important
modules are the train module (which is responsible for updating the policies and loading the
models from file) and the models module (where the policies reside).

7
3.6 Knowledge acquisition
The project I chose requires me to take inspiration from other applications based on pytorch,
and to take into account on how those programs shape their policies. It also requires me to
understand how to connect with the tool and to understand each type of action it uses so the
reward is increased.

3.7 Related work


• Actor-Critic methods: https://towardsdatascience.com/understanding-actor-critic-method

• PyTorch examples: https://github.com/pytorch/examples/tree/master/reinforcement_


learning

Intelligent Systems Group

You might also like