You are on page 1of 62

Department of Computer Science & Engineering

SOUTHEAST UNIVERSITY

CSE4000: Research Methodology


Gamebot Designing

A dissertation submitted to the Southeast University in partial fulfillment of the


requirements for the degree of B. Sc. in Computer Science & Engineering

Submitted by
Syed Intiser Ahsan Kazi Fazle Azim Rabi Sakib Shahriyar Pathan
ID: 2013100000001 ID: 2013200000034 ID: 2013200000042

Supervised by
Monirul Hasan
Coordinator, Department of CSE
Southeast University

Copyright Year
c 2017
September, 2017
Letter of Transmittal
December 9, 2017

The Chairman,
Department of Computer Science & Engineering,
Southeast University,
Banani, Dhaka.

Through: Supervisor, Monirul Hasan

Subject: Submission of proposal for CSE4000: Reasearch Methodolody

Dear Sir,
With due respect, we have planned to carry out a research on Gamebot Designing
under the course, Research Methodology wanting to make a gamebot which can
play games on its own based on game image pixels.

We shall try our level best to complete our project according to our plan. Please
feel free for any query or clarification that you would like us to explain. Hope
you will appreciate our hard work and excuse the minor errors. Thanking you
for your cooperation.

Thank you.

Sincerely yours, Supervisor:

Syed Intiser Ahsan Monirul Hasan


ID: 2013100000001 Coordinator, Department of CSE
Batch: 34 Program: CSE Southeast University

Kazi Fazle Azim Rabi


ID: 2013200000034
Batch: 35 Program: CSE

Sakib Shahriyar Pathan


ID: 2013200000042
Batch: 35 Program: CSE
Certificate
This is to certify that the research title Gamebot Designing has been submitted
to the respected members of the board of examiner of the faculty of Science and
Engineering in partial fulfillment of the requirements for the degree of Bachelor
of Science in CSE. Science and Engineering by the following students and has
been accepted. This report has been carried out under my guidance.

Author: Approved by the Supervisor:

Syed Intiser Ahsan Monirul Hasan


ID: 2013100000001 Coordinator, Department of CSE
Batch: 34 Program: CSE Southeast University

Kazi Fazle Azim Rabi


ID: 2013200000034
Batch: 35 Program: CSE

Sakib Shahriyar Pathan


ID: 2013200000042
Batch: 35 Program: CSE
Abstract

Reinforcement Learning is the the type of learning where an agent learns the
behaviour of a non-stationary environment through trial and error. Unlike super-
vised learning RL does not require any knowledge of the environment and there
is nothing like cost or error to make correction. This makes RL the most appro-
priate technique to use to design Gamebots. In this study we review both, few
fundamental RL algorithms those are useful to build gamebots and fundamental
machine learning algorithms, which are prerequisite for RL. Along with the re-
view we also implement and present our performance analysis of these algorithms
using benchmark tools like OpenAI Gym.

i
Acknowledgements

First of all, all praise to the almighty for everything.

We thank our University for allowing us to carry out our research.

We place on record, our sincere thanks to our supervisor Monirul Hasan sir,
Coordinator of the Department, for providing us with all the necessary facili-
ties for the research and supporting us throughout the whole time. Without his
valuable guidance and encouragement, it will be very difficult for us to finish the
research.

We also like to thank the Chairman of CSE Department Shahriar Manzoor sir,
for guiding us.

We take this opportunity to express our gratitude to all of the Department fac-
ulty members for their help, support, and encouragement throughout the venture.

And finally, our sense of gratitude to one and all, who has directly or indirectly
helped us to continue our journey.

ii
Contents

Abstract i

Acknowledgements ii

List of Figures vi

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Literature Review 6
2.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Markov Decision Process (MDP) . . . . . . . . . . . . . . . . . . 11
2.5.1 Markov Property . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Transition Matrix . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.3 Markov Reward Process . . . . . . . . . . . . . . . . . . . 12
2.6 Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Temporal Difference . . . . . . . . . . . . . . . . . . . . . . . . . 16

iii
CONTENTS

2.10 SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.11 Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.12 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . 19
2.12.1 Convolution Layer . . . . . . . . . . . . . . . . . . . . . . 22
2.12.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Problem 24
3.1 Frozen Lake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Cart Pole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Implementation 27
4.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.7 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.7.1 FrozenLake . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.7.2 CartPole . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.8 Upload To Gym . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Result and Evaluation 44


5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.1 FrozenLake . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.2 CartPole . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.1 FrozenLake . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.2 CartPole . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Limtations 50

iv
CONTENTS

7 Conclusion 51

Bibliography 52

v
List of Figures

1.1 Our agent playing Frozenlake-V0 a toy text game . . . . . . . . . 2

2.1 Top-down Hierarchy of Reinforcement Learning . . . . . . . . . . 6


2.2 pooling layer of CNN . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 Frozenlake Game . . . . . . . . . . . . . . . . . . . . . . . . . . . 25


3.2 CartPole Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1 Linear Regression and result . . . . . . . . . . . . . . . . . . . . . 28


4.2 Logistic Regression data and result . . . . . . . . . . . . . . . . . 30
4.3 ANN on Handwriting recognition . . . . . . . . . . . . . . . . . . 33
4.4 Grading system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 CGPA Classification . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.7 Policy Iteration Optimal Polciy . . . . . . . . . . . . . . . . . . . 38
4.8 Policy Iteration Optimal Value . . . . . . . . . . . . . . . . . . . . 38
4.9 Value Iteration Optimal actions . . . . . . . . . . . . . . . . . . . 39

5.1 Our agents performance on Frozenlake-V0 a toy text game . . . . 45


5.2 Our agents performance on cartpole a classic game . . . . . . . . 46

vi
Chapter 1

Introduction

Machine learning is an application of Artificial Intelligence that accesses data


and use it to learn for themselves automatically by taking better decisions after
observing the data. It also saves us from writing complex and excessive codes.
This research is about to explore the field of machine learning and implement
machine learning algorithm to design Gamebots that can play games on its own
based on OpenAI gym environment. A game environment given by OpenAI gym
to observe for the agent.
OpenAI gym is a toolkit, collection of benchmark problems (Games) that shares
a platform and a website for people to share their agents results and compare
the performances of algorithms where agents experience is divided into a series
of episodes. The agents initial state is randomly sampled in each episode [1].
This platform gives us the necessary game environments which is observed by the
agent.

1
CHAPTER 1. INTRODUCTION

Observing OpenAI gyms environment, we trained our agent to take actions based
on rewards and states of that environment implementing Q-Learning to build the
agent. Q-learning is a model-free reinforcement learning algorithm which is an
off-policy learning. For any finite Markov Decision Process, Q-learning can be
used to find an action selection policy to find the optimal policy[2].

Figure 1.1: Our agent playing Frozenlake-V0 a toy text game

Also stated before, to explore the field of machine learning, weve also went
through some other machine learning algorithms and implemented them in dif-
ferent problems which is discussed here.

2
CHAPTER 1. INTRODUCTION

1.1 Motivation

In current world numerous thing we see is associate with Artificial Intelligence and
Machine Learning is an extremely prestigious and renowned piece of it. Scientists
are putting additional endeavors to make this field greater and advantageous and
and attempting to tackle genuine issues with it. Likewise in PC equipment orga-
nizations are embracing this method to improve their equipment performs even
faster, Google, Nvidia, AMD are some of them[3][4]. Additionally now numer-
ous robots are constructed which can follow up on their own. We know about
Googles and Teslas hit venture Self driving car and apple joins in their venture
too. Teslas self driving car is out on market and people are using it[5]. These are
all products and real life implementation of machine learning. These actualities
encourages us to demonstrate our enthusiasm for this field. Google introduced
us with TensorFlow[6], a machine learning library which has a huge impact in
the field of Machine Learning. Mainly it is an open source software library for
machine learning using data flow graphs. Interesting part is, it can accesses GPU
for faster computations without which it would take years to complete and much
more expensive. As Elon Musk opened OpenAI for all, its a decent stage to
test our machine learning algorithms performances and furthermore having a bot
which can play a game on its own is something worth try for. These things
motivate us and push us to go facilitate in this field.

3
CHAPTER 1. INTRODUCTION

1.2 Goal

The title says Gamebot Designing however the objective was considerably more
than that. Our primary objective was to get introduced ourselves with Machine
Learning. The objective of this research was to examine machine learning al-
gorithms and dissect their performances and after that building a nonspecific
agent which can play as much game as could be expected on the OpenAI Gym
environment. To construct that agent our objective was to try different learning
algorithms and finally implement a Deep Reinforcement Learning (DRL) algo-
rithm to do that.

4
CHAPTER 1. INTRODUCTION

1.3 Overview

We will discuss necessary machine learning algorithms from basic to Reinforce-


ment Learning that are needed to complete our goal of designing a Gamebot
including Markov chain, Q learning, Artificial Neural Network and Convolutional
Neural Network etc also, some problems that arose regarding these algorithms
and we solved using these algorithms in this paper. We will cover about the game
environment for which we built our agent FrozenLake and Cartpole and also
how we attempted to build our agent and will also discuss the results of our agent
on these game and analyze it. Last but not least we will conclude with where
and how to improve our research with possible future works.

5
Chapter 2

Literature Review

Before diving into Reinforcement Learning, one needs to have a solid foundation of
fundamental machine learning with some other important topics as shown in the
Top-Down RL Hierarchy. In this section, we cover those basic machine learning
approaches along with a brief introduction to reinforcement learning techniques.

Figure 2.1: Top-down Hierarchy of Reinforcement Learning

6
CHAPTER 2. LITERATURE REVIEW

2.1 Linear Regression

Linear Regression is a machine learning technique that is super popular in su-


pervised learning. In RL it is also used with the combination of some core RL
algorithms like SARSA-lambda, TD-lambda etc when the number of states are
continuous or just too many to be able to represent in memory. Consider an
independent variable on X-axis and a dependent variable on Y -axis. Linear re-
gression is the formation of a relationship between the two variables through a
line which will help to predict the dependent variable for any value of the inde-
pendent variable in the future[7].
A Linear Regression line has an equation of the form y = a + bX, where X is the
explanatory variable and y is the dependent variable. The slope of the line is b,
and a is the intercept (the value of y when x = 0).

2.2 Logistic Regression

Logistic Regression is another powerful machine learning technique that is also


very popular in supervised learning. In Reinforcement Learning (RL) it probably
has no direct contribution. But it sets a good platform to make learning Neural
Network easy, which is a good tool to have under ones belt for Reinforcement
Learning.

7
CHAPTER 2. LITERATURE REVIEW

Logistic Regression is the formation of a model and then estimating the prob-
ability of an event occurring or not occurring, depending on the values of the
categorical or numerical independent variables. Although it has regression in
its name, its a classification algorithm. Its very similar to Linear Regression in
its learning approach, but the cost and gradient functions are formulated differ-
ently. Logistic regression uses a sigmoid or logit activation function instead of
the continuous output in linear regression, thus the name[7].

logit(p) = b0 + b1 X1 + b2 X2 + b3 X3 + + bk Xk (2.1)

where, p is the probability of the presence of elements. The logit activation


function is defined as logged odds:

p probability of having elements


odds = = (2.2)
1p probability of not having elements
!
p
so the logit fucntion, logit(p) = ln
1p

8
CHAPTER 2. LITERATURE REVIEW

2.3 Artificial Neural Network

Artificial Neural Network (ANN) is widely used in supervised learning. In the-


ory there are researches those proved that ANN is not the most appropriate
techniques to use with Core Reinforcement Learning (RL) algorithms. But in
practice there are some success stories with ANN. Although experts mostly dis-
courage to use the plain vanilla version of ANN with RL, recent researches shown
ways to achieve great success using Deep Neural Network which is an advanced
version of the basic one. Despite the fact that the basic version has almost no
fruitfulness in RL world, learning basic ANN is a giant step to use RL and Deep
Neural Network together.
An Artificial Neural Network (ANN) is a structure consists of several components
interconnected in layers to mainly classify items. The components are called ar-
tificial neurons. Each Artificial Neuron is itself a simple classifier with limited
ability. We form Artificial Neural Network by interconnecting a number of Artifi-
cial Neurons. The first layer of an ANN is called the input layer, last layer is the
output layer and all the middle layers are called the hidden layers. Like logistic
regression it outputs probability of selecting or not selecting something. We can
get continuous output by preprocessing our inputs[7].

9
CHAPTER 2. LITERATURE REVIEW

2.4 Monte Carlo

We implemented Monte Carlo algorithm for an exceptionally basic problem.


Monte Carlo is a technique which takes some random sample and calculates a
value for a particular function[8]. We calculated the value of . As we know
that the quarter area of circle with radius 1 depends on . We took random co-
ordinates (x, y) and where value of x and y is in between 0 and 1. We check that
with a function if, the distance of (x, y) from (0, 0) is 1 or not. If, the distance is
less or equal to 1, then the coordinate is in the circle. We calculated the distance
with the euclidean distance formula,

p
D = (x x1 )2 + (y y1 )2 (2.3)

After running many times the approximate value of get more accurate. (Error
percentage varies as we took random points to for every simulation.)

Running Time : 1000 gave a value with 1.451259% error

Running Time : 10000 gave a value with 0.916498% error

Running Time : 100000 gave a value with 0.038431% error

Running Time : 1000000 gave a value with 0.002185% error

10
CHAPTER 2. LITERATURE REVIEW

2.5 Markov Decision Process (MDP)

Before jumping into Reinforcement Learning we need to have a clear idea of envi-
ronments and this introduced Markov Decision Process (MDP). Markov Decision
Process is a random process that formally describes a fully observable environ-
ment. To understand Markov Decision Process we need to go through Markov
property[9].

2.5.1 Markov Property

Markov property states that the future is independent of the past given the
present. It means the probability of taking an action in the future state is not
dependent on the all previous actions taken on all other states but the current
action in the current state. Or in short, we can say the next state is dependent
only on the current state. If St is a state, it would be markov iff,


P(St+1 St ) = P[St+1 S1 , S2 , S3 , , St ] (2.4)

11
CHAPTER 2. LITERATURE REVIEW

2.5.2 Transition Matrix

Markov Decision Process deals with transition matrices. The transition matrix
is a probability matrix that gives us the probability of going from one state to
another where each row of the matrix sums to 1. Markov process is also known
as Markov chain is a tuple(S, P) where,

S is the set of states and

P is the state transition probability matrix which can be defined as

Pss0 = P[St+1 = s0 St = s]

(2.5)

2.5.3 Markov Reward Process

But to take an action in states we need to consider rewards and this introduces
us to Markov Reward Process. Markov reward process is in short Markov Chain
with some value judgments from the immediate state. The return is the total
discounted reward or the cumulative sum of rewards obtained from all states
from time step t. Discounted factor [0, 1] is to determine whether to focus on
the future rewards more or to current rewards.


X
2
Gt = Rt+1 + Rt+2 + Rt+3 + = k Rt+k=1 (2.6)
k=0

It is discounted so that the infinite return int the cyclic Markov Reward Process
can be avoided. So the Markov reward Process can be represented,


v(s) = E[Gt St = s] = E[Rt+1 + v(St+1 ) St = s] (2.7)

12
CHAPTER 2. LITERATURE REVIEW

So, Markov Decision Process is basically Markov reward process with decisions
or we can say with actions. It gives a decision from this state which action to
choose to end up being in a new state. Markov Decision Process (MDP) is a
tuple(S, A, P, R, ) where,

S is a finite set of state

A is a finite set of action in a state

P is the transition probability matrix can be described as

Pass0 = P[St+1 = s0 St = s, At = a]

(2.8)

R is a reward function


Rsa = E[Rt+1 St = s, At = a] (2.9)

is the discount factor

As the Markov Decision Process (MDP) is defined the policy is learned by doing
policy iteration or value iteration to calculate the expected reward for each state
and the optimal policy gives us the best action to take[10].

13
CHAPTER 2. LITERATURE REVIEW

2.6 Policy Evaluation

Policy Evaluation is one of the most fundamental parts of Reinforcement Learn-


ing. There is no such RL algorithms that do not use Policy evaluation. Even
for unknown MDPs, policy evaluation is used with Monte Carlo Learning and
Temporal Difference learnings.
Policy evaluation is the computation of value function v for all the states S using
an arbitrary policy. In other words, it is just an estimation or prediction of values
for all the states for a known policy. As the policy is known for a known MDP, we
can simulate the action that will be chosen under this policy using the bellman
equation. In policy evaluation we take all the actions those are possible from a
state and return an expected reward for that particular state[11][12].

2.7 Policy Iteration

Just like Policy Evaluation, Policy Iteration is another fundamental Reinforce-


ment Learning technique that is used in almost every RL algorithms. So, all the
known and unknown environment we have solved we have used Policy Iteration
directly with all the control methods.
In simplest terms, Policy Iteration is the repetition of Policy Evaluation with a
new and better policy for every iteration. In the prediction problem, We evaluate
the given policy just once and return the optimal value function for that policy.
In policy Iteration we start with an arbitrary policy and once we get the optimal
value function after evaluating this policy, we try to improve our policy based
on the returned value function and send the new policy for evaluation until the
policy converges. Our agent can use the optimal policy to navigate in the envi-
ronment and squeeze as much reward as possible[11][12].

14
CHAPTER 2. LITERATURE REVIEW

2.8 Value Iteration

Many algorithms in RL are built upon the ideas of Value Iteration. Although
someone is probably not going to use the Dynamic Programming (DP) version
of Value Iteration in real life or complex problems, it does help in understanding
of few advanced RL techniques those utilizes Value Iteration.
Policy Iteration works pretty well but it has a drawback that can be avoided
using some other techniques. In Policy Iteration, we are evaluating the policy
and in every evaluation we are waiting for the convergence which requires quite
a heavy amount of computational time. Instead one can make use of just one
sweep of Policy Evaluation and use that information to simultaneously improve
the existing policy. This way of predicting value and immediately improving pol-
icy is known as Value Iteration[11][12].

15
CHAPTER 2. LITERATURE REVIEW

2.9 Temporal Difference

We also studied Temporal Difference Learning algorithm. But to understand


Temporal Difference we had to have a basic understanding of Value Function.
Value Function is a function with state and action that estimates how good the
action will be in a given state[11].
V (S) - value of a state s given policy , beginning in s and following from
there on.
Q (s, a) - value of taking action a in state s under a policy , starting from s
taking action a and from that point following policy .
Temporal Difference learning technique is utilized to evaluate these value func-
tions. Without evaluating value work it would need to hold up until the point
when the last reward is gotten before updating any state-action paired value.
Therefore itd require much time to update a value.

V (st ) <= V (st ) + [Rt V (st )] (2.10)

where, st is the state visited at the time t, Rt is the reward after time t and is
the learning rate.
But in Temporal Difference(TD) , estimate of the final reward is calculated at
each state, and state-action pair value is updated in every step.

V (st ) <= V (st ) + [rt+1 + V (st+1 ) V (st )] (2.11)

where, rt+1 is the observed reward at the time t + 1 and is the discount factor.

16
CHAPTER 2. LITERATURE REVIEW

Temporal Difference learning approaches fall into two classes: on-policy, off-
policy.
In the on-policy learning, the agent takes in an policy which offers decisions to
take actions in a specific state. In the off-policy learning, the agent takes in an
approach which evaluates the value of making different action in a state.

2.10 SARSA

SARSA(state action reward state action) is an on-policy algorithm for Temporal


Difference(TD) learning. It updates its Q values based on a new action (a0 ) also
known as next action and therefore a new reward (r) using the same policy that
determines the original action. Basically updates are done using the quintuple
Q(s, a, r, s0 , a0 ).

Q(s, a) <= Q(s, a) + [r + Q(s0 , a0 ) Q(s, a)] (2.12)

where, s,a are the original(current) state and action, r is the reward observed in
the following state and s0 ,a0 are the new(next) state-action pair[11].

17
CHAPTER 2. LITERATURE REVIEW

2.11 Q Learning

Q-learning is a model-free reinforcement learning algorithm which is an off-policy


learning. For any finite Markov Decision Process, Q-learning can be used to
find an action selection policy. Q-learning gives an expected value for the finite
number of actions that can be taken in the states. In every state, it takes the
action with maximum utility value. Also after taking an action, it updates the
utility value for that action in that state[13]. The tabular Q-learning update rule,

Q(s, a) <= Q(s, a) + [r + max Q(s0 , a0 ) Q(s, a)] (2.13)

The update of the utility value is depended on the value of the reward it gets for
taking an action in each state. Rewards are of positives number for a good deci-
sion and for a bad decision it may a negative value. The reward usually provided
by the environment. Though Q-learning uses utility value as like value iteration
algorithm it doesnt need to know the model of the environment on it is playing
the game[14][15].

18
CHAPTER 2. LITERATURE REVIEW

2.12 Convolutional Neural Network

Convolutional Neural Network or CNN is an Artificial Neural Network which


uses FeedForward and BackPropagation with multi perception layers. CNN is
very successful in visual recognition. CNN is inspired by the connectivity pattern
of the biological neurons. There are many popular CNN implementations. But all
of them used similar layers in different order and in different manners. Common
layers are a convolutional layer, fully connected layers, pooling layers. The hinge
loss SVM or cross-entropy loss of Softmax are used as loss function and Relu
as the activation function. Other activations functions have demerits for a large
number of data. The performance varies on the orders of the layers and how the
layer is mechanized[16].
The score function is like the below one:

f (Xi , W, B) = W Xi + B (2.14)

where,

X is the input data

W is a matrix of weights which will be multiplied with the input data and
result a matrix for any data with the number of classes which will help us
to determine in which class it is in. Actually we will choose the maximum
one.

B is the bias vector

19
CHAPTER 2. LITERATURE REVIEW

The fully connected network is most common and regularly used in artificial
neural networks. The neurons in adjacent layers are fully connected in this layer.
In the neural network, the matrix operations are very widely used. The output
layers would be one where we get the scores for each class. So we follow a score
function to follow that output layers. The score function is like: f (x) = max(0, a)
There are some methods of calculating the loss. Here the loss means actually if
the actual class score is not greater than the other class scores. The score of the
correct class will maintain at least a minimum difference from all the other value.
There are two popular loss functions: Multiclass Support Vector Machine (SVM)
and Softmax. The idea of the calculation of Softmax is different from SVM. It
calculates the log probabilities of each class. The probability of the correct class
should be higher than other classes. The Loss function of SVM is given below :

X
Li = (0, sj syi + ) (2.15)
j6=yi

Here,

S(j) , S(yi ) = is actually denoting the score function f (Xi , Wi , Bi )

Delta, = Is the minimum difference of the correct classs score and the
other ones.

The softmax function is given below :


!
ezj
fj (z) = log P z (2.16)
ke
k

where, z is the score vector.

20
CHAPTER 2. LITERATURE REVIEW

The above function is the data loss. A problem with the weights W is that there
can be multiple values for W that can predict the score correctly. Like if W can
predict scores correctly then all the multiple of W can do that. For that, we will
calculate the regularization loss. Here W is the weight, so, the regularization loss
function is given below:
XX
2
R(W ) = Wk,l (2.17)
k l

So the loss function L is,

1 X
L= Li + R(W ) (2.18)
N i | {z }
| {z } regularization loss
data loss

The loss calculation can be done with the function above. Minimizing loss can
be done by gradient descent. Evaluate the gradients of the loss functions with
respect to the parameters that it can update the parameter that the loss will
be decreased. The gradient is the vector of partial derivatives. We can easily
backpropagate and calculate the gradients with the chain rule.

21
CHAPTER 2. LITERATURE REVIEW

2.12.1 Convolution Layer

Convolution layer is the vital layer of the CNN. The CNN layers are 3-dimensional
layers which actually map the image differently. What the filter does is actually
take the part of the image and make an activation map in the convolutional layer.
It uses filters to take small parts of the image. It can take filters of different sizes.
Different strides are used to slide the filters in different manners.
Also, zero padding is used to resize the input image. The height and width of the
convolutional layer depend on the stride and padding and the dimension of the
layer which is used to filter. It can be calculated with the following equation:

Height or Width = (W F + 2P )/S + 1 (2.19)

here,

W is the size of the image, F is the size of receptive field of the filter, P is
number of padding, S is number of strides that applied

If the remainder of the division is zero then that padding and stride value can
filter the image properly. The depth number will depend on the number of filters
that will be used.

22
CHAPTER 2. LITERATURE REVIEW

2.12.2 Pooling Layer

Pooling layer used in a convolutional neural network to reduce the number of pa-
rameters in the computation. Generally pooling layer is used after convolutional
layer. Pooling layer operates in every layer of convolution with different filters
and strides to resize it using a MAX operation. A very common pooling layer is
by applying the 2 2 filter. The picture below is showing how it works:

Figure 2.2: pooling layer of CNN

Gamebot Designing is not a new topic. Different approaches have been used
over the past years. Reinforcement learning is also being studied for years. Even
in RL we have so many techniques as discussed above. So it is good to have a
review paper along with problems and performance analysis of various RL algo-
rithms in one place.

23
Chapter 3

Problem

To properly review and analyze the basic Machine Learning and Reinforcement
Learning techniques we try to solve few problems. Among the problems we
solve, two problems we focus most using the latest advanced RL algorithms are
OpenAI-Gyms FrozenLake-v0 from Toy Text section and CartPole-v0 from Clas-
sic Games section as these problems represent well the problems RL algorithms
solve for continuous and discrete cases. In this part of the paper, we elaborate
the environments of these two problems.

3.1 Frozen Lake

Frozen Lake is a simple grid-world where a starting state and goal State is
given[17]. In the paths between Starting state to goal state, there are some holes
which need to be avoided. And others are frozen areas (states) which should be
picked to walk on, but the frozen states are also slippery. So there is a probability
of slipping to another direction when an agent takes an action.

24
CHAPTER 3. PROBLEM

For example, the agent took an action to go up in the grid world but it may end
up on the right, left or down.
In the given environment of FrozenLake in OpenAI gym,

Figure 3.1: Frozenlake Game

0 S 0 denotes the starting state

0 G0 denotes the goal state

0 F 0 denotes the frozen state

0 H 0 denotes the holes

The actions in each state are up, down, left and right.

25
CHAPTER 3. PROBLEM

3.2 Cart Pole

On the other hand, the game Cart Pole is played by only two actions and the
states are not discrete. The objective of the game is to balance the pole by
controlling the cart as long as possible.

Figure 3.2: CartPole Game

To control the cart we can take two actions, Left and Right. As it has no discrete
number of states, the game environment of OpenAI gym provides 4 pieces of
information about the current situation of the Cart we are controlling and Pole
that we are balancing[18] :

Carts Position

Poles Angle

Carts Velocity

Angle Changing Rate

For implementing Q-learning we divided the ranges of these and try to make a
discrete number of states on that, so we can run the algorithm.

26
Chapter 4

Implementation

As OpenAi gym is a toolkit build with python thats why we used python 2.7 as
our language and its related libraries like pandas, numpy, SciPy, Matplotlib etc.
We used Jupyter notebook and Spyder as our editor.

4.1 Linear Regression

We applied Linear Regression to predict profit for a business organization given


population size of a city. The dataset had 97 data. Each one containing pop-
ulation size and profit of an individual city. To solve the stated problem that
applied for Linear Regression, we implemented Linear Regression using numpy,
pandas, matplotlib libraries. We try to create a linear model of the data X, using
some number of parameters theta, that describes the variance of the data such
that given a new data point thats not in X, we could accurately predict what
the outcome y would be without actually knowing what y is.

27
CHAPTER 4. IMPLEMENTATION

The first thing we need was a cost function. The cost function evaluates the
quality of our model by calculating the error between our models prediction
for a data point, using the model parameters, and the actual data point. For
example, if the population for a given city is 4 and we predicted that it was 7,
our error is (7 4)2 = 32 = 9 (assuming an L2 or leastsquares loss function).
We do this for each data point in X and sum the result to get the total cost.
In this implementation we use an optimization technique called gradient descent
to find the parameters theta.
The idea with gradient descent is that for each iteration, we compute the gradient
of the error term in order to figure out the appropriate direction to move our
parameter vector. In other words, were calculating the changes to make to our
parameters in order to reduce the error, thus bringing our solution closer to the
optimal solution. Here is the linear fit of the dataset.

(a) Population vs Profit

(b) Linear feed

Figure 4.1: Linear Regression and result

28
CHAPTER 4. IMPLEMENTATION

4.2 Logistic Regression

Logistic Regression was applied to to predict whether a student will get admission
in a particular university based on two exam scores. The dataset had 100 data.
Each one containing two exam scores and a boolean value indicating the students
status. For this we used numpy, pandas, matplotlib, optimize from scipy libraries
to implement logistic regression for the stated problem that we said we solved
using logistic regression.
The first step is implementing the sigmoid function. Sigmoid function is the ac-
tivation function for the output of logistic regression. It converts a continuous
input into a value between zero and one. This value can be interpreted as the
class probability, or the likelihood that the input example should be classified
positively. Using this probability along with a threshold value, we can obtain a
discrete label prediction.
The next thing we need is a cost function. The cost function evaluates the quality
of our model by calculating the error between our models prediction for a data
point, using the model parameters, and the actual data point.
The next thing we write is a function that computes the gradient of the model
parameters to figure out how to change the parameters to improve the outcome
of the model on the training data. At each training iteration we update the pa-
rameters in a way thats guaranteed to move them in a direction that reduces the
training error. We can do this because the cost function is differentiable.

29
CHAPTER 4. IMPLEMENTATION

We dont actually perform gradient descent in this function - we just compute a


single gradient step. We use SciPys optimization API (f min tnc), implemented
by the numerical method experts, to get the output similar to gradient descent.
Once we have the optimal model parameters for our data set, we need to write
a function that will output predictions for a dataset X using our learned param-
eters theta. We can then use this function to score the training accuracy of our
classifier.
We had 89.0% accuracy on training data

(a) Admission status Data

(b) Admission status scatter plot

Figure 4.2: Logistic Regression data and result

30
CHAPTER 4. IMPLEMENTATION

4.3 Artificial Neural Network

1. The famous handwritten digit recognition problem.

2. Predicting letter grade from a double valued GPA.

Above two problem were solved using Artificial Neural Network. For the the first
problem we trained our algorithm using 4000 datas, each containing 400 pixels
(20 20) of a digit in one row. For the second one we built our model using 175
randomly generated GPA.
Artificial Neural Network were implemented on python using their numpy, pan-
das, matplotlib, optimize and loadmat from SciPy libraries. For the first problem
we have 400 units in the input layer and for the second problem we use 3 units in
the input layer. For both the problems we have 25 units in the hidden layer and
10 units in the output layer. All the input and hidden layer also has one extra
bias unit.
To use ANN as classifier the first thing one needs to do is write a routine to create
a vector for each training data of size n, where n is the number of classes. All the
indices of this vector for each particular training data will contain zero except
the index that matches with the status for that particular data.

31
CHAPTER 4. IMPLEMENTATION

Then we implemented the sigmoid function as our ANNs activation function


and the sigmoid gradient function to calculate the gradient. The next few lines
of code is the implementation of the forward-propagate function. It computes the
hypothesis for each training sample given the current parameters. The shape of
the hypothesis vector, which contains the prediction probabilities for each class,
should match our previously created vector.
The next thing we write is the Backpropagation algorithm that computes the
parameter updates that will reduce the error of the network on the training data.
We start by calculating the cost using the currently available parameters. Then
we do the main part where we adjust our current parameters to reduce error for
the next iteration by computing the contributions at each layer to the total cost
and adjusting appropriately.
In the final part, We again use SciPys optimization API (minimize) to get the
optimized thetas.
Once we have the optimal model parameters for our data set, we write predict
function using our dataset X and learned parameters theta. We can then use this
function to score the training accuracy of our classifier. Weve got quite satisfying
result for those two problems.

32
CHAPTER 4. IMPLEMENTATION

For the Handwritten Digit Recognition:

4967 accurate out of 5000

99.34% accuracy on training data

(a) Data we feed (b) Result we got

Figure 4.3: ANN on Handwriting recognition

33
CHAPTER 4. IMPLEMENTATION

And for CGPA TO Grade Prediction:

188 accurate out of 200

94.0% accuracy on cross-validation dataset

Figure 4.4: Grading system

Figure 4.5: CGPA Classification

34
CHAPTER 4. IMPLEMENTATION

4.4 Policy Evaluation

As previously mentioned, Policy Evaluation is used in every Reinforcement Learn-


ing algorithms. But, the basic most use of policy evaluation can be found in
Dynamic Programming. As the Markov Decision Process is known, we can easily
get a sense of how Policy Evaluation actually works. To understand this funda-
mental concept we solved a pretty simple grid world. The target of the agent was
to reach one of the two endpoints as early as possible as there is a negative living
reward.
To solve the policy evaluation problem, taking gridWorldEnv, we first need to
create a policy that we would like to evaluate. For the simplicity and to see the
effectiveness of Policy Evaluation we use a random policy. Then we write a func-
tion that will take a policy, discount factor and terminating threshold difference
and evaluate the given policy.

35
CHAPTER 4. IMPLEMENTATION

To evaluate the policy we first take a numpy array and assign zero to for all the
states. Then we repeat forever until the value function V converges to a stable
one. To get a stable value function on each iteration we visit all the states and for
all the states we take each action, receive the expected discounted reward from all
the states this action might lead us, and assign a new expected reward calculated
from the likelihood of choosing a particular action multiplied by the received
discounted reward for all the action. If the difference between the new reward
and the old reward for all the states are less than the terminating threshold,
we consider our policy evaluation to be converged, so, we stop and return the
predicted value for the given policy.
It took 216 steps before convergence:

Figure 4.6: Policy Evaluation

36
CHAPTER 4. IMPLEMENTATION

4.5 Policy Iteration

Policy Iteration is used in almost every RL control algorithms. But, the basic
most use of policy iteration can be found in Dynamic Programming. As the MPD
is known, we can easily get a sense of how Policy Iteration actually works. To
understand this fundamental concept we solved a pretty simple grid world. The
target of the agent was to reach one of the two endpoints as early as possible as
there is a negative living reward.
To solve the problem that we claimed to solve with Policy Iteration, we first need
to create a policy that we would like to evaluate and gradually improve. To begin
with, we use a random policy.
Then we continue to better the policy until there are no changes between the
given policy that to be evaluated and the newly found policy. To improve a
policy, we first evaluate the the policy and get the value function for this policy.
Then, for all the states, we consider all the actions and for each action we average
out everything from all the new states this action might lead us to and assign the
average to this action.

37
CHAPTER 4. IMPLEMENTATION

Once all the actions for a state is considered, we adjust the probability of each
action based on the outcome the produced. The better the outcomes, higher the
probability in the improved version of the policy. If there is no change of policy
for any state, we stop, as it is the optimum policy.
It took only 3 steps before convergence. The final optimal Value function :

Figure 4.7: Policy Iteration Optimal Polciy

Figure 4.8: Policy Iteration Optimal Value

38
CHAPTER 4. IMPLEMENTATION

4.6 Value Iteration

To get a good grips on Value Iteration we again use the simple grid world with
two endpoints and living penalty. Unlike Policy Evaluation and Policy Iteration,
Value Iteration does not require one to start with any kind of policy. Although
we are gradually going to reach a better policy, we would not require declaring
any explicit policy to run this algorithm. We always can develop the optimal
policy from the value function generated by Value Iteration.
So, to get the optimal value function, We infinitely visit all the states in each
iteration and for each state we try all the actions. But, instead of using the
expected reward, we assign the max reward to each state after traversing all
the action until there is no significant difference between the newly found value
function and the old value function.
This way on each iteration, we are performing one sweep of policy evaluation and
as we are taking the maximum reward returned by all the actions, we are actually
immediately improving our implicit policy towards the optimal one.
Number of steps before convergence: 5 (without evaluating a policy for infinite
amount of time). Final Optimal Value Function:

Figure 4.9: Value Iteration Optimal actions

39
CHAPTER 4. IMPLEMENTATION

4.7 Q-Learning

4.7.1 FrozenLake

OpenAI Gym provides the environment to train the agents. For the environment
FrozenLake we called the function gym.make() and pass the string FrozenLake-
v0 as a parameter in a function as it has to be called like this if we want to play
Frozen Lake (4 4 size grid world). From the environment of the game, we got
a number of states and number of actions.
We initialized the Q table a 2d array where we stored the utility values for the
actions in states. There are many ways of initializing a Q table. One way to
initialize it with the random values. We did in another way which is to make all
the values zero initially.
We started taking more random actions and the chance of taking random actions
decreases rapidly. To take random actions we used a variable epsilon which will
determine in each episode how much random actions will we take. We initialized
the epsilon value with 1.00 and the epsilon decay with 0.99. After each episode,
the epsilon value is updated by multiplying itself with the epsilon decay. In the
each time step of an episode, we took a random value between 0 to 1 and check
if it is greater than the epsilon value or not. If the value of a random number
is greater than the epsilon value, we took the action with the maximum utility
value. Otherwise, we took a random action.

40
CHAPTER 4. IMPLEMENTATION

In the game, for playing in episodes, we restart the game with a function called
environment.reset() where we set up the environment variable with frozenlake
environment by calling gym.make() that we mentioned above. For taking each
step we had to pass the action value. Actually, action value is an integer (0 3)
for this environment. The function returns a tuple with 4 values:

1. Current State

2. Reward

3. Game over status

4. Additional info

We used Q-Learning to update our utility value here,

Q(Si , Ai ) = Q(Si , Ai ) + [Ri + (max(Q(Si+1 , Ai+1 )) Q(Si , Ai ))]

Here,

Si = State in ith time step

Ai = Action in ith time step

= Learning rate

= Discount factor

Ri = The reward it will get in the ith time step for taking an action Ai in
the Si state

41
CHAPTER 4. IMPLEMENTATION

4.7.2 CartPole

Alike the implementation of frozen lake we initiated CartPole environment with


the function gym.make(). We passed the string CartPole-v0 as it is determined
by the OpenAI library for playing CartPole. The environments have no discrete
numbers of states and have 2 actions only. And in every step, the environment
returns the observation of 4 values that we mentioned earlier: carts position,
poles angle, carts velocity, angle changing rate. The range of these 4 properties
are also not discrete and it can be gotten by function env.observation space.low
and env.observation space.high where env is the environment object.
To make a discrete number of states, we divided each range of properties in 10
divisions. We call these divisions as bins. For an example, 0 to 5.0 is divided
into 10 bins. So 0 to 0.5 is bin number 0, 0.51 to 0.1 is bin number 1 and so on. As
there are 10 divisions of 4 properties, the total number of different permutations
of the properties would be 104 = 10000 is also the number of states.
As we now know the number of states and number of actions, we can initialize
the Q table. So we did similarly as Frozen Lake implement (initial value zero).
We got optimal performance (accepted by OpenAI) where learning rate was 0.2.
We took initial epsilon value 1.0 and the epsilon decay 0.99. In each episode to
restart the game.reset() function was used.

42
CHAPTER 4. IMPLEMENTATION

While playing the game, as the environment return those 4 properties, so to


calculate the state after an action we wrote a function which returned us the
next state. First, we got the bin number in which these values were in. and then
multiply it with 100 , 101 , 102 and 103 and then summed them up. Which gave
us the state number.
Taking an action was similar to the previous game we played. It took a random
value (between 0 and 1) in every time steps of every episode and compared it
with the epsilon value to take a random function or the action with maximum
utility value. If the random value is greater than the epsilon value it takes optimal
action. Otherwise, it took a random action. We used Q-learning for the utility
value as we used in our previous played game:

Q(Si , Ai ) = Q(Si , Ai ) + [Ri + (max(Q(Si+1 , Ai+1 )) Q(Si , Ai ))]

4.8 Upload To Gym

OpenAI gives us games environment, our agent observes data and then learn to
play the game. To know if the agent is playing well or not OpneAI has an option
so that we can submit our agents performance to check our agents performance.
We need to call their wrappers.M onitor() function which needs two parameter
the environment and path to a local storage to save the agents performance.
They store agents performance as json files. To upload it to gym we have to call
their function gym.upload() which takes that local storage path to json files with
a API key which can be found after login to OpenAI.

43
Chapter 5

Result and Evaluation

We measured our agents performances based on OpenAI. They analyzes our


agents performance and gave us the performance chart.

5.1 Results

After uploading our solution to OpenAI gym, it gives a verdict of our solution
and generates a performance graph. For different games, there are different goals
that have to be achieved by an agent to get accepted.

5.1.1 FrozenLake

If any agent can score in average 0.78 or above in any consecutive 100 episodes
then the solution is accepted. The score is calculated by how near it went to
the goal. If any agent reached the goal in any episode then its score is the
highest possible score which is 1.0. If the agent stuck to the initial position after
completing one episode, the score will be the lowest one which is 0.

44
CHAPTER 5. RESULT AND EVALUATION

Here is the performance graph of our agent which is by generated by OpenAI,

Figure 5.1: Our agents performance on Frozenlake-V0 a toy text game

The picture above shows that our solution took 350 episodes to solve and scored
in average 0.81 0.04 in consecutive 100 episodes. It took 1 second to solve.

45
CHAPTER 5. RESULT AND EVALUATION

5.1.2 CartPole

The acceptance criteria for the game Cart Pole is that the agent has to score in
average at least 195.0 in any consecutive 100 episodes. Any agent gets 1.0 reward
for surviving each timestep while playing the game. That means for this game
the score will be higher if we play longer.
The screenshot of the performance graph of our agent while playing game that
generated by OpenAI,

Figure 5.2: Our agents performance on cartpole a classic game

Our agent took 3691 episodes to fulfill the requirement of OpenAI. The reward
was 200.0 in average in between the episodes from 3592 to 3691. The agent took
42 seconds to solve.

46
CHAPTER 5. RESULT AND EVALUATION

5.2 Analysis

5.2.1 FrozenLake

We mentioned in earlier sections that we implemented Q Learning algorithm for


the environment or game called Frozen Lake. We submitted our solution in
OpenAI. The solution we submitted reached 756 times at maximum (as we ran
to check) to the goal out of thousand episodes. For this solution, we initialized
the Q table values as zero. And the learning rate was 0.1 and the discount factor
was 1.0. Our agent reaches the goal 653 times (actually the value varies from
600 750 times) per thousands of episodes. For learning rate 1.0 it reached the
goal in average 31 times (20 50 times ) in per thousands of episodes. Also, less
gamma () value didnt give any better results but when the learning rate was
1.0, the lesser gamma value gave the better results. The learning rate value 0.5
also gave good results. But we stuck with value 1.0. We tried to initialize Q table
differently. All value zero initialization have one problem that though it gave us
better results in average. But it sometimes takes too many timesteps and scores
so much poorly than the average. On the other hand, we tried initializing the Q
table with a random value. It works better and also the score falls rarely.

47
CHAPTER 5. RESULT AND EVALUATION

5.2.2 CartPole

In OpenAI gym if an agent can score 195.0 or above in average for the game Cart
Pole in 100 consecutive episodes then the solution will be accepted. In our one
of the accepted solution, we initialized the Q table randomly, we took learning
rate 0.2 and took the discount factor gamma () 1.0. It took 3691 episodes to
solve the problem. We changed the learning rate for this problem to see the
change in performance. Though learning rate 0.1 and 0.05 worked similarly for
the game Frozen Lake, but higher than 0.2 and lower than 0.5 didnt work
better. Gamma () value 1.0 was the better choice and gave better results. The
result may vary for the other parts of the initialization for this problem.

48
CHAPTER 5. RESULT AND EVALUATION

As we have mentioned earlier that there is no discrete number of states and


we had to create this states and define the state. We also mentioned that we
get 4 property values from each observation and created bins of the highest and
lowest values of the properties which define the range of each property status. It
turns out that we can initialize the bins differently. One of the properties is pole
angle. This property determines the angle it is made with the cart now. The
poles angles lowest and highest values are 3.5 to +3.5 (range). The value 0
means it is in the middle or at a 90 angle with the cart and lesser or greater
value indicates how much more or lesser degree value it created with the cart.
If the angle went too much towards any side of positive or negative there is less
possibility of balancing it. We can say it went more than halfway toward any
side then it is near impossible to balance it anymore. So we can make the range
1.75 for low and +1.75 for high and create bins. It will also improve the learning
in that range as the bins sizes are smaller. This is also applicable for another
property. As we want to balance the cart-pole we should not want to move the
cart in much velocity as in higher velocity there is much chance of creating an
angle from where we can not balance anymore. So we can also shrink the range
of the velocity. Shrinking of both properties gave much better results. For the
same learning rate, gamma value without shrinking the range it was passing 195
timesteps 1500 1600 times in an average out of ten thousand times. On the
other hand, it passed more than eight thousand times after shrinking the range.

49
Chapter 6

Limtations

In the beginning we want to build few gamebots and we successfully design quite
a few. After working with plenty of discrete and few continuous environment we
desire to solve a good number of games with continuous state spaces as those
were the most famous and eye-catching one. But it is totally impractical to store
each continuous state in the memory. Only way to store these continuous state
spaces were to predict them as closely as possible using the features of a state
and some trained parameter vector. As we all know, one popular method known
for such representation is known as Artificial Neural Network. But the problem
of using ANN with RL is that it fails terribly as RL environments are unstable
and dynamic[11].
To overcome this problem efficiently, one must use Deep Neural Network with
established RL algorithms. DNN is mostly studied by Masters or PhD students
in one semester. After exhaustively studying, implementing and analyzing RL
and then self studying DNN within this short period become very challenging for
us. So we could not reach our goal that was set in the middle of our research.

50
Chapter 7

Conclusion

The uncertainty in RL makes it a really interesting, powerful and tough approach


to use - all at the same time - in Game Bot designing over the traditional learn-
ing methods. Most of the fundamental RL techniques like, Monte Carlo Meth-
ods, TD, SARSA and Q-Learning works just fine without any state transition
probabilities. On the other hand, methods such as DP requires state transition
probabilities. Almost all the algorithms those work for discrete state space scale
well in continuous state space if used with DNN with some exception. Although,
there are huge scopes yet to be discovered and so many interesting unanswered
questions specially for state approximation in continuous cases, are open. We
plan to devote our time on these in the future.

51
Bibliography

[1] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang,


and W. Zaremba, OpenAI gym, arXiv preprint arXiv:1606.01540, 2016.

[2] F. S. Melo, Convergence of q-learning: A simple proof, Institute Of Sys-


tems and Robotics, Tech. Rep, 2001.

[3] Whats new in deep learning and artificial intelligence from NVIDIA,
2017. [Online]. Available: https://www.nvidia.com/en-us/deep-learning-ai/

[4] A. M. Devices, AMD launches the worlds fastest graphics card for
machine learning development and advanced visualization workloads, 2017.
[Online]. Available: https://goo.gl/VqN6zK

[5] Autopilot, 2017. [Online]. Available: https://www.tesla.com/autopilot

[6] C. Zeng and J. Tunney, Build your own machine learning visualizations
with the new tensorboard api, 2017. [Online]. Available: https:
//research.googleblog.com/search/label/TensorFlow

[7] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.

[8] S. M. Ross, Introduction to Probability Models, 6th ed. San Diego, CA,
USA: Academic Press, 1997.

[9] J. Heinrich and D. Silver, Deep reinforcement learning from self-


play in imperfect-information games, CoRR, 2016. [Online]. Available:
http://arxiv.org/abs/1603.01121

[10] Reinforcement learning-intro to MDP by david silver, 2015. [Online].


Available: https://www.youtube.com/watch?v=lfHX2hHRMVQ

[11] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.


MIT Press, 1998.

[12] M. N. e. Weber C., Elshaw M., Reinforcement Learning. The-


ory and Applications, 1st ed. I-TECH Education and Publishing,
2008. [Online]. Available: http://gen.lib.rus.ec/book/index.php?md5=
87c7ad29029399c6cff6886a81855f8d

[13] S. J. Russell and P. Norvig, Artificial intelligence: a modern approach. Pear-


son, 2016.

52
BIBLIOGRAPHY

[14] N. Heess, G. Wayne, D. Silver, T. P. Lillicrap, Y. Tassa, and T. Erez,


Learning continuous control policies by stochastic value gradients, CoRR,
2015. [Online]. Available: http://arxiv.org/abs/1510.09142

[15] C. J. Watkins and P. Dayan, Q-learning, Machine learning, vol. 8, 1992.

[16] A. Karpathy and J. Johnson, Cs231n convolutional neural networks for


visual recognition, 2017. [Online]. Available: http://cs231n.github.io/

[17] OpenAi, Frozenlake-v0 environment. [Online]. Available: https://gym.


openai.com/envs/FrozenLake-v0/

[18] Cartpole v0, Aug 2016. [Online]. Available: https://github.com/openai/


gym/wiki/CartPole-v0

53

You might also like