Reinforcement Learning

6 min readOct 12, 2019

Reinforcement learning is a field of machine learning in which an agent is learn by its own with the help of actions and their rewards. This is a method in which there is no need of supervision of the working that is done by the agent instead in this approach there is an action which is performed in the environment by the agent and as a result the state of the agent is changed and a new state is occurred and also there is a reward for each state.

And by doing different actions in the environment and after gained rewards agents get learned what is expected from him to do and which it should not be performed. The work which we want that the agent perform, for that work we gave him positive reward and negative reward for vice versa.

Working of Reinforcement Learning:

There are some basic concepts:

· S is stand for state

· a is stand for action

· R is stand for reward

· ɣ is stand for Discounted factor

Suppose we have a maze in which an agent is introduced. In that maze there is a state which has a positive reward and a state which has a negative reward. Agent has took the different action and reached at the state of positive reward or negative reward state. So, now there should be some way with the help of the agent can remind that from adopting which path he reached in that reward states. So, that next time he could again reached to that state which gave him a positive reward and can avoid that path which lead him to negative reward.

For this purpose the Bellman-equation is used:

With the help of this equation we assigned a specific value to each state of the maze and each value show the importance of choosing that state in order to reach the state of positive reward.

In the above maze green box will gave us a positive reward well red box provide a negative reward. Agent want to reach at the green box and avoid the red box. With the help of bellman equation firstly we set the value of more closet box of the green box which value is one. Then value in the second box is less that the value of first box. Which show that importance of this state for reaching the reward state is less as compared to first one. So, suppose agent as at the state of value 0.66 .Now agent will prefer to move toward the state which value is greater than its own state value. So with the help of these values agent will learn to move. So, the values of the all boxes are give below.

Deterministic and Non-Deterministic Environment

There are two kind of environment one is deterministic in which the movement direction of the agent is 100% sure. Means if we said agent move up then it must move in the upward direction. While there is also a non-deterministic approach in which there is some randomness involved the agent action. For example if I said him to move up then there is 80% chance it will move up while 10% chance to move right and 10 % chance to move left.

So, when the non-deterministic environment involved the bellman equation is failed to work. For, that solution the Marko Decision Process was introduced. This is the newest form of the Bellman equation with adding a factor of probability. In this method the previous states of the agent does not matter instead only the current state and the probability of the all upcoming states are matter. So, we change the bell man equation in this way:

We replaced the value of next state with sum of the all possible states multiply with their probabilities.

So, the next equation is look like this:

Now after applying this formula to each state of the maze we got these resulted values

Now, you can see the difference in the values from the values which we received through the Bellman. Now you can see the value of most right bottom corner state that is more small that the (2, 3) state which is 0.39. so, this thing show that from the state 0.22 there is big chance of getting negative reward in two ways.

Firstly state of V=0.22 => If move upward then die (-ve reward state)

Secondly state of V=0.22 => If move left -> then move up -> again chance to move right

(-ve reward state)

So now I want to show the both paths that is provided by the equation to the agent.

With the Bellman Equation (also called Plan) for deterministic environment

Path provided by MDPs equation (also called policy) for Non-deterministic approach

Now, after seeing these result a human can surprised because these results show that agent will learn to away from the fire even in the random situation. Like take the position of the right bottom state. In which agent will take 80% move to hit with wall men downward and 10% chance to move right and 10% move to left. So, 0% chance to move in upward direction.

We call the Reinforcement Learning as a Q-Learning. Now move towards it.

What are the Q-Values?

Basically q-values show the quality of an action at each state. On each state there is four Q-values which is the quantified factor of performing an action. While each state has an only one value.

So there is a relation between the Q-value and the Value of the state. Which is provided the given equation

This formula is very close to the formula of the value of the states. Basically for finding the q-value of moving in specific direction (like upward) firstly put the reward of that state and add in it the discounted factor multiplied with the sum of all the max Q-value with the probability of all the corresponding states.

Temporal Difference:

With the help of temporal difference the agent will be able to compute the Q-values and the Value of the state. Look at the equation

Basically the above equation shows that the difference of the previous q-value and the q-value which occur due to the randomness. As, the changing in the Q-values occurred slovenly with the passage of time so we called it is temporal difference.

Now the second equation is the q-value at time t. Here alpha is the learning rate of the agent.

Now the final equation which is made up by the above two equations are given below.

So, with passage of time the temporal difference value is going to decrease.so, eventually the algorithm will be converge. But, with the continuous change of the environment the policy will also be change and to learn that the temporal difference will have some value.

(This lecture used the material from the udemy course Artificial Intelligence A-Z: Learn How to Build an AI)

Analytics Vidhya

Reinforcement Learning

Published in Analytics Vidhya

Written by Ume Habiba

No responses yet