SlideShare a Scribd company logo
8
Most read
12
Most read
13
Most read
Reinforcement Learning using
OpenAI Gym
Muhammad Aleem
Siddiqui
What is REINFORCEMENT
LEARNING ?
•In reinforcement learning, an agent takes actions
in an environment & receives rewards. The
ultimate goal is to maximize rewards over time
•A reinforcement learning algorithm, or agent,
learns by interacting with its environment. The
agent receives rewards by performing correctly
and penalties for performing incorrectly. The
agent learns without intervention from a human
by maximizing its reward and minimizing its
penalty.
Exploration And Exploitation
The only way to uncover the correct signal
is
to assume nothing, try out different things
(explore), and learn to act optimally
(exploit) based on environmental feedback.
Balancing exploration & exploitation is
what reinforcement learning is all about.
What is an Agent, Environment,
Action, Policy and Reward ?
Agent: An Algorithm / A Robot / A Game
Environment: What agent interact with
Action: What the agent can do
Policy: What action to be chosen
Reward: What agent receives by performing
correctly
Agent and Environment
Agent:
An agent can be a program or robot,
which may receive inputs based off the
environment and perform some action
based on that input.
Environment:
An environment is the actual setting the
agent is interacting with. An environment
need to be able to represented in a way,
an agent can understand. it often be a
game in examples, but it can be any real
world or artificial environment.
Action, Policy and Reward
Action:
The actual interaction an agent will perform
on the environment. Moving in an
environment, choosing the next move in a
game, etc.
Policy:
The policy is the strategy of choosing an
action given a state in expectation of better
outcomes.
Reward:
The metric that allows an agent to understand
whether or not the previous sets of action
helped or hurt in its overall goal
Reinforcement Learning
Process
Reinforcement Learning is the science of
making optimal decisions using
experiences. Breaking it down, the process
of Reinforcement Learning involves these
simple steps:
1. Observation of the environment
2. Deciding how to act using some
strategy
3. Acting accordingly
4. Receiving a reward or penalty
5. Learning from the experiences and
refining our strategy
6. Iterate until an optimal strategy is found
OpenAI Gym
OpenAI is a non-profit AI Research
company, discovering and enacting the path
to safe artificial general intelligence.
Gym is a toolkit for developing and
comparing reinforcement learning
algorithms.
OpenAI Gym Library is a python library with
a collection of environment that can be used
with the reinforcement learning algorithms.
Link: gym.openai.com
Gym
OpenAI Gym’s Environment
Here is an example of getting
something running. This will
run an instance of the
“CartPole-v0” environment for
1000 time-steps, rendering
the environment at each step.
CODE:
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
env.render()
env.step(env.action_space.sample())# take a random action
env.close()
Environment (Contd.)
Gym’s “CartPole-v0”
environment is a numpy array
with 4 floating point values :
1. Horizontal Position
2. Horizontal Velocity
3. Angle of Pole
4. Angular Velocity
Functions of Gym
make(): Used to create environment.
reset(): Setting the environment to default
starting stage.
render(): It creates a popup window to
display Simulation of Agent interacting with
environment
step(): Action taken by the agent. it returns
an observation. (4 valued numpy array,
<observations, reward, done, info> )
sample(): Random samples input for the
agent.
close(): Close the environment after action
performed.
CODE:
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
env.render()
env.step(env.action_space.sample())
env.close()
OpenAI Gym’s Observations
Observations are the environment specific information variables:
Observation (object): An environment-specific object representing your observation of
the environment. For example, joint angles and joint velocities of a robot, or the
board state in a board game.
Reward (float): Amount of reward achieved by the previous action. The scale varies
between environments, but the goal is always to increase your total reward.
Done (boolean): Whether it’s time to reset the environment again. Most tasks are
divided into well-defined episodes, and done being True indicates the episode has
terminated. For example, perhaps the pole tipped too far, or you lost your last life.
Info (dict): Diagnostic information useful for debugging. It can sometimes be useful
for learning. For example, it might contain the raw probabilities behind the
environment’s last state change. However, official evaluations of your agent are not
allowed to use this for learning.
Observations (Contd.)
The process gets started by
calling reset(), which returns an
initial observation. So a more
proper way of writing the
previous code with respect to
the episodes and done flag:
CODE:
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
observation = env.reset()
for t in range(100):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} time steps".format(t+1))
break
env.close()
OUTPUT:
[-0.03327757 0.5649743 -0.0374682 -0.87239967]
[-0.02197809 0.7605852 -0.05491619 -1.17662316]
[-0.00676638 0.95637585 -0.07844866 -1.48600365]
[ 0.01236114 1.15236136 -0.10816873 -1.80211802]
[ 0.03540836 1.3485132 -0.14421109 -2.12636579]
[ 0.06237863 1.15508926 -0.18673841 -1.88149287]
Episode finished after 11 time steps
Making a Hard-Coded Policy for
Agent
CODE:
import gym
env = gym.make('CartPole-v0')
observation = env.reset()
for t in range(1000):
env.render()
# Defining a Hard-Coded Policy
cart_pos , cart_vel , pole_ang , ang_vel = observation
# Move Cart Right if Pole is Falling to the Right
# Angle is measured off straight vertical line
if pole_ang > 0:
# Move Right
action = 1
else:
# Move Left
action = 0
# Perform Action
observation , reward, done, info = env.step(action)
env.close()
Using Neural Network in
Reinforcement Learning
ReLu – f(x)
Using Neural Network In
TensorFlow for Reinforcement
Learning (Contd.)
Let's design a simple Neural
Network that takes in the
observation array passes it through
a hidden layer and output
probability for left. (for right = left-1)
CODE:
import tensorflow as tf
import gym
import numpy as np
# PART ONE: NETWORK VARIABLES #
# Observation Space has 4 inputs
num_inputs = 4
num_hidden = 4
# Outputs the probability it should go left
num_outputs = 1
initializer = tf.contrib.layers.variance_scaling_initializer()
# PART TWO: NETWORK LAYERS #
X = tf.placeholder(tf.float32, shape=[None, num_inputs])
hidden_layer_one = tf.layers.dense(X,n um_hidden,activation = tf.nn.relu, kernel_initializer=initializer)
hidden_layer_two = tf.layers.dense(hidden_layer_one, num_hidden, activation=tf.nn.relu, kernel_initializer=initializer)
# Probability to go left
output_layer = tf.layers.dense(hidden_layer_one, num_outputs, activation=tf.nn.sigmoid, kernel_initializer=initializer)
# [ Prob to go left , Prob to go right]
probabilties = tf.concat(axis=1, values=[output_layer, 1 - output_layer])
# Sample 1 randomly based on probabilities
action = tf.multinomial(probabilties, num_samples=1)
init = tf.global_variables_initializer()
# PART THREE: SESSION #
saver = tf.train.Saver()
epi = 50
step_limit = 500
avg_steps = []
env = gym.make("CartPole-v1")
with tf.Session() as sess:
init.run()
for i_episode in range(epi):
obs = env.reset()
for step in range(step_limit):
env.render()
action_val = action.eval(feed_dict={X: obs.reshape(1, num_inputs)})
obs, reward, done, info = env.step(action_val[0][0])
if done:
avg_steps.append(step)
print('Done after {} steps'.format(step))
break
print("After {} episodes the average cart steps before done was {}".format(epi, np.mean(avg_steps)))
env.close()Note: 2x zoom required.

More Related Content

PDF
Multi-Agent Reinforcement Learning
PDF
Deep Q-Learning
PDF
Deep Reinforcement Learning and Its Applications
PPTX
Reinforcement Learning, Application and Q-Learning
PDF
An introduction to reinforcement learning
PDF
Actor critic algorithm
PDF
Continual Learning with Deep Architectures - Tutorial ICML 2021
PPTX
Intro to Deep Reinforcement Learning
Multi-Agent Reinforcement Learning
Deep Q-Learning
Deep Reinforcement Learning and Its Applications
Reinforcement Learning, Application and Q-Learning
An introduction to reinforcement learning
Actor critic algorithm
Continual Learning with Deep Architectures - Tutorial ICML 2021
Intro to Deep Reinforcement Learning

What's hot (20)

PPTX
Reinforcement learning
PPTX
Deep sarsa, Deep Q-learning, DQN
PPTX
Deep Reinforcement Learning
PPTX
OpenAI Gym & Universe
PPTX
Image colorization
PDF
Crop prediction using machine learning
PPTX
Decision Trees
PPTX
Getting started with reinforcement learning in open ai gym
PPTX
Object detection
PPTX
An introduction to reinforcement learning
PPTX
Reinforcement Learning : A Beginners Tutorial
PPT
Video object tracking with classification and recognition of objects
PPTX
CNN Tutorial
PDF
Neuro-Symbolic AI for Sentiment Analysis with Michael Malak
PDF
Reinforcement learning, Q-Learning
PDF
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
PDF
DQN (Deep Q-Network)
PPTX
Game playing in AI
PPTX
Human age and gender Detection
PPT
Reinforcement Learning Q-Learning
Reinforcement learning
Deep sarsa, Deep Q-learning, DQN
Deep Reinforcement Learning
OpenAI Gym & Universe
Image colorization
Crop prediction using machine learning
Decision Trees
Getting started with reinforcement learning in open ai gym
Object detection
An introduction to reinforcement learning
Reinforcement Learning : A Beginners Tutorial
Video object tracking with classification and recognition of objects
CNN Tutorial
Neuro-Symbolic AI for Sentiment Analysis with Michael Malak
Reinforcement learning, Q-Learning
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
DQN (Deep Q-Network)
Game playing in AI
Human age and gender Detection
Reinforcement Learning Q-Learning
Ad

Similar to Reinforcement Learning using OpenAI Gym (20)

PDF
Aaa ped-24- Reinforcement Learning
PDF
DRL 1 Course Introduction Reinforcement.ppt
PDF
22PCOAM16 Machine Learning Unit V Full notes & QB
PDF
Autonomous Systems for Optimization and Control
PDF
Lec4b pong from_pixels
PDF
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
PPTX
Reinforcement Learning and Artificial Neural Nets
PDF
anintroductiontoreinforcementlearning-180912151720.pdf
PDF
Reinforcement Learning
PPTX
An Introduction to Reinforcement Learning (December 2018)
PPTX
Designing an AI that gains experience for absolute beginners
PPTX
Reinforcement course material samples: lecture 1
PDF
Autonomous Control AI Training from Data
PPTX
Learning Task in machine learning
PDF
3 Types of Machine Learning
PDF
Reinforcement learning in a nutshell
PDF
Lecture 1 - introduction.pdf
PDF
How to train your robot (with Deep Reinforcement Learning)
PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
PPTX
Reinforcement learning slides
Aaa ped-24- Reinforcement Learning
DRL 1 Course Introduction Reinforcement.ppt
22PCOAM16 Machine Learning Unit V Full notes & QB
Autonomous Systems for Optimization and Control
Lec4b pong from_pixels
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Reinforcement Learning and Artificial Neural Nets
anintroductiontoreinforcementlearning-180912151720.pdf
Reinforcement Learning
An Introduction to Reinforcement Learning (December 2018)
Designing an AI that gains experience for absolute beginners
Reinforcement course material samples: lecture 1
Autonomous Control AI Training from Data
Learning Task in machine learning
3 Types of Machine Learning
Reinforcement learning in a nutshell
Lecture 1 - introduction.pdf
How to train your robot (with Deep Reinforcement Learning)
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Reinforcement learning slides
Ad

More from Muhammad Aleem Siddiqui (8)

PDF
Crop Analysis using Unmanned Aerial Vehicle
PDF
Precision Irrigation using IoT and Machine Learning for Drip Irrigation
PDF
Multispectral Imagery Data for Agricultural Surveying
PDF
Mobile Robot for Agriculture Farming
PDF
Unmanned Aerial Vehicle - Aerial Robotics
PDF
Precision Agriculture
PDF
Neural network
PDF
Data structures and algorithm analysis in java
Crop Analysis using Unmanned Aerial Vehicle
Precision Irrigation using IoT and Machine Learning for Drip Irrigation
Multispectral Imagery Data for Agricultural Surveying
Mobile Robot for Agriculture Farming
Unmanned Aerial Vehicle - Aerial Robotics
Precision Agriculture
Neural network
Data structures and algorithm analysis in java

Recently uploaded (20)

PPTX
Welding lecture in detail for understanding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
DOCX
573137875-Attendance-Management-System-original
PDF
PPT on Performance Review to get promotions
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Well-logging-methods_new................
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
Welding lecture in detail for understanding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Mechanical Engineering MATERIALS Selection
Foundation to blockchain - A guide to Blockchain Tech
Lecture Notes Electrical Wiring System Components
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
573137875-Attendance-Management-System-original
PPT on Performance Review to get promotions
Operating System & Kernel Study Guide-1 - converted.pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Automation-in-Manufacturing-Chapter-Introduction.pdf
Well-logging-methods_new................
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
OOP with Java - Java Introduction (Basics)
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
R24 SURVEYING LAB MANUAL for civil enggi
Model Code of Practice - Construction Work - 21102022 .pdf

Reinforcement Learning using OpenAI Gym

  • 1. Reinforcement Learning using OpenAI Gym Muhammad Aleem Siddiqui
  • 2. What is REINFORCEMENT LEARNING ? •In reinforcement learning, an agent takes actions in an environment & receives rewards. The ultimate goal is to maximize rewards over time •A reinforcement learning algorithm, or agent, learns by interacting with its environment. The agent receives rewards by performing correctly and penalties for performing incorrectly. The agent learns without intervention from a human by maximizing its reward and minimizing its penalty.
  • 3. Exploration And Exploitation The only way to uncover the correct signal is to assume nothing, try out different things (explore), and learn to act optimally (exploit) based on environmental feedback. Balancing exploration & exploitation is what reinforcement learning is all about.
  • 4. What is an Agent, Environment, Action, Policy and Reward ? Agent: An Algorithm / A Robot / A Game Environment: What agent interact with Action: What the agent can do Policy: What action to be chosen Reward: What agent receives by performing correctly
  • 5. Agent and Environment Agent: An agent can be a program or robot, which may receive inputs based off the environment and perform some action based on that input. Environment: An environment is the actual setting the agent is interacting with. An environment need to be able to represented in a way, an agent can understand. it often be a game in examples, but it can be any real world or artificial environment.
  • 6. Action, Policy and Reward Action: The actual interaction an agent will perform on the environment. Moving in an environment, choosing the next move in a game, etc. Policy: The policy is the strategy of choosing an action given a state in expectation of better outcomes. Reward: The metric that allows an agent to understand whether or not the previous sets of action helped or hurt in its overall goal
  • 7. Reinforcement Learning Process Reinforcement Learning is the science of making optimal decisions using experiences. Breaking it down, the process of Reinforcement Learning involves these simple steps: 1. Observation of the environment 2. Deciding how to act using some strategy 3. Acting accordingly 4. Receiving a reward or penalty 5. Learning from the experiences and refining our strategy 6. Iterate until an optimal strategy is found
  • 8. OpenAI Gym OpenAI is a non-profit AI Research company, discovering and enacting the path to safe artificial general intelligence. Gym is a toolkit for developing and comparing reinforcement learning algorithms. OpenAI Gym Library is a python library with a collection of environment that can be used with the reinforcement learning algorithms. Link: gym.openai.com Gym
  • 9. OpenAI Gym’s Environment Here is an example of getting something running. This will run an instance of the “CartPole-v0” environment for 1000 time-steps, rendering the environment at each step. CODE: import gym env = gym.make('CartPole-v0') env.reset() for _ in range(1000): env.render() env.step(env.action_space.sample())# take a random action env.close()
  • 10. Environment (Contd.) Gym’s “CartPole-v0” environment is a numpy array with 4 floating point values : 1. Horizontal Position 2. Horizontal Velocity 3. Angle of Pole 4. Angular Velocity
  • 11. Functions of Gym make(): Used to create environment. reset(): Setting the environment to default starting stage. render(): It creates a popup window to display Simulation of Agent interacting with environment step(): Action taken by the agent. it returns an observation. (4 valued numpy array, <observations, reward, done, info> ) sample(): Random samples input for the agent. close(): Close the environment after action performed. CODE: import gym env = gym.make('CartPole-v0') env.reset() for _ in range(1000): env.render() env.step(env.action_space.sample()) env.close()
  • 12. OpenAI Gym’s Observations Observations are the environment specific information variables: Observation (object): An environment-specific object representing your observation of the environment. For example, joint angles and joint velocities of a robot, or the board state in a board game. Reward (float): Amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward. Done (boolean): Whether it’s time to reset the environment again. Most tasks are divided into well-defined episodes, and done being True indicates the episode has terminated. For example, perhaps the pole tipped too far, or you lost your last life. Info (dict): Diagnostic information useful for debugging. It can sometimes be useful for learning. For example, it might contain the raw probabilities behind the environment’s last state change. However, official evaluations of your agent are not allowed to use this for learning.
  • 13. Observations (Contd.) The process gets started by calling reset(), which returns an initial observation. So a more proper way of writing the previous code with respect to the episodes and done flag: CODE: import gym env = gym.make('CartPole-v0') for i_episode in range(20): observation = env.reset() for t in range(100): env.render() print(observation) action = env.action_space.sample() observation, reward, done, info = env.step(action) if done: print("Episode finished after {} time steps".format(t+1)) break env.close() OUTPUT: [-0.03327757 0.5649743 -0.0374682 -0.87239967] [-0.02197809 0.7605852 -0.05491619 -1.17662316] [-0.00676638 0.95637585 -0.07844866 -1.48600365] [ 0.01236114 1.15236136 -0.10816873 -1.80211802] [ 0.03540836 1.3485132 -0.14421109 -2.12636579] [ 0.06237863 1.15508926 -0.18673841 -1.88149287] Episode finished after 11 time steps
  • 14. Making a Hard-Coded Policy for Agent CODE: import gym env = gym.make('CartPole-v0') observation = env.reset() for t in range(1000): env.render() # Defining a Hard-Coded Policy cart_pos , cart_vel , pole_ang , ang_vel = observation # Move Cart Right if Pole is Falling to the Right # Angle is measured off straight vertical line if pole_ang > 0: # Move Right action = 1 else: # Move Left action = 0 # Perform Action observation , reward, done, info = env.step(action) env.close()
  • 15. Using Neural Network in Reinforcement Learning ReLu – f(x)
  • 16. Using Neural Network In TensorFlow for Reinforcement Learning (Contd.) Let's design a simple Neural Network that takes in the observation array passes it through a hidden layer and output probability for left. (for right = left-1) CODE: import tensorflow as tf import gym import numpy as np # PART ONE: NETWORK VARIABLES # # Observation Space has 4 inputs num_inputs = 4 num_hidden = 4 # Outputs the probability it should go left num_outputs = 1 initializer = tf.contrib.layers.variance_scaling_initializer() # PART TWO: NETWORK LAYERS # X = tf.placeholder(tf.float32, shape=[None, num_inputs]) hidden_layer_one = tf.layers.dense(X,n um_hidden,activation = tf.nn.relu, kernel_initializer=initializer) hidden_layer_two = tf.layers.dense(hidden_layer_one, num_hidden, activation=tf.nn.relu, kernel_initializer=initializer) # Probability to go left output_layer = tf.layers.dense(hidden_layer_one, num_outputs, activation=tf.nn.sigmoid, kernel_initializer=initializer) # [ Prob to go left , Prob to go right] probabilties = tf.concat(axis=1, values=[output_layer, 1 - output_layer]) # Sample 1 randomly based on probabilities action = tf.multinomial(probabilties, num_samples=1) init = tf.global_variables_initializer() # PART THREE: SESSION # saver = tf.train.Saver() epi = 50 step_limit = 500 avg_steps = [] env = gym.make("CartPole-v1") with tf.Session() as sess: init.run() for i_episode in range(epi): obs = env.reset() for step in range(step_limit): env.render() action_val = action.eval(feed_dict={X: obs.reshape(1, num_inputs)}) obs, reward, done, info = env.step(action_val[0][0]) if done: avg_steps.append(step) print('Done after {} steps'.format(step)) break print("After {} episodes the average cart steps before done was {}".format(epi, np.mean(avg_steps))) env.close()Note: 2x zoom required.