SlideShare a Scribd company logo
2
Most read
3
Most read
4
Most read
presentation by
Anand D Joshi
Outline
▪ Introduction
▪ Goal
the humanoid hand, ShadowHand
▪ Reinforcement Learning
Actor-Critic Approach
Proximal Policy Optimization
Generalized Advantage Estimator
▪ Methodology
▪ Results
▪ Conclusions
Introduction
▪ Research in control of robotic devices is a subject with great
application across a number of sectors
▪ Prior methods have completely trained and tested either on
simulations alone or on physical robots alone
▪ However, simulations do not transfer with sufficient accuracy to
real world, while training on physical robots require years of
experience to perform satisfactorily
▪ In this study, training is carried out on simulated robots, and the
policies learned in the process are deployed on a physical robot
▪ Without explicit instructions to the robot on how to perform an
action, the problem of completing pre-defined tasks is well-suited
for Reinforcement Learning (RL)
Goal
▪ To train a robotic hand, ShadowHand, in dexterous manipulation of
an object, like a block
▪ 24 joints involving 20 actuated degrees of
freedom and 4 under-actuated movements
▪ PhaseSpace sensors capture fingertip motion
▪ Sensors record relative angles between joints
▪ RGB cameras used for pose estimation
▪ Touch sensors in the hand not used
▪ Simulation of the Hand done with MuJoCo physics engine
▪ Model of Hand is based on the robotic environment OpenAI Gym,
a toolkit for developing Reinforcement Learning (RL) algorithms
▪ Rendering of simulations carried out with Unity
ShadowHand holding a bulb All the joints of ShadowHand
Reinforcement Learning
▪ RL trains an agent in some environment to take an action in a given
state resulting in a new state and a reward from the environment,
with the aim to maximize the cumulative reward
For the ShadowHand robot:
▪ State is a 60D space describing angles and velocities of all Hand
joints and position, orientation and velocities of object in hand.
▪ Goal is to achieve the desired orientation with an accuracy of 23°
▪ Action is a 20D space corresponding to desired angles of Hand
joints. Each coordinate is discretized and specified relative to
current joint angle, and rescaled to the range [-1,1]
▪ Reward at time-step 𝑡 is 𝑟𝑡 = 𝑑 𝑡– 𝑑 𝑡+1 where 𝑑 𝑡+1 is rotation
angle between desired and current orientation before transition
and 𝑑 𝑡 is the angle after transition
▪ Policy is function that maps the state
to an action and a new state
▪ Value Function describes how good
is the agent’s state or action, and is
used to predict future rewards
▪ Model is the agent’s representation
of environment
▪ Typically, to choose the actions that give the most possible reward
RL agents are categorized as value-based (dynamic programming),
where they follow value function without explicit policy or policy-
based (policy optimization), where they follow a policy without
explicit value function
▪ The Actor-Critic approach combines and tries to get best of both
the approaches
Actor-Critic Approach
▪ The Hand is trained where simulations have full access to Hand
state and environment
▪ Ideally, for physical robot to do as well as during simulation, it
should have the same full access to Hand state and environment,
which is very infeasible in a real world setup
▪ Thus we cannot rely on training in simulation alone
▪ Therefore we have Actor-Critic approach, where
▪ in simulation, Critic takes full state as input and learns the state
to action mapping much faster
▪ in real world, Actor sees only partial observations
▪ To generalize the policy and vision to reality, Domain
Randomization makes use of a large variety of randomized
experiences without an accurate modelling of the real world
▪ Randomizations over mass, dimensions, friction, noise, colour,
motor backlash, vision, etc., are carried out
Generalized Advantage Estimator (GAE)
▪ In policy gradient (PG) methods, the aim is to maximize the return, i.e.
maximize E[∇ log 𝜋 𝑎 𝑡 𝑠𝑡 𝑓(𝑥)] where 𝑓(𝑥) is a value function, and E
denotes the expectation operator
▪ To simplify the calculation of future rewards as per policy 𝜋, we use a
discount factor, 𝛾 (0 < 𝛾 < 1), and define the value functions as
state-value function: 𝑉 𝜋(𝑠) = E[σ𝑖=0
∞
𝛾 𝑖 𝑟𝑖 | 𝑠0 = 𝑠]
action-value function: 𝑄 𝜋
(𝑠, 𝑎) = E[σ𝑖=0
∞
𝛾 𝑖
𝑟𝑖 | 𝑠0 = 𝑠, 𝑎0 = 𝑎]
▪ The advantage function then 𝐴 𝜋 𝑠, 𝑎 = 𝑄 𝜋 𝑠, 𝑎 − 𝑉 𝜋(𝑠) tells us how
much an action is better than the one prescribed by policy alone
▪ Often, the value function at time 𝑡 needs to be estimated as,
෠𝑉𝑡 = σ𝑖=𝑡
∞
𝛾 𝑖−𝑡 𝑟𝑖 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ which can be written as
෠𝑉𝑡 = 𝑟𝑡 + 𝛾 ෠𝑉𝑡+1 or ෠𝑉𝑡 = 𝑟𝑡 + 𝛾2 𝑟𝑡+2 + 𝛾2 ෠𝑉𝑡+2
in general, ෠𝑉𝑡
(𝑘)
= σ𝑖=𝑡
𝑡+𝑘−1
𝛾 𝑖−𝑡
𝑟𝑖 + 𝛾 𝑘
𝑉 𝑠𝑡+𝑘 ≈ 𝑉 𝜋
𝑠𝑡, 𝑎 𝑡
is the 𝑘-step return estimator
▪ Now, the 𝑘-step advantage estimator is defined as,
መ𝐴 𝑡
(𝑘)
= σ𝑖=𝑡
𝑡+𝑘−1
𝛾 𝑖−𝑡
𝑟𝑖 + 𝛾 𝑘
𝑉 𝑠𝑡+𝑘 − 𝑉 𝑠𝑡 = ෠𝑉𝑡
𝑘
− 𝑉 𝑠𝑡
where 𝑉 𝑠𝑡 is the baseline, which lowers the expectation in the event of
bad actions
▪ The Generalized Advantage Estimator (GAE) is then defined as the
exponentially weighted average of the 𝑘-step estimators
መ𝐴 𝑡
𝐺𝐴𝐸
= (1 − λ) መ𝐴 𝑡
(1)
+ λ መ𝐴 𝑡
(2)
+ λ2 መ𝐴 𝑡
(3)
+ ⋯ simplified to,
መ𝐴 𝑡
𝐺𝐴𝐸
= σ𝑙=0
∞
(𝛾λ)𝑙
𝛿𝑡+𝑙
𝑉
where 𝛿𝑡+𝑙
𝑉
= 𝑟𝑡 + 𝛾𝑉 𝑠𝑡 + 1 − 𝑉(𝑠𝑡) is the TD residual term
▪ Using the መ𝐴 𝑡
𝐺𝐴𝐸
, it is possible to estimate value functions, for all the
states in an episode.
Proximal Policy Optimization (PPO)
▪ A standard PG method typically performs one gradient update in
the policy direction for every data sample
▪ The maximization objective can be represented as a loss function,
𝐿 𝑃𝐺 𝜃 = E[∇ log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡
መ𝐴 𝑡
𝐺𝐴𝐸
] where the policy 𝜋 is
parameterized by 𝜃 (e.g. weights of a neural network)
▪ If 𝜃 𝑜𝑙𝑑 is the vector of policy parameters before an update, then
𝑟𝑡 𝜃 =
𝜋 𝜃(𝑎 𝑡|𝑠 𝑡)
𝜋 𝜃 𝑜𝑙𝑑
(𝑎 𝑡|𝑠 𝑡)
is the probability ratio of taking a given action
as per current policy to taking the action as per old policy.
▪ The loss function can now be modified as
𝐿 𝑃𝑃𝑂 = E min 𝑟𝑡 𝜃 መ𝐴 𝑡
𝐺𝐴𝐸
, clip 𝑟𝑡 𝜃 , 1 − ε, 1 + ε መ𝐴 𝑡
𝐺𝐴𝐸
where the clip function maintains 𝑟𝑡 𝜃 between 1 − ε and 1 + ε
to prevent an excessively large update to the policy, with ε being a
hyperparameter, usually about 0.2
Methodology
▪ Pool of 384 rollout workers with 16 CPU cores each, are used,
while optimization is performed on a single machine with 8 GPUs
▪ Current version of policy is used by a worker on a sample from the
distribution of randomizations
▪ States are observed and actions determined by the policy network,
while returns are predicted by value network. These two make up
the PPO. The two networks have the same architecture (LSTM), but
independent parameters.
▪ An episode ends when either 50 successive orientations are
achieved, policy fails to achieve desired orientation within 8 s, or if
the object is dropped
▪ For better transfer to real world, simulated object pose is
determined from rendered images by a pose estimator CNN.
3 RGB cameras are used on the physical robot for this
▪ Distributed infrastructure during
training of rollout workers
▪ Workers randomly connect to a
Redis server to which policy and
parameters are communicated
▪ Experiences are sent from Redis to
GPU through a buffer
▪ Gradients are computed in each
GPU locally before the MPI averages
across all threads to update the
network parameters
▪ The policy network (left) and value
network (right) for determining actions
and rewards respectively
▪ Normalization block ensures uniform
mean and std. dev. for all observations
Results
▪ The ShadowHand policy learns several grasping and manipulating
strategies without any incentivization or demonstration
▪ Grasps found in human adults were rediscovered, and adapted as
per the Hand’s limitations and strengths
▪ PhaseSpace trackers on fingers perform better than vision-based
pose estimation in both simulation and real world
▪ Policy learned on a cube when applied to differently shaped object
performs much better in simulation than in real world
▪ Randomized training performs better in
real world with 13 median rotations
▪ Without any randomization, median
rotations achieved reduces to 0
▪ Median orientations of PhaseSpace (13)
and Vision tracking (11.5) are
comparable after randomized training
Training Hand with all randomizations
requires more time
Training with memory enables the Hand to
achieve more rotations faster
▪ Keeping the batch size per GPU
fixed, having 16 GPUs and 12,288
rollout CPU cores is the optimum
▪ Markers for object orientation are
not always possible in real world
▪ However, prediction error in
orientation in real world is still less
than noise during observations
Conclusions
▪ The success is mainly due to (1) domain randomizations, (2) policy
with memory (LSTM), and (3) large scale distributed RL
▪ Although equipped with tactile and pressure sensors, they were
used neither in simulation nor in real world. This is because a
lower dimensional state space is easier to model
▪ Only a solid cube was used in simulations, but the policies were
general enough to be used with other objects in the real world, but
with lower levels of accuracy
▪ This work demonstrates that current RL algorithms can be used
effectively for real-world problems
References
Literature
▪ OpenAI, Andrychowicz M., et al., ‘Learning Dexterous In-Hand Manipulation’, arXiv
preprint arXiv:1808.00177, 2019
▪ Schulman J., Moritz P., Levine S., Jordan M. & Abbeel P., ‘High-Dimensional Continuous
Control using Generalized Advantage Estimation’, arXiv preprint arXiv:1506.02438, 2015
▪ Schulman J., Wolski F., Dhariwal P., Radford A. & Klimov O., ‘Proximal Policy
Optimization Algorithms’, arXiv preprint arXiv:1707.06347, 2017
▪ Mnih V., Kavukcuoglu K., et al., ‘Human-level Control through Deep Reinforcement
Learning’, Nature, 2015, 518, p. 529
Blogs
▪ openai.com/blog/learning-dexterity/
▪ karpathy.github.io/2016/05/31/rl/
▪ openai.com/blog/openai-baselines-ppo/
▪ openai.com/five/
Youtube
▪ RL course by David Silver (youtu.be/2pWv7GOvuf0)
▪ John Schulman: Deep Reinforcement Learning (youtu.be/aUrX-rP_ss4)
▪ Arxiv Insights (youtu.be/JgvyzIkgxF0)

More Related Content

PDF
[1808.00177] Learning Dexterous In-Hand Manipulation
PDF
[1312.5602] Playing Atari with Deep Reinforcement Learning
PDF
Exploration Strategies in Reinforcement Learning
PDF
DQN (Deep Q-Network)
PDF
Introduction to Deep Reinforcement Learning
PDF
Deep Q-Learning
PDF
Deep Reinforcement Learning
PPTX
Introduction: Asynchronous Methods for Deep Reinforcement Learning
[1808.00177] Learning Dexterous In-Hand Manipulation
[1312.5602] Playing Atari with Deep Reinforcement Learning
Exploration Strategies in Reinforcement Learning
DQN (Deep Q-Network)
Introduction to Deep Reinforcement Learning
Deep Q-Learning
Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement Learning

What's hot (20)

PPTX
An Introduction to Reinforcement Learning - The Doors to AGI
PPTX
Intro to Deep Reinforcement Learning
PDF
Reinforcement Learning 8: Planning and Learning with Tabular Methods
PPTX
Reinforcement Learning and Artificial Neural Nets
PDF
Lec3 dqn
PDF
Reinforcement Learning 5. Monte Carlo Methods
PDF
Model Based Episodic Memory
PDF
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
PDF
Actor critic algorithm
PDF
Episodic Policy Gradient Training
PPT
Reinforcement learning
PPTX
An overview of gradient descent optimization algorithms
PPTX
Deep Reinforcement Learning
PPTX
Deep Q-learning from Demonstrations DQfD
PDF
Reinforcement Learning 7. n-step Bootstrapping
PPT
POMDP Seminar Backup3
PDF
Reinforcement Learning 3. Finite Markov Decision Processes
PDF
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
PDF
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
PPTX
Deep learning to the rescue - solving long standing problems of recommender ...
An Introduction to Reinforcement Learning - The Doors to AGI
Intro to Deep Reinforcement Learning
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning and Artificial Neural Nets
Lec3 dqn
Reinforcement Learning 5. Monte Carlo Methods
Model Based Episodic Memory
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Actor critic algorithm
Episodic Policy Gradient Training
Reinforcement learning
An overview of gradient descent optimization algorithms
Deep Reinforcement Learning
Deep Q-learning from Demonstrations DQfD
Reinforcement Learning 7. n-step Bootstrapping
POMDP Seminar Backup3
Reinforcement Learning 3. Finite Markov Decision Processes
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Deep learning to the rescue - solving long standing problems of recommender ...
Ad

Similar to Dexterous In-hand Manipulation by OpenAI (20)

PPTX
Reinforcement learning:policy gradient (part 1)
PDF
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
PDF
Structured prediction with reinforcement learning
PDF
Reinforcement Learning - DQN
PPTX
DDPG algortihm for angry birds
PPTX
An efficient use of temporal difference technique in Computer Game Learning
PPTX
Unit 4 - 4.1 Markov Decision Process.pptx
PDF
Reinfrocement Learning
PPTX
R22 Machine learning jntuh UNIT- 5.pptx
PDF
Imitation Learning for Autonomous Driving in TORCS
PPTX
Reinforcement learning
PDF
Reinforcement Learning Guide For Beginners
PPTX
24.09.2021 Reinforcement Learning Algorithms.pptx
PDF
5. 8519 1-pb
PPTX
Making Complex Decisions(Artificial Intelligence)
PDF
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PDF
Graph-based SLAM
PDF
Methods of Optimization in Machine Learning
PDF
Reinforcement learning
PDF
Playing Atari with Deep Reinforcement Learning
Reinforcement learning:policy gradient (part 1)
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Structured prediction with reinforcement learning
Reinforcement Learning - DQN
DDPG algortihm for angry birds
An efficient use of temporal difference technique in Computer Game Learning
Unit 4 - 4.1 Markov Decision Process.pptx
Reinfrocement Learning
R22 Machine learning jntuh UNIT- 5.pptx
Imitation Learning for Autonomous Driving in TORCS
Reinforcement learning
Reinforcement Learning Guide For Beginners
24.09.2021 Reinforcement Learning Algorithms.pptx
5. 8519 1-pb
Making Complex Decisions(Artificial Intelligence)
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
Graph-based SLAM
Methods of Optimization in Machine Learning
Reinforcement learning
Playing Atari with Deep Reinforcement Learning
Ad

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Cloud computing and distributed systems.
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
KodekX | Application Modernization Development
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation theory and applications.pdf
Empathic Computing: Creating Shared Understanding
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Programs and apps: productivity, graphics, security and other tools
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Cloud computing and distributed systems.
NewMind AI Weekly Chronicles - August'25 Week I
MIND Revenue Release Quarter 2 2025 Press Release
“AI and Expert System Decision Support & Business Intelligence Systems”
KodekX | Application Modernization Development
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
MYSQL Presentation for SQL database connectivity
Encapsulation theory and applications.pdf

Dexterous In-hand Manipulation by OpenAI

  • 2. Outline ▪ Introduction ▪ Goal the humanoid hand, ShadowHand ▪ Reinforcement Learning Actor-Critic Approach Proximal Policy Optimization Generalized Advantage Estimator ▪ Methodology ▪ Results ▪ Conclusions
  • 3. Introduction ▪ Research in control of robotic devices is a subject with great application across a number of sectors ▪ Prior methods have completely trained and tested either on simulations alone or on physical robots alone ▪ However, simulations do not transfer with sufficient accuracy to real world, while training on physical robots require years of experience to perform satisfactorily ▪ In this study, training is carried out on simulated robots, and the policies learned in the process are deployed on a physical robot ▪ Without explicit instructions to the robot on how to perform an action, the problem of completing pre-defined tasks is well-suited for Reinforcement Learning (RL)
  • 4. Goal ▪ To train a robotic hand, ShadowHand, in dexterous manipulation of an object, like a block ▪ 24 joints involving 20 actuated degrees of freedom and 4 under-actuated movements ▪ PhaseSpace sensors capture fingertip motion ▪ Sensors record relative angles between joints ▪ RGB cameras used for pose estimation ▪ Touch sensors in the hand not used ▪ Simulation of the Hand done with MuJoCo physics engine ▪ Model of Hand is based on the robotic environment OpenAI Gym, a toolkit for developing Reinforcement Learning (RL) algorithms ▪ Rendering of simulations carried out with Unity ShadowHand holding a bulb All the joints of ShadowHand
  • 5. Reinforcement Learning ▪ RL trains an agent in some environment to take an action in a given state resulting in a new state and a reward from the environment, with the aim to maximize the cumulative reward For the ShadowHand robot: ▪ State is a 60D space describing angles and velocities of all Hand joints and position, orientation and velocities of object in hand. ▪ Goal is to achieve the desired orientation with an accuracy of 23° ▪ Action is a 20D space corresponding to desired angles of Hand joints. Each coordinate is discretized and specified relative to current joint angle, and rescaled to the range [-1,1] ▪ Reward at time-step 𝑡 is 𝑟𝑡 = 𝑑 𝑡– 𝑑 𝑡+1 where 𝑑 𝑡+1 is rotation angle between desired and current orientation before transition and 𝑑 𝑡 is the angle after transition
  • 6. ▪ Policy is function that maps the state to an action and a new state ▪ Value Function describes how good is the agent’s state or action, and is used to predict future rewards ▪ Model is the agent’s representation of environment ▪ Typically, to choose the actions that give the most possible reward RL agents are categorized as value-based (dynamic programming), where they follow value function without explicit policy or policy- based (policy optimization), where they follow a policy without explicit value function ▪ The Actor-Critic approach combines and tries to get best of both the approaches
  • 7. Actor-Critic Approach ▪ The Hand is trained where simulations have full access to Hand state and environment ▪ Ideally, for physical robot to do as well as during simulation, it should have the same full access to Hand state and environment, which is very infeasible in a real world setup ▪ Thus we cannot rely on training in simulation alone ▪ Therefore we have Actor-Critic approach, where ▪ in simulation, Critic takes full state as input and learns the state to action mapping much faster ▪ in real world, Actor sees only partial observations ▪ To generalize the policy and vision to reality, Domain Randomization makes use of a large variety of randomized experiences without an accurate modelling of the real world ▪ Randomizations over mass, dimensions, friction, noise, colour, motor backlash, vision, etc., are carried out
  • 8. Generalized Advantage Estimator (GAE) ▪ In policy gradient (PG) methods, the aim is to maximize the return, i.e. maximize E[∇ log 𝜋 𝑎 𝑡 𝑠𝑡 𝑓(𝑥)] where 𝑓(𝑥) is a value function, and E denotes the expectation operator ▪ To simplify the calculation of future rewards as per policy 𝜋, we use a discount factor, 𝛾 (0 < 𝛾 < 1), and define the value functions as state-value function: 𝑉 𝜋(𝑠) = E[σ𝑖=0 ∞ 𝛾 𝑖 𝑟𝑖 | 𝑠0 = 𝑠] action-value function: 𝑄 𝜋 (𝑠, 𝑎) = E[σ𝑖=0 ∞ 𝛾 𝑖 𝑟𝑖 | 𝑠0 = 𝑠, 𝑎0 = 𝑎] ▪ The advantage function then 𝐴 𝜋 𝑠, 𝑎 = 𝑄 𝜋 𝑠, 𝑎 − 𝑉 𝜋(𝑠) tells us how much an action is better than the one prescribed by policy alone ▪ Often, the value function at time 𝑡 needs to be estimated as, ෠𝑉𝑡 = σ𝑖=𝑡 ∞ 𝛾 𝑖−𝑡 𝑟𝑖 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ which can be written as ෠𝑉𝑡 = 𝑟𝑡 + 𝛾 ෠𝑉𝑡+1 or ෠𝑉𝑡 = 𝑟𝑡 + 𝛾2 𝑟𝑡+2 + 𝛾2 ෠𝑉𝑡+2 in general, ෠𝑉𝑡 (𝑘) = σ𝑖=𝑡 𝑡+𝑘−1 𝛾 𝑖−𝑡 𝑟𝑖 + 𝛾 𝑘 𝑉 𝑠𝑡+𝑘 ≈ 𝑉 𝜋 𝑠𝑡, 𝑎 𝑡 is the 𝑘-step return estimator
  • 9. ▪ Now, the 𝑘-step advantage estimator is defined as, መ𝐴 𝑡 (𝑘) = σ𝑖=𝑡 𝑡+𝑘−1 𝛾 𝑖−𝑡 𝑟𝑖 + 𝛾 𝑘 𝑉 𝑠𝑡+𝑘 − 𝑉 𝑠𝑡 = ෠𝑉𝑡 𝑘 − 𝑉 𝑠𝑡 where 𝑉 𝑠𝑡 is the baseline, which lowers the expectation in the event of bad actions ▪ The Generalized Advantage Estimator (GAE) is then defined as the exponentially weighted average of the 𝑘-step estimators መ𝐴 𝑡 𝐺𝐴𝐸 = (1 − λ) መ𝐴 𝑡 (1) + λ መ𝐴 𝑡 (2) + λ2 መ𝐴 𝑡 (3) + ⋯ simplified to, መ𝐴 𝑡 𝐺𝐴𝐸 = σ𝑙=0 ∞ (𝛾λ)𝑙 𝛿𝑡+𝑙 𝑉 where 𝛿𝑡+𝑙 𝑉 = 𝑟𝑡 + 𝛾𝑉 𝑠𝑡 + 1 − 𝑉(𝑠𝑡) is the TD residual term ▪ Using the መ𝐴 𝑡 𝐺𝐴𝐸 , it is possible to estimate value functions, for all the states in an episode.
  • 10. Proximal Policy Optimization (PPO) ▪ A standard PG method typically performs one gradient update in the policy direction for every data sample ▪ The maximization objective can be represented as a loss function, 𝐿 𝑃𝐺 𝜃 = E[∇ log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡 መ𝐴 𝑡 𝐺𝐴𝐸 ] where the policy 𝜋 is parameterized by 𝜃 (e.g. weights of a neural network) ▪ If 𝜃 𝑜𝑙𝑑 is the vector of policy parameters before an update, then 𝑟𝑡 𝜃 = 𝜋 𝜃(𝑎 𝑡|𝑠 𝑡) 𝜋 𝜃 𝑜𝑙𝑑 (𝑎 𝑡|𝑠 𝑡) is the probability ratio of taking a given action as per current policy to taking the action as per old policy. ▪ The loss function can now be modified as 𝐿 𝑃𝑃𝑂 = E min 𝑟𝑡 𝜃 መ𝐴 𝑡 𝐺𝐴𝐸 , clip 𝑟𝑡 𝜃 , 1 − ε, 1 + ε መ𝐴 𝑡 𝐺𝐴𝐸 where the clip function maintains 𝑟𝑡 𝜃 between 1 − ε and 1 + ε to prevent an excessively large update to the policy, with ε being a hyperparameter, usually about 0.2
  • 11. Methodology ▪ Pool of 384 rollout workers with 16 CPU cores each, are used, while optimization is performed on a single machine with 8 GPUs ▪ Current version of policy is used by a worker on a sample from the distribution of randomizations ▪ States are observed and actions determined by the policy network, while returns are predicted by value network. These two make up the PPO. The two networks have the same architecture (LSTM), but independent parameters. ▪ An episode ends when either 50 successive orientations are achieved, policy fails to achieve desired orientation within 8 s, or if the object is dropped ▪ For better transfer to real world, simulated object pose is determined from rendered images by a pose estimator CNN. 3 RGB cameras are used on the physical robot for this
  • 12. ▪ Distributed infrastructure during training of rollout workers ▪ Workers randomly connect to a Redis server to which policy and parameters are communicated ▪ Experiences are sent from Redis to GPU through a buffer ▪ Gradients are computed in each GPU locally before the MPI averages across all threads to update the network parameters ▪ The policy network (left) and value network (right) for determining actions and rewards respectively ▪ Normalization block ensures uniform mean and std. dev. for all observations
  • 13. Results ▪ The ShadowHand policy learns several grasping and manipulating strategies without any incentivization or demonstration ▪ Grasps found in human adults were rediscovered, and adapted as per the Hand’s limitations and strengths ▪ PhaseSpace trackers on fingers perform better than vision-based pose estimation in both simulation and real world ▪ Policy learned on a cube when applied to differently shaped object performs much better in simulation than in real world
  • 14. ▪ Randomized training performs better in real world with 13 median rotations ▪ Without any randomization, median rotations achieved reduces to 0 ▪ Median orientations of PhaseSpace (13) and Vision tracking (11.5) are comparable after randomized training Training Hand with all randomizations requires more time Training with memory enables the Hand to achieve more rotations faster ▪ Keeping the batch size per GPU fixed, having 16 GPUs and 12,288 rollout CPU cores is the optimum ▪ Markers for object orientation are not always possible in real world ▪ However, prediction error in orientation in real world is still less than noise during observations
  • 15. Conclusions ▪ The success is mainly due to (1) domain randomizations, (2) policy with memory (LSTM), and (3) large scale distributed RL ▪ Although equipped with tactile and pressure sensors, they were used neither in simulation nor in real world. This is because a lower dimensional state space is easier to model ▪ Only a solid cube was used in simulations, but the policies were general enough to be used with other objects in the real world, but with lower levels of accuracy ▪ This work demonstrates that current RL algorithms can be used effectively for real-world problems
  • 16. References Literature ▪ OpenAI, Andrychowicz M., et al., ‘Learning Dexterous In-Hand Manipulation’, arXiv preprint arXiv:1808.00177, 2019 ▪ Schulman J., Moritz P., Levine S., Jordan M. & Abbeel P., ‘High-Dimensional Continuous Control using Generalized Advantage Estimation’, arXiv preprint arXiv:1506.02438, 2015 ▪ Schulman J., Wolski F., Dhariwal P., Radford A. & Klimov O., ‘Proximal Policy Optimization Algorithms’, arXiv preprint arXiv:1707.06347, 2017 ▪ Mnih V., Kavukcuoglu K., et al., ‘Human-level Control through Deep Reinforcement Learning’, Nature, 2015, 518, p. 529 Blogs ▪ openai.com/blog/learning-dexterity/ ▪ karpathy.github.io/2016/05/31/rl/ ▪ openai.com/blog/openai-baselines-ppo/ ▪ openai.com/five/ Youtube ▪ RL course by David Silver (youtu.be/2pWv7GOvuf0) ▪ John Schulman: Deep Reinforcement Learning (youtu.be/aUrX-rP_ss4) ▪ Arxiv Insights (youtu.be/JgvyzIkgxF0)