SlideShare a Scribd company logo
Learning agile and dynamic motor skills
for legged robots
Kohei Nishimura, DeepX,Inc.
Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso,
Vassilios Tsounis, Vladlen Koltun, Marco Hutter
Robotics System Lab, ETH, Switzerland
Introduction
• Proposed a control method for a multi-legged robot that requires
complicated motor control by combining simulation modeling
improvement and deep reinforcement learning
Background
• Multi-legged robots are attracting attention as robots capable of operating
under various environments
• In the research of on multi legged robots, there are many methods to model
and control the behavior of actuators,
• but studies have not been done in consideration of generalization performance,
ease of tuning, efficiency
https://www.bostondynamic
s.com/spot-mini
http://biomimetics.mit.edu https://www.anybotics.
com/anymal/
Previous research
• Control multi-legged robot as a combination of modules
- ex. Assume that only the center of gravity of the robot has mass and
joints with no mass are attached and consider the optimum control
- Disadvantage
• Modeling inaccuracy causes in control inaccuracies
• It is necessary to modify the control parameters for each new
robot, new model, and it becomes necessary to model and
parameterize the module from scratch every time the task is
changed
(It takes several months even by skilled engineers)
• Control by trajectory optimization
- Control using two modules of planning and tracking
- Parameter tuning to optimize trajectory is cumbersome and may fall
into local solutions
- The calculation of trajectory optimization is heavy and not suitable
for controlling the robot in real time
Robot control using RL
• Research on robot control using reinforcement learning has
been conducted as a learning-based method
• There are two major trends of machine control with sim2real
using reinforcement learning
1. Make the behavior of the simulator faithful to the reality and obtain
policies that are easy to transfer to real space
ex. Use direct drive type (requiring analytically behavior) actuators
(Sim-to-real: Learning agile locomotion for quadruped robots)
2. Randomize the variables in the simulator and obtain a policy with
high generalization performance
ex. Randomize dynamics, add noise to observation
(Learning Dexterous In-Hand Manipulation)
Overview of proposed method
• Policy is learned via reinforcement learning with simulator only
- The policy receives the state and outputs the action (joint angle) of the
actuator
- To fill up the difference between the simulator and the real world
• Accelerate simulation of ground contact
• Learn the relationship between action and torque of real world
actuators with NN
• Randomize simulator conditions(Stochastic model) and learn
policies
Technique details : Improve contact simulation
Technique details : Improve contact simulation
• Requires a simulator that can handle complicated contacts
generated by motion in a stable, accurate, and high speed manner
• The general method is the penalty method (which is also adopted
in mujoco)
- Small embossing of objects is permitted and a repulsive force
correspondingto that is generated
- Easy to implement and low computational complexity but poor simulation
accuracy for highly rigid objects
• More accurate simulation method is PGS (projected Gauss-Seidel)
method
- A method of calculating the contact force based on the physical constraint
condition
- Although it is a method of solving linear equations by convergence
calculation, there are drawbacks that the number of updates is not stable
• Convergence takes time, such as when colliding
• We extended the PGS method using the dichotomy method,
proposed a method to stably solve the solution at high speed, and
used it for this experiment
Technique details : Actuator Net
• Learn the relationship between action and torque of the real
world actuator with NN
- Input
• The position error (command and actual angular difference)
at time t, t - 0.01, t - 0.02 and the angular velocity
- Output
• Torque at time t
- Network structure
• MLP with 3 intermediate layers
• The activation function is softsign
- Learning data
• Collect joint angle, joint angular velocity, torque data at 400
Hz for 4 minutes
• Walk with a simple control model and add disturbance
during walking
Technique details : Actuator Net
Technique details : Reinforcement learning
At every time step t the agent obtains an observation 𝑜𝑡 ∈ O,
performs an action 𝑎 𝑡 ∈ A, and achieves a scalar reward 𝑟𝑡 ∈ R
Aim is to find a policy that maximizes the discounted sum of rewards over
an infinite horizon:
Technique details : Control Policy
• Control policy
- Input
• Position of the robot
• Base orientation of the robot
• Series of joint angles (most recent 3 steps)
• Series of control signals (most recent 3 steps)
• Operation signal (by human controller)
- Output
• Control signal (angle control signal for each actuator)
• Control strategy learning algorithm uses TRPO
- Use TRPO's default parameters for original papers
Trust Region Policy Optimization (TRPO) [22], a policy gradient algorithm that
has been demonstrated to learn locomotion policies in simulation
Technique details : Learning Control Policy
• Make a stochastic model of Robot to include modelling error
- 15% mass error in the estimation due to un-modeled cabling and
electronics
• Randomize the condition of the simulator and robustify the policy
- by training with 30 different ANYmal models with stochastically
sampled inertial properties.
- The center of mass positions, the masses of links, and joint positions
are randomized by adding a noise sampled from
U(-2, 2) cm, U(-15, 15) %, and U(-2, 2) cm
Technique details : Learning Control Policy
• Can not learn well by naive learning
- Reducing constraints on torque and angular velocity results in
unnatural movement
- Increasing constraints on torque and angular velocity will result in
a local solution that does not move at all
• Learn the whole movement broadly and learn to refine
movement afterwards
- ex. Constraints on torque and joint speed are initially small,
increasing in the second half
- Introduce curriculum variables 𝑘 𝑐, 𝑘 𝑑, 𝑚𝑎𝑘𝑒 𝑘 𝑐=1 correspond to
difficult movements
- Update 𝑘 𝑐 𝑤𝑖𝑡ℎ 𝑓𝑜𝑟𝑚𝑢𝑙𝑎𝑟 𝑘 𝑐,𝑗+1 ← 𝑘 𝑐,𝑗
𝑘 𝑑
- j is a step of reinforcement learning
- In experiments in this paper, we used 𝑘0 = 0.3, 𝑘 𝑑 = 0.997
Diagram of proposed method
• The network and the variables are summarized
Technique details : Deployment on the physical system
• Custom MLP implementation and the trained parameter set were
ported to the robot’s onboard PC.
• This network was evaluated at 200 Hz for command-conditioned/high
speed locomotion and at 100 Hz for recovery from a fall.
• Performance was surprisingly insensitive to the controlrate.
• Even at 100 Hz, evaluation of the network uses only 0.25% of the
computation available on a single CPU core.
actuator net ideal model
train 0.740 [N•m] 3.55[N•m]
valid 0.996 [N•m] 5.74 [N•m]
Accuracy of Actuator net
• Collected data is divided into 9:1 and verified
• Comparison with numerical solutions assuming the ideal state
- no communication delay, zero mechanical response time
- The RMS was smaller than the numerical solution assuming the ideal state
Exp. 1: Command-conditioned locomotion
• Experiment contents
- Experiment that gives a command and controls it to move
according to the command
- The command is the speed in the straight running direction, the
speed in the lateral direction, the direction of the robot
• Reward function
- Angular velocity, moving speed, torque, joint speed (see appendix)
• Learning strategies
- 4 hours in the real world (time steps of 9 days in the simulator space)
Exp. 1 : Comparison method
• As a comparative article, we use a model-based approach
- Define the cost function for the task
- Constraint Condition · Calculate the Hessian and Jacobian of the
cost function and take the optimum position of the center of gravity
and the coordinates of each foot as quadratic programming.
- Calculate the optimum acceleration and friction force, solve the
torque as a quadratic programming method, and send a signal to
the robot
Exp. 1 : Comparison method
• Experimental result(proposed method)
Result 1: Command-conditioned locomotion
• Comparison of actuator modeling methods
– left : analytical actuator model, right : ideal actuator model
Result 1: Command-conditioned locomotion
Result 1: Command-conditioned locomotion
• Evaluate the difference between the simulator and the
actual machine as the fidelity performance of the moving
speed of the robot
- Behavior of the simulator is quite close to that of the real machine
Result 1: Command-conditioned locomotion
• Control error with respect to command, control efficiency
(torque, power consumption)
- Comparison with previous studies
Exp. 2 : High-speed locomotion
• Experiment contents
- Task to run as fast as possible
• Reward design and learning time are the same as in
Experiment 1
Exp. 2 : High-speed locomotion
• Result
- Maximum speed in previous research: 1.2 m/s
- Maximum speed in this method: 1.6 m/s
• Consideration on results
- Maximum speed depends on hardware such as actuators and
parts
- With the existing control method, planning calculation
processing is heavy, control in real environment can not be
made in time so it can not be controlled at high speed
Exp. 3: Recovery from a fall
• Experiment contents
- Tasks that gets up from a state of being sprinkled
- Experiment with nine initial conditions
• Reward function
- Constraints on torque, joint speed, joint acceleration ... (see appendix)
• Learning time
- 11 hours in the real world (time step of 76 days in the simulator space)
Exp. 3: Recovery from a fall
Conclusions
• propose a method to control accurately and efficiently by
reinforcement learning of simulator only, and applied it to actual
machine
• able to learn a robust control strategy for the machine state by
the proposed method
- It was possible to control even if applied to actual machine without reshaping the
policy with in 3 months
• Future tasks
- It is hard to decide the distribution of the reward design and the initial state, so
would like to improve it
- Would like to be able to perform multiple tasks by giving hierarchical structure
control policy
- It has already been posted to arxiv (https://arxiv.org/pdf/1901.07517.pdf)
Impressions
• It is great that the calculation specifications required for both
training and inference are not so large and can be controlled
• A lot of agent simulator videos were uploaded to youtube, but
I would like to know details about the simulator
• Reward design seems to be very difficult
• Kicking the robot with the animation of Experiment 1 is gentle
References
• Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso,
Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning Agile
and Dynamic Motor Skills for Legged Robots. Science Robotics,
4(26):eaau5872, 2019.
• J. Hwangbo, J. Lee, M. Hutter, Per-contact iteration method for
solving contact dynamics. IEEE Robot. Autom. Lett. 3, 895–902
(2018).
• C. D. Bellicoso, F. Jenelten, C. Gehring, M. Hutter, Dynamic
locomotion through online nonlinear motion optimization for
quadrupedal robots. IEEE Robot. Autom. Lett. 3, 2261–2268 (2018).
Notation used for Reward function
Reward function in Exp. 1 & 2
• K that appears below uses logistic kernel
• The Reward function is the sum of the following rewards
– 𝑘 𝑐 is a curriculum variable
Reward function in Exp. 1 & 2
Appendix. Reward function in Exp. 3
• The angle Diff () that appears below uses the smaller
difference between the two angles

More Related Content

PPTX
Robotics of Quadruped Robot
PPTX
KOM - Unit 5 - friction in machine elements
PDF
Maxon presentation sizing drive systems with low power dc motors 02-2014
PDF
Gear box
PPTX
Sliding mode control
PPTX
Hill Assist Control Seminar
PDF
Active suspension system
PPTX
knock sensor
Robotics of Quadruped Robot
KOM - Unit 5 - friction in machine elements
Maxon presentation sizing drive systems with low power dc motors 02-2014
Gear box
Sliding mode control
Hill Assist Control Seminar
Active suspension system
knock sensor

What's hot (20)

PPTX
Lyapunov stability analysis
PPTX
Four wheel steering system
PDF
Electronically Controlled suspensions system .pdf
PPT
Gears and gears types and gear making
PPTX
Design and fabrication of differential unit locking system
PDF
Thiết kế,tính toán hệ thống treo cho xe con (có bản vẽ)
PDF
Emm3104 chapter 2 part2
DOCX
Đồ Án Lắp Đặt Mô Hình Hệ Thống Phanh Abs Xe Lexus.docx
PPTX
Unit 4 suspension and brakes
PPTX
Deep parking
DOCX
01 Mecanum Project Report
PDF
Unit 2 : Suspension System
PPTX
Belt Drives_1.pptx
PDF
Suspension, steering & braking system
PPTX
Synchromesh gear box
PPTX
Magnetic bearing
PDF
Porter Governor
PDF
Wheel Alignment
PPTX
Machine Design Belts
PDF
Transmission system
Lyapunov stability analysis
Four wheel steering system
Electronically Controlled suspensions system .pdf
Gears and gears types and gear making
Design and fabrication of differential unit locking system
Thiết kế,tính toán hệ thống treo cho xe con (có bản vẽ)
Emm3104 chapter 2 part2
Đồ Án Lắp Đặt Mô Hình Hệ Thống Phanh Abs Xe Lexus.docx
Unit 4 suspension and brakes
Deep parking
01 Mecanum Project Report
Unit 2 : Suspension System
Belt Drives_1.pptx
Suspension, steering & braking system
Synchromesh gear box
Magnetic bearing
Porter Governor
Wheel Alignment
Machine Design Belts
Transmission system
Ad

Similar to Learning agile and dynamic motor skills for legged robots (20)

PPTX
Rapid motor adaptation for legged robots
PDF
On Learning Navigation Behaviors for Small Mobile Robots With Reservoir Compu...
PDF
Seminar Nima Yousefi 2015 Engineering University of Alberta
PDF
[1808.00177] Learning Dexterous In-Hand Manipulation
PPTX
R22 Machine learning jntuh UNIT- 5.pptx
PPTX
SPLT Transformer.pptx
PPTX
Grad presentation
PPTX
System Simulation and Modelling with types and Event Scheduling
PDF
Start MPC
PPTX
Introduction to simulation modeling
PPTX
Chaos Presentation
PDF
Lec 03(VDIdr shady)
PPTX
Trajectory Transformer.pptx
DOCX
Model-based Investigation of the Effect of Tuning Parameters o.docx
PDF
Developments In Precision Positioning Stages with High Speed Range
PPTX
Project-Mohit Suri
PPT
Modelling Simulation and Control of a Real System
PDF
Mit16 30 f10_lec01
PDF
Fontys Driving Simulator for Fontys
PDF
53_36765_ME591_2012_1__1_1_Mechatronics System Design.pdf
Rapid motor adaptation for legged robots
On Learning Navigation Behaviors for Small Mobile Robots With Reservoir Compu...
Seminar Nima Yousefi 2015 Engineering University of Alberta
[1808.00177] Learning Dexterous In-Hand Manipulation
R22 Machine learning jntuh UNIT- 5.pptx
SPLT Transformer.pptx
Grad presentation
System Simulation and Modelling with types and Event Scheduling
Start MPC
Introduction to simulation modeling
Chaos Presentation
Lec 03(VDIdr shady)
Trajectory Transformer.pptx
Model-based Investigation of the Effect of Tuning Parameters o.docx
Developments In Precision Positioning Stages with High Speed Range
Project-Mohit Suri
Modelling Simulation and Control of a Real System
Mit16 30 f10_lec01
Fontys Driving Simulator for Fontys
53_36765_ME591_2012_1__1_1_Mechatronics System Design.pdf
Ad

More from 홍배 김 (20)

PDF
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
PPTX
Gaussian processing
PPTX
Lecture Summary : Camera Projection
PPTX
Basics of Robotics
PPTX
Recurrent Neural Net의 이론과 설명
PPTX
Convolutional neural networks 이론과 응용
PPTX
Anomaly detection using deep one class classifier
PPTX
Optimal real-time landing using DNN
PPTX
The world of loss function
PPTX
Machine learning applications in aerospace domain
PPTX
Anomaly Detection and Localization Using GAN and One-Class Classifier
PPTX
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
PPTX
Brief intro : Invariance and Equivariance
PPTX
Anomaly Detection with GANs
PPTX
Focal loss의 응용(Detection & Classification)
PPTX
Convolution 종류 설명
PPTX
Learning by association
PPTX
알기쉬운 Variational autoencoder
PPTX
Binarized CNN on FPGA
PPTX
Visualizing data using t-SNE
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Gaussian processing
Lecture Summary : Camera Projection
Basics of Robotics
Recurrent Neural Net의 이론과 설명
Convolutional neural networks 이론과 응용
Anomaly detection using deep one class classifier
Optimal real-time landing using DNN
The world of loss function
Machine learning applications in aerospace domain
Anomaly Detection and Localization Using GAN and One-Class Classifier
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
Brief intro : Invariance and Equivariance
Anomaly Detection with GANs
Focal loss의 응용(Detection & Classification)
Convolution 종류 설명
Learning by association
알기쉬운 Variational autoencoder
Binarized CNN on FPGA
Visualizing data using t-SNE

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Spectral efficient network and resource selection model in 5G networks
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
KodekX | Application Modernization Development
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
sap open course for s4hana steps from ECC to s4
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
MIND Revenue Release Quarter 2 2025 Press Release
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectral efficient network and resource selection model in 5G networks
The AUB Centre for AI in Media Proposal.docx
Review of recent advances in non-invasive hemoglobin estimation
Advanced methodologies resolving dimensionality complications for autism neur...
Building Integrated photovoltaic BIPV_UPV.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Spectroscopy.pptx food analysis technology
KodekX | Application Modernization Development
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
cuic standard and advanced reporting.pdf
Encapsulation_ Review paper, used for researhc scholars
Unlocking AI with Model Context Protocol (MCP)
sap open course for s4hana steps from ECC to s4
NewMind AI Weekly Chronicles - August'25 Week I
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton

Learning agile and dynamic motor skills for legged robots

  • 1. Learning agile and dynamic motor skills for legged robots Kohei Nishimura, DeepX,Inc. Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, Marco Hutter Robotics System Lab, ETH, Switzerland
  • 2. Introduction • Proposed a control method for a multi-legged robot that requires complicated motor control by combining simulation modeling improvement and deep reinforcement learning
  • 3. Background • Multi-legged robots are attracting attention as robots capable of operating under various environments • In the research of on multi legged robots, there are many methods to model and control the behavior of actuators, • but studies have not been done in consideration of generalization performance, ease of tuning, efficiency https://www.bostondynamic s.com/spot-mini http://biomimetics.mit.edu https://www.anybotics. com/anymal/
  • 4. Previous research • Control multi-legged robot as a combination of modules - ex. Assume that only the center of gravity of the robot has mass and joints with no mass are attached and consider the optimum control - Disadvantage • Modeling inaccuracy causes in control inaccuracies • It is necessary to modify the control parameters for each new robot, new model, and it becomes necessary to model and parameterize the module from scratch every time the task is changed (It takes several months even by skilled engineers) • Control by trajectory optimization - Control using two modules of planning and tracking - Parameter tuning to optimize trajectory is cumbersome and may fall into local solutions - The calculation of trajectory optimization is heavy and not suitable for controlling the robot in real time
  • 5. Robot control using RL • Research on robot control using reinforcement learning has been conducted as a learning-based method • There are two major trends of machine control with sim2real using reinforcement learning 1. Make the behavior of the simulator faithful to the reality and obtain policies that are easy to transfer to real space ex. Use direct drive type (requiring analytically behavior) actuators (Sim-to-real: Learning agile locomotion for quadruped robots) 2. Randomize the variables in the simulator and obtain a policy with high generalization performance ex. Randomize dynamics, add noise to observation (Learning Dexterous In-Hand Manipulation)
  • 6. Overview of proposed method • Policy is learned via reinforcement learning with simulator only - The policy receives the state and outputs the action (joint angle) of the actuator - To fill up the difference between the simulator and the real world • Accelerate simulation of ground contact • Learn the relationship between action and torque of real world actuators with NN • Randomize simulator conditions(Stochastic model) and learn policies
  • 7. Technique details : Improve contact simulation
  • 8. Technique details : Improve contact simulation • Requires a simulator that can handle complicated contacts generated by motion in a stable, accurate, and high speed manner • The general method is the penalty method (which is also adopted in mujoco) - Small embossing of objects is permitted and a repulsive force correspondingto that is generated - Easy to implement and low computational complexity but poor simulation accuracy for highly rigid objects • More accurate simulation method is PGS (projected Gauss-Seidel) method - A method of calculating the contact force based on the physical constraint condition - Although it is a method of solving linear equations by convergence calculation, there are drawbacks that the number of updates is not stable • Convergence takes time, such as when colliding • We extended the PGS method using the dichotomy method, proposed a method to stably solve the solution at high speed, and used it for this experiment
  • 9. Technique details : Actuator Net • Learn the relationship between action and torque of the real world actuator with NN - Input • The position error (command and actual angular difference) at time t, t - 0.01, t - 0.02 and the angular velocity - Output • Torque at time t - Network structure • MLP with 3 intermediate layers • The activation function is softsign - Learning data • Collect joint angle, joint angular velocity, torque data at 400 Hz for 4 minutes • Walk with a simple control model and add disturbance during walking
  • 10. Technique details : Actuator Net
  • 11. Technique details : Reinforcement learning At every time step t the agent obtains an observation 𝑜𝑡 ∈ O, performs an action 𝑎 𝑡 ∈ A, and achieves a scalar reward 𝑟𝑡 ∈ R Aim is to find a policy that maximizes the discounted sum of rewards over an infinite horizon:
  • 12. Technique details : Control Policy • Control policy - Input • Position of the robot • Base orientation of the robot • Series of joint angles (most recent 3 steps) • Series of control signals (most recent 3 steps) • Operation signal (by human controller) - Output • Control signal (angle control signal for each actuator) • Control strategy learning algorithm uses TRPO - Use TRPO's default parameters for original papers Trust Region Policy Optimization (TRPO) [22], a policy gradient algorithm that has been demonstrated to learn locomotion policies in simulation
  • 13. Technique details : Learning Control Policy • Make a stochastic model of Robot to include modelling error - 15% mass error in the estimation due to un-modeled cabling and electronics • Randomize the condition of the simulator and robustify the policy - by training with 30 different ANYmal models with stochastically sampled inertial properties. - The center of mass positions, the masses of links, and joint positions are randomized by adding a noise sampled from U(-2, 2) cm, U(-15, 15) %, and U(-2, 2) cm
  • 14. Technique details : Learning Control Policy • Can not learn well by naive learning - Reducing constraints on torque and angular velocity results in unnatural movement - Increasing constraints on torque and angular velocity will result in a local solution that does not move at all • Learn the whole movement broadly and learn to refine movement afterwards - ex. Constraints on torque and joint speed are initially small, increasing in the second half - Introduce curriculum variables 𝑘 𝑐, 𝑘 𝑑, 𝑚𝑎𝑘𝑒 𝑘 𝑐=1 correspond to difficult movements - Update 𝑘 𝑐 𝑤𝑖𝑡ℎ 𝑓𝑜𝑟𝑚𝑢𝑙𝑎𝑟 𝑘 𝑐,𝑗+1 ← 𝑘 𝑐,𝑗 𝑘 𝑑 - j is a step of reinforcement learning - In experiments in this paper, we used 𝑘0 = 0.3, 𝑘 𝑑 = 0.997
  • 15. Diagram of proposed method • The network and the variables are summarized
  • 16. Technique details : Deployment on the physical system • Custom MLP implementation and the trained parameter set were ported to the robot’s onboard PC. • This network was evaluated at 200 Hz for command-conditioned/high speed locomotion and at 100 Hz for recovery from a fall. • Performance was surprisingly insensitive to the controlrate. • Even at 100 Hz, evaluation of the network uses only 0.25% of the computation available on a single CPU core.
  • 17. actuator net ideal model train 0.740 [N•m] 3.55[N•m] valid 0.996 [N•m] 5.74 [N•m] Accuracy of Actuator net • Collected data is divided into 9:1 and verified • Comparison with numerical solutions assuming the ideal state - no communication delay, zero mechanical response time - The RMS was smaller than the numerical solution assuming the ideal state
  • 18. Exp. 1: Command-conditioned locomotion • Experiment contents - Experiment that gives a command and controls it to move according to the command - The command is the speed in the straight running direction, the speed in the lateral direction, the direction of the robot • Reward function - Angular velocity, moving speed, torque, joint speed (see appendix) • Learning strategies - 4 hours in the real world (time steps of 9 days in the simulator space)
  • 19. Exp. 1 : Comparison method • As a comparative article, we use a model-based approach - Define the cost function for the task - Constraint Condition · Calculate the Hessian and Jacobian of the cost function and take the optimum position of the center of gravity and the coordinates of each foot as quadratic programming. - Calculate the optimum acceleration and friction force, solve the torque as a quadratic programming method, and send a signal to the robot
  • 20. Exp. 1 : Comparison method
  • 21. • Experimental result(proposed method) Result 1: Command-conditioned locomotion
  • 22. • Comparison of actuator modeling methods – left : analytical actuator model, right : ideal actuator model Result 1: Command-conditioned locomotion
  • 23. Result 1: Command-conditioned locomotion • Evaluate the difference between the simulator and the actual machine as the fidelity performance of the moving speed of the robot - Behavior of the simulator is quite close to that of the real machine
  • 24. Result 1: Command-conditioned locomotion • Control error with respect to command, control efficiency (torque, power consumption) - Comparison with previous studies
  • 25. Exp. 2 : High-speed locomotion • Experiment contents - Task to run as fast as possible • Reward design and learning time are the same as in Experiment 1
  • 26. Exp. 2 : High-speed locomotion • Result - Maximum speed in previous research: 1.2 m/s - Maximum speed in this method: 1.6 m/s • Consideration on results - Maximum speed depends on hardware such as actuators and parts - With the existing control method, planning calculation processing is heavy, control in real environment can not be made in time so it can not be controlled at high speed
  • 27. Exp. 3: Recovery from a fall • Experiment contents - Tasks that gets up from a state of being sprinkled - Experiment with nine initial conditions • Reward function - Constraints on torque, joint speed, joint acceleration ... (see appendix) • Learning time - 11 hours in the real world (time step of 76 days in the simulator space)
  • 28. Exp. 3: Recovery from a fall
  • 29. Conclusions • propose a method to control accurately and efficiently by reinforcement learning of simulator only, and applied it to actual machine • able to learn a robust control strategy for the machine state by the proposed method - It was possible to control even if applied to actual machine without reshaping the policy with in 3 months • Future tasks - It is hard to decide the distribution of the reward design and the initial state, so would like to improve it - Would like to be able to perform multiple tasks by giving hierarchical structure control policy - It has already been posted to arxiv (https://arxiv.org/pdf/1901.07517.pdf)
  • 30. Impressions • It is great that the calculation specifications required for both training and inference are not so large and can be controlled • A lot of agent simulator videos were uploaded to youtube, but I would like to know details about the simulator • Reward design seems to be very difficult • Kicking the robot with the animation of Experiment 1 is gentle
  • 31. References • Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning Agile and Dynamic Motor Skills for Legged Robots. Science Robotics, 4(26):eaau5872, 2019. • J. Hwangbo, J. Lee, M. Hutter, Per-contact iteration method for solving contact dynamics. IEEE Robot. Autom. Lett. 3, 895–902 (2018). • C. D. Bellicoso, F. Jenelten, C. Gehring, M. Hutter, Dynamic locomotion through online nonlinear motion optimization for quadrupedal robots. IEEE Robot. Autom. Lett. 3, 2261–2268 (2018).
  • 32. Notation used for Reward function
  • 33. Reward function in Exp. 1 & 2 • K that appears below uses logistic kernel
  • 34. • The Reward function is the sum of the following rewards – 𝑘 𝑐 is a curriculum variable Reward function in Exp. 1 & 2
  • 35. Appendix. Reward function in Exp. 3 • The angle Diff () that appears below uses the smaller difference between the two angles