SlideShare a Scribd company logo
Coordinated Multi-Agent Control
Utilizing Deep Reinforcement Learning
Alex Gao, Kevin Frans
Henry M. Gunn High School
Introduction
In today’s world, nearly everything is automated. But behind
this veil of automation, a programmer has to explicitly tell the
computer how to act in each situation. As more complex prob-
lems are reached, this manual method becomes impractical, as
it requires excessive amounts of work.
With autonomous learning, we allow the computer to figure out
a strategy on its own. This strategy is referred to as a policy,
and the computer program following it is an agent.
In a standard setup, many simulations are run on an unknown
task, starting from a random policy. How the simulation data
can be used to improve the policy is an active area of research,
especially with the rise of modern computing power.
Objectives
In this experiment, our goal is to identify new learning meth-
ods in order to:
• Improve performance on the task. Each task defines
success in a different manner, and an optimal strategy
may involve cooperation between multiple agents.
• Speed up training time. An algorithm that requires less
simulations can learn at a faster rate. Additionally,
algorithms that require less compute power can be
executed faster in terms of wall-clock time.
Reinforcement Learning
When a human child learns to walk, it isn’t told how. Instead,
the child keeps trying things until it figures out how each of
its muscles should move. Reinforcement learning, a subset of
machine learning, follows the same methodology. Through con-
tinuous trial and error, a policy can gradually be improved upon.
We represent our policy as a deep neural network, mapping
from an observation vector (positions and angles of joints) to an
agent’s actions (joint torque). The policy is stochastic: actions
are represented as normal distributions, parametrized by a mean
and a standard deviation. This network can be thought of as an
agent’s "brain".
Each task gives out a reward value based on how well the agent
is performing. This reward is different in each task, ranging from
’how far forward the agent has traveled’ to ’how close is the agent
to a desired point’.
By repeatedly running simulations of a task, we can record which
actions, on average, resulted in a higher reward.
Utilizing this information, we backpropagate through the neural
network, increasing the likelihood of actions that have proven to
be beneficial.
This paradigm is known as the policy gradient, and is the
basis for how our learning agent is trained.
Hypothesis
Traditional methods use one big policy to account for all the
agents actions. However, this setup has trouble scaling into
more complex tasks. Instead, we split this large network
into many smaller, distinct networks — one for each
agent. We view each of a robot’s joints as separate agents, al-
lowing independent decisions to be made. In practice, this means
training distinct neural networks for each joint. Our method is
decentralized, and each network can be trained and executed
independently of the others.
Figure: Left: traditional network. Right: decentralized network.
Tasks
Figure: Swimmer
Figure: Hopper
Figure: Reacher
Figure: Delivery
Rationale
Large neural networks, especially those with multiple objectives,
often struggle to encode observations correctly. For example, a
robot will move its leg joint differently, depending on the position
of its foot. However, this information is not useful in controlling
its right arm. When using a single network, the network is forced
to encode all information, which can lead to confusion.
By utilizing multiple networks, we allow each network to
specialize, and only encode the information useful to its specific
role in the task.
Additionally, the decentralization of our policy enables scaling up
of training, through parallelization. In a centralized setup, the
entire network would have to be updated before the next update
could start. With the policy split into separate networks, each
can be trained on different computers, in parallel – allowing for
faster training when scaling up.
Results
Algorithm
Algorithm 1 Decentralized Policy Gradient
1: initialize π for every agent
2: for iteration = 0,1,2,... until convergence do
3: for episode = 0,1,2, ... 10000 do
4: Reset environment
5: for timestep = 0,1,2,... until episode end do
6: Receive observation
7: Every agent takes action according to policy π
8: Record tuple (observation, action, reward)
9: Estimate value function Q(s,a) using recorded tuples
10: Compute gradient to increase expected value
11: Update policies in gradient direction
Conclusion
We tested our algorithm’s performance on a variety of standard
control tasks, and showed an increase of up to two times
the performance of the single-network method. Our
method also learned an optimal solution in less simulations. Re-
peated trials on each task validated our results.
While the total computational power of a decentralized method
may be higher, the neural networks can be trained in paral-
lel across multiple computers, effectively reducing the computa-
tional time. This is more beneficial for practical use, since the
smaller neural networks can be distributed among each agent
without intercommunication.
Our method’s improvements on the industry standard, in terms
of performance and training time, allow for deep reinforcement
learning to be applied in many practical scenarios. Real-world
problems such as assembly line robotics have traditionally been
explicitly coded; however, as autonomous learning becomes more
efficient, artificial intelligence may take over these jobs.
Future Work
Our work provides a reliable basis for practical cooperative learn-
ing in a multitude of environments, and paves the way for future
research in the emergent field of multi-agent control.
Our robust algorithm has countless practical applications requir-
ing cooperative learning that can be explored in future work,
including:
• Drone package delivery network management and collision
avoidance
• Robotic bee pollination to proxy bee extinction
• Planetary exploration with teams of rovers
• Prosthetics for disabled persons

More Related Content

DOCX
Neural basics
PDF
Designing your neural networks – a step by step walkthrough
PPTX
Key Insights Of Using Deep Learning To Analyze Healthcare Data | Workshop Fro...
PDF
Neural network image recognition
PDF
Group7_PPT_IRE
PDF
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
PPTX
Meta-Learning Presentation
DOCX
FORWARD CHAINING AND BACKWARD CHAINING SYSTEMS IN ARTIFICIAL INTELIGENCE
Neural basics
Designing your neural networks – a step by step walkthrough
Key Insights Of Using Deep Learning To Analyze Healthcare Data | Workshop Fro...
Neural network image recognition
Group7_PPT_IRE
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
Meta-Learning Presentation
FORWARD CHAINING AND BACKWARD CHAINING SYSTEMS IN ARTIFICIAL INTELIGENCE

What's hot (18)

PPTX
Neural network in R by Aman Chauhan
PPT
poster
PPTX
Machine Learning
PPTX
September 28, Course Projects
PPTX
세미나 20170929
PPTX
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
PPTX
Neural Network Presentation
PPTX
Tracking Dynamic Networks in Real Time
PDF
Genetics influence inter-subject Brain State Prediction.
DOC
Abstract21
PPT
Face alignment by deep convolutional network with adaptive learning rate
PDF
FYP Thesis
PPTX
Tricking a DNN with adversarial examples
DOCX
DOCX
Recommender system with artificial intelligence for fitness assistance system
PDF
파이콘 한국 2020) 파이썬으로 구현하는 신경세포 기반의 인공 뇌 시뮬레이터
DOCX
Comparison of machine learning methods for breast cancer diagnosis
PDF
Energy Efficient Mobile Targets Classification and Tracking in WSNs based on ...
Neural network in R by Aman Chauhan
poster
Machine Learning
September 28, Course Projects
세미나 20170929
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
Neural Network Presentation
Tracking Dynamic Networks in Real Time
Genetics influence inter-subject Brain State Prediction.
Abstract21
Face alignment by deep convolutional network with adaptive learning rate
FYP Thesis
Tricking a DNN with adversarial examples
Recommender system with artificial intelligence for fitness assistance system
파이콘 한국 2020) 파이썬으로 구현하는 신경세포 기반의 인공 뇌 시뮬레이터
Comparison of machine learning methods for breast cancer diagnosis
Energy Efficient Mobile Targets Classification and Tracking in WSNs based on ...
Ad

Similar to Coordinated Multi-Agent Control Utilizing Deep Reinforcement Learning (20)

PDF
How to train your robot (with Deep Reinforcement Learning)
PDF
Reinforcement Learning
PPTX
Reinforcement learning
PDF
Learning Communication with Neural Networks
PDF
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
PDF
RL presentation
PDF
Learning to discover monte carlo algorithm on spin ice manifold
PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
PDF
自然方策勾配法の基礎と応用
PDF
Reinforcement learning in a nutshell
PDF
Towards Reinforcement Learning-based Aggregate Computing
PPTX
Survey of Modern Reinforcement Learning
PPTX
Reinforcement Learning
PDF
An introduction to reinforcement learning
PDF
esnq_control
PPTX
Deep Multi-agent Reinforcement Learning
PDF
PPTX
Reinforcement Learning and Artificial Neural Nets
PDF
Reinforcement Learning - Learning from Experience like a Human
PPTX
Intro to Deep Reinforcement Learning
How to train your robot (with Deep Reinforcement Learning)
Reinforcement Learning
Reinforcement learning
Learning Communication with Neural Networks
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
RL presentation
Learning to discover monte carlo algorithm on spin ice manifold
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
自然方策勾配法の基礎と応用
Reinforcement learning in a nutshell
Towards Reinforcement Learning-based Aggregate Computing
Survey of Modern Reinforcement Learning
Reinforcement Learning
An introduction to reinforcement learning
esnq_control
Deep Multi-agent Reinforcement Learning
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning - Learning from Experience like a Human
Intro to Deep Reinforcement Learning
Ad

Recently uploaded (20)

DOCX
573137875-Attendance-Management-System-original
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Lecture Notes Electrical Wiring System Components
PDF
composite construction of structures.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPT
Project quality management in manufacturing
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
Digital Logic Computer Design lecture notes
PPTX
additive manufacturing of ss316l using mig welding
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
573137875-Attendance-Management-System-original
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Foundation to blockchain - A guide to Blockchain Tech
Mechanical Engineering MATERIALS Selection
Lecture Notes Electrical Wiring System Components
composite construction of structures.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Project quality management in manufacturing
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Sustainable Sites - Green Building Construction
Digital Logic Computer Design lecture notes
additive manufacturing of ss316l using mig welding
Model Code of Practice - Construction Work - 21102022 .pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Structs to JSON How Go Powers REST APIs.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx

Coordinated Multi-Agent Control Utilizing Deep Reinforcement Learning

  • 1. Coordinated Multi-Agent Control Utilizing Deep Reinforcement Learning Alex Gao, Kevin Frans Henry M. Gunn High School Introduction In today’s world, nearly everything is automated. But behind this veil of automation, a programmer has to explicitly tell the computer how to act in each situation. As more complex prob- lems are reached, this manual method becomes impractical, as it requires excessive amounts of work. With autonomous learning, we allow the computer to figure out a strategy on its own. This strategy is referred to as a policy, and the computer program following it is an agent. In a standard setup, many simulations are run on an unknown task, starting from a random policy. How the simulation data can be used to improve the policy is an active area of research, especially with the rise of modern computing power. Objectives In this experiment, our goal is to identify new learning meth- ods in order to: • Improve performance on the task. Each task defines success in a different manner, and an optimal strategy may involve cooperation between multiple agents. • Speed up training time. An algorithm that requires less simulations can learn at a faster rate. Additionally, algorithms that require less compute power can be executed faster in terms of wall-clock time. Reinforcement Learning When a human child learns to walk, it isn’t told how. Instead, the child keeps trying things until it figures out how each of its muscles should move. Reinforcement learning, a subset of machine learning, follows the same methodology. Through con- tinuous trial and error, a policy can gradually be improved upon. We represent our policy as a deep neural network, mapping from an observation vector (positions and angles of joints) to an agent’s actions (joint torque). The policy is stochastic: actions are represented as normal distributions, parametrized by a mean and a standard deviation. This network can be thought of as an agent’s "brain". Each task gives out a reward value based on how well the agent is performing. This reward is different in each task, ranging from ’how far forward the agent has traveled’ to ’how close is the agent to a desired point’. By repeatedly running simulations of a task, we can record which actions, on average, resulted in a higher reward. Utilizing this information, we backpropagate through the neural network, increasing the likelihood of actions that have proven to be beneficial. This paradigm is known as the policy gradient, and is the basis for how our learning agent is trained. Hypothesis Traditional methods use one big policy to account for all the agents actions. However, this setup has trouble scaling into more complex tasks. Instead, we split this large network into many smaller, distinct networks — one for each agent. We view each of a robot’s joints as separate agents, al- lowing independent decisions to be made. In practice, this means training distinct neural networks for each joint. Our method is decentralized, and each network can be trained and executed independently of the others. Figure: Left: traditional network. Right: decentralized network. Tasks Figure: Swimmer Figure: Hopper Figure: Reacher Figure: Delivery Rationale Large neural networks, especially those with multiple objectives, often struggle to encode observations correctly. For example, a robot will move its leg joint differently, depending on the position of its foot. However, this information is not useful in controlling its right arm. When using a single network, the network is forced to encode all information, which can lead to confusion. By utilizing multiple networks, we allow each network to specialize, and only encode the information useful to its specific role in the task. Additionally, the decentralization of our policy enables scaling up of training, through parallelization. In a centralized setup, the entire network would have to be updated before the next update could start. With the policy split into separate networks, each can be trained on different computers, in parallel – allowing for faster training when scaling up. Results Algorithm Algorithm 1 Decentralized Policy Gradient 1: initialize π for every agent 2: for iteration = 0,1,2,... until convergence do 3: for episode = 0,1,2, ... 10000 do 4: Reset environment 5: for timestep = 0,1,2,... until episode end do 6: Receive observation 7: Every agent takes action according to policy π 8: Record tuple (observation, action, reward) 9: Estimate value function Q(s,a) using recorded tuples 10: Compute gradient to increase expected value 11: Update policies in gradient direction Conclusion We tested our algorithm’s performance on a variety of standard control tasks, and showed an increase of up to two times the performance of the single-network method. Our method also learned an optimal solution in less simulations. Re- peated trials on each task validated our results. While the total computational power of a decentralized method may be higher, the neural networks can be trained in paral- lel across multiple computers, effectively reducing the computa- tional time. This is more beneficial for practical use, since the smaller neural networks can be distributed among each agent without intercommunication. Our method’s improvements on the industry standard, in terms of performance and training time, allow for deep reinforcement learning to be applied in many practical scenarios. Real-world problems such as assembly line robotics have traditionally been explicitly coded; however, as autonomous learning becomes more efficient, artificial intelligence may take over these jobs. Future Work Our work provides a reliable basis for practical cooperative learn- ing in a multitude of environments, and paves the way for future research in the emergent field of multi-agent control. Our robust algorithm has countless practical applications requir- ing cooperative learning that can be explored in future work, including: • Drone package delivery network management and collision avoidance • Robotic bee pollination to proxy bee extinction • Planetary exploration with teams of rovers • Prosthetics for disabled persons