SlideShare a Scribd company logo
Diversity is All You Need :
Learning Skills without a Reward Function
김예찬(Paul Kim)
Index
1. Abstract
2. Introduction
3. Related Work
4. Diversity is All You Need
4.1 How it Works
4.2 Implementation
5. What Skills are Learned?
6. Harnessing Learned Skills
6.1 Adapting Skills to Maximize Reward
6.2 Using Skills for Hierachical RL
6.3 Imitation an Expert
7. Conclusion
Abstract
1. Abstract
DIAYN(Diversity is All You Need)
- Agent can explore their environment and learn useful skills witho
ut supervision(감독)
- DIYAN can learning usefull sklls without a reward function
- maximum entropy policy을 활용하며 information theoretic를 m
aximizing하는 방식
- DIAYN을 효과적인 pretraining 방법론 제시함. exploration과 data
efficiency측면에서 RL의 문제를 극복
Introduction
2. Introduction
DRL has been demonstrated to effectively learn a wide range of re
ward driven skills, including
1. play games
2. controlling robots
3. navigation
2. Introduction
DRL has been demonstrated to effectively learn a wide range of re
ward driven skills, including
1. play games
2. controlling robots
3. navigation
DIAYN
Not Reward
Driven
2. Introduction
DIAYN : Unsupervised skill discovery
- Learning usefull skills without supervision은 spares reward ta
sk인 경우 exploration을 하는데 도움을 줄 수 있음
- For long horizon tasks, skills discovered without reward can serv
e as primitives for HRL, effectively shortening the episode length
- human feedback : ex) reward design
- reward function을 design하는데 많은 시간을 투자할 필요가 없
음
2. Introduction
What is Skill?
- Skill은 환경의 state를 consistent way(일관된 방식)으로 변경시키
는 policy임
- skills might be useless
- skills are not only distinguishable, but also are as diverse as p
ossible
- Diverse skills are robust to perturbations and better exploring
the environment
2. Introduction
핵심 아이디어
distinguishable하며 diversity한 skill들을 습득하자
- object based on mutual information
- application : HRL, imitation Learning
2. Introduction
Contribution 5가지
1. method for learning useful skills without any rewards
- maximizing an information theoretic, maximum entropy policy
2. simple exploration objective results in the unsupervised emerge
nce skills
- (running, jumping), some of learned skills solve task..
3. simple method for using learned skills for HRL and find this met
hods solves tasks
4. how skills discovered can be quickly adapted to solve new task
5. skills discovered can be used for imitation learning
2. Introduction
Related Work
3. Related Work
HRL Perspective
Previous work
- HRL has learned skills to maximize a single, known, reward f
unction by jointly learning a set of skills and meta-controller
- in joint training, meta-policy does not select ‘bad’ options, so t
hese options do not receive any reward signal to improve
DIAYN특징
- random meta-policy를 제시
- learns skills with no reward
3. Related Work
Connection between RL and information theory
Previous work
- mutual information between states and actions as a notion of e
mpowerment for an intrinsically motivated agent
- discriminability objective is equivalent to maximizing the mutu
al information between latent skill z and some aspect of the corres
ponding trajectory
- setting with many tasks, and reward function
- setting with a single task reward
3. Related Work
Connection between RL and information theory
Previous work
- mutual information between states and actions as a notion of e
mpowerment for an intrinsically motivated agent
- discriminability objective is equivalent to maximizing the mutu
al information between latent skill $z$ and some aspect of the corr
esponding trajectory
- setting with many tasks, and reward function
- setting with a single task reward
DIAYN특징
- maximize the mutual information between states and skills(
can be interpreted as maximizing the empowerment of a hierarc
hical agent whoes action space is the set of skills)
3. Related Work
Connection between RL and information theory
DIAYN특징
- maximum entropy policies to force skill to be diverse
- fix the distribution p(z) rather than learning it, preventing p(z) fr
om collapsing to sampling only handful of skills.
- discriminator looks at every state, which provides additional rew
ard signal
3. Related Work
Neuroevolution and evolutionary algorithms
- neuroevolution and evolutionary algorithms has studied how com
plex behaviors can be learned by directly maximizing diversity
DIAYN특징
- acquire complex skills with minimal supervision to improve efficie
ncy
- focus on deriving a general, information theoretic objective that
does not require manual design of distance metrics and can be a
pplied to any RL task without additional engineering
3. Related Work
Intrinsic motivation
- previous works use an intrinsic motivation objective to learn a
single policy
DIAYN특징
- propose an objective for learning many, diverse policies
Diversity is
All You Need
4. Diversity is All You Need
Unsupervised RL paradigm
- agent is allowed an unsupervised “exploration” stage followed b
y a supervised stage
- the aim of the unsupervised stage is to learn skills that eventu
ally will make it easier to maximize the task reward in the super
vised stage.
- Conveniently, because skills are learned without a priori knowled
ge of the task, the learned skills can be used for many different tas
ks
Maximize a mixture of policies (the collection of skills together wi
th p(z))
4. Diversity is All You Need
Unsupervised RL paradigm
- agent is allowed an unsupervised “exploration” stage followed by
a supervised stage
- the aim of the unsupervised stage is to learn skills that eventually
will make it easier to maximize the task reward in the supervised s
tage.
- Conveniently, because skills are learned without a priori knowled
ge of the task, the learned skills can be used for many different tas
ks
Unsupervised and Supervised
- the agent explores the environment, but does
not receive any task reward
Learn Skills
4. Diversity is All You Need
Unsupervised RL paradigm
- agent is allowed an unsupervised “exploration” stage followed by
a supervised stage
- the aim of the unsupervised stage is to learn skills that eventually
will make it easier to maximize the task reward in the supervised s
tage.
- Conveniently, because skills are learned without a priori knowled
ge of the task, the learned skills can be used for many different tas
ks
Unsupervised and Supervised
- the agent receives the task reward, and its go
al is to learn the task by maximizing the task r
eward
Maximize the
task reward
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
2. To distinguish skills, we use states not actions
3. The skills should be as diverse as possible
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
Maximize Mutual Information between skills and states
- also skill should control with states the agent visit
MI(s, z)
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
To ensure that states, not action, are used to distinguish skills,
we minimize the mutual information between skills and actions
given the state, MI(a, z | s)
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
- minimize the mutual information between skills and actions given the state, MI(a, z | s)
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
- minimize the mutual information between skills and actions given the state, MI(a, z | s)
3. The skills should be as diverse as possible
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
- minimize the mutual information between skills and actions given the state, MI(a, z | s)
3. The skills should be as diverse as possible
Maximize a mixture of policies (the collection of skills together
with p(z))
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
- minimize the mutual information between skills and actions given the state, MI(a, z | s)
3. The skills should be as diverse as possible
- maximize a mixture of poilicies (the collection of skills together with p(z))
4.1 How it Works?
4.2 Implementation
- Uses soft actor critic to learn a policy
- Entropy regularizer is scaled by alpha
- found empirically 0.01
- trade off between exploration and discriminability
- Uses a pseudo-reward r_z to maximize the entropy
4.2 Implementation
What Skills
are Learned
5. What skills are Learned?
1. Does entropy regularization lead to more diverse skills?
- small alpha, learns skills that move large distances in directions
but fail to explore large parts of the state space
- increasing alpha, the skills visit a more diverse set of states, whi
ch may help with exploration in complex state space
- It is difficult to discriminate skills when alpha is further increas
ed
orientation, forward velocity
5. What skills are Learned?
2. How does the distribution of skills change during training
- inverted pendulum and mountain car become increasingly divers
e throughout training
- skills are learned with no reward, so it is natural that some skills
correspond to small task reward while others correspond to large t
ask reward
5. What skills are Learned?
3. Does DIAYN explore effectively in complex environment?
- half-cheetah, hopper, and ant
- learn diverse locomotion primitives
5. What skills are Learned?
3. Does DIAYN explore effectively in complex environment?
- evaluate all skills on three reward functions:
running (maximize X coordinate), jumping (maximize Z coordinate)
moving (maximize L2 distance from origin)
- DIAYN learns some skills that achieve high reward
- DIAYN optimizes a collection of policies, which enables more diver
se exploration.
5. What skills are Learned?
4. Does DIAYN ever learn skills that solve a benchmark task?
- half cheetah and hopper learns skills that run and hop forward q
uickly => good
Harnessing
Learned Skills
6. Harnessing Learned Skills
Three perhaps less obvious applications are adapting skills to
1. maximize a reward
2. hierarchical RL
3. imitation learning
6.1 Adapting Skills to Maximize Reward
- After DIAYN learns task-agnostic skills without supervision, we c
an quickly adapt the skills can to solve a desired task
- Akin to computer vision researchers using models pre-trained o
n ImageNet
- DIAYN as (unsupervised) pre-training in resource-constrained
settings
6.1 Adapting Skills to Maximize Reward
5. Can we use learned skills to directly maximize the task rewa
rd?
- approach differs from this baseline only in how weights are
initialized => good
6.2 Using Skills for Hierarchical RL
- In theory, hierarchical RL should decompose a complex task in
to motion primitives, which may be reused for multiple tasks
- In practice, algorithms for hierarchical RL encounter many difficul
ties:
1. each motion primitive reduces to a single action
2. the hierarchical policy only samples a single
motion primitive
3. all motion primitives attempt to do the entire task
6.2 Using Skills for Hierarchical RL
- In theory, hierarchical RL should decompose a complex task in
to motion primitives, which may be
reused for multiple tasks
- In practice, algorithms for hierarchical RL encounter many difficul
ties:
1. each motion primitive reduces to a single action [9]
2. the hierarchical policy only samples a single
mo tion primitive [24]
3. all motion primitives attempt to do the entire
task
DIAYN discovers diverse, task-agnostic skills, which hold the
promise of acting as a building block for hierarchical RL
6.2 Using Skills for Hierarchical RL
6. Are skills discovered by DIAYN useful for hierarchical RL?
- how DIAYN outperforms all baselines. TRPO and SAC are comp
etitive on-policy and off-policy RL algorithms, while VIME includes
an auxiliary objective to promote efficient exploration
6.2 Using Skills for Hierarchical RL
7. How can DIAYN leverage prior knowledge about what skills
will be useful?
- In particular, we can condition the discriminator on only a subse
t of the observation, forcing DIAYN to find skills that are divers
e in this subspace (but potentially indistinguishable along other d
imensions)
6.3 Imitating an Expert
8. Can we use learned skills to imitate an expert?
- consider the setting where we are given an expert trajectory con
sisting of states (not actions)
6.3 Imitating an Expert
8. Can we use learned skills to imitate an expert?
- Given the expert trajectory, we use our learned discriminator
to estimate which skill was most likely to have generated the tra
jectory
- this optimization problem, which we can solve for categorical z
by simple enumerate, is equivalent to an M-projection
6.3 Imitating an Expert
9. How does DIAYN differ from Variational Intrinsic Control?
- maximum entropy policies and not learn the prior p(z)
- found that DIAYN method consistently matched the expert trajectory
more closely than VIC baselines without these elements
- the ABC distribution over skills, p(z) is learned, the model may encount
er a rich-get-richer problem
Conclustion
7. Conclution
- In this paper, DIAYN, a method for learning skills without rewar
d functions
- DIAYN learns diverse skills for complex tasks, often solving ben
chmark tasks with one of the learned skills without actually receivi
ng any task reward
7. Conclution
- proposed methods for using the learned skills
(1) to quickly adapt to a new task
(2) to solve complex tasks via hierarchical RL
(3) to imitate an expert
- As a rule of thumb, DIAYN may make learning a task easier by r
eplacing the task’s complex action space with a set of useful
skills
- DIAYN could be combined with methods for augmenting the obs
ervation space and reward function
7. Conclution
- Using the common language of information theory, joint objecti
ve can likely be derived.
- DIAYN may also more efficiently learn from human preferences
by having humans select among learned skills
- Finally, for creativity and education, the skills produced by DIAYN
might be used by game designers to allow players to control comp
lex robots and by artists to design dancing robots.
Thank you

More Related Content

PDF
Wasserstein GAN 수학 이해하기 I
PDF
Energy based models and boltzmann machines - v2.0
PPTX
Control as Inference.pptx
PDF
안.전.제.일. 강화학습!
PPT
PRML復々習レーン2.3.2
PPTX
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
PDF
Reinforcement Learning 2. Multi-armed Bandits
PDF
diffusion 모델부터 DALLE2까지.pdf
Wasserstein GAN 수학 이해하기 I
Energy based models and boltzmann machines - v2.0
Control as Inference.pptx
안.전.제.일. 강화학습!
PRML復々習レーン2.3.2
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
Reinforcement Learning 2. Multi-armed Bandits
diffusion 모델부터 DALLE2까지.pdf

What's hot (20)

PDF
Reinforcement Learning: An Introduction 輪読会第1回資料
PDF
Deep generative model.pdf
PDF
PRML_titech 2.3.1 - 2.3.7
PPTX
알기쉬운 Variational autoencoder
PDF
RLCode와 A3C 쉽고 깊게 이해하기
PDF
PRML輪読#9
PDF
가깝고도 먼 Trpo
PDF
Actor critic algorithm
PDF
Energy based models and boltzmann machines
PDF
Trust Region Policy Optimization
PDF
PRML輪読#12
PDF
Soft Actor-Critic Algorithms and Applications 한국어 리뷰
PDF
An introduction to deep reinforcement learning
PPTX
Logistic Regression
PDF
Sutton chapter4
PDF
Prml14 5
PPTX
[DL輪読会]Randomized Prior Functions for Deep Reinforcement Learning
PDF
はじめてのパターン認識輪読会 10章後半
PDF
Proximal Policy Optimization (Reinforcement Learning)
PDF
Introduction to SAC(Soft Actor-Critic)
Reinforcement Learning: An Introduction 輪読会第1回資料
Deep generative model.pdf
PRML_titech 2.3.1 - 2.3.7
알기쉬운 Variational autoencoder
RLCode와 A3C 쉽고 깊게 이해하기
PRML輪読#9
가깝고도 먼 Trpo
Actor critic algorithm
Energy based models and boltzmann machines
Trust Region Policy Optimization
PRML輪読#12
Soft Actor-Critic Algorithms and Applications 한국어 리뷰
An introduction to deep reinforcement learning
Logistic Regression
Sutton chapter4
Prml14 5
[DL輪読会]Randomized Prior Functions for Deep Reinforcement Learning
はじめてのパターン認識輪読会 10章後半
Proximal Policy Optimization (Reinforcement Learning)
Introduction to SAC(Soft Actor-Critic)
Ad

Similar to Diversity is all you need(DIAYN) : Learning Skills without a Reward Function (20)

PDF
"Reinforcement Learning: Pioneering the Next Evolution in Artificial Intellig...
PDF
Reinforcement Learning.pdf
PPTX
the difference between competence and competency
PDF
22SUPD41 - PERSONALITY DEVELOPMENT - II.pdf
PPTX
robbinsjudge_oraganisational behavior ppt
PPTX
Artificial intyelligence and machine learning introduction.pptx
PPTX
AI3391 ARTIFICIAL INTELLIGENCE Session 2 Types of Agent .pptx
PDF
A Review on Introduction to Reinforcement Learning
PDF
DRL 1 Course Introduction Reinforcement.ppt
PDF
MachineLearning_Unit-I.pptx.pdtegfdxcdsfxf
PDF
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
PPTX
inductive human biases.pptx
PPTX
Learning in AI
PPTX
Reinforcement learning
PPTX
MachineLearning_Unit-I.pptxScrum.pptxAgile Model.pptxAgile Model.pptxAgile Mo...
PPTX
DCIT 403_1_DESIGNING INTELLIGENT AGENTS.pptx
PPTX
DCIT 403_1_DESIGNING INTELLIGENT AGENTS.pptx
PPTX
IODA - The Promise & Perils of Narrative Research
PDF
PPTX
ACQUISITION OF CORPORATE HUMAN RESOURCES.pptx
"Reinforcement Learning: Pioneering the Next Evolution in Artificial Intellig...
Reinforcement Learning.pdf
the difference between competence and competency
22SUPD41 - PERSONALITY DEVELOPMENT - II.pdf
robbinsjudge_oraganisational behavior ppt
Artificial intyelligence and machine learning introduction.pptx
AI3391 ARTIFICIAL INTELLIGENCE Session 2 Types of Agent .pptx
A Review on Introduction to Reinforcement Learning
DRL 1 Course Introduction Reinforcement.ppt
MachineLearning_Unit-I.pptx.pdtegfdxcdsfxf
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
inductive human biases.pptx
Learning in AI
Reinforcement learning
MachineLearning_Unit-I.pptxScrum.pptxAgile Model.pptxAgile Model.pptxAgile Mo...
DCIT 403_1_DESIGNING INTELLIGENT AGENTS.pptx
DCIT 403_1_DESIGNING INTELLIGENT AGENTS.pptx
IODA - The Promise & Perils of Narrative Research
ACQUISITION OF CORPORATE HUMAN RESOURCES.pptx
Ad

More from Yechan(Paul) Kim (8)

PDF
강화학습과 LV&A 그리고 Navigation Agent
PDF
Neural module Network
PDF
Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Ne...
PDF
Multiagent Cooperative and Competition with Deep Reinforcement Learning
PDF
2018 global ai_bootcamp_seoul_HomeNavi(Reinforcement Learning, AI)
PDF
3D Environment : HomeNavigation
PDF
pyconkr 2018 RL_Adventure : Rainbow(value based Reinforcement Learning)
PDF
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
강화학습과 LV&A 그리고 Navigation Agent
Neural module Network
Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Ne...
Multiagent Cooperative and Competition with Deep Reinforcement Learning
2018 global ai_bootcamp_seoul_HomeNavi(Reinforcement Learning, AI)
3D Environment : HomeNavigation
pyconkr 2018 RL_Adventure : Rainbow(value based Reinforcement Learning)
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"

Recently uploaded (20)

PPTX
2. Earth - The Living Planet earth and life
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
Pharmacology of Autonomic nervous system
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPT
6.1 High Risk New Born. Padetric health ppt
PPTX
2Systematics of Living Organisms t-.pptx
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
famous lake in india and its disturibution and importance
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPT
protein biochemistry.ppt for university classes
PDF
Sciences of Europe No 170 (2025)
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
BIOMOLECULES PPT........................
2. Earth - The Living Planet earth and life
POSITIONING IN OPERATION THEATRE ROOM.ppt
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Phytochemical Investigation of Miliusa longipes.pdf
Pharmacology of Autonomic nervous system
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
INTRODUCTION TO EVS | Concept of sustainability
6.1 High Risk New Born. Padetric health ppt
2Systematics of Living Organisms t-.pptx
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
famous lake in india and its disturibution and importance
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
protein biochemistry.ppt for university classes
Sciences of Europe No 170 (2025)
Introduction to Cardiovascular system_structure and functions-1
BIOMOLECULES PPT........................

Diversity is all you need(DIAYN) : Learning Skills without a Reward Function

  • 1. Diversity is All You Need : Learning Skills without a Reward Function 김예찬(Paul Kim)
  • 2. Index 1. Abstract 2. Introduction 3. Related Work 4. Diversity is All You Need 4.1 How it Works 4.2 Implementation 5. What Skills are Learned? 6. Harnessing Learned Skills 6.1 Adapting Skills to Maximize Reward 6.2 Using Skills for Hierachical RL 6.3 Imitation an Expert 7. Conclusion
  • 4. 1. Abstract DIAYN(Diversity is All You Need) - Agent can explore their environment and learn useful skills witho ut supervision(감독) - DIYAN can learning usefull sklls without a reward function - maximum entropy policy을 활용하며 information theoretic를 m aximizing하는 방식 - DIAYN을 효과적인 pretraining 방법론 제시함. exploration과 data efficiency측면에서 RL의 문제를 극복
  • 6. 2. Introduction DRL has been demonstrated to effectively learn a wide range of re ward driven skills, including 1. play games 2. controlling robots 3. navigation
  • 7. 2. Introduction DRL has been demonstrated to effectively learn a wide range of re ward driven skills, including 1. play games 2. controlling robots 3. navigation DIAYN Not Reward Driven
  • 8. 2. Introduction DIAYN : Unsupervised skill discovery - Learning usefull skills without supervision은 spares reward ta sk인 경우 exploration을 하는데 도움을 줄 수 있음 - For long horizon tasks, skills discovered without reward can serv e as primitives for HRL, effectively shortening the episode length - human feedback : ex) reward design - reward function을 design하는데 많은 시간을 투자할 필요가 없 음
  • 9. 2. Introduction What is Skill? - Skill은 환경의 state를 consistent way(일관된 방식)으로 변경시키 는 policy임 - skills might be useless - skills are not only distinguishable, but also are as diverse as p ossible - Diverse skills are robust to perturbations and better exploring the environment
  • 10. 2. Introduction 핵심 아이디어 distinguishable하며 diversity한 skill들을 습득하자 - object based on mutual information - application : HRL, imitation Learning
  • 11. 2. Introduction Contribution 5가지 1. method for learning useful skills without any rewards - maximizing an information theoretic, maximum entropy policy 2. simple exploration objective results in the unsupervised emerge nce skills - (running, jumping), some of learned skills solve task.. 3. simple method for using learned skills for HRL and find this met hods solves tasks 4. how skills discovered can be quickly adapted to solve new task 5. skills discovered can be used for imitation learning
  • 14. 3. Related Work HRL Perspective Previous work - HRL has learned skills to maximize a single, known, reward f unction by jointly learning a set of skills and meta-controller - in joint training, meta-policy does not select ‘bad’ options, so t hese options do not receive any reward signal to improve DIAYN특징 - random meta-policy를 제시 - learns skills with no reward
  • 15. 3. Related Work Connection between RL and information theory Previous work - mutual information between states and actions as a notion of e mpowerment for an intrinsically motivated agent - discriminability objective is equivalent to maximizing the mutu al information between latent skill z and some aspect of the corres ponding trajectory - setting with many tasks, and reward function - setting with a single task reward
  • 16. 3. Related Work Connection between RL and information theory Previous work - mutual information between states and actions as a notion of e mpowerment for an intrinsically motivated agent - discriminability objective is equivalent to maximizing the mutu al information between latent skill $z$ and some aspect of the corr esponding trajectory - setting with many tasks, and reward function - setting with a single task reward DIAYN특징 - maximize the mutual information between states and skills( can be interpreted as maximizing the empowerment of a hierarc hical agent whoes action space is the set of skills)
  • 17. 3. Related Work Connection between RL and information theory DIAYN특징 - maximum entropy policies to force skill to be diverse - fix the distribution p(z) rather than learning it, preventing p(z) fr om collapsing to sampling only handful of skills. - discriminator looks at every state, which provides additional rew ard signal
  • 18. 3. Related Work Neuroevolution and evolutionary algorithms - neuroevolution and evolutionary algorithms has studied how com plex behaviors can be learned by directly maximizing diversity DIAYN특징 - acquire complex skills with minimal supervision to improve efficie ncy - focus on deriving a general, information theoretic objective that does not require manual design of distance metrics and can be a pplied to any RL task without additional engineering
  • 19. 3. Related Work Intrinsic motivation - previous works use an intrinsic motivation objective to learn a single policy DIAYN특징 - propose an objective for learning many, diverse policies
  • 21. 4. Diversity is All You Need Unsupervised RL paradigm - agent is allowed an unsupervised “exploration” stage followed b y a supervised stage - the aim of the unsupervised stage is to learn skills that eventu ally will make it easier to maximize the task reward in the super vised stage. - Conveniently, because skills are learned without a priori knowled ge of the task, the learned skills can be used for many different tas ks Maximize a mixture of policies (the collection of skills together wi th p(z))
  • 22. 4. Diversity is All You Need Unsupervised RL paradigm - agent is allowed an unsupervised “exploration” stage followed by a supervised stage - the aim of the unsupervised stage is to learn skills that eventually will make it easier to maximize the task reward in the supervised s tage. - Conveniently, because skills are learned without a priori knowled ge of the task, the learned skills can be used for many different tas ks Unsupervised and Supervised - the agent explores the environment, but does not receive any task reward Learn Skills
  • 23. 4. Diversity is All You Need Unsupervised RL paradigm - agent is allowed an unsupervised “exploration” stage followed by a supervised stage - the aim of the unsupervised stage is to learn skills that eventually will make it easier to maximize the task reward in the supervised s tage. - Conveniently, because skills are learned without a priori knowled ge of the task, the learned skills can be used for many different tas ks Unsupervised and Supervised - the agent receives the task reward, and its go al is to learn the task by maximizing the task r eward Maximize the task reward
  • 24. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits 2. To distinguish skills, we use states not actions 3. The skills should be as diverse as possible
  • 25. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits
  • 26. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits Maximize Mutual Information between skills and states - also skill should control with states the agent visit MI(s, z)
  • 27. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z)
  • 28. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions
  • 29. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions To ensure that states, not action, are used to distinguish skills, we minimize the mutual information between skills and actions given the state, MI(a, z | s)
  • 30. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions - minimize the mutual information between skills and actions given the state, MI(a, z | s)
  • 31. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions - minimize the mutual information between skills and actions given the state, MI(a, z | s) 3. The skills should be as diverse as possible
  • 32. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions - minimize the mutual information between skills and actions given the state, MI(a, z | s) 3. The skills should be as diverse as possible Maximize a mixture of policies (the collection of skills together with p(z))
  • 33. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions - minimize the mutual information between skills and actions given the state, MI(a, z | s) 3. The skills should be as diverse as possible - maximize a mixture of poilicies (the collection of skills together with p(z))
  • 34. 4.1 How it Works?
  • 35. 4.2 Implementation - Uses soft actor critic to learn a policy - Entropy regularizer is scaled by alpha - found empirically 0.01 - trade off between exploration and discriminability - Uses a pseudo-reward r_z to maximize the entropy
  • 38. 5. What skills are Learned? 1. Does entropy regularization lead to more diverse skills? - small alpha, learns skills that move large distances in directions but fail to explore large parts of the state space - increasing alpha, the skills visit a more diverse set of states, whi ch may help with exploration in complex state space - It is difficult to discriminate skills when alpha is further increas ed orientation, forward velocity
  • 39. 5. What skills are Learned? 2. How does the distribution of skills change during training - inverted pendulum and mountain car become increasingly divers e throughout training - skills are learned with no reward, so it is natural that some skills correspond to small task reward while others correspond to large t ask reward
  • 40. 5. What skills are Learned? 3. Does DIAYN explore effectively in complex environment? - half-cheetah, hopper, and ant - learn diverse locomotion primitives
  • 41. 5. What skills are Learned? 3. Does DIAYN explore effectively in complex environment? - evaluate all skills on three reward functions: running (maximize X coordinate), jumping (maximize Z coordinate) moving (maximize L2 distance from origin) - DIAYN learns some skills that achieve high reward - DIAYN optimizes a collection of policies, which enables more diver se exploration.
  • 42. 5. What skills are Learned? 4. Does DIAYN ever learn skills that solve a benchmark task? - half cheetah and hopper learns skills that run and hop forward q uickly => good
  • 44. 6. Harnessing Learned Skills Three perhaps less obvious applications are adapting skills to 1. maximize a reward 2. hierarchical RL 3. imitation learning
  • 45. 6.1 Adapting Skills to Maximize Reward - After DIAYN learns task-agnostic skills without supervision, we c an quickly adapt the skills can to solve a desired task - Akin to computer vision researchers using models pre-trained o n ImageNet - DIAYN as (unsupervised) pre-training in resource-constrained settings
  • 46. 6.1 Adapting Skills to Maximize Reward 5. Can we use learned skills to directly maximize the task rewa rd? - approach differs from this baseline only in how weights are initialized => good
  • 47. 6.2 Using Skills for Hierarchical RL - In theory, hierarchical RL should decompose a complex task in to motion primitives, which may be reused for multiple tasks - In practice, algorithms for hierarchical RL encounter many difficul ties: 1. each motion primitive reduces to a single action 2. the hierarchical policy only samples a single motion primitive 3. all motion primitives attempt to do the entire task
  • 48. 6.2 Using Skills for Hierarchical RL - In theory, hierarchical RL should decompose a complex task in to motion primitives, which may be reused for multiple tasks - In practice, algorithms for hierarchical RL encounter many difficul ties: 1. each motion primitive reduces to a single action [9] 2. the hierarchical policy only samples a single mo tion primitive [24] 3. all motion primitives attempt to do the entire task DIAYN discovers diverse, task-agnostic skills, which hold the promise of acting as a building block for hierarchical RL
  • 49. 6.2 Using Skills for Hierarchical RL 6. Are skills discovered by DIAYN useful for hierarchical RL? - how DIAYN outperforms all baselines. TRPO and SAC are comp etitive on-policy and off-policy RL algorithms, while VIME includes an auxiliary objective to promote efficient exploration
  • 50. 6.2 Using Skills for Hierarchical RL 7. How can DIAYN leverage prior knowledge about what skills will be useful? - In particular, we can condition the discriminator on only a subse t of the observation, forcing DIAYN to find skills that are divers e in this subspace (but potentially indistinguishable along other d imensions)
  • 51. 6.3 Imitating an Expert 8. Can we use learned skills to imitate an expert? - consider the setting where we are given an expert trajectory con sisting of states (not actions)
  • 52. 6.3 Imitating an Expert 8. Can we use learned skills to imitate an expert? - Given the expert trajectory, we use our learned discriminator to estimate which skill was most likely to have generated the tra jectory - this optimization problem, which we can solve for categorical z by simple enumerate, is equivalent to an M-projection
  • 53. 6.3 Imitating an Expert 9. How does DIAYN differ from Variational Intrinsic Control? - maximum entropy policies and not learn the prior p(z) - found that DIAYN method consistently matched the expert trajectory more closely than VIC baselines without these elements - the ABC distribution over skills, p(z) is learned, the model may encount er a rich-get-richer problem
  • 55. 7. Conclution - In this paper, DIAYN, a method for learning skills without rewar d functions - DIAYN learns diverse skills for complex tasks, often solving ben chmark tasks with one of the learned skills without actually receivi ng any task reward
  • 56. 7. Conclution - proposed methods for using the learned skills (1) to quickly adapt to a new task (2) to solve complex tasks via hierarchical RL (3) to imitate an expert - As a rule of thumb, DIAYN may make learning a task easier by r eplacing the task’s complex action space with a set of useful skills - DIAYN could be combined with methods for augmenting the obs ervation space and reward function
  • 57. 7. Conclution - Using the common language of information theory, joint objecti ve can likely be derived. - DIAYN may also more efficiently learn from human preferences by having humans select among learned skills - Finally, for creativity and education, the skills produced by DIAYN might be used by game designers to allow players to control comp lex robots and by artists to design dancing robots.

Editor's Notes