SlideShare a Scribd company logo
Preference learning for guiding the tree searches
in continuous POMDPs
Jiyong Ahn, Sanghyeon Son, Dongryung Lee, Jisu Han, Dongwon Son, Beomjoon Kim
Intelligent Mobile Manipulation Lab
7th Conference on Robot Learning (2023), Atlanta, USA
Goal
Rashly fetching the object
without gathering the information
could make the occluding object fall.
The robot needs to efficiently perform information-gathering actions for robust operation.
Enable robots to create plans that consider the uncertainties of the objects in an unstructured environment.
Challenges
This is a challenging problem that involves partial observability in a high dimensional
continuous state, action, and observation spaces.
[1] Z. Sunberg and M. Kochenderfer. Online algorithms for pomdps with continuous state, action, and observation spaces. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 28, pages 259–263, 2018.
The primary difficulty in using POMDP solvers is their computational complexity.
• In recent work, POMCPOW[1] effectively handles large-scale continuous
observation and action spaces POMDPs using Monte Carlo Tree Search (MCTS).
• However, the computational challenge stemming from continuous and high-
dimensional spaces is still significant.
POMCPOW tree
Learning to Guide Tree Search
[2] B. Kim, L. Shimanuki, L. Kaelbling, and T. Lozano-Perez. Representation, learning, and planning algorithms for geometric task and motion planning. In The International Journal of Robotics Research, 2022.
GANDI[2] proposes a method for learning to guide planning for challenging long-horizon tasks and motion planning problems in
continuous space, in a similar fashion to AlphaGo (Kim et al. 2018).
Learning to Guide Tree Search
[3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
In order to deal with the computational challenge in the game of Go,AlphaGo[3] proposed the idea of learning to guide the value
function and policy from past planning experience.
However, these methods are limited to full observability environments and cannot handle state uncertainty.
Learning to Guide Tree Search
In this work, we extend GANDI[2] to a partially observable setup.
• Previously, the value function and policy operate on a fully observable state.
• Ours operates on action-observation history 𝒉 and guides POMCPOW[1].
In AlphaGo, value function and policy are the functions of a state.
ℎ𝑡 = 𝑜0, 𝑎1, 𝑜1, … , 𝑎𝑡, 𝑜𝑡
𝜋 ℎ, 𝑎 = Pr 𝑎𝑡+1 = 𝑎 ℎ𝑡 = ℎ
𝑉
𝜋(ℎ) = 𝔼𝜋 𝑅𝑡 ℎ𝑡 = ℎ
In POMDP, value function and policy are the functions of a history.
Partial
observability
Learning to Guide Tree Search
However,
• This approach would require an unthinkably large amount of data to train an effective value function due to the infinite
possible histories containing high-dimensional observations.
• As generating a dataset involves performing a tree search in a POMDP, such large-scale data generation would require
a significant amount of time.
𝒉𝒕 = 𝒐𝟎, 𝑎1, 𝒐𝟏, … , 𝑎𝑡, 𝒐𝒕
𝜋 ℎ, 𝑎 = Pr 𝑎𝑡+1 = 𝑎 ℎ𝑡 = ℎ
𝑉
𝜋(ℎ) = 𝔼𝝅 𝑅𝑡 ℎ𝑡 = ℎ
In POMDP, value function and policy are the functions of a history.
How can we deal with this need for a large amount of data?
Based on these two observations, we propose a value function learning algorithm
that learns the ranking among histories.
Two observations:
1. A search tree for a POMDP typically consists of a few success histories that led to a goal and a large number of other
histories that did not.
2. All we need to efficiently guide a tree search is ranking among the histories specifying which one is more likely to lead to
the goal, not their actual values.
Preference-based Learning
Given a search tree, we take the following,
• Success history: A history that got to a goal.
• Failure history: A history that did not get to a goal.
and learn a value function such that it prefers the success history over the failure history.
Preference-based Learning
However, this simple success-and-fail preference labeling scheme might not have the notion of optimality.
To incorporate that,
1. we create additional data by pairing two successful histories
2. and then label the one that is closer to the goal as more preferred.
In scenarios with limited data, preference learning proves more robust than regression because it is less susceptible to variations in
value differences, demonstrating greater resilience against noise in comparison to regression, which exhibits higher variance.
Proposed method: PGP (Preference-guided POMCPOW)
1. If the 𝑖𝑡ℎ
history is part of a success history and 𝑗𝑡ℎ
history is not,
⇒ we prefer the 𝒊𝒕𝒉
history.
2. If both are part of success histories and have an equal number of remaining time steps to the goal,
⇒ they are both equally preferred.
3. If both are success histories but 𝑗𝑡ℎ
history is closer to the goal,
⇒ we prefer the 𝒋𝒕𝒉
history.
Proposed method: PGP (Preference-guided POMCPOW)
We train Preference-V and Preference-Q function, denoted 𝑉 and 𝑄, by setting it as a preference predictor.
1. Approximate the probability of preferring history ℎ𝑖 over ℎ𝑗 by the softmax.
2. Learn the function by minimizing the parameters of 𝑉
with respect to the cross-entropy loss.
3. Train 𝑽 and 𝑸 concurrently by designing the neural network
to share the same backbone, with separate heads.
VAE Approximated Energy-based Q-function
For training a policy, we could, in principle, imitate the actions on success histories from past planning experiences using regression.
However,
• to facilitate efficient exploration during tree search, we need a multi-modal policy rather than a single-modal policy [5].
• if we imitate success histories, then we are limited to a small number of histories per search tree, which is data inefficient
Supervised policy network training pipeline and architecture in AlphaGo[3].
[5] Reinforcement learning with deep energy-based policies. ICML, 2017
VAE Approximated Energy-based Q-function
We instead propose to use an energy-based model, where the energy is based on a Q function, which effectively defines a multi-modal
policy whose probability of action is proportional to its Q-value.
We train the Q-function simultaneously with the value function by implementing these two functions as two heads that share the same
backbone transformer but with different inputs
VAE Approximated Energy-based Q-function
The problem with an energy-based function in continuous space is that exact sampling is intractable.
Instead, we propose to use VAE to approximate the Preference-Q-based energy function.
At every gradient step, we uniform-randomly sample N number of actions and minimize the Q-function-weighted loss for training the VAE.
The higher Q-value, the higher the chance of being sampled from the VAE.
This approach,
• generates higher-quality samples than uniform random sampling.
• has a much faster inference time than MCMC sampling.
Proposed method: PGP (Preference-guided POMCPOW)
4. We use importance sampling to train policy 𝜋𝜃
by sampling actions from a uniform distribution
and evaluating the importance weight according to 𝜋𝜃
Experiments
Domains
Domain #1.
Light-dark room
Domain #2.
Object fetching with known object classes
Domain #3.
Object fetching with unknown object classes
To navigate to a goal position with the minimum number
of steps in a 2D plane.
Objective:
To fetch the target object completely occluded by the non-target object from the cabinet
and place it in the goal region in a minimum number of steps.
Objective:
Experiments
2D Light-Dark room domain - Quantitative results
Success rate (%) Average number of time steps
Number of simulations Number of simulations
Experiments
IGP SF-PGP PGP
PGP and SF-PGP (preference based) reach to goal region with fewer number of timesteps.
2D Light-Dark room domain - Qualitative results
Experiments
Fetching domain with known object classes - Quantitative results
Success rate (%) Average number of time steps
Experiments
Diverse object fetching with unknown object classes (Simulation)
Success rate (%)
Experiments
• PGP shows the highest success rate and most optimal success trajectories in the real robot experiment.
• IGP tends to rashly fetch the object at the last step, dropping the occluder and failing.
Real-world object fetching with unknown object classes - Qualitative results (on real-world environment)

More Related Content

PDF
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
PDF
Deep Dive into Hyperparameter Tuning
PPTX
250602_JW_labseminar[Graph Contrastive Learning Automated].pptx
DOCX
dl unit 4.docx for deep learning in b tech
PDF
Summary of BRAC
PDF
Active learning for ranking through expected loss optimization
PPT
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Deep Dive into Hyperparameter Tuning
250602_JW_labseminar[Graph Contrastive Learning Automated].pptx
dl unit 4.docx for deep learning in b tech
Summary of BRAC
Active learning for ranking through expected loss optimization
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...

Similar to Preference learning for guiding the tree searches in continuous POMDPs (CoRL 2023) (20)

PPT
6811067.ppt6811067.ppt6811067.ppt6811067.ppt
PDF
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
PDF
Extended pso algorithm for improvement problems k means clustering algorithm
PPT
Supervised Learningclassification Part1.ppt
PDF
A Tabu Search Heuristic For The Generalized Assignment Problem
PPTX
Intro to machine learning
PDF
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
DOC
genetic paper
PDF
Introduction of Deep Reinforcement Learning
PPTX
Reinforcement Learning
PPTX
250630_JW_labseminar[Does GNN Pretraining Help Molecular].pptx
PDF
Optimizing deep learning models from multi-objective perspective via Bayesian...
PDF
Extended pso algorithm for improvement problems k means clustering algorithm
PDF
Particle Swarm Optimization based K-Prototype Clustering Algorithm
PDF
I017235662
PPTX
Information Theoretic aspect of reinforcement learning
PDF
Coordinated Multi-Agent Control Utilizing Deep Reinforcement Learning
PDF
Data mining projects topics for java and dot net
PDF
Ijricit 01-002 enhanced replica detection in short time for large data sets
PDF
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
6811067.ppt6811067.ppt6811067.ppt6811067.ppt
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
Extended pso algorithm for improvement problems k means clustering algorithm
Supervised Learningclassification Part1.ppt
A Tabu Search Heuristic For The Generalized Assignment Problem
Intro to machine learning
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
genetic paper
Introduction of Deep Reinforcement Learning
Reinforcement Learning
250630_JW_labseminar[Does GNN Pretraining Help Molecular].pptx
Optimizing deep learning models from multi-objective perspective via Bayesian...
Extended pso algorithm for improvement problems k means clustering algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
I017235662
Information Theoretic aspect of reinforcement learning
Coordinated Multi-Agent Control Utilizing Deep Reinforcement Learning
Data mining projects topics for java and dot net
Ijricit 01-002 enhanced replica detection in short time for large data sets
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Ad

Recently uploaded (20)

PPTX
The Minerals for Earth and Life Science SHS.pptx
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
Seminar Hypertension and Kidney diseases.pptx
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PPTX
Pharmacology of Autonomic nervous system
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
Overview of calcium in human muscles.pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
PPTX
Microbes in human welfare class 12 .pptx
PPTX
Application of enzymes in medicine (2).pptx
PPTX
perinatal infections 2-171220190027.pptx
PDF
. Radiology Case Scenariosssssssssssssss
PDF
The Land of Punt — A research by Dhani Irwanto
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PDF
Biophysics 2.pdffffffffffffffffffffffffff
The Minerals for Earth and Life Science SHS.pptx
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
7. General Toxicologyfor clinical phrmacy.pptx
Seminar Hypertension and Kidney diseases.pptx
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Pharmacology of Autonomic nervous system
Placing the Near-Earth Object Impact Probability in Context
Overview of calcium in human muscles.pptx
Phytochemical Investigation of Miliusa longipes.pdf
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
Microbes in human welfare class 12 .pptx
Application of enzymes in medicine (2).pptx
perinatal infections 2-171220190027.pptx
. Radiology Case Scenariosssssssssssssss
The Land of Punt — A research by Dhani Irwanto
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
Biophysics 2.pdffffffffffffffffffffffffff
Ad

Preference learning for guiding the tree searches in continuous POMDPs (CoRL 2023)

  • 1. Preference learning for guiding the tree searches in continuous POMDPs Jiyong Ahn, Sanghyeon Son, Dongryung Lee, Jisu Han, Dongwon Son, Beomjoon Kim Intelligent Mobile Manipulation Lab 7th Conference on Robot Learning (2023), Atlanta, USA
  • 2. Goal Rashly fetching the object without gathering the information could make the occluding object fall. The robot needs to efficiently perform information-gathering actions for robust operation. Enable robots to create plans that consider the uncertainties of the objects in an unstructured environment.
  • 3. Challenges This is a challenging problem that involves partial observability in a high dimensional continuous state, action, and observation spaces. [1] Z. Sunberg and M. Kochenderfer. Online algorithms for pomdps with continuous state, action, and observation spaces. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 28, pages 259–263, 2018. The primary difficulty in using POMDP solvers is their computational complexity. • In recent work, POMCPOW[1] effectively handles large-scale continuous observation and action spaces POMDPs using Monte Carlo Tree Search (MCTS). • However, the computational challenge stemming from continuous and high- dimensional spaces is still significant. POMCPOW tree
  • 4. Learning to Guide Tree Search [2] B. Kim, L. Shimanuki, L. Kaelbling, and T. Lozano-Perez. Representation, learning, and planning algorithms for geometric task and motion planning. In The International Journal of Robotics Research, 2022. GANDI[2] proposes a method for learning to guide planning for challenging long-horizon tasks and motion planning problems in continuous space, in a similar fashion to AlphaGo (Kim et al. 2018).
  • 5. Learning to Guide Tree Search [3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017. In order to deal with the computational challenge in the game of Go,AlphaGo[3] proposed the idea of learning to guide the value function and policy from past planning experience. However, these methods are limited to full observability environments and cannot handle state uncertainty.
  • 6. Learning to Guide Tree Search In this work, we extend GANDI[2] to a partially observable setup. • Previously, the value function and policy operate on a fully observable state. • Ours operates on action-observation history 𝒉 and guides POMCPOW[1]. In AlphaGo, value function and policy are the functions of a state. ℎ𝑡 = 𝑜0, 𝑎1, 𝑜1, … , 𝑎𝑡, 𝑜𝑡 𝜋 ℎ, 𝑎 = Pr 𝑎𝑡+1 = 𝑎 ℎ𝑡 = ℎ 𝑉 𝜋(ℎ) = 𝔼𝜋 𝑅𝑡 ℎ𝑡 = ℎ In POMDP, value function and policy are the functions of a history. Partial observability
  • 7. Learning to Guide Tree Search However, • This approach would require an unthinkably large amount of data to train an effective value function due to the infinite possible histories containing high-dimensional observations. • As generating a dataset involves performing a tree search in a POMDP, such large-scale data generation would require a significant amount of time. 𝒉𝒕 = 𝒐𝟎, 𝑎1, 𝒐𝟏, … , 𝑎𝑡, 𝒐𝒕 𝜋 ℎ, 𝑎 = Pr 𝑎𝑡+1 = 𝑎 ℎ𝑡 = ℎ 𝑉 𝜋(ℎ) = 𝔼𝝅 𝑅𝑡 ℎ𝑡 = ℎ In POMDP, value function and policy are the functions of a history.
  • 8. How can we deal with this need for a large amount of data? Based on these two observations, we propose a value function learning algorithm that learns the ranking among histories. Two observations: 1. A search tree for a POMDP typically consists of a few success histories that led to a goal and a large number of other histories that did not. 2. All we need to efficiently guide a tree search is ranking among the histories specifying which one is more likely to lead to the goal, not their actual values.
  • 9. Preference-based Learning Given a search tree, we take the following, • Success history: A history that got to a goal. • Failure history: A history that did not get to a goal. and learn a value function such that it prefers the success history over the failure history.
  • 10. Preference-based Learning However, this simple success-and-fail preference labeling scheme might not have the notion of optimality. To incorporate that, 1. we create additional data by pairing two successful histories 2. and then label the one that is closer to the goal as more preferred. In scenarios with limited data, preference learning proves more robust than regression because it is less susceptible to variations in value differences, demonstrating greater resilience against noise in comparison to regression, which exhibits higher variance.
  • 11. Proposed method: PGP (Preference-guided POMCPOW) 1. If the 𝑖𝑡ℎ history is part of a success history and 𝑗𝑡ℎ history is not, ⇒ we prefer the 𝒊𝒕𝒉 history. 2. If both are part of success histories and have an equal number of remaining time steps to the goal, ⇒ they are both equally preferred. 3. If both are success histories but 𝑗𝑡ℎ history is closer to the goal, ⇒ we prefer the 𝒋𝒕𝒉 history.
  • 12. Proposed method: PGP (Preference-guided POMCPOW) We train Preference-V and Preference-Q function, denoted 𝑉 and 𝑄, by setting it as a preference predictor. 1. Approximate the probability of preferring history ℎ𝑖 over ℎ𝑗 by the softmax. 2. Learn the function by minimizing the parameters of 𝑉 with respect to the cross-entropy loss. 3. Train 𝑽 and 𝑸 concurrently by designing the neural network to share the same backbone, with separate heads.
  • 13. VAE Approximated Energy-based Q-function For training a policy, we could, in principle, imitate the actions on success histories from past planning experiences using regression. However, • to facilitate efficient exploration during tree search, we need a multi-modal policy rather than a single-modal policy [5]. • if we imitate success histories, then we are limited to a small number of histories per search tree, which is data inefficient Supervised policy network training pipeline and architecture in AlphaGo[3]. [5] Reinforcement learning with deep energy-based policies. ICML, 2017
  • 14. VAE Approximated Energy-based Q-function We instead propose to use an energy-based model, where the energy is based on a Q function, which effectively defines a multi-modal policy whose probability of action is proportional to its Q-value. We train the Q-function simultaneously with the value function by implementing these two functions as two heads that share the same backbone transformer but with different inputs
  • 15. VAE Approximated Energy-based Q-function The problem with an energy-based function in continuous space is that exact sampling is intractable. Instead, we propose to use VAE to approximate the Preference-Q-based energy function. At every gradient step, we uniform-randomly sample N number of actions and minimize the Q-function-weighted loss for training the VAE. The higher Q-value, the higher the chance of being sampled from the VAE. This approach, • generates higher-quality samples than uniform random sampling. • has a much faster inference time than MCMC sampling.
  • 16. Proposed method: PGP (Preference-guided POMCPOW) 4. We use importance sampling to train policy 𝜋𝜃 by sampling actions from a uniform distribution and evaluating the importance weight according to 𝜋𝜃
  • 17. Experiments Domains Domain #1. Light-dark room Domain #2. Object fetching with known object classes Domain #3. Object fetching with unknown object classes To navigate to a goal position with the minimum number of steps in a 2D plane. Objective: To fetch the target object completely occluded by the non-target object from the cabinet and place it in the goal region in a minimum number of steps. Objective:
  • 18. Experiments 2D Light-Dark room domain - Quantitative results Success rate (%) Average number of time steps Number of simulations Number of simulations
  • 19. Experiments IGP SF-PGP PGP PGP and SF-PGP (preference based) reach to goal region with fewer number of timesteps. 2D Light-Dark room domain - Qualitative results
  • 20. Experiments Fetching domain with known object classes - Quantitative results Success rate (%) Average number of time steps
  • 21. Experiments Diverse object fetching with unknown object classes (Simulation) Success rate (%)
  • 22. Experiments • PGP shows the highest success rate and most optimal success trajectories in the real robot experiment. • IGP tends to rashly fetch the object at the last step, dropping the occluder and failing. Real-world object fetching with unknown object classes - Qualitative results (on real-world environment)