Preference learning for guiding the tree searches in continuous POMDPs (CoRL 2023)

Preference learning for guiding the tree searches
in continuous POMDPs
Jiyong Ahn, Sanghyeon Son, Dongryung Lee, Jisu Han, Dongwon Son, Beomjoon Kim
Intelligent Mobile Manipulation Lab
7th Conference on Robot Learning (2023), Atlanta, USA

Goal
Rashly fetching the object
without gathering the information
could make the occluding object fall.
The robot needs to efficiently perform information-gathering actions for robust operation.
Enable robots to create plans that consider the uncertainties of the objects in an unstructured environment.

Challenges
This is a challenging problem that involves partial observability in a high dimensional
continuous state, action, and observation spaces.
[1] Z. Sunberg and M. Kochenderfer. Online algorithms for pomdps with continuous state, action, and observation spaces. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 28, pages 259–263, 2018.
The primary difficulty in using POMDP solvers is their computational complexity.
• In recent work, POMCPOW[1] effectively handles large-scale continuous
observation and action spaces POMDPs using Monte Carlo Tree Search (MCTS).
• However, the computational challenge stemming from continuous and high-
dimensional spaces is still significant.
POMCPOW tree

Learning to Guide Tree Search
[2] B. Kim, L. Shimanuki, L. Kaelbling, and T. Lozano-Perez. Representation, learning, and planning algorithms for geometric task and motion planning. In The International Journal of Robotics Research, 2022.
GANDI[2] proposes a method for learning to guide planning for challenging long-horizon tasks and motion planning problems in
continuous space, in a similar fashion to AlphaGo (Kim et al. 2018).

[3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
In order to deal with the computational challenge in the game of Go,AlphaGo[3] proposed the idea of learning to guide the value
function and policy from past planning experience.
However, these methods are limited to full observability environments and cannot handle state uncertainty.

In this work, we extend GANDI[2] to a partially observable setup.
• Previously, the value function and policy operate on a fully observable state.
• Ours operates on action-observation history 𝒉 and guides POMCPOW[1].
In AlphaGo, value function and policy are the functions of a state.
ℎ𝑡 = 𝑜0, 𝑎1, 𝑜1, … , 𝑎𝑡, 𝑜𝑡
𝜋 ℎ, 𝑎 = Pr 𝑎𝑡+1 = 𝑎 ℎ𝑡 = ℎ
𝑉
𝜋(ℎ) = 𝔼𝜋 𝑅𝑡 ℎ𝑡 = ℎ
In POMDP, value function and policy are the functions of a history.
Partial
observability

However,
• This approach would require an unthinkably large amount of data to train an effective value function due to the infinite
possible histories containing high-dimensional observations.
• As generating a dataset involves performing a tree search in a POMDP, such large-scale data generation would require
a significant amount of time.
𝒉𝒕 = 𝒐𝟎, 𝑎1, 𝒐𝟏, … , 𝑎𝑡, 𝒐𝒕
𝜋 ℎ, 𝑎 = Pr 𝑎𝑡+1 = 𝑎 ℎ𝑡 = ℎ
𝑉
𝜋(ℎ) = 𝔼𝝅 𝑅𝑡 ℎ𝑡 = ℎ
In POMDP, value function and policy are the functions of a history.

How can we deal with this need for a large amount of data?
Based on these two observations, we propose a value function learning algorithm
that learns the ranking among histories.
Two observations:
1. A search tree for a POMDP typically consists of a few success histories that led to a goal and a large number of other
histories that did not.
2. All we need to efficiently guide a tree search is ranking among the histories specifying which one is more likely to lead to
the goal, not their actual values.

Preference-based Learning
Given a search tree, we take the following,
• Success history: A history that got to a goal.
• Failure history: A history that did not get to a goal.
and learn a value function such that it prefers the success history over the failure history.

Preference-based Learning
However, this simple success-and-fail preference labeling scheme might not have the notion of optimality.
To incorporate that,
1. we create additional data by pairing two successful histories
2. and then label the one that is closer to the goal as more preferred.
In scenarios with limited data, preference learning proves more robust than regression because it is less susceptible to variations in
value differences, demonstrating greater resilience against noise in comparison to regression, which exhibits higher variance.

Proposed method: PGP (Preference-guided POMCPOW)
1. If the 𝑖𝑡ℎ
history is part of a success history and 𝑗𝑡ℎ
history is not,
⇒ we prefer the 𝒊𝒕𝒉
history.
2. If both are part of success histories and have an equal number of remaining time steps to the goal,
⇒ they are both equally preferred.
3. If both are success histories but 𝑗𝑡ℎ
history is closer to the goal,
⇒ we prefer the 𝒋𝒕𝒉
history.

We train Preference-V and Preference-Q function, denoted 𝑉 and 𝑄, by setting it as a preference predictor.
1. Approximate the probability of preferring history ℎ𝑖 over ℎ𝑗 by the softmax.
2. Learn the function by minimizing the parameters of 𝑉
with respect to the cross-entropy loss.
3. Train 𝑽 and 𝑸 concurrently by designing the neural network
to share the same backbone, with separate heads.

VAE Approximated Energy-based Q-function
For training a policy, we could, in principle, imitate the actions on success histories from past planning experiences using regression.
However,
• to facilitate efficient exploration during tree search, we need a multi-modal policy rather than a single-modal policy [5].
• if we imitate success histories, then we are limited to a small number of histories per search tree, which is data inefficient
Supervised policy network training pipeline and architecture in AlphaGo[3].
[5] Reinforcement learning with deep energy-based policies. ICML, 2017

We instead propose to use an energy-based model, where the energy is based on a Q function, which effectively defines a multi-modal
policy whose probability of action is proportional to its Q-value.
We train the Q-function simultaneously with the value function by implementing these two functions as two heads that share the same
backbone transformer but with different inputs

The problem with an energy-based function in continuous space is that exact sampling is intractable.
Instead, we propose to use VAE to approximate the Preference-Q-based energy function.
At every gradient step, we uniform-randomly sample N number of actions and minimize the Q-function-weighted loss for training the VAE.
The higher Q-value, the higher the chance of being sampled from the VAE.
This approach,
• generates higher-quality samples than uniform random sampling.
• has a much faster inference time than MCMC sampling.

4. We use importance sampling to train policy 𝜋𝜃
by sampling actions from a uniform distribution
and evaluating the importance weight according to 𝜋𝜃

Experiments
Domains
Domain #1.
Light-dark room
Domain #2.
Object fetching with known object classes
Domain #3.
Object fetching with unknown object classes
To navigate to a goal position with the minimum number
of steps in a 2D plane.
Objective:
To fetch the target object completely occluded by the non-target object from the cabinet
and place it in the goal region in a minimum number of steps.
Objective:

Experiments
2D Light-Dark room domain - Quantitative results
Success rate (%) Average number of time steps
Number of simulations Number of simulations

Experiments
IGP SF-PGP PGP
PGP and SF-PGP (preference based) reach to goal region with fewer number of timesteps.
2D Light-Dark room domain - Qualitative results

Experiments
Fetching domain with known object classes - Quantitative results
Success rate (%) Average number of time steps

Experiments
Diverse object fetching with unknown object classes (Simulation)
Success rate (%)

Experiments
• PGP shows the highest success rate and most optimal success trajectories in the real robot experiment.
• IGP tends to rashly fetch the object at the last step, dropping the occluder and failing.
Real-world object fetching with unknown object classes - Qualitative results (on real-world environment)

Preference learning for guiding the tree searches in continuous POMDPs (CoRL 2023)

More Related Content

Similar to Preference learning for guiding the tree searches in continuous POMDPs (CoRL 2023) (20)

Recently uploaded (20)

Preference learning for guiding the tree searches in continuous POMDPs (CoRL 2023)