A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf

MoE
A Mixture-of-Expert Approach to RL-based
Dialogue Management
Yinlam Chow, Aza Tulepbergenov, Ofir Nachum et al.
National Yang Ming Chiao Tung University, Hsinchu
Speaker: Po-Chuan Chen
April 27, 2023
1 / 42

MoE
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
4 Mixture of Experts (MoE) Language Model
5 Primitive Discovery in MoE-LM
6 Expert Construction with Plug-and-Play Language Models
7 Reinforcement Learning for MoE-LM Dialogue Manager
8 Experiments
9 Concluding Remarks
2 / 42

MoE
Abstract
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
8 Experiments
3 / 42

MoE
Abstract
Abstract
Recently, there have some challenges for language models (LMs), for
example, dialogue management (DM) problems and ability to carry
on rich conversations.
Reinforcement learning (RL) to develop a dialogue agent that avoids
being short-sighted and maximizes overall user satisfaction.
But, they still need to deal with a combinatorially complex action
space even for a medium-size vocabulary.
4 / 42

MoE
Abstract
Contribution
In this paper, they introduce RL-based DM using a novel mixture of
expert language model (MoE-LM) that consists of
1 A LM capable of learning diverse semantics for conversation
histories
2 A number of specialized LMs (or experts) capable of generating
utterances corresponding to a particular attribute
3 RL-based DM that performs dialogue planning with the
utterances generated by the experts or personality
They can have greater flexibility to generate sensible utterances
with different intents and allows RL to focus on
conversational-level DM.
5 / 42

MoE
Introduction
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
8 Experiments
6 / 42

MoE
Introduction
Introduction
In natural language understanding and generation, since the system
needs to satisfy the user, a good dialogue agent should not only
generate natural responses, but also be capable of pursuing the task’s
objectives and adapting to the user’s feedback on-the-fly.
To build the dialogue agent, there has two ways.
1 Behavioral cloning, where the agent is a language model (LM)
that imitates the utterances in the training set
2 Reinforcement learning (RL) to optimize the agent’s policy
7 / 42

MoE
Introduction
Challenges
For the behavioral cloning, although these LMs produce fluent and
relevant responses, it is unclear how to control them to
systematically pursue goals during multi-turn dialogue
conversations.
And with the reinforcement learning, the action space can be
captured by hand-crafted representations, and they cannot handle
complex conversations.
Another issue is that RL only optimizes a scalar reward, while the
aforementioned methods often need to optimize for both the quality of
the generated utterance.
8 / 42

MoE
Introduction
This paper propose an RL-based DM agent using a novel mixture of
expert (MoE) approach. That has 3 main components.
1 A LM capable of learning diverse semantics for conversation
histories, and as a result generating diverse utterances, which
they refer to as the primitive LM or LM0.
2 A number of specialized LMs (or experts), {LM}m
i=1, that each is
constructed using the latent space learned by LM0, but has been
trained such that it is capable of generating utterances
corresponding to a certain intent or personality.
3 An RL-based dialogue manager (DM) that at each turn, given the
latent state shared by the experts {LM}m
i=0 and the utterance
action(s) they suggest, chooses one among them for the agent to
execute.
9 / 42

MoE
Preliminaries
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
8 Experiments
10 / 42

MoE
Preliminaries
Language Models (LMs)
In this work, they employ seq2seq LMs to generate the next utterances
in a dialogue, which the dataset of the form D =

X(k), Y(k)
|D|
k=1
.
X = X(k) is a L-turn conversation history, X = {Xl}L−1
l=0 and Y is its
next utterance. And the upper-bound on the length (number of tokens)
of each utterance is NX.
The LM first encodes the conversation history X using an encoder Φ
to a (L × NX)-length sequence of embeddings {(zl,0, . . . , zl,NX−1 )}L−1
l=0 ,
where each zl,n is a vector in the latent space, and the next utterance
b
Y = {b
yn}N
n=1 is sampled token-by-token from the decoder Ψ.
11 / 42

MoE
Preliminaries
Markov Decision Processes (MDPs)
M = (S, A, P, r, s0, 𝛾).
The state space S represents the tokenized conversation history and
the initial state s0 ∈ S is the initial user’s query.
The action space A is also the tokenized language space with each
action a ∈ A being the agent’s next utterance.
The transition kernel P models the user’s response to the action taken
by the agent (bot).
The reward function r measures the user’s satisfaction.
12 / 42

MoE
Preliminaries
Markov Decision Processes (MDPs)
For the task, they want the entire LM as a policy that maps
conversation histories to next utterances, and solve them by finding a
policy 𝜋∗ with maximum expected discounted return.
𝜋∗
∈ arg max
𝜋
J𝜋 := E
∞
∑︁
k=0
𝛾t
rt | P, s0, 𝜋
#
Because of the size of the tokenized state and action spaces grow
exponentially with the size of the vocabulary, it would quite desirable
to develop a novel MDP paradigm that is more amendable to
RL-based DM systems.
13 / 42

MoE
Mixture of Experts (MoE) Language Model
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
8 Experiments
14 / 42

MoE
15 / 42

MoE
Primitive Discovery
Firstly, employ the dataset D, and learn a language model
LM0 = (Φ, G0, Ψ).
The stochastic encoder (Φ, G0) comprises an encoder Φ that maps
tokenized conversation histories X to a latent space Z ⊆ Rd.
And, using it to construct a parameterized d-dimensional Gaussian
distribution G0(z′ | z) = N (𝜇0(z), 𝜎2
0 (z)Id×d).
The decoder predicts the next utterance b
Y0 conditioned on the point z′
sampled from the latent distribution.
16 / 42

MoE
Expert Construction
Intuitively, each Gi corresponds to an attribute and generates samples
in specific parts of the latent space Z.
This results in having m LMs, {LMi}m
i=1, LMi = (Φ, Gi, Ψ), each of
them corresponds to a specialized version of the original LM, LM0,
and serves as an expert in their MoE-LM.
Upon receiving a conversation history X, each expert LMi generates a
candidate (or more) for the next utterance b
Yi in certain parts of the
language space that are compatible with its attribute (personality).
17 / 42

MoE
Dialogue Manager (DM)
The dialogue manager, denoted by 𝜇, takes as input the encoded
conversation history z = Φ(X) and the candidate action utterances
generated by the experts {b
Yi}m
i=0, and selects one of them as the action
for the bot to execute, i.e., b
Y ∼ 𝜇(· | z, {b
Yi}m
i=0).
18 / 42

MoE
Primitive Discovery in MoE-LM
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
8 Experiments
19 / 42

MoE
The primitive LM, LM0, in their MoE-LM by solving the following
KL-constrained optimization problem that aims at capturing diverse
semantics:
min
(Φ,G0,Ψ),𝜌
b
Ez′∼𝜌(·|z,Y),z=Φ(X) [− log Ψ(Y | z′
)],
s.t. b
Ez=Φ(X) [KL(𝜌(z′
| z, Y)∥G0(z′
| z))] ≤ 𝜖KL
where 𝜖KL positive real-valued threshold.
In this section, they learn LM0 by maximizing the log-likelihood,
while enforcing consistency between the latent variable z′ via the KL
constraint.
20 / 42

MoE
Expert Construction with Plug-and-Play Language Models
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
8 Experiments
21 / 42

MoE
Denote by ℓi(X, Y) ∈ R a real-valued label that characterizes the
intent of expert i ∈ {1, . . . , m}, e.g., determined by an off-the-shelf
sentiment classifier. They train the latent distribution Gi(z) of expert i
by solving the optimization problem
min
Gi
b
Ez′∼Gi (·|z),z=Φ(X),Y∼Ψ(·|z′ ) [−ℓi(X, Y)]
22 / 42

MoE
They learn each expert via reward-maximization and treat ℓi as a
reward signal w.r.t. expert i to be maximized. In reinforcement
learning (RL), both the ”state” and ”action” spaces are the latent space
Z, and the ”policy” is the latent distribution Gi.
The main benefit of their approach is that it does not require the target
utterance Y from data D and is thus less vulnerable to data-imbalance
issues in D on certain intents.
The main motivation is that they want each expert to possess
particular behaviors, and this can readily be done via greedy
maximization; Long-term dialogue optimization will be handled by
the dialogue manager rather than these experts.
23 / 42

MoE
Reinforcement Learning for MoE-LM Dialogue Manager
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
8 Experiments
24 / 42

MoE
The dialogue manager (DM) of their MoE-LM and propose RL
algorithms to train it.
The DM is a policy 𝜇 that takes the encoded conversation history
z = Φ(X) and the m + 1 candidate action utterances generated by the
experts {b
Yi}m
i=0, and stochastically selects one of them to execute, i.e.,
b
Y ∼ 𝜇(· | z, {b
Yi}m
i=0).
Note that each expert i ∈ {0, . . . , m} is a LM, LMi, that acts as a policy
𝜋i(· | X) and maps each conversation history X to an utterance b
Yi.
25 / 42

MoE
MoE-MDP
M =

S, A, P̄, R̄, s̄0, 𝛾

.
The state space of MoE-MDP is the product of the learned latent space
Z and the joint action space of the m + 1 experts, i.e., S = Z × Am+1.
Its action space consists of the m + 1 experts, i.e., A = {0, . . . , m}.
Its initial state is the encoding of the initial user’s query and the
utterances suggested by the experts in response to this query. The
transition still models user’s responses but is now over the joint space
of the latent states and experts’ actions.
The reward function is the same as in the original MDP, i.e.,
r̄(s̄, ā) = r X, aj

, where s̄ = z, {ai}m
i=0

with ai ∼ 𝜋i(· | X) and
z = Φ(X), and ā ∈ {0, . . . , m} is the expert selected by the DM.
26 / 42

MoE
RL algorithms for Dialogue Manager (CQL)
The first one is conservative Q-learning (CQL), a popular offline RL
algorithm.
CQL is a regularization scheme that learns a conservative Q-function
that lower-bounds the true one.
CQL regularization minimizes the differences in Q-values of their
DM and the primitive.
𝜇(ā | s̄) ∝ exp (Q𝜃 (s̄, ā))
27 / 42

MoE
RL algorithms for Dialogue Manager (CQL)
Given the offline conversation data D, they parameterize the
Q-function by parameter 𝜃 and learn 𝜃 by minimizing the Bellman
error with behavior regularization:
min𝜃
Í
(s̄,ā,r̄,s̄+)∈D 𝛼 Eā∼𝜇 [Q𝜃 (s̄, ā)] − Q𝜃 (s̄, a0)

+

r̄ + 𝛾Q𝜃target

s̄+, arg maxā+ ∈A Q𝜃 (s̄+, ā+)

− Q𝜃 (s̄, ā)
2
, where a0 is action suggested by the primitive LM a0 ∼ 𝜋0, 𝛼 0 is a
regularization parameter, and 𝜃target is the target Q-function parameter.
28 / 42

MoE
RL algorithms for Dialogue Manager (MBRL)
The second RL algorithm they use is model-based RL (MBRL).
They learn a user utterance model
Puser (X+ | X, a) := Ez=Φuser ([X,a]) [Ψuser (X+ | z)] via maximum
likelihood, then generate data DMB, whose next-stateb
s+encodes the
next conversation generated from roll-outs and the corresponding
candidate actions, and finally solve the Bellman error minimization in
MoE-MDP:
min
𝜃
∑︁
(s̄,ā,r̄,b
s+)∈DMB

r̄ + 𝛾Q𝜃target

b
s+, arg max
ā+ ∈ J
Q𝜃 (b
s+, ā+)

− Q𝜃 (s̄, ā)
2
29 / 42

MoE
Experiments
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
8 Experiments
30 / 42

MoE
Experiments
Experiments Setup
They evaluate their MoE-approach for dialogue management on two
benchmark open-domain tasks.
Cornell, consists of conversations between speakers in different
movie lines and has a median conversation length of 3 utterances
Reddit, a casual conversation corpus on various topics between
users of at least 3 turns with the median conversation containing
7 utterances
Evaluated about
1 The predictive power and diversity of the primitive
2 The quality of experts
3 The overall DM performance
31 / 42

MoE
Experiments
MoE-LMs
They ran an ablation study on 4 transformer-based MoE-LMs.
MoE-1: use a simpler architecture, smaller latent distribution
models {Gi} than MoE-2
MoE-2: use a simpler architecture
MoE-3: same encoder architecture as BERT, with pre-trained
weight
MoE-4: same encoder architecture as BERT, but trains that from
scratch
32 / 42

MoE
Experiments
EXP 1: Comparing Primitive Models
To assess their quality, for each test conversation they generated 25
utterances and reported the following 3 metrics:
1 Diversity: Sparsity of the singular values of the embedded
utterances
2 Dist-{1, 2, 3}: Ratio of unique {1, 2, 3}-gram in the generated
utterances
3 Perplexity
33 / 42

MoE
Experiments
EXP 1: Comparing Primitive Models
34 / 42

MoE
Experiments
EXP 2: Quality of Experts
They use the following label functions to define the intents of experts:
1 ℓpos-sent (Y), ℓneg-sent (Y), and ℓjoy (Y), ℓoptimism (Y), ℓanger (Y),
ℓsadness (Y) quantify 6 different sentiment tones and are
constructed by a RoBERTa-based sentiment detector
2 ℓsent-coh (X, Y) measures empathy, i.e., bot’s sentiment coherence
with user’s
3 ℓquestion (Y) outputs 1 when a bot question is detected and 0
otherwise
4 ℓexp(X, Y) quantifies exploration, i.e., the tendency to avoid
repetitive contexts
35 / 42

MoE
Experiments
EXP 2: Quality of Experts
36 / 42

MoE
Experiments
EXP 3: MoE-RL Against DialoGPT Simulated Users
The DM task is to maximize total user satisfaction in the conversation
level, which is measured by both
1 User’s overall sentiment
2 User’s sentiment transition
To construct an immediate reward that serves as a surrogate for user
satisfaction, they set
r (X, a, X+) = 𝜆1ℓsent (X+) + 𝜆2 ℓsent (X+) −
1 − 𝛾
1 − 𝛾L
L−1
∑︁
l=0
𝛾l
ℓsent (Xl)
!
, where the linear combination weights (𝜆1, 𝜆2) = (0.75, 0.25).
37 / 42

MoE
Experiments
ℓsent (X) is the same RoBerTa-based sentiment labeler as in EXP2,
which assigns a score from [−1, 1] that is proportional to the positive
sentiment and inversely proportional to the negative sentiment
prediction probabilities.
Since the DM problem lasts at most 5 turns, they use this as the
effective horizon and set 𝛾 = 1 − 1/5 = 0.8.
38 / 42

MoE
Experiments
39 / 42

MoE
Concluding Remarks
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
8 Experiments
40 / 42

MoE
Concluding Remarks
Concluding Remarks
With mixture-of-expert (MoE) approach for RL-based dialogue
management (DM). They use three components
1 A LM that can generate diverse semantics for conversation
histories
2 A number of specialized LMs (or experts) that can produce
utterances corresponding to a particular attribute or intent
3 A RL-based DM that performs dialogue planning with the
utterances generated by the experts
Their method improves
1 Diversity of text generation
2 Generating utterances with specific intents
3 Yields better overall performance
41 / 42

MoE
Concluding Remarks
Future works
Improving the language representation with information theoretic
approaches
Fine-tuning the experts based on the DM objective
Extending the RL agent to track users’ behaviors (via abstract
belief states) and plan upon them
Preventing RL dialogue agents from generating harmful
behaviors
Evaluating their MoE-LM on more realistic problems, such as
information retrieval, recommendation, and negotiation
42 / 42

A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf

More Related Content

More from Po-Chuan Chen (20)

Recently uploaded (20)

A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf