MoE
A Mixture-of-Expert Approach to RL-based
Dialogue Management
Yinlam Chow, Aza Tulepbergenov, Ofir Nachum et al.
National Yang Ming Chiao Tung University, Hsinchu
Speaker: Po-Chuan Chen
April 27, 2023
1 / 42
MoE
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
4 Mixture of Experts (MoE) Language Model
5 Primitive Discovery in MoE-LM
6 Expert Construction with Plug-and-Play Language Models
7 Reinforcement Learning for MoE-LM Dialogue Manager
8 Experiments
9 Concluding Remarks
2 / 42
MoE
Abstract
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
4 Mixture of Experts (MoE) Language Model
5 Primitive Discovery in MoE-LM
6 Expert Construction with Plug-and-Play Language Models
7 Reinforcement Learning for MoE-LM Dialogue Manager
8 Experiments
9 Concluding Remarks
3 / 42
MoE
Abstract
Abstract
Recently, there have some challenges for language models (LMs), for
example, dialogue management (DM) problems and ability to carry
on rich conversations.
Reinforcement learning (RL) to develop a dialogue agent that avoids
being short-sighted and maximizes overall user satisfaction.
But, they still need to deal with a combinatorially complex action
space even for a medium-size vocabulary.
4 / 42
MoE
Abstract
Contribution
In this paper, they introduce RL-based DM using a novel mixture of
expert language model (MoE-LM) that consists of
1 A LM capable of learning diverse semantics for conversation
histories
2 A number of specialized LMs (or experts) capable of generating
utterances corresponding to a particular attribute
3 RL-based DM that performs dialogue planning with the
utterances generated by the experts or personality
They can have greater flexibility to generate sensible utterances
with different intents and allows RL to focus on
conversational-level DM.
5 / 42
MoE
Introduction
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
4 Mixture of Experts (MoE) Language Model
5 Primitive Discovery in MoE-LM
6 Expert Construction with Plug-and-Play Language Models
7 Reinforcement Learning for MoE-LM Dialogue Manager
8 Experiments
9 Concluding Remarks
6 / 42
MoE
Introduction
Introduction
In natural language understanding and generation, since the system
needs to satisfy the user, a good dialogue agent should not only
generate natural responses, but also be capable of pursuing the task’s
objectives and adapting to the user’s feedback on-the-fly.
To build the dialogue agent, there has two ways.
1 Behavioral cloning, where the agent is a language model (LM)
that imitates the utterances in the training set
2 Reinforcement learning (RL) to optimize the agent’s policy
7 / 42
MoE
Introduction
Challenges
For the behavioral cloning, although these LMs produce fluent and
relevant responses, it is unclear how to control them to
systematically pursue goals during multi-turn dialogue
conversations.
And with the reinforcement learning, the action space can be
captured by hand-crafted representations, and they cannot handle
complex conversations.
Another issue is that RL only optimizes a scalar reward, while the
aforementioned methods often need to optimize for both the quality of
the generated utterance.
8 / 42
MoE
Introduction
This paper propose an RL-based DM agent using a novel mixture of
expert (MoE) approach. That has 3 main components.
1 A LM capable of learning diverse semantics for conversation
histories, and as a result generating diverse utterances, which
they refer to as the primitive LM or LM0.
2 A number of specialized LMs (or experts), {LM}m
i=1, that each is
constructed using the latent space learned by LM0, but has been
trained such that it is capable of generating utterances
corresponding to a certain intent or personality.
3 An RL-based dialogue manager (DM) that at each turn, given the
latent state shared by the experts {LM}m
i=0 and the utterance
action(s) they suggest, chooses one among them for the agent to
execute.
9 / 42
MoE
Preliminaries
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
4 Mixture of Experts (MoE) Language Model
5 Primitive Discovery in MoE-LM
6 Expert Construction with Plug-and-Play Language Models
7 Reinforcement Learning for MoE-LM Dialogue Manager
8 Experiments
9 Concluding Remarks
10 / 42
MoE
Preliminaries
Language Models (LMs)
In this work, they employ seq2seq LMs to generate the next utterances
in a dialogue, which the dataset of the form D =

X(k), Y(k)
 |D|
k=1
.
X = X(k) is a L-turn conversation history, X = {Xl}L−1
l=0 and Y is its
next utterance. And the upper-bound on the length (number of tokens)
of each utterance is NX.
The LM first encodes the conversation history X using an encoder Φ
to a (L × NX)-length sequence of embeddings {(zl,0, . . . , zl,NX−1 )}L−1
l=0 ,
where each zl,n is a vector in the latent space, and the next utterance
b
Y = {b
yn}N
n=1 is sampled token-by-token from the decoder Ψ.
11 / 42
MoE
Preliminaries
Markov Decision Processes (MDPs)
M = (S, A, P, r, s0, 𝛾).
The state space S represents the tokenized conversation history and
the initial state s0 ∈ S is the initial user’s query.
The action space A is also the tokenized language space with each
action a ∈ A being the agent’s next utterance.
The transition kernel P models the user’s response to the action taken
by the agent (bot).
The reward function r measures the user’s satisfaction.
12 / 42
MoE
Preliminaries
Markov Decision Processes (MDPs)
For the task, they want the entire LM as a policy that maps
conversation histories to next utterances, and solve them by finding a
policy 𝜋∗ with maximum expected discounted return.
𝜋∗
∈ arg max
𝜋
J𝜋 := E
 ∞
∑︁
k=0
𝛾t
rt | P, s0, 𝜋
#
Because of the size of the tokenized state and action spaces grow
exponentially with the size of the vocabulary, it would quite desirable
to develop a novel MDP paradigm that is more amendable to
RL-based DM systems.
13 / 42
MoE
Mixture of Experts (MoE) Language Model
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
4 Mixture of Experts (MoE) Language Model
5 Primitive Discovery in MoE-LM
6 Expert Construction with Plug-and-Play Language Models
7 Reinforcement Learning for MoE-LM Dialogue Manager
8 Experiments
9 Concluding Remarks
14 / 42
MoE
Mixture of Experts (MoE) Language Model
Mixture of Experts (MoE) Language Model
15 / 42
MoE
Mixture of Experts (MoE) Language Model
Primitive Discovery
Firstly, employ the dataset D, and learn a language model
LM0 = (Φ, G0, Ψ).
The stochastic encoder (Φ, G0) comprises an encoder Φ that maps
tokenized conversation histories X to a latent space Z ⊆ Rd.
And, using it to construct a parameterized d-dimensional Gaussian
distribution G0(z′ | z) = N (𝜇0(z), 𝜎2
0 (z)Id×d).
The decoder predicts the next utterance b
Y0 conditioned on the point z′
sampled from the latent distribution.
16 / 42
MoE
Mixture of Experts (MoE) Language Model
Expert Construction
Intuitively, each Gi corresponds to an attribute and generates samples
in specific parts of the latent space Z.
This results in having m LMs, {LMi}m
i=1, LMi = (Φ, Gi, Ψ), each of
them corresponds to a specialized version of the original LM, LM0,
and serves as an expert in their MoE-LM.
Upon receiving a conversation history X, each expert LMi generates a
candidate (or more) for the next utterance b
Yi in certain parts of the
language space that are compatible with its attribute (personality).
17 / 42
MoE
Mixture of Experts (MoE) Language Model
Dialogue Manager (DM)
The dialogue manager, denoted by 𝜇, takes as input the encoded
conversation history z = Φ(X) and the candidate action utterances
generated by the experts {b
Yi}m
i=0, and selects one of them as the action
for the bot to execute, i.e., b
Y ∼ 𝜇(· | z, {b
Yi}m
i=0).
18 / 42
MoE
Primitive Discovery in MoE-LM
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
4 Mixture of Experts (MoE) Language Model
5 Primitive Discovery in MoE-LM
6 Expert Construction with Plug-and-Play Language Models
7 Reinforcement Learning for MoE-LM Dialogue Manager
8 Experiments
9 Concluding Remarks
19 / 42
MoE
Primitive Discovery in MoE-LM
Primitive Discovery in MoE-LM
The primitive LM, LM0, in their MoE-LM by solving the following
KL-constrained optimization problem that aims at capturing diverse
semantics:
min
(Φ,G0,Ψ),𝜌
b
Ez′∼𝜌(·|z,Y),z=Φ(X) [− log Ψ(Y | z′
)],
s.t. b
Ez=Φ(X) [KL(𝜌(z′
| z, Y)∥G0(z′
| z))] ≤ 𝜖KL
where 𝜖KL positive real-valued threshold.
In this section, they learn LM0 by maximizing the log-likelihood,
while enforcing consistency between the latent variable z′ via the KL
constraint.
20 / 42
MoE
Expert Construction with Plug-and-Play Language Models
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
4 Mixture of Experts (MoE) Language Model
5 Primitive Discovery in MoE-LM
6 Expert Construction with Plug-and-Play Language Models
7 Reinforcement Learning for MoE-LM Dialogue Manager
8 Experiments
9 Concluding Remarks
21 / 42
MoE
Expert Construction with Plug-and-Play Language Models
Expert Construction with Plug-and-Play Language Models
Denote by ℓi(X, Y) ∈ R a real-valued label that characterizes the
intent of expert i ∈ {1, . . . , m}, e.g., determined by an off-the-shelf
sentiment classifier. They train the latent distribution Gi(z) of expert i
by solving the optimization problem
min
Gi
b
Ez′∼Gi (·|z),z=Φ(X),Y∼Ψ(·|z′ ) [−ℓi(X, Y)]
22 / 42
MoE
Expert Construction with Plug-and-Play Language Models
Expert Construction with Plug-and-Play Language Models
They learn each expert via reward-maximization and treat ℓi as a
reward signal w.r.t. expert i to be maximized. In reinforcement
learning (RL), both the ”state” and ”action” spaces are the latent space
Z, and the ”policy” is the latent distribution Gi.
The main benefit of their approach is that it does not require the target
utterance Y from data D and is thus less vulnerable to data-imbalance
issues in D on certain intents.
The main motivation is that they want each expert to possess
particular behaviors, and this can readily be done via greedy
maximization; Long-term dialogue optimization will be handled by
the dialogue manager rather than these experts.
23 / 42
MoE
Reinforcement Learning for MoE-LM Dialogue Manager
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
4 Mixture of Experts (MoE) Language Model
5 Primitive Discovery in MoE-LM
6 Expert Construction with Plug-and-Play Language Models
7 Reinforcement Learning for MoE-LM Dialogue Manager
8 Experiments
9 Concluding Remarks
24 / 42
MoE
Reinforcement Learning for MoE-LM Dialogue Manager
Reinforcement Learning for MoE-LM Dialogue Manager
The dialogue manager (DM) of their MoE-LM and propose RL
algorithms to train it.
The DM is a policy 𝜇 that takes the encoded conversation history
z = Φ(X) and the m + 1 candidate action utterances generated by the
experts {b
Yi}m
i=0, and stochastically selects one of them to execute, i.e.,
b
Y ∼ 𝜇(· | z, {b
Yi}m
i=0).
Note that each expert i ∈ {0, . . . , m} is a LM, LMi, that acts as a policy
𝜋i(· | X) and maps each conversation history X to an utterance b
Yi.
25 / 42
MoE
Reinforcement Learning for MoE-LM Dialogue Manager
MoE-MDP
M =

S, A, P̄, R̄, s̄0, 𝛾

.
The state space of MoE-MDP is the product of the learned latent space
Z and the joint action space of the m + 1 experts, i.e., S = Z × Am+1.
Its action space consists of the m + 1 experts, i.e., A = {0, . . . , m}.
Its initial state is the encoding of the initial user’s query and the
utterances suggested by the experts in response to this query. The
transition still models user’s responses but is now over the joint space
of the latent states and experts’ actions.
The reward function is the same as in the original MDP, i.e.,
r̄(s̄, ā) = r X, aj

, where s̄ = z, {ai}m
i=0

with ai ∼ 𝜋i(· | X) and
z = Φ(X), and ā ∈ {0, . . . , m} is the expert selected by the DM.
26 / 42
MoE
Reinforcement Learning for MoE-LM Dialogue Manager
RL algorithms for Dialogue Manager (CQL)
The first one is conservative Q-learning (CQL), a popular offline RL
algorithm.
CQL is a regularization scheme that learns a conservative Q-function
that lower-bounds the true one.
CQL regularization minimizes the differences in Q-values of their
DM and the primitive.
𝜇(ā | s̄) ∝ exp (Q𝜃 (s̄, ā))
27 / 42
MoE
Reinforcement Learning for MoE-LM Dialogue Manager
RL algorithms for Dialogue Manager (CQL)
Given the offline conversation data D, they parameterize the
Q-function by parameter 𝜃 and learn 𝜃 by minimizing the Bellman
error with behavior regularization:
min𝜃
Í
(s̄,ā,r̄,s̄+)∈D 𝛼 Eā∼𝜇 [Q𝜃 (s̄, ā)] − Q𝜃 (s̄, a0)

+

r̄ + 𝛾Q𝜃target

s̄+, arg maxā+ ∈A Q𝜃 (s̄+, ā+)

− Q𝜃 (s̄, ā)
2
, where a0 is action suggested by the primitive LM a0 ∼ 𝜋0, 𝛼  0 is a
regularization parameter, and 𝜃target is the target Q-function parameter.
28 / 42
MoE
Reinforcement Learning for MoE-LM Dialogue Manager
RL algorithms for Dialogue Manager (MBRL)
The second RL algorithm they use is model-based RL (MBRL).
They learn a user utterance model
Puser (X+ | X, a) := Ez=Φuser ([X,a]) [Ψuser (X+ | z)] via maximum
likelihood, then generate data DMB, whose next-stateb
s+encodes the
next conversation generated from roll-outs and the corresponding
candidate actions, and finally solve the Bellman error minimization in
MoE-MDP:
min
𝜃
∑︁
(s̄,ā,r̄,b
s+)∈DMB

r̄ + 𝛾Q𝜃target

b
s+, arg max
ā+ ∈ J
Q𝜃 (b
s+, ā+)

− Q𝜃 (s̄, ā)
2
29 / 42
MoE
Experiments
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
4 Mixture of Experts (MoE) Language Model
5 Primitive Discovery in MoE-LM
6 Expert Construction with Plug-and-Play Language Models
7 Reinforcement Learning for MoE-LM Dialogue Manager
8 Experiments
9 Concluding Remarks
30 / 42
MoE
Experiments
Experiments Setup
They evaluate their MoE-approach for dialogue management on two
benchmark open-domain tasks.
Cornell, consists of conversations between speakers in different
movie lines and has a median conversation length of 3 utterances
Reddit, a casual conversation corpus on various topics between
users of at least 3 turns with the median conversation containing
7 utterances
Evaluated about
1 The predictive power and diversity of the primitive
2 The quality of experts
3 The overall DM performance
31 / 42
MoE
Experiments
MoE-LMs
They ran an ablation study on 4 transformer-based MoE-LMs.
MoE-1: use a simpler architecture, smaller latent distribution
models {Gi} than MoE-2
MoE-2: use a simpler architecture
MoE-3: same encoder architecture as BERT, with pre-trained
weight
MoE-4: same encoder architecture as BERT, but trains that from
scratch
32 / 42
MoE
Experiments
EXP 1: Comparing Primitive Models
To assess their quality, for each test conversation they generated 25
utterances and reported the following 3 metrics:
1 Diversity: Sparsity of the singular values of the embedded
utterances
2 Dist-{1, 2, 3}: Ratio of unique {1, 2, 3}-gram in the generated
utterances
3 Perplexity
33 / 42
MoE
Experiments
EXP 1: Comparing Primitive Models
34 / 42
MoE
Experiments
EXP 2: Quality of Experts
They use the following label functions to define the intents of experts:
1 ℓpos-sent (Y), ℓneg-sent (Y), and ℓjoy (Y), ℓoptimism (Y), ℓanger (Y),
ℓsadness (Y) quantify 6 different sentiment tones and are
constructed by a RoBERTa-based sentiment detector
2 ℓsent-coh (X, Y) measures empathy, i.e., bot’s sentiment coherence
with user’s
3 ℓquestion (Y) outputs 1 when a bot question is detected and 0
otherwise
4 ℓexp(X, Y) quantifies exploration, i.e., the tendency to avoid
repetitive contexts
35 / 42
MoE
Experiments
EXP 2: Quality of Experts
36 / 42
MoE
Experiments
EXP 3: MoE-RL Against DialoGPT Simulated Users
The DM task is to maximize total user satisfaction in the conversation
level, which is measured by both
1 User’s overall sentiment
2 User’s sentiment transition
To construct an immediate reward that serves as a surrogate for user
satisfaction, they set
r (X, a, X+) = 𝜆1ℓsent (X+) + 𝜆2 ℓsent (X+) −
1 − 𝛾
1 − 𝛾L
L−1
∑︁
l=0
𝛾l
ℓsent (Xl)
!
, where the linear combination weights (𝜆1, 𝜆2) = (0.75, 0.25).
37 / 42
MoE
Experiments
EXP 3: MoE-RL Against DialoGPT Simulated Users
ℓsent (X) is the same RoBerTa-based sentiment labeler as in EXP2,
which assigns a score from [−1, 1] that is proportional to the positive
sentiment and inversely proportional to the negative sentiment
prediction probabilities.
Since the DM problem lasts at most 5 turns, they use this as the
effective horizon and set 𝛾 = 1 − 1/5 = 0.8.
38 / 42
MoE
Experiments
EXP 3: MoE-RL Against DialoGPT Simulated Users
39 / 42
MoE
Concluding Remarks
Table of contents
1 Abstract
2 Introduction
3 Preliminaries
4 Mixture of Experts (MoE) Language Model
5 Primitive Discovery in MoE-LM
6 Expert Construction with Plug-and-Play Language Models
7 Reinforcement Learning for MoE-LM Dialogue Manager
8 Experiments
9 Concluding Remarks
40 / 42
MoE
Concluding Remarks
Concluding Remarks
With mixture-of-expert (MoE) approach for RL-based dialogue
management (DM). They use three components
1 A LM that can generate diverse semantics for conversation
histories
2 A number of specialized LMs (or experts) that can produce
utterances corresponding to a particular attribute or intent
3 A RL-based DM that performs dialogue planning with the
utterances generated by the experts
Their method improves
1 Diversity of text generation
2 Generating utterances with specific intents
3 Yields better overall performance
41 / 42
MoE
Concluding Remarks
Future works
Improving the language representation with information theoretic
approaches
Fine-tuning the experts based on the DM objective
Extending the RL agent to track users’ behaviors (via abstract
belief states) and plan upon them
Preventing RL dialogue agents from generating harmful
behaviors
Evaluating their MoE-LM on more realistic problems, such as
information retrieval, recommendation, and negotiation
42 / 42

More Related Content

PDF
#1 Berlin Students in AI, Machine Learning & NLP presentation
PDF
Li Deng at AI Frontiers: Three Generations of Spoken Dialogue Systems (Bots)
PDF
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
PDF
Empowering Conversational Agents with Situated Natural Language Communication...
PDF
IRJET- Chatbot Using Gated End-to-End Memory Networks
PDF
Quick Overview of the Top 9 Popular LLMs.pdf
PDF
An introduction to deep reinforcement learning
PDF
Generative AI and the Rise of Large Language Models
#1 Berlin Students in AI, Machine Learning & NLP presentation
Li Deng at AI Frontiers: Three Generations of Spoken Dialogue Systems (Bots)
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Empowering Conversational Agents with Situated Natural Language Communication...
IRJET- Chatbot Using Gated End-to-End Memory Networks
Quick Overview of the Top 9 Popular LLMs.pdf
An introduction to deep reinforcement learning
Generative AI and the Rise of Large Language Models

More from Po-Chuan Chen (20)

PDF
Graph Neural Prompting with Large Language Models.pdf
PDF
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
PDF
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
PDF
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
PDF
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
PDF
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
PDF
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
PDF
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
PDF
A Statistical Perspective on Retrieval-Based Models.pdf
PDF
A Neural Corpus Indexer for Document Retrieval.pdf
PDF
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
PDF
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
PDF
Active Retrieval Augmented Generation.pdf
PDF
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
PDF
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
PDF
Image_to_Prompts.pdf
PDF
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
PDF
Evaluating Parameter Efficient Learning for Generation.pdf
PDF
Off-Policy Deep Reinforcement Learning without Exploration.pdf
PDF
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
Graph Neural Prompting with Large Language Models.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
A Statistical Perspective on Retrieval-Based Models.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
Active Retrieval Augmented Generation.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Image_to_Prompts.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
Ad

Recently uploaded (20)

PPTX
PRASUNET_20240614003_231416_0000[1].pptx
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PDF
Applications of Equal_Area_Criterion.pdf
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PPTX
wireless networks, mobile computing.pptx
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PPTX
Measurement Uncertainty and Measurement System analysis
PDF
UEFA_Carbon_Footprint_Calculator_Methology_2.0.pdf
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
DOC
T Pandian CV Madurai pandi kokkaf illaya
PPTX
Principal presentation for NAAC (1).pptx
PPTX
Amdahl’s law is explained in the above power point presentations
PPTX
Software Engineering and software moduleing
PDF
Unit I -OPERATING SYSTEMS_SRM_KATTANKULATHUR.pptx.pdf
PDF
UEFA_Embodied_Carbon_Emissions_Football_Infrastructure.pdf
PPT
Chapter 1 - Introduction to Manufacturing Technology_2.ppt
PPTX
ai_satellite_crop_management_20250815030350.pptx
PPTX
A Brief Introduction to IoT- Smart Objects: The "Things" in IoT
PDF
First part_B-Image Processing - 1 of 2).pdf
PRASUNET_20240614003_231416_0000[1].pptx
"Array and Linked List in Data Structures with Types, Operations, Implementat...
Applications of Equal_Area_Criterion.pdf
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
wireless networks, mobile computing.pptx
August 2025 - Top 10 Read Articles in Network Security & Its Applications
Measurement Uncertainty and Measurement System analysis
UEFA_Carbon_Footprint_Calculator_Methology_2.0.pdf
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
T Pandian CV Madurai pandi kokkaf illaya
Principal presentation for NAAC (1).pptx
Amdahl’s law is explained in the above power point presentations
Software Engineering and software moduleing
Unit I -OPERATING SYSTEMS_SRM_KATTANKULATHUR.pptx.pdf
UEFA_Embodied_Carbon_Emissions_Football_Infrastructure.pdf
Chapter 1 - Introduction to Manufacturing Technology_2.ppt
ai_satellite_crop_management_20250815030350.pptx
A Brief Introduction to IoT- Smart Objects: The "Things" in IoT
First part_B-Image Processing - 1 of 2).pdf
Ad

A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf

  • 1. MoE A Mixture-of-Expert Approach to RL-based Dialogue Management Yinlam Chow, Aza Tulepbergenov, Ofir Nachum et al. National Yang Ming Chiao Tung University, Hsinchu Speaker: Po-Chuan Chen April 27, 2023 1 / 42
  • 2. MoE Table of contents 1 Abstract 2 Introduction 3 Preliminaries 4 Mixture of Experts (MoE) Language Model 5 Primitive Discovery in MoE-LM 6 Expert Construction with Plug-and-Play Language Models 7 Reinforcement Learning for MoE-LM Dialogue Manager 8 Experiments 9 Concluding Remarks 2 / 42
  • 3. MoE Abstract Table of contents 1 Abstract 2 Introduction 3 Preliminaries 4 Mixture of Experts (MoE) Language Model 5 Primitive Discovery in MoE-LM 6 Expert Construction with Plug-and-Play Language Models 7 Reinforcement Learning for MoE-LM Dialogue Manager 8 Experiments 9 Concluding Remarks 3 / 42
  • 4. MoE Abstract Abstract Recently, there have some challenges for language models (LMs), for example, dialogue management (DM) problems and ability to carry on rich conversations. Reinforcement learning (RL) to develop a dialogue agent that avoids being short-sighted and maximizes overall user satisfaction. But, they still need to deal with a combinatorially complex action space even for a medium-size vocabulary. 4 / 42
  • 5. MoE Abstract Contribution In this paper, they introduce RL-based DM using a novel mixture of expert language model (MoE-LM) that consists of 1 A LM capable of learning diverse semantics for conversation histories 2 A number of specialized LMs (or experts) capable of generating utterances corresponding to a particular attribute 3 RL-based DM that performs dialogue planning with the utterances generated by the experts or personality They can have greater flexibility to generate sensible utterances with different intents and allows RL to focus on conversational-level DM. 5 / 42
  • 6. MoE Introduction Table of contents 1 Abstract 2 Introduction 3 Preliminaries 4 Mixture of Experts (MoE) Language Model 5 Primitive Discovery in MoE-LM 6 Expert Construction with Plug-and-Play Language Models 7 Reinforcement Learning for MoE-LM Dialogue Manager 8 Experiments 9 Concluding Remarks 6 / 42
  • 7. MoE Introduction Introduction In natural language understanding and generation, since the system needs to satisfy the user, a good dialogue agent should not only generate natural responses, but also be capable of pursuing the task’s objectives and adapting to the user’s feedback on-the-fly. To build the dialogue agent, there has two ways. 1 Behavioral cloning, where the agent is a language model (LM) that imitates the utterances in the training set 2 Reinforcement learning (RL) to optimize the agent’s policy 7 / 42
  • 8. MoE Introduction Challenges For the behavioral cloning, although these LMs produce fluent and relevant responses, it is unclear how to control them to systematically pursue goals during multi-turn dialogue conversations. And with the reinforcement learning, the action space can be captured by hand-crafted representations, and they cannot handle complex conversations. Another issue is that RL only optimizes a scalar reward, while the aforementioned methods often need to optimize for both the quality of the generated utterance. 8 / 42
  • 9. MoE Introduction This paper propose an RL-based DM agent using a novel mixture of expert (MoE) approach. That has 3 main components. 1 A LM capable of learning diverse semantics for conversation histories, and as a result generating diverse utterances, which they refer to as the primitive LM or LM0. 2 A number of specialized LMs (or experts), {LM}m i=1, that each is constructed using the latent space learned by LM0, but has been trained such that it is capable of generating utterances corresponding to a certain intent or personality. 3 An RL-based dialogue manager (DM) that at each turn, given the latent state shared by the experts {LM}m i=0 and the utterance action(s) they suggest, chooses one among them for the agent to execute. 9 / 42
  • 10. MoE Preliminaries Table of contents 1 Abstract 2 Introduction 3 Preliminaries 4 Mixture of Experts (MoE) Language Model 5 Primitive Discovery in MoE-LM 6 Expert Construction with Plug-and-Play Language Models 7 Reinforcement Learning for MoE-LM Dialogue Manager 8 Experiments 9 Concluding Remarks 10 / 42
  • 11. MoE Preliminaries Language Models (LMs) In this work, they employ seq2seq LMs to generate the next utterances in a dialogue, which the dataset of the form D = X(k), Y(k) |D| k=1 . X = X(k) is a L-turn conversation history, X = {Xl}L−1 l=0 and Y is its next utterance. And the upper-bound on the length (number of tokens) of each utterance is NX. The LM first encodes the conversation history X using an encoder Φ to a (L × NX)-length sequence of embeddings {(zl,0, . . . , zl,NX−1 )}L−1 l=0 , where each zl,n is a vector in the latent space, and the next utterance b Y = {b yn}N n=1 is sampled token-by-token from the decoder Ψ. 11 / 42
  • 12. MoE Preliminaries Markov Decision Processes (MDPs) M = (S, A, P, r, s0, 𝛾). The state space S represents the tokenized conversation history and the initial state s0 ∈ S is the initial user’s query. The action space A is also the tokenized language space with each action a ∈ A being the agent’s next utterance. The transition kernel P models the user’s response to the action taken by the agent (bot). The reward function r measures the user’s satisfaction. 12 / 42
  • 13. MoE Preliminaries Markov Decision Processes (MDPs) For the task, they want the entire LM as a policy that maps conversation histories to next utterances, and solve them by finding a policy 𝜋∗ with maximum expected discounted return. 𝜋∗ ∈ arg max 𝜋 J𝜋 := E ∞ ∑︁ k=0 𝛾t rt | P, s0, 𝜋 # Because of the size of the tokenized state and action spaces grow exponentially with the size of the vocabulary, it would quite desirable to develop a novel MDP paradigm that is more amendable to RL-based DM systems. 13 / 42
  • 14. MoE Mixture of Experts (MoE) Language Model Table of contents 1 Abstract 2 Introduction 3 Preliminaries 4 Mixture of Experts (MoE) Language Model 5 Primitive Discovery in MoE-LM 6 Expert Construction with Plug-and-Play Language Models 7 Reinforcement Learning for MoE-LM Dialogue Manager 8 Experiments 9 Concluding Remarks 14 / 42
  • 15. MoE Mixture of Experts (MoE) Language Model Mixture of Experts (MoE) Language Model 15 / 42
  • 16. MoE Mixture of Experts (MoE) Language Model Primitive Discovery Firstly, employ the dataset D, and learn a language model LM0 = (Φ, G0, Ψ). The stochastic encoder (Φ, G0) comprises an encoder Φ that maps tokenized conversation histories X to a latent space Z ⊆ Rd. And, using it to construct a parameterized d-dimensional Gaussian distribution G0(z′ | z) = N (𝜇0(z), 𝜎2 0 (z)Id×d). The decoder predicts the next utterance b Y0 conditioned on the point z′ sampled from the latent distribution. 16 / 42
  • 17. MoE Mixture of Experts (MoE) Language Model Expert Construction Intuitively, each Gi corresponds to an attribute and generates samples in specific parts of the latent space Z. This results in having m LMs, {LMi}m i=1, LMi = (Φ, Gi, Ψ), each of them corresponds to a specialized version of the original LM, LM0, and serves as an expert in their MoE-LM. Upon receiving a conversation history X, each expert LMi generates a candidate (or more) for the next utterance b Yi in certain parts of the language space that are compatible with its attribute (personality). 17 / 42
  • 18. MoE Mixture of Experts (MoE) Language Model Dialogue Manager (DM) The dialogue manager, denoted by 𝜇, takes as input the encoded conversation history z = Φ(X) and the candidate action utterances generated by the experts {b Yi}m i=0, and selects one of them as the action for the bot to execute, i.e., b Y ∼ 𝜇(· | z, {b Yi}m i=0). 18 / 42
  • 19. MoE Primitive Discovery in MoE-LM Table of contents 1 Abstract 2 Introduction 3 Preliminaries 4 Mixture of Experts (MoE) Language Model 5 Primitive Discovery in MoE-LM 6 Expert Construction with Plug-and-Play Language Models 7 Reinforcement Learning for MoE-LM Dialogue Manager 8 Experiments 9 Concluding Remarks 19 / 42
  • 20. MoE Primitive Discovery in MoE-LM Primitive Discovery in MoE-LM The primitive LM, LM0, in their MoE-LM by solving the following KL-constrained optimization problem that aims at capturing diverse semantics: min (Φ,G0,Ψ),𝜌 b Ez′∼𝜌(·|z,Y),z=Φ(X) [− log Ψ(Y | z′ )], s.t. b Ez=Φ(X) [KL(𝜌(z′ | z, Y)∥G0(z′ | z))] ≤ 𝜖KL where 𝜖KL positive real-valued threshold. In this section, they learn LM0 by maximizing the log-likelihood, while enforcing consistency between the latent variable z′ via the KL constraint. 20 / 42
  • 21. MoE Expert Construction with Plug-and-Play Language Models Table of contents 1 Abstract 2 Introduction 3 Preliminaries 4 Mixture of Experts (MoE) Language Model 5 Primitive Discovery in MoE-LM 6 Expert Construction with Plug-and-Play Language Models 7 Reinforcement Learning for MoE-LM Dialogue Manager 8 Experiments 9 Concluding Remarks 21 / 42
  • 22. MoE Expert Construction with Plug-and-Play Language Models Expert Construction with Plug-and-Play Language Models Denote by ℓi(X, Y) ∈ R a real-valued label that characterizes the intent of expert i ∈ {1, . . . , m}, e.g., determined by an off-the-shelf sentiment classifier. They train the latent distribution Gi(z) of expert i by solving the optimization problem min Gi b Ez′∼Gi (·|z),z=Φ(X),Y∼Ψ(·|z′ ) [−ℓi(X, Y)] 22 / 42
  • 23. MoE Expert Construction with Plug-and-Play Language Models Expert Construction with Plug-and-Play Language Models They learn each expert via reward-maximization and treat ℓi as a reward signal w.r.t. expert i to be maximized. In reinforcement learning (RL), both the ”state” and ”action” spaces are the latent space Z, and the ”policy” is the latent distribution Gi. The main benefit of their approach is that it does not require the target utterance Y from data D and is thus less vulnerable to data-imbalance issues in D on certain intents. The main motivation is that they want each expert to possess particular behaviors, and this can readily be done via greedy maximization; Long-term dialogue optimization will be handled by the dialogue manager rather than these experts. 23 / 42
  • 24. MoE Reinforcement Learning for MoE-LM Dialogue Manager Table of contents 1 Abstract 2 Introduction 3 Preliminaries 4 Mixture of Experts (MoE) Language Model 5 Primitive Discovery in MoE-LM 6 Expert Construction with Plug-and-Play Language Models 7 Reinforcement Learning for MoE-LM Dialogue Manager 8 Experiments 9 Concluding Remarks 24 / 42
  • 25. MoE Reinforcement Learning for MoE-LM Dialogue Manager Reinforcement Learning for MoE-LM Dialogue Manager The dialogue manager (DM) of their MoE-LM and propose RL algorithms to train it. The DM is a policy 𝜇 that takes the encoded conversation history z = Φ(X) and the m + 1 candidate action utterances generated by the experts {b Yi}m i=0, and stochastically selects one of them to execute, i.e., b Y ∼ 𝜇(· | z, {b Yi}m i=0). Note that each expert i ∈ {0, . . . , m} is a LM, LMi, that acts as a policy 𝜋i(· | X) and maps each conversation history X to an utterance b Yi. 25 / 42
  • 26. MoE Reinforcement Learning for MoE-LM Dialogue Manager MoE-MDP M = S, A, P̄, R̄, s̄0, 𝛾 . The state space of MoE-MDP is the product of the learned latent space Z and the joint action space of the m + 1 experts, i.e., S = Z × Am+1. Its action space consists of the m + 1 experts, i.e., A = {0, . . . , m}. Its initial state is the encoding of the initial user’s query and the utterances suggested by the experts in response to this query. The transition still models user’s responses but is now over the joint space of the latent states and experts’ actions. The reward function is the same as in the original MDP, i.e., r̄(s̄, ā) = r X, aj , where s̄ = z, {ai}m i=0 with ai ∼ 𝜋i(· | X) and z = Φ(X), and ā ∈ {0, . . . , m} is the expert selected by the DM. 26 / 42
  • 27. MoE Reinforcement Learning for MoE-LM Dialogue Manager RL algorithms for Dialogue Manager (CQL) The first one is conservative Q-learning (CQL), a popular offline RL algorithm. CQL is a regularization scheme that learns a conservative Q-function that lower-bounds the true one. CQL regularization minimizes the differences in Q-values of their DM and the primitive. 𝜇(ā | s̄) ∝ exp (Q𝜃 (s̄, ā)) 27 / 42
  • 28. MoE Reinforcement Learning for MoE-LM Dialogue Manager RL algorithms for Dialogue Manager (CQL) Given the offline conversation data D, they parameterize the Q-function by parameter 𝜃 and learn 𝜃 by minimizing the Bellman error with behavior regularization: min𝜃 Í (s̄,ā,r̄,s̄+)∈D 𝛼 Eā∼𝜇 [Q𝜃 (s̄, ā)] − Q𝜃 (s̄, a0) + r̄ + 𝛾Q𝜃target s̄+, arg maxā+ ∈A Q𝜃 (s̄+, ā+) − Q𝜃 (s̄, ā) 2 , where a0 is action suggested by the primitive LM a0 ∼ 𝜋0, 𝛼 0 is a regularization parameter, and 𝜃target is the target Q-function parameter. 28 / 42
  • 29. MoE Reinforcement Learning for MoE-LM Dialogue Manager RL algorithms for Dialogue Manager (MBRL) The second RL algorithm they use is model-based RL (MBRL). They learn a user utterance model Puser (X+ | X, a) := Ez=Φuser ([X,a]) [Ψuser (X+ | z)] via maximum likelihood, then generate data DMB, whose next-stateb s+encodes the next conversation generated from roll-outs and the corresponding candidate actions, and finally solve the Bellman error minimization in MoE-MDP: min 𝜃 ∑︁ (s̄,ā,r̄,b s+)∈DMB r̄ + 𝛾Q𝜃target b s+, arg max ā+ ∈ J Q𝜃 (b s+, ā+) − Q𝜃 (s̄, ā) 2 29 / 42
  • 30. MoE Experiments Table of contents 1 Abstract 2 Introduction 3 Preliminaries 4 Mixture of Experts (MoE) Language Model 5 Primitive Discovery in MoE-LM 6 Expert Construction with Plug-and-Play Language Models 7 Reinforcement Learning for MoE-LM Dialogue Manager 8 Experiments 9 Concluding Remarks 30 / 42
  • 31. MoE Experiments Experiments Setup They evaluate their MoE-approach for dialogue management on two benchmark open-domain tasks. Cornell, consists of conversations between speakers in different movie lines and has a median conversation length of 3 utterances Reddit, a casual conversation corpus on various topics between users of at least 3 turns with the median conversation containing 7 utterances Evaluated about 1 The predictive power and diversity of the primitive 2 The quality of experts 3 The overall DM performance 31 / 42
  • 32. MoE Experiments MoE-LMs They ran an ablation study on 4 transformer-based MoE-LMs. MoE-1: use a simpler architecture, smaller latent distribution models {Gi} than MoE-2 MoE-2: use a simpler architecture MoE-3: same encoder architecture as BERT, with pre-trained weight MoE-4: same encoder architecture as BERT, but trains that from scratch 32 / 42
  • 33. MoE Experiments EXP 1: Comparing Primitive Models To assess their quality, for each test conversation they generated 25 utterances and reported the following 3 metrics: 1 Diversity: Sparsity of the singular values of the embedded utterances 2 Dist-{1, 2, 3}: Ratio of unique {1, 2, 3}-gram in the generated utterances 3 Perplexity 33 / 42
  • 34. MoE Experiments EXP 1: Comparing Primitive Models 34 / 42
  • 35. MoE Experiments EXP 2: Quality of Experts They use the following label functions to define the intents of experts: 1 ℓpos-sent (Y), ℓneg-sent (Y), and ℓjoy (Y), ℓoptimism (Y), ℓanger (Y), ℓsadness (Y) quantify 6 different sentiment tones and are constructed by a RoBERTa-based sentiment detector 2 ℓsent-coh (X, Y) measures empathy, i.e., bot’s sentiment coherence with user’s 3 ℓquestion (Y) outputs 1 when a bot question is detected and 0 otherwise 4 ℓexp(X, Y) quantifies exploration, i.e., the tendency to avoid repetitive contexts 35 / 42
  • 36. MoE Experiments EXP 2: Quality of Experts 36 / 42
  • 37. MoE Experiments EXP 3: MoE-RL Against DialoGPT Simulated Users The DM task is to maximize total user satisfaction in the conversation level, which is measured by both 1 User’s overall sentiment 2 User’s sentiment transition To construct an immediate reward that serves as a surrogate for user satisfaction, they set r (X, a, X+) = 𝜆1ℓsent (X+) + 𝜆2 ℓsent (X+) − 1 − 𝛾 1 − 𝛾L L−1 ∑︁ l=0 𝛾l ℓsent (Xl) ! , where the linear combination weights (𝜆1, 𝜆2) = (0.75, 0.25). 37 / 42
  • 38. MoE Experiments EXP 3: MoE-RL Against DialoGPT Simulated Users ℓsent (X) is the same RoBerTa-based sentiment labeler as in EXP2, which assigns a score from [−1, 1] that is proportional to the positive sentiment and inversely proportional to the negative sentiment prediction probabilities. Since the DM problem lasts at most 5 turns, they use this as the effective horizon and set 𝛾 = 1 − 1/5 = 0.8. 38 / 42
  • 39. MoE Experiments EXP 3: MoE-RL Against DialoGPT Simulated Users 39 / 42
  • 40. MoE Concluding Remarks Table of contents 1 Abstract 2 Introduction 3 Preliminaries 4 Mixture of Experts (MoE) Language Model 5 Primitive Discovery in MoE-LM 6 Expert Construction with Plug-and-Play Language Models 7 Reinforcement Learning for MoE-LM Dialogue Manager 8 Experiments 9 Concluding Remarks 40 / 42
  • 41. MoE Concluding Remarks Concluding Remarks With mixture-of-expert (MoE) approach for RL-based dialogue management (DM). They use three components 1 A LM that can generate diverse semantics for conversation histories 2 A number of specialized LMs (or experts) that can produce utterances corresponding to a particular attribute or intent 3 A RL-based DM that performs dialogue planning with the utterances generated by the experts Their method improves 1 Diversity of text generation 2 Generating utterances with specific intents 3 Yields better overall performance 41 / 42
  • 42. MoE Concluding Remarks Future works Improving the language representation with information theoretic approaches Fine-tuning the experts based on the DM objective Extending the RL agent to track users’ behaviors (via abstract belief states) and plan upon them Preventing RL dialogue agents from generating harmful behaviors Evaluating their MoE-LM on more realistic problems, such as information retrieval, recommendation, and negotiation 42 / 42