Rasa Developer Summit - Bing Liu - Interactive Learning of Task-Oriented Dialog Systems

Interactive Learning of
Task-Oriented Dialog Systems
Bing Liu
Research Scientist, Facebook Conversational AI
Rasa Developer Summit - 2019

Interactive Learning of Task-Oriented
Dialog Systems
Bing Liu
Research Scientist, Facebook
PhD, Carnegie Mellon University

❖ Dialog systems
➢ Chit-chat bot, QA bot, task-oriented dialog system, ...
❖ Get stuff done - assist users in completing specific tasks
➢ Personal assistants (e.g. Siri, Alexa, Google Assistant, Hey Portal)
➢ Voice command in vehicle and smart home
➢ Customer service; Sales and marketing
Task-Oriented Dialog System
2

Modular Dialog System Architecture
3

Task-Oriented Dialog System
❖ Highly handcrafted
❖ Process interdependent
4
❖ Data driven end-to-end (E2E) systems
➢ [Wen et al. 2016]: E2E supervised training neural dialog model
➢ [Bordes and Weston, 2017]: E2E model with memory network
➢ [Andrea et al, 2018]: Mem2Seq for incorporating knowledge to E2E
system
❖ Interactive learning for E2E system with less human supervision

Why Learn through Interactions?
❖ Task-oriented dialog as a sequential decision making process over
multiple steps
5
❖ State space grows exponentially with number of dialog turns
❖ Extremely hard to
➢ Design all possible dialog paths
➢ Collect a dialog corpus that is large
enough to cover all dialog scenarios
→ Continuously learn through the interaction
with users and improve over time

How can we learn end-to-end task-oriented dialog
system effectively through interaction with users?
6

End-to-End Task-Oriented Dialog Modeling
7
❖ Dialog context modeling with hierarchical RNN
B Liu, et al, "Dialogue Learning with Human Teaching and Feedback in End-To-End Trainable Task-Oriented Dialogue Systems", NAACL 2018.

End-to-End Task-Oriented Dialog Modeling
8
End-to-End Modeling of
SLU, DST, and Dialog Policy

Supervised Pre-training
❖ Supervised model pre-training on dialog corpus with MLE
➢ Objective function: linear interpolation of cross-entropy losses for
■ Dialog state tracking, i.e. user goal estimation, and
■ Dialog policy, i.e. system action prediction
➢ Optimization: Stochastic gradient descent, Adam
9
← Loss for user goal estimation
← Loss for system action prediction

Learn Interactively from User Feedback
❖ Interactive dialog learning with user feedback
10
Provide feedback for
policy optimization
Human-Human
Dialog Corpora
Supervised
Pre-training

❖ Use user feedback as dialog reward
❖ Introduce step penalty to encourage
shorter dialog for task completion
❖ Optimize dialog model end-to-end
with policy gradient RL:
11

❖ Policy optimization with RL can be slow due to sparse reward
12
❖ Dialog state distribution mismatch between offline training and
interactive learning leads to compounding errors
→ Ask user for correction/demonstration
when fails at a task and learn to act
❖ Agent may learn to recover from bad state with
RL but the search process can be very inefficient

Learn Interactively from User Teaching
❖ Interactive dialog learning with user teaching
13
Correct mistakes &
Demo desired dialog
agent behavior
Add to existing corpora
Driven by the
agent’s own policy
New
Dialog
Human-Human
Dialog Corpora
Supervised
Pre-training

Evaluation
14
Slots: theatre name, movie, date, time, num of people
SL: Supervised pre-training model
IL: Imitation learning with user teaching
RL: Reinforcement learning with user feedback
❖ Movie booking domain simulation (M2M)
Table: Human evaluation results. Mean and
standard deviation of crowd worker scores (1-5)
B Liu, et al, "Dialogue Learning with Human Teaching and Feedback in End-To-End Trainable Task-Oriented Dialogue Systems", NAACL 2018.

15
What if a user did not provide any feedback, can we
still learn anything from the interaction?

Can we learn a dialog reward function?
❖ User feedback serves as reward to RL optimization
16
❖ Task completion based reward requires prior knowledge of user’s goal
→ NOT usually accessible in real world user interactions
❖ In practice, user feedback can be inconsistent and is NOT always
available

Adversarial Dialog Learning
17
Reward
Bing Liu and Ian Lane, "Adversarial Learning of Task-Oriented Neural Dialog Models", in SIGDIAL 2018.
❖ Reward a machine-agent for conducting task-oriented dialog in a way
that is indistinguishable from the way human-agents do it.

Discriminative Reward Model
18
User’s Turn Agent’s Turn
External
Entity Info
❖ Input:
➢ Sequence of dialog turns
❖ Representation:
➢ BiLSTM with max-pooling
❖ Output:
➢ Prob. of a dialog being
successfully completed by
a human agent
Bing Liu and Ian Lane, "Adversarial Learning
of Task-Oriented Neural Dialog Models", in
SIGDIAL 2018.

Model Training
❖ Supervised pre-training with an initial set of pos & neg samples
➢ Pre-train dialog agent G on positive dialog samples with MLE
➢ Pre-train discriminative reward function D on pos & neg samples
❖ Interactive learning cycle
➢ Collect new dialog sample(s) between agent G and users
➢ Update dialog agent G with RL using the reward produced by D
➢ Update reward function D using the newly collected sample(s)
➢ Continue for next learning cycle
19

❖ Comparing different reward functions
Evaluation
20
Bing Liu and Ian Lane, "Adversarial Learning of
Task-Oriented Neural Dialog Models", in
SIGDIAL 2018.

Summary
❖ The multi-turn nature of task-oriented dialogs makes it especially
important for a system to learn through interaction with users
❖ Learning task-oriented dialog model end-to-end with user teaching
and feedback
❖ Adversarial dialog learning to address the challenges with missing or
inconsistent user feedback with less human supervision
21

Rasa Developer Summit - Bing Liu - Interactive Learning of Task-Oriented Dialog Systems

More Related Content

What's hot (20)

Similar to Rasa Developer Summit - Bing Liu - Interactive Learning of Task-Oriented Dialog Systems (20)

More from Rasa Technologies (20)

Recently uploaded (20)

Rasa Developer Summit - Bing Liu - Interactive Learning of Task-Oriented Dialog Systems