SlideShare a Scribd company logo
SECOND THOUGHTS
Second Thoughts are Best: Learning to Re-Align
With Human Values from Text Edits
Ruibo Liu, Chenyan Jia, Ge Zhang et al.
National Yang Ming Chiao Tung University, Hsinchu
Speaker: Po-Chuan, Chen
December 6, 2022
1 / 33
SECOND THOUGHTS
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
Problem Statement of Re-alignment
Augmented Edits Modeling
Refinement by Reinforcement Learning
5 Experiments
6 Limitations and Discussion
7 Conclusion
2 / 33
SECOND THOUGHTS
Abstract
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
Problem Statement of Re-alignment
Augmented Edits Modeling
Refinement by Reinforcement Learning
5 Experiments
6 Limitations and Discussion
7 Conclusion
3 / 33
SECOND THOUGHTS
Abstract
Abstract
This paper present SECOND THOUGHTS, a new learning paradigm
that enables language models (LMs) to re-align with human values.
By modeling the chain-of-edits between value-unaligned and
value-aligned text, with LM fine-tuning and additional refinement
through reinforcement learning.
The generated editing steps also offer better interpretability and ease
for interactive error correction.
4 / 33
SECOND THOUGHTS
Introduction
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
Problem Statement of Re-alignment
Augmented Edits Modeling
Refinement by Reinforcement Learning
5 Experiments
6 Limitations and Discussion
7 Conclusion
5 / 33
SECOND THOUGHTS
Introduction
Motivation
Current large-scale pre-trained language models (LMs) have shown
great success in many knowledge-recalling tasks.
The ability to select socially good text from bad (or generating
prosocial text) in open-world settings is still limited, even when the
models are scaled up to hundreds of billions of parameters.
6 / 33
SECOND THOUGHTS
Introduction
Fine-tuned language models (LMs) for generating text
7 / 33
SECOND THOUGHTS
Introduction
Contribution
Presenting a new learning paradigm that can make current LMs aware
of the human value alignment.
Trained with SECOND THOUGHTS, LMs can not only re-align
their generation with human values, even when the context has
already been poisoned, but also show the chain of editing steps for
ease of interpretability and to facilitate further edits.
The experiments confirm that simply scaling LMs is not adequate for
good alignment with human values, which echoes the findings of
recent studies.
8 / 33
SECOND THOUGHTS
Previous work
Previous work
This section will review existing work that considers in-context
explanations during prompting or training. Also summarize other
value alignment methods for language models.
Learning From In-Context Instructions
Human Value Alignment for Language Models
9 / 33
SECOND THOUGHTS
Previous work
Learning From In-Context Instructions
The few-shot performance of LMs can be enhanced by learning from
in-context instructions.
Ex: task descriptions, answer demonstrations ...
However, the instructions normally require careful human design,
which is costly and whose quality greatly affects performance.
10 / 33
SECOND THOUGHTS
Previous work
Human Value Alignment for Language Models
Trained on unfiltered and problematic language from the web, current
large-scale LMs have be shown to be poorly aligned with human
values.
Existing general purpose remedies include filtering the training data,
attribute-control generation, and modifying the decoding algorithm
with hard or soft constraints.
But based on the experiment that they have limited performance
when the context has already been poisoned. And the solution for
this situation costly and not always available in every value
alignment dataset.
11 / 33
SECOND THOUGHTS
Approach
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
Problem Statement of Re-alignment
Augmented Edits Modeling
Refinement by Reinforcement Learning
5 Experiments
6 Limitations and Discussion
7 Conclusion
12 / 33
SECOND THOUGHTS
Approach
SECOND THOUGHTS
SECOND THOUGHTS comprises two main steps.
1 Inferring chain-of-edits automatically from source and target
responses with a dynamic programming algorithm, and fine-tune
an LM on the edits augmented training data.
2 Then deploy a reinforce learning stage to refine the generation,
by either adversarial imitation learning or value modeling.
13 / 33
SECOND THOUGHTS
Approach
Comparison
14 / 33
SECOND THOUGHTS
Approach
Problem Statement of Re-alignment
Problem Statement of Re-alignment
Value alignment datasets normally consist of contexts, value-aligned
responses, and value unaligned responses.
Existing alignment methods formulate the value alignment task as a
conditional generation problem: given a situation as the context,
train a model that can generate responses resembling a value-aligned
target rather than a not-aligned wrong target.
15 / 33
SECOND THOUGHTS
Approach
Problem Statement of Re-alignment
Problem Statement of Re-alignment
However, many studies have shown that LMs trained with such a
paradigm can be easily derailed by poisoned contexts —i.e., contexts
that already include value-unaligned content, either from the model’s
own generation or from malicious users.
To teach a model how to re-align, they deliberately add the
value-unaligned response into the context, referred to as the source,
and keep the value-aligned response as the target.
16 / 33
SECOND THOUGHTS
Approach
Augmented Edits Modeling
Augmented Edits Modeling
1 DP-based Edits Inference
2 Augmented Edits Modeling (AEM)
17 / 33
SECOND THOUGHTS
Approach
Augmented Edits Modeling
DP-based Edits Inference
Given two text strings, source and target, one can find unlimited ways
to edit source to produce target.
Thus, they apply two constraints onto the editing:
1 The edits should be combinations of generic editing
operations—inserting, deleting, and replacing a single token.
2 Each edit operation has a cost and the goal is to infer the
chain-of-edits that has minimum cost.
Under these constraints, the edits inference problem can be converted
to a token-level “edit distance problem”, which can be solved by
dynamic programming (DP).
18 / 33
SECOND THOUGHTS
Approach
Augmented Edits Modeling
Augmented Edits Modeling (AEM)
To augment the edits, they run the DP algorithm on the same source
and target pairs with a variety of editing costs to create a collection of
chain-of-edits for each source-target pair, which they call positive
demonstrations (y+).
And then fine-tune an LM on these source-edits-target text inputs.
This paper also construct negative demonstrations (y−) by using the
targets from other contexts, leading to inferred chain-of-edits that
generate value-aligned responses which are incoherent with the given
context.
19 / 33
SECOND THOUGHTS
Approach
Augmented Edits Modeling
20 / 33
SECOND THOUGHTS
Approach
Refinement by Reinforcement Learning
Refinement by Reinforcement Learning
Based on manual examination, the responses tend to be generic, so
they are thus motivated to deploy a reinforcement learning (RL) stage
to further refine the generation quality.
Notation
context and source as x
target that generated by SECOND THOUGHTS as y
Define the state at time t as the set of generated tokens before t
(i.e., st = y<t)
The action as the current step’s output token (i.e., at = yt)
The softmax output of the language modeling head is considered
as the policy 𝜋t for picking token yt (action at), given the state
st = y<t.
21 / 33
SECOND THOUGHTS
Approach
Refinement by Reinforcement Learning
Adversarial Imitation Learning (AIL)
They propose to leverage negative samples to penalize the LM for
imitating the mismatched target by training an adversarial LM only
on the negative demonstrations y−, so that following its policy 𝜋ADV.
t
will lead to incoherent generations. The t-th step objective of AIL to
be maximized is:
JAIL,t = E𝜏∼𝜋∗
t
[− log 𝜋ADV
t (at | st)
| {z }
unlikelihood
+ 𝛼 log 𝜋∗
t (at | st)
| {z }
likelihood
] − 𝛽 KL 𝜋t∥𝜋∗
t

where 𝜋∗
t is the desired refinement policy, 𝛼 is the balancing factor,
and the KL penalty term KL(𝜋∥𝜋∗
t ) with the coefficient 𝛽 is the trust
region constraint, which prevents the updated policy from drifting
too far away from the original one.
22 / 33
SECOND THOUGHTS
Approach
Refinement by Reinforcement Learning
Adversarial Imitation Learning (AIL)
The intuition behind such a design is to maximize the unlikelihood
of forming the trajectory 𝜏 = {s1, a1, ..., st, at} that can be induced
by the adversarial policy 𝜋ADV., weighted against the balancing
likelihood term.
After refinement, the learned policy 𝜋∗
t can generate tokens unlike
those that can be produced by 𝜋ADV., which will form sequences
more coherent to the context.
23 / 33
SECOND THOUGHTS
Approach
Refinement by Reinforcement Learning
Value Modeling (VM)
And another refinement method that directly learns a value function.
By training a binary LM-based classifier f on the mixture of positive
and negative demonstrations.
Taking the sigmoid of the log-likelihood predicted by f as the reward
r, which is r = 𝜎 log f (x, y), where the t-th step r is adjusted by an
importance-sampling ratio between the current and original policy for
off-policy stability, for the entropy bonus term encourage more
exploration of the current policy.
JVM,t = E𝜏∼𝜋t

𝜋∗
t (at | st)
𝜋t (at | st)
· rt

+ 𝜆H (· | st)∼𝜋∗
24 / 33
SECOND THOUGHTS
Experiments
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
Problem Statement of Re-alignment
Augmented Edits Modeling
Refinement by Reinforcement Learning
5 Experiments
6 Limitations and Discussion
7 Conclusion
25 / 33
SECOND THOUGHTS
Experiments
SECOND THOUGHTS on three benchmark datasets
1 Moral Stories. The Moral Stories dataset (N = 20, 000)
examines whether LMs can generate moral responses under
diverse social situations.
2 MIC. The MIC dataset (N = 38, 000) studies whether chatbots
can generate utterances that are aligned with a set of “Rules of
Thumb (RoT)” of morality.
3 ETHICS-Deontology. The ETHICS dataset (N = 25, 356)
investigates the performance of LMs on five human values
alignment tasks.
This paper also consider two smaller-scale human values alignment
datasets: HHH (Helpful, Honest,  Harmless) (N = 178) and
Truthful QA (N = 299), to evaluate the domain transfer ability.
26 / 33
SECOND THOUGHTS
Experiments
Main Results on the Performance of Value Alignment
Alignment, by asking “To what extent does the edited response
improve the original response in terms of alignment with human
values?” Answers range from 1-not at all. to 7-to an extreme extent.
Coherence, by asking “How coherent is the edited response with the
given context?” Answers range from 1-not at all. to 7-extremely
coherent.
27 / 33
SECOND THOUGHTS
Experiments
Value Transfer Learning with Limited Human-Labeled Data
Transfer learning ability of SECOND THOUGHTS from seen human
values (i.e., trained on MRL, MIC, ETC) to unseen values (i.e., testing
on TQA, HHH)
28 / 33
SECOND THOUGHTS
Experiments
Error Analysis and Human-Guided Correction
SECOND THOUGHTS enables higher quality human-guided
corrections, in terms of alignment and coherence scores.
29 / 33
SECOND THOUGHTS
Experiments
Configuration for the Best Performing SECOND
THOUGHTS
Hyperparameter search on balancing factor 𝛼 and entropy factor 𝜆 in
the Moral Stories task for best performing SECOND THOUGHTS.
30 / 33
SECOND THOUGHTS
Limitations and Discussion
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
Problem Statement of Re-alignment
Augmented Edits Modeling
Refinement by Reinforcement Learning
5 Experiments
6 Limitations and Discussion
7 Conclusion
31 / 33
SECOND THOUGHTS
Limitations and Discussion
Limitations and Discussion
SECOND THOUGHTS can be limited by the LM that it is based
on—for instance, the total length of the chain-of-edits is limited by the
max sequence length allowed for the LM.
32 / 33
SECOND THOUGHTS
Conclusion
Conclusion
This paper has proposed SECOND THOUGHTS, a novel learning
paradigm that enables LMs to re-align with human values when given
a poisoned context.
In addition, the chain-of-edits modeling by SECOND THOUGHTS
enables easy error diagnosis and human-guided correction, which they
believe to be an essential ability for human-AI interactive systems.
33 / 33

More Related Content

PDF
Meta Pseudo Label - InsideAIML
PPT
Introduction to Machine Learning.
PDF
[Emnlp] what is glo ve part i - towards data science
DOC
Supervised Corpus-based Methods for Word Sense Disambiguation
PPT
Learning On The Border:Active Learning in Imbalanced classification Data
PDF
LNCS 5050 - Bilevel Optimization and Machine Learning
PPTX
Graph Neural Prompting with Large Language Models.pptx
PDF
Sentence Validation by Statistical Language Modeling and Semantic Relations
Meta Pseudo Label - InsideAIML
Introduction to Machine Learning.
[Emnlp] what is glo ve part i - towards data science
Supervised Corpus-based Methods for Word Sense Disambiguation
Learning On The Border:Active Learning in Imbalanced classification Data
LNCS 5050 - Bilevel Optimization and Machine Learning
Graph Neural Prompting with Large Language Models.pptx
Sentence Validation by Statistical Language Modeling and Semantic Relations

Similar to Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits.pdf (20)

PPT
2-Chapter Two-N-gram Language Models.ppt
PPTX
Introduction to Machine Learning
PDF
A0311010106
PDF
Duality Theory in Multi Objective Linear Programming Problems
PDF
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
PPT
Machine Learning and Artificial Neural Networks.ppt
PDF
Interactive Fuzzy Goal Programming approach for Tri-Level Linear Programming ...
PDF
Knowledge distillation deeplab
PDF
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
PDF
Evaluation of subjective answers using glsa enhanced with contextual synonymy
PPTX
SemiBoost: Boosting for Semi-supervised Learning
PDF
Paris Lecture 1
DOC
Monoton-working version-1995.doc
DOC
Monoton-working version-1995.doc
PPTX
AI & ML(Unit III).pptx.It contains also syllabus
PPT
Machine learning and Neural Networks
PDF
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
PDF
p138-jiang
PPTX
Visualizing Machine Learning Models for Enhanced Financial Decision-Making an...
2-Chapter Two-N-gram Language Models.ppt
Introduction to Machine Learning
A0311010106
Duality Theory in Multi Objective Linear Programming Problems
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
Machine Learning and Artificial Neural Networks.ppt
Interactive Fuzzy Goal Programming approach for Tri-Level Linear Programming ...
Knowledge distillation deeplab
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Evaluation of subjective answers using glsa enhanced with contextual synonymy
SemiBoost: Boosting for Semi-supervised Learning
Paris Lecture 1
Monoton-working version-1995.doc
Monoton-working version-1995.doc
AI & ML(Unit III).pptx.It contains also syllabus
Machine learning and Neural Networks
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
p138-jiang
Visualizing Machine Learning Models for Enhanced Financial Decision-Making an...
Ad

More from Po-Chuan Chen (20)

PDF
Graph Neural Prompting with Large Language Models.pdf
PDF
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
PDF
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
PDF
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
PDF
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
PDF
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
PDF
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
PDF
A Statistical Perspective on Retrieval-Based Models.pdf
PDF
A Neural Corpus Indexer for Document Retrieval.pdf
PDF
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
PDF
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
PDF
Active Retrieval Augmented Generation.pdf
PDF
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
PDF
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
PDF
Image_to_Prompts.pdf
PDF
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
PDF
Evaluating Parameter Efficient Learning for Generation.pdf
PDF
Off-Policy Deep Reinforcement Learning without Exploration.pdf
PDF
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
PDF
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Graph Neural Prompting with Large Language Models.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
A Statistical Perspective on Retrieval-Based Models.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
Active Retrieval Augmented Generation.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Image_to_Prompts.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Ad

Recently uploaded (20)

PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
PPT
A5_DistSysCh1.ppt_INTRODUCTION TO DISTRIBUTED SYSTEMS
PPT
introduction to datamining and warehousing
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PPTX
UNIT - 3 Total quality Management .pptx
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PPT
Occupational Health and Safety Management System
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPTX
Artificial Intelligence
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PDF
Abrasive, erosive and cavitation wear.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
737-MAX_SRG.pdf student reference guides
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
A5_DistSysCh1.ppt_INTRODUCTION TO DISTRIBUTED SYSTEMS
introduction to datamining and warehousing
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
UNIT - 3 Total quality Management .pptx
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
Occupational Health and Safety Management System
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
Automation-in-Manufacturing-Chapter-Introduction.pdf
Fundamentals of safety and accident prevention -final (1).pptx
Artificial Intelligence
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
Abrasive, erosive and cavitation wear.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
737-MAX_SRG.pdf student reference guides

Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits.pdf

  • 1. SECOND THOUGHTS Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits Ruibo Liu, Chenyan Jia, Ge Zhang et al. National Yang Ming Chiao Tung University, Hsinchu Speaker: Po-Chuan, Chen December 6, 2022 1 / 33
  • 2. SECOND THOUGHTS Table of contents 1 Abstract 2 Introduction 3 Previous work 4 Approach Problem Statement of Re-alignment Augmented Edits Modeling Refinement by Reinforcement Learning 5 Experiments 6 Limitations and Discussion 7 Conclusion 2 / 33
  • 3. SECOND THOUGHTS Abstract Table of contents 1 Abstract 2 Introduction 3 Previous work 4 Approach Problem Statement of Re-alignment Augmented Edits Modeling Refinement by Reinforcement Learning 5 Experiments 6 Limitations and Discussion 7 Conclusion 3 / 33
  • 4. SECOND THOUGHTS Abstract Abstract This paper present SECOND THOUGHTS, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning. The generated editing steps also offer better interpretability and ease for interactive error correction. 4 / 33
  • 5. SECOND THOUGHTS Introduction Table of contents 1 Abstract 2 Introduction 3 Previous work 4 Approach Problem Statement of Re-alignment Augmented Edits Modeling Refinement by Reinforcement Learning 5 Experiments 6 Limitations and Discussion 7 Conclusion 5 / 33
  • 6. SECOND THOUGHTS Introduction Motivation Current large-scale pre-trained language models (LMs) have shown great success in many knowledge-recalling tasks. The ability to select socially good text from bad (or generating prosocial text) in open-world settings is still limited, even when the models are scaled up to hundreds of billions of parameters. 6 / 33
  • 7. SECOND THOUGHTS Introduction Fine-tuned language models (LMs) for generating text 7 / 33
  • 8. SECOND THOUGHTS Introduction Contribution Presenting a new learning paradigm that can make current LMs aware of the human value alignment. Trained with SECOND THOUGHTS, LMs can not only re-align their generation with human values, even when the context has already been poisoned, but also show the chain of editing steps for ease of interpretability and to facilitate further edits. The experiments confirm that simply scaling LMs is not adequate for good alignment with human values, which echoes the findings of recent studies. 8 / 33
  • 9. SECOND THOUGHTS Previous work Previous work This section will review existing work that considers in-context explanations during prompting or training. Also summarize other value alignment methods for language models. Learning From In-Context Instructions Human Value Alignment for Language Models 9 / 33
  • 10. SECOND THOUGHTS Previous work Learning From In-Context Instructions The few-shot performance of LMs can be enhanced by learning from in-context instructions. Ex: task descriptions, answer demonstrations ... However, the instructions normally require careful human design, which is costly and whose quality greatly affects performance. 10 / 33
  • 11. SECOND THOUGHTS Previous work Human Value Alignment for Language Models Trained on unfiltered and problematic language from the web, current large-scale LMs have be shown to be poorly aligned with human values. Existing general purpose remedies include filtering the training data, attribute-control generation, and modifying the decoding algorithm with hard or soft constraints. But based on the experiment that they have limited performance when the context has already been poisoned. And the solution for this situation costly and not always available in every value alignment dataset. 11 / 33
  • 12. SECOND THOUGHTS Approach Table of contents 1 Abstract 2 Introduction 3 Previous work 4 Approach Problem Statement of Re-alignment Augmented Edits Modeling Refinement by Reinforcement Learning 5 Experiments 6 Limitations and Discussion 7 Conclusion 12 / 33
  • 13. SECOND THOUGHTS Approach SECOND THOUGHTS SECOND THOUGHTS comprises two main steps. 1 Inferring chain-of-edits automatically from source and target responses with a dynamic programming algorithm, and fine-tune an LM on the edits augmented training data. 2 Then deploy a reinforce learning stage to refine the generation, by either adversarial imitation learning or value modeling. 13 / 33
  • 15. SECOND THOUGHTS Approach Problem Statement of Re-alignment Problem Statement of Re-alignment Value alignment datasets normally consist of contexts, value-aligned responses, and value unaligned responses. Existing alignment methods formulate the value alignment task as a conditional generation problem: given a situation as the context, train a model that can generate responses resembling a value-aligned target rather than a not-aligned wrong target. 15 / 33
  • 16. SECOND THOUGHTS Approach Problem Statement of Re-alignment Problem Statement of Re-alignment However, many studies have shown that LMs trained with such a paradigm can be easily derailed by poisoned contexts —i.e., contexts that already include value-unaligned content, either from the model’s own generation or from malicious users. To teach a model how to re-align, they deliberately add the value-unaligned response into the context, referred to as the source, and keep the value-aligned response as the target. 16 / 33
  • 17. SECOND THOUGHTS Approach Augmented Edits Modeling Augmented Edits Modeling 1 DP-based Edits Inference 2 Augmented Edits Modeling (AEM) 17 / 33
  • 18. SECOND THOUGHTS Approach Augmented Edits Modeling DP-based Edits Inference Given two text strings, source and target, one can find unlimited ways to edit source to produce target. Thus, they apply two constraints onto the editing: 1 The edits should be combinations of generic editing operations—inserting, deleting, and replacing a single token. 2 Each edit operation has a cost and the goal is to infer the chain-of-edits that has minimum cost. Under these constraints, the edits inference problem can be converted to a token-level “edit distance problem”, which can be solved by dynamic programming (DP). 18 / 33
  • 19. SECOND THOUGHTS Approach Augmented Edits Modeling Augmented Edits Modeling (AEM) To augment the edits, they run the DP algorithm on the same source and target pairs with a variety of editing costs to create a collection of chain-of-edits for each source-target pair, which they call positive demonstrations (y+). And then fine-tune an LM on these source-edits-target text inputs. This paper also construct negative demonstrations (y−) by using the targets from other contexts, leading to inferred chain-of-edits that generate value-aligned responses which are incoherent with the given context. 19 / 33
  • 21. SECOND THOUGHTS Approach Refinement by Reinforcement Learning Refinement by Reinforcement Learning Based on manual examination, the responses tend to be generic, so they are thus motivated to deploy a reinforcement learning (RL) stage to further refine the generation quality. Notation context and source as x target that generated by SECOND THOUGHTS as y Define the state at time t as the set of generated tokens before t (i.e., st = y<t) The action as the current step’s output token (i.e., at = yt) The softmax output of the language modeling head is considered as the policy 𝜋t for picking token yt (action at), given the state st = y<t. 21 / 33
  • 22. SECOND THOUGHTS Approach Refinement by Reinforcement Learning Adversarial Imitation Learning (AIL) They propose to leverage negative samples to penalize the LM for imitating the mismatched target by training an adversarial LM only on the negative demonstrations y−, so that following its policy 𝜋ADV. t will lead to incoherent generations. The t-th step objective of AIL to be maximized is: JAIL,t = E𝜏∼𝜋∗ t [− log 𝜋ADV t (at | st) | {z } unlikelihood + 𝛼 log 𝜋∗ t (at | st) | {z } likelihood ] − 𝛽 KL 𝜋t∥𝜋∗ t where 𝜋∗ t is the desired refinement policy, 𝛼 is the balancing factor, and the KL penalty term KL(𝜋∥𝜋∗ t ) with the coefficient 𝛽 is the trust region constraint, which prevents the updated policy from drifting too far away from the original one. 22 / 33
  • 23. SECOND THOUGHTS Approach Refinement by Reinforcement Learning Adversarial Imitation Learning (AIL) The intuition behind such a design is to maximize the unlikelihood of forming the trajectory 𝜏 = {s1, a1, ..., st, at} that can be induced by the adversarial policy 𝜋ADV., weighted against the balancing likelihood term. After refinement, the learned policy 𝜋∗ t can generate tokens unlike those that can be produced by 𝜋ADV., which will form sequences more coherent to the context. 23 / 33
  • 24. SECOND THOUGHTS Approach Refinement by Reinforcement Learning Value Modeling (VM) And another refinement method that directly learns a value function. By training a binary LM-based classifier f on the mixture of positive and negative demonstrations. Taking the sigmoid of the log-likelihood predicted by f as the reward r, which is r = 𝜎 log f (x, y), where the t-th step r is adjusted by an importance-sampling ratio between the current and original policy for off-policy stability, for the entropy bonus term encourage more exploration of the current policy. JVM,t = E𝜏∼𝜋t 𝜋∗ t (at | st) 𝜋t (at | st) · rt + 𝜆H (· | st)∼𝜋∗ 24 / 33
  • 25. SECOND THOUGHTS Experiments Table of contents 1 Abstract 2 Introduction 3 Previous work 4 Approach Problem Statement of Re-alignment Augmented Edits Modeling Refinement by Reinforcement Learning 5 Experiments 6 Limitations and Discussion 7 Conclusion 25 / 33
  • 26. SECOND THOUGHTS Experiments SECOND THOUGHTS on three benchmark datasets 1 Moral Stories. The Moral Stories dataset (N = 20, 000) examines whether LMs can generate moral responses under diverse social situations. 2 MIC. The MIC dataset (N = 38, 000) studies whether chatbots can generate utterances that are aligned with a set of “Rules of Thumb (RoT)” of morality. 3 ETHICS-Deontology. The ETHICS dataset (N = 25, 356) investigates the performance of LMs on five human values alignment tasks. This paper also consider two smaller-scale human values alignment datasets: HHH (Helpful, Honest, Harmless) (N = 178) and Truthful QA (N = 299), to evaluate the domain transfer ability. 26 / 33
  • 27. SECOND THOUGHTS Experiments Main Results on the Performance of Value Alignment Alignment, by asking “To what extent does the edited response improve the original response in terms of alignment with human values?” Answers range from 1-not at all. to 7-to an extreme extent. Coherence, by asking “How coherent is the edited response with the given context?” Answers range from 1-not at all. to 7-extremely coherent. 27 / 33
  • 28. SECOND THOUGHTS Experiments Value Transfer Learning with Limited Human-Labeled Data Transfer learning ability of SECOND THOUGHTS from seen human values (i.e., trained on MRL, MIC, ETC) to unseen values (i.e., testing on TQA, HHH) 28 / 33
  • 29. SECOND THOUGHTS Experiments Error Analysis and Human-Guided Correction SECOND THOUGHTS enables higher quality human-guided corrections, in terms of alignment and coherence scores. 29 / 33
  • 30. SECOND THOUGHTS Experiments Configuration for the Best Performing SECOND THOUGHTS Hyperparameter search on balancing factor 𝛼 and entropy factor 𝜆 in the Moral Stories task for best performing SECOND THOUGHTS. 30 / 33
  • 31. SECOND THOUGHTS Limitations and Discussion Table of contents 1 Abstract 2 Introduction 3 Previous work 4 Approach Problem Statement of Re-alignment Augmented Edits Modeling Refinement by Reinforcement Learning 5 Experiments 6 Limitations and Discussion 7 Conclusion 31 / 33
  • 32. SECOND THOUGHTS Limitations and Discussion Limitations and Discussion SECOND THOUGHTS can be limited by the LM that it is based on—for instance, the total length of the chain-of-edits is limited by the max sequence length allowed for the LM. 32 / 33
  • 33. SECOND THOUGHTS Conclusion Conclusion This paper has proposed SECOND THOUGHTS, a novel learning paradigm that enables LMs to re-align with human values when given a poisoned context. In addition, the chain-of-edits modeling by SECOND THOUGHTS enables easy error diagnosis and human-guided correction, which they believe to be an essential ability for human-AI interactive systems. 33 / 33