Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits.pdf

SECOND THOUGHTS
Second Thoughts are Best: Learning to Re-Align
With Human Values from Text Edits
Ruibo Liu, Chenyan Jia, Ge Zhang et al.
National Yang Ming Chiao Tung University, Hsinchu
Speaker: Po-Chuan, Chen
December 6, 2022
1 / 33

SECOND THOUGHTS
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
Problem Statement of Re-alignment
Augmented Edits Modeling
Refinement by Reinforcement Learning
5 Experiments
6 Limitations and Discussion
7 Conclusion
2 / 33

SECOND THOUGHTS
Abstract
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
5 Experiments
7 Conclusion
3 / 33

SECOND THOUGHTS
Abstract
Abstract
This paper present SECOND THOUGHTS, a new learning paradigm
that enables language models (LMs) to re-align with human values.
By modeling the chain-of-edits between value-unaligned and
value-aligned text, with LM fine-tuning and additional refinement
through reinforcement learning.
The generated editing steps also offer better interpretability and ease
for interactive error correction.
4 / 33

SECOND THOUGHTS
Introduction
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
5 Experiments
7 Conclusion
5 / 33

SECOND THOUGHTS
Introduction
Motivation
Current large-scale pre-trained language models (LMs) have shown
great success in many knowledge-recalling tasks.
The ability to select socially good text from bad (or generating
prosocial text) in open-world settings is still limited, even when the
models are scaled up to hundreds of billions of parameters.
6 / 33

SECOND THOUGHTS
Introduction
Fine-tuned language models (LMs) for generating text
7 / 33

SECOND THOUGHTS
Introduction
Contribution
Presenting a new learning paradigm that can make current LMs aware
of the human value alignment.
Trained with SECOND THOUGHTS, LMs can not only re-align
their generation with human values, even when the context has
already been poisoned, but also show the chain of editing steps for
ease of interpretability and to facilitate further edits.
The experiments confirm that simply scaling LMs is not adequate for
good alignment with human values, which echoes the findings of
recent studies.
8 / 33

SECOND THOUGHTS
Previous work
Previous work
This section will review existing work that considers in-context
explanations during prompting or training. Also summarize other
value alignment methods for language models.
Learning From In-Context Instructions
Human Value Alignment for Language Models
9 / 33

SECOND THOUGHTS
Previous work
Learning From In-Context Instructions
The few-shot performance of LMs can be enhanced by learning from
in-context instructions.
Ex: task descriptions, answer demonstrations ...
However, the instructions normally require careful human design,
which is costly and whose quality greatly affects performance.
10 / 33

SECOND THOUGHTS
Previous work
Human Value Alignment for Language Models
Trained on unfiltered and problematic language from the web, current
large-scale LMs have be shown to be poorly aligned with human
values.
Existing general purpose remedies include filtering the training data,
attribute-control generation, and modifying the decoding algorithm
with hard or soft constraints.
But based on the experiment that they have limited performance
when the context has already been poisoned. And the solution for
this situation costly and not always available in every value
alignment dataset.
11 / 33

SECOND THOUGHTS
Approach
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
5 Experiments
7 Conclusion
12 / 33

SECOND THOUGHTS
Approach
SECOND THOUGHTS
SECOND THOUGHTS comprises two main steps.
1 Inferring chain-of-edits automatically from source and target
responses with a dynamic programming algorithm, and fine-tune
an LM on the edits augmented training data.
2 Then deploy a reinforce learning stage to refine the generation,
by either adversarial imitation learning or value modeling.
13 / 33

SECOND THOUGHTS
Approach
Comparison
14 / 33

SECOND THOUGHTS
Approach
Value alignment datasets normally consist of contexts, value-aligned
responses, and value unaligned responses.
Existing alignment methods formulate the value alignment task as a
conditional generation problem: given a situation as the context,
train a model that can generate responses resembling a value-aligned
target rather than a not-aligned wrong target.
15 / 33

SECOND THOUGHTS
Approach
However, many studies have shown that LMs trained with such a
paradigm can be easily derailed by poisoned contexts —i.e., contexts
that already include value-unaligned content, either from the model’s
own generation or from malicious users.
To teach a model how to re-align, they deliberately add the
value-unaligned response into the context, referred to as the source,
and keep the value-aligned response as the target.
16 / 33

SECOND THOUGHTS
Approach
1 DP-based Edits Inference
2 Augmented Edits Modeling (AEM)
17 / 33

SECOND THOUGHTS
Approach
DP-based Edits Inference
Given two text strings, source and target, one can find unlimited ways
to edit source to produce target.
Thus, they apply two constraints onto the editing:
1 The edits should be combinations of generic editing
operations—inserting, deleting, and replacing a single token.
2 Each edit operation has a cost and the goal is to infer the
chain-of-edits that has minimum cost.
Under these constraints, the edits inference problem can be converted
to a token-level “edit distance problem”, which can be solved by
dynamic programming (DP).
18 / 33

SECOND THOUGHTS
Approach
Augmented Edits Modeling (AEM)
To augment the edits, they run the DP algorithm on the same source
and target pairs with a variety of editing costs to create a collection of
chain-of-edits for each source-target pair, which they call positive
demonstrations (y+).
And then fine-tune an LM on these source-edits-target text inputs.
This paper also construct negative demonstrations (y−) by using the
targets from other contexts, leading to inferred chain-of-edits that
generate value-aligned responses which are incoherent with the given
context.
19 / 33

SECOND THOUGHTS
Approach
20 / 33

SECOND THOUGHTS
Approach
Based on manual examination, the responses tend to be generic, so
they are thus motivated to deploy a reinforcement learning (RL) stage
to further refine the generation quality.
Notation
context and source as x
target that generated by SECOND THOUGHTS as y
Define the state at time t as the set of generated tokens before t
(i.e., st = y<t)
The action as the current step’s output token (i.e., at = yt)
The softmax output of the language modeling head is considered
as the policy 𝜋t for picking token yt (action at), given the state
st = y<t.
21 / 33

SECOND THOUGHTS
Approach
Adversarial Imitation Learning (AIL)
They propose to leverage negative samples to penalize the LM for
imitating the mismatched target by training an adversarial LM only
on the negative demonstrations y−, so that following its policy 𝜋ADV.
t
will lead to incoherent generations. The t-th step objective of AIL to
be maximized is:
JAIL,t = E𝜏∼𝜋∗
t
[− log 𝜋ADV
t (at | st)
| {z }
unlikelihood
+ 𝛼 log 𝜋∗
t (at | st)
| {z }
likelihood
] − 𝛽 KL 𝜋t∥𝜋∗
t

where 𝜋∗
t is the desired refinement policy, 𝛼 is the balancing factor,
and the KL penalty term KL(𝜋∥𝜋∗
t ) with the coefficient 𝛽 is the trust
region constraint, which prevents the updated policy from drifting
too far away from the original one.
22 / 33

SECOND THOUGHTS
Approach
Adversarial Imitation Learning (AIL)
The intuition behind such a design is to maximize the unlikelihood
of forming the trajectory 𝜏 = {s1, a1, ..., st, at} that can be induced
by the adversarial policy 𝜋ADV., weighted against the balancing
likelihood term.
After refinement, the learned policy 𝜋∗
t can generate tokens unlike
those that can be produced by 𝜋ADV., which will form sequences
more coherent to the context.
23 / 33

SECOND THOUGHTS
Approach
Value Modeling (VM)
And another refinement method that directly learns a value function.
By training a binary LM-based classifier f on the mixture of positive
and negative demonstrations.
Taking the sigmoid of the log-likelihood predicted by f as the reward
r, which is r = 𝜎 log f (x, y), where the t-th step r is adjusted by an
importance-sampling ratio between the current and original policy for
off-policy stability, for the entropy bonus term encourage more
exploration of the current policy.
JVM,t = E𝜏∼𝜋t

𝜋∗
t (at | st)
𝜋t (at | st)
· rt

+ 𝜆H (· | st)∼𝜋∗
24 / 33

SECOND THOUGHTS
Experiments
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
5 Experiments
7 Conclusion
25 / 33

SECOND THOUGHTS
Experiments
SECOND THOUGHTS on three benchmark datasets
1 Moral Stories. The Moral Stories dataset (N = 20, 000)
examines whether LMs can generate moral responses under
diverse social situations.
2 MIC. The MIC dataset (N = 38, 000) studies whether chatbots
can generate utterances that are aligned with a set of “Rules of
Thumb (RoT)” of morality.
3 ETHICS-Deontology. The ETHICS dataset (N = 25, 356)
investigates the performance of LMs on five human values
alignment tasks.
This paper also consider two smaller-scale human values alignment
datasets: HHH (Helpful, Honest, Harmless) (N = 178) and
Truthful QA (N = 299), to evaluate the domain transfer ability.
26 / 33

SECOND THOUGHTS
Experiments
Main Results on the Performance of Value Alignment
Alignment, by asking “To what extent does the edited response
improve the original response in terms of alignment with human
values?” Answers range from 1-not at all. to 7-to an extreme extent.
Coherence, by asking “How coherent is the edited response with the
given context?” Answers range from 1-not at all. to 7-extremely
coherent.
27 / 33

SECOND THOUGHTS
Experiments
Value Transfer Learning with Limited Human-Labeled Data
Transfer learning ability of SECOND THOUGHTS from seen human
values (i.e., trained on MRL, MIC, ETC) to unseen values (i.e., testing
on TQA, HHH)
28 / 33

SECOND THOUGHTS
Experiments
Error Analysis and Human-Guided Correction
SECOND THOUGHTS enables higher quality human-guided
corrections, in terms of alignment and coherence scores.
29 / 33

SECOND THOUGHTS
Experiments
Configuration for the Best Performing SECOND
THOUGHTS
Hyperparameter search on balancing factor 𝛼 and entropy factor 𝜆 in
the Moral Stories task for best performing SECOND THOUGHTS.
30 / 33

SECOND THOUGHTS
Limitations and Discussion
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
5 Experiments
7 Conclusion
31 / 33

SECOND THOUGHTS
SECOND THOUGHTS can be limited by the LM that it is based
on—for instance, the total length of the chain-of-edits is limited by the
max sequence length allowed for the LM.
32 / 33

SECOND THOUGHTS
Conclusion
Conclusion
This paper has proposed SECOND THOUGHTS, a novel learning
paradigm that enables LMs to re-align with human values when given
a poisoned context.
In addition, the chain-of-edits modeling by SECOND THOUGHTS
enables easy error diagnosis and human-guided correction, which they
believe to be an essential ability for human-AI interactive systems.
33 / 33

Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits.pdf

More Related Content

Similar to Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits.pdf (20)

More from Po-Chuan Chen (20)

Recently uploaded (20)

Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits.pdf