Instruction Tuning and Reinforcement Learning from Human Feedback Data and Web Science Group IE686 Large Language Models and Agents

University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Instruction Tuning and Reinforcement
Learning from Human Feedback
IE686 Large Language Models and Agents
1

Credits
• This slide set is based on slides from
– Jiaxin Huang
– Mrinmaya Sachan
– Tatsunori Hashimoto
• Many thanks to all of you!
2

Outline
• Recap: Pre-training Language Models
• Scaling up and Emergent Abilities of LLMs
• Instruction Tuning
• Reinforcement Learning from Human Feedback
• Existing Large Language Models
3

Recap: Language Models over Time
• Simple n-gram models followed by shallow neural methods
and RNNs
• The Transformer architecture started the age of pre-trained
language models
– Large-scale Pre-training followed by task-specific fine-tuning
➔ Transfer Learning
4

Recap: Pre-training Data
5

Recap: Pre-training Decoder-only
6
Original Sentence:

Language Modeling ≠ Solving Tasks
• Language modelling with next token prediction does not
make the model a competent task solver
• How to adapt to correctly solving tasks?
7
Ouyang, L et al., 2022. Training Language Models to follow Instructions with Human Feedback. Advances in Neural Information
Processing Systems, 35, pp.27730-27744.

Pre-train/Fine-tune Paradigm of PLMs
• The pre-training stage lets language models learn generic
representations and knowledge from large corpora, but they are
not fine-tuned on any form of user tasks.
• To adapt language models to a specific downstream task, use
comparably small task-specific datasets for fine-tuning
➔ Transfer knowledge from pre-training, show the model what we want the
output to look like and subsequently perform well on one task
8

Outline
9

Scaling up Language Models
• Scaling in three dimensions has been shown to strongly
increase task solving capability and generalization
– Model size in terms of parameters
– Increasing pre-training data
– Available training compute
10

Emergent Abilities of LLMs
• “Abilities that are not present in small models but arise in
large models”
• Three typical emergent abilities:
– In-context learning: After providing the LLM with one or several
task demonstrations in the prompt, it can generate the expected
output (next week)
– Instruction following: Fine-tuning the model with instructions for
various tasks at once, leads to strong performance on unseen tasks
(instruction tuning -> our focus today)
– Step-by-step reasoning: LLMs can perform complex tasks by
breaking down a problem into smaller steps. The chain-of-thought
prompting mechanism is a popular example (next week)
11
J. Wei et al., “Emergent Abilities of Large Language Models,” CoRR, vol. abs/2206.07682, 2022

Emergent Abilities of LLMs
12
J. Wei et al., “Emergent Abilities of Large Language Models,” CoRR, vol. abs/2206.07682, 2022
• Emergent abilities can lead to sudden leaps in performance
on various tasks

Typical LLM Training Procedure
1. Self-supervised pre-training
(next token prediction)
2. Supervised training on pairs of
human-written prompt/answer
pairs (Step 1)
3. LLM tasked to generate multiple
outputs for a prompt, which are
ranked by a human and used to
train a reward model (Step 2)
4. The LLM is optimized with
reinforcement learning using
the reward model (Step 3)
13

Outline
14

LLM Training Framework
15
Instruction-Tuning Reinforcement Learning from Human Feedback

Instruction Tuning
• Leverage emergent ability of the models
• Incorporate instructions into the fine-tuning procedure by
prepending a “description” of each task to be carried out
• Examples
– Sentiment -> “Is the sentiment of this movie review positive or
negative?”
– Translation (En to De) -> “Translate the following sentence into
German:”
– …
• Some simple templates are used to transform existing
datasets into an instructional format
16

Instruction Tuning
• Fine-tune on many tasks at once
• Teaches language model to follow different natural
language instructions, so that it can perform well on
downstream tasks and even generalize to unseen tasks
17

Increasing Generalization
18

Instruction Tuning: Adding Diversity
• There is a gap between NLP tasks and user needs…
• More diversity needs to be added to the data...
19

Adding Diversity via Task Prompts
• Example Task: Summarization
• Create diversity from the same example via prompt
variations
20

T0 – An Instruction-tuned LLM
21
Sanh, V. et al., Multitask Prompted Training Enables Zero-Shot Task Generalization.
In International Conference on Learning Representations.

T0 Training Sets
• Collected from multiple public NLP datasets and variety of tasks
22

Training Mixtures and Unseen Sets
• Training Mixtures:
– Question answering, structure-to-text, summarization
– Sentiment analysis, topic classification, paraphrase identification
• Unseen test set:
– Sentence completion, BIG-Bench
– Natural language inference, coreference resolution, word sense
disambiguation
• T0 is trained using the T5 transformer (11B model)
23

Task Adaptation with Prompt Templates
• Instead of directly using input/output pairs, specific
instructions are added to explain each task
• The outputs are natural language tokens instead of class
labels
24

Performance on Unseen Tasks
• For T5 and T0, each dot represents one evaluation prompt
25

Effect of Prompt Variations
• Increasing the number of paraphrasing prompts generally
leads to better performance
26

Effects of More Training Datasets
• Adding more datasets consistently leads to higher median
performance
27

Crowdsourcing for Instruction Tuning
• Crowdsourcing as source for
diverse instruction data
• Large dataset of natural
language instructions created
– For 61 distinct tasks
– 193K instances (input/output
pairs)
• Using a set instruction schema
for the annotators
28
Mishra, S. et al., 2022, May. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. In Proceedings of the
60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3470-3487).

Proposed Data Schema
• Title: High-level description of task
• Definition: Core detailed instructions
of task
• Things to avoid: Instructions regarding
undesirable annotations that need to
be avoided
• Emphasis/caution: highlights
statements to be emphasized or
warned against
• Positive example: Example of desired
input/output pair
• Negative example: Example of
undesired input/output pair
29

An Example in this Schema
30

Crowdsourced Dataset
• Random splitting of tasks (12 evaluation, 49 supervision)
• Leave-one-category-out
31

Generalization to Unseen Tasks
• Model: BART (140M,
instruction-tuned)
• All instruction elements
help improve model
performance on unseen
tasks, apart from negative
examples
32

Number of Training Tasks
• Generalization to unseen tasks improves with more
observed tasks
33

Comparison to the GPT3 LLM
• Model: BART (140M params., instruction-tuned)
• Baseline: GPT3 (175B params., not instruction-tuned)
34
• Instructions consistently
improve model performance
on unseen tasks
• BART with instruction-tuning
can often outperform GPT3
without, albeit being a much
smaller model

Using LLMs to generate Instructions
• (Good) Human-written instruction data is expensive
• Possible to reduce the labeling effort?
• Idea: generate instructions using an off-the-shelf LLM (GPT-
3) with human written seed tasks
35
Wang, Y., et al., 2023, July. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In Proceedings of
the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13484-13508).

Self-Instruct Framework
• Classify whether the generated instruction is a classification
task
• Output-first: avoid bias towards one class label
36

Self-Instruct Framework
• Filter out instructions similar with existing ones
• Add newly generated tasks into the task pool for next
iteration
37

Selected Tasks Generated by GPT-3
38

Self-Instruct Experiments
• Use GPT-3-davinci to generate new instruction tasks and
use them to subsequently fine-tune the model itself
• 175 seed tasks -> 52K instructions and 82K instances
39

Self-Instruct Evaluation
40

LIMA: Less is More for Alignment
• Hypothesis: A model’s knowledge and capabilities are
learned almost entirely during pre-training, while
instruction tuning teaches the right format to use when
interacting with users
• Is a small amount of data enough to achieve this goal and
still generalize to new unseen tasks?
41
Zhou, C., et al., 2024. Lima: Less is More for Alignment. Advances in Neural Information Processing Systems, 36.

LIMA: Less is More for Alignment
• Only 1000 training examples: no self-generation and only
few manual annotations
– 750 top questions/answers selected from community forums
– 250 examples (prompt and response) manually written to exemplify
the desired response style of the model
• Finally instruction-tune 65B Llama model on these 1000
examples
42

Comparing LIMA with other LLMs
• By asking human crowd workers and GPT-4 which model
response is the better one (binary decision)
43
Human Evaluation GPT4 Evaluation

Important Factors
• Quality Control:
– Public data: select data with high user ratings
– Manually generated examples: make sure tone and format are
uniform
• Diversity Control:
– Public data: stratified sampling to increase domain diversity
– Manually generated examples: Create with wide range of
tasks/scenarios
44

Quality vs. Quantity vs. Diversity
• Scaling up training data does not necessarily improve the
model response quality
• Quality and diversity are important before quantity
45
Filtered Stack Exchange: diverse and high quality
Unfiltered Stack Exchange: diverse but low quality
wikiHow: high quality but low diversity

Format Constraint Impact on Response
• LIMA with or without 6 format constraint examples
– Generating product page with highlights, about the product and
how to use
– Paper reviews with summary, strengths, weaknesses and potentials
46

Comparing Instruction Datasets
• There is not a single best instruction tuning dataset across
all tasks
• Combining datasets results in the best overall performance
47
Wang, Y., et al., 2023. How Far Can Camels Go? Exploring the State of Instruction Tuning on Open
Resources. Advances in Neural Information Processing Systems, 36, pp.74764-74786.

Impact of Base Model
• Base model quality is extremely important for downstream
task performance
• Llama is pre-trained on more tokens than other models
48

Impact of Model Size
• Smaller models benefit more from instruction-tuning
• Instruction-tuning does not help to enhance strong
capabilities already existing in the original model
49

Summary: Instruction Tuning
• Instruction tuning enables language models to follow novel
user instructions that are not seen during fine-tuning
➔This is what users want!
• Instruction-tuned models perform well on many tasks not
just a single one as with task-specific fine-tuning
• Limitations:
– Data collection is expensive, especially for complex tasks (quality
and diversity control are necessary)
– Many tasks do not have a single acceptable output (format) but
many can be considered correct
– Instruction tuning does not directly model human preferences
50

Summary: Instruction Tuning
• All presented techniques are used today to prepare
instruction-tuning data for LLMs
– Reformulating existing tasks into natural language format
– Crowdsourcing instructions and answers
– Generating instructions with LLMs themselves
51
Zhao et al.: A Survey of Large Language Models. 2024. arXiv:2303.18223

Outline
52

The Problem of Supervised Fine-tuning
• There is still a misalignment between the ML objective –
maximizing the likelihood of a specific piece of human-
written text – and what humans actually want – generation
of high-quality outputs as determined by humans
• Language models go through another phase of learning,
called alignment, where they learn how to present
information to users and align to human preferences, e.g.:
– Helpfulness
– Honesty
– Harmlessness
• Do you see a problem with these preferences?
53

LLM Pre-training Framework
54
Instruction-Tuning Reinforcement Learning from Human Feedback

Reinforcement Learning Model
• An agent has a policy function,
which can take action At
according to current state St
• As a result of the action, the
agent receives a reward Rt
from the environment and
transits to the next state St+1
55

InstructGPT
• Agent: language model
• Action: predict the next token
• Policy: The output distribution of
the next token
• Reward: a reward model trained
by human evaluations on model
responses
➔ Removes the need for a human-
in-the-loop
56

Reward Model Training
• Prompt supervised fine-tuned language model to produce
pairs of answers
• Human annotators decide which one is preferred
• Reward model is trained to score yw higher than yl
• Reward model is often initialized from πSFT with a linear
layer to produce a scalar reward value
57

RLHF: Proximal Policy Optimization
• Optimize the language model with feedback from the
reward model
• Prevents mode collapse to single high reward answers
• Prevents the model from deviating too far from the
distribution where the reward model is accurate
58
Schulman, J. et al., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Comparison with Baselines
• RLHF models are more preferred by human labelers
59

Evaluations on Different Aspects
60

Limitations of PPO Methods
• Need to train multiple models
– Reward model
– Policy model
• Needs sampling from Language model during fine-tuning
• Complicated reinforcement learning training process
• Is it possible to directly train a language model from human
preference annotations?
61

Direct Preference Optimization
• Removes the iterative
reinforcement learning
process by directly tuning
the model on human
preferences
• DPO eliminates the need to
– train a reward model
– sample from the LM during
fine-tuning
– perform large hyperparameter
search
62

DPO versus Baselines
• DPO provides higher expected reward compared to PPO (left)
• Higher win-rate compared to human-written summarizations, evaluated
by GPT4 (right)
63

Comparison between PPO and DPO
• Proximal policy optimization
– Complex reinforcement learning
– Iterative process
– Can handle more informative
human feedback (e.g. numerical
ratings)
• Direct preference optimization
– Simpler fine-tuning process by
directly fitting reward model
– Cheaper and more stable
training
– Can only handle binary signals
64

Fine-grained Human Feedback
• Assigning a single score to the model output may not be
informative enough
65
Wu, Z. et al., 2024. Fine-grained Human Feedback gives Better Rewards for
Language Model Training. Advances in Neural Information Processing Systems, 36.

Multiple Reward Functions
• Provide a reward after every segment (e.g. a sentence) is
generated
• Different feedback types: factual incorrectness, irrelevance,
and information incompleteness
• Combined reward:
66

Example: Detoxification
• Measure toxicity
– 0: non-toxic
– 1: toxic
67

Example: Detoxification
• Learning from denser fine-grained rewards is more sample
efficient than learning from holistic rewards
• Fine-grained location of toxic content is a stronger training
signal than a single scalar value for the whole text.
68

Customizing LLM Behavior
• Keep factualness/completeness reward weights fixed
• Alternate relevance reward weight: 0.4/0.3/0.2
• Relevance reward penalizes referencing passages and
auxiliary information
69

Open Issues with RLHF
• There remain challenges within each of the three steps
– Human feedback
– Reward model
– Policy
70
Casper, S., et al., 2023. Open Problems and Fundamental Limitations of Reinforcement Learning
from Human Feedback. Transactions on Machine Learning Research.

Challenges: Human Feedback
• Biases of human evaluators
– Studies found that ChatGPT became politically biased after RLHF
• Good oversight is difficult
– Evaluators are paid per example and may make mistakes given time
constraints
– Poor feedback when evaluating difficult tasks
• Data Quality
– Cost/Quality tradeoff
• Tradeoff between richness and efficiency of feedback types
– Comparison-based feedback, scalar feedback, correction feedback,
language feedback, …
71

Challenges: Reward Model
• A single reward model cannot represent a diverse society of
humans
• Reward misgeneralization: reward model may fit with
human preference data due to unexpected features
• Evaluation of reward model is difficult and expensive
72

Challenges: Policy
• Robust reinforcement learning is difficult
– Balance between exploring new actions and exploiting known
rewards
– Challenge increases in high-dimensional or sparse reward settings
• Policy misgeneralization: training and deployment
environments are different
73

Summary: RLHF
• Reinforcement Learning from Human Feedback allows to
directly model human preferences and generalize beyond
the labelled data
• Reinforcement Learning from Human Feedback can improve
on doing only instruction-tuning
• Tricky to get right
• “Alignment Tax”: performance on tasks may suffer in favour
of modelling outputs to human preference
74

Summary: RLHF
• Human preferences are unreliable!
– “Reward hacking” is common problem in RL
– Chatbots are rewarded to produce responses that seem
authoritative and helpful, regardless of truth, which can result in
hallucinations
• Models of human preferences are even more unreliable!
• Still very data expensive
• Very underexplored and fast-moving research area
75

Current Developments
• Focus on Reasoning LLMs (OpenAI o1/o3, Deepseek, etc.)
– Incorporation of chain-of-thought prompting (next week) into
training procedure
– Introduction of additional tokens to “give the model time to think”
have also been shown to be helpful
• Reinforcement learning is used to automatically generate
reasoning examples (e.g. Deepseek)
– Problem: How to verify the final output is correct if we do not have
labels?
→Use domains where correct answer can be programmatically
derived (math, coding, ...)
76
OpenAI Blog: https://openai.com/index/learning-to-reason-with-llms/
OpenAI o1 system card: https://cdn.openai.com/o1-system-card-20241205.pdf
Deepseek R1 paper: https://arxiv.org/abs/2501.12948

Outline
77

A Problem for Open Research
• The presented training procedures for creating performant
LLMs requires huge amounts of compute resources for
extended amounts of time (weeks to months)
• Public research institutions mostly do not have this kind of
infrastructure/funding
• ChatGPT/Claude/Gemini/etc.: closed source/proprietary
models, we don’t know about the pre-training corpus and
we can’t access the weights of the models
➔ We can use them but we can only operate on assumptions
regarding their training data and specifics of the training
procedure
78

Llama: Open-Source Language Models
• Open-source models by Meta
• Available in various versions and sizes ranging from 7B to 405B
parameters
• The pre-training corpus is transparent and the models are freely
available for anyone
– Pre-training corpus: English CommonCrawl, C4, Github, Wikipedia,
Gutenberg and Books3, ArXiv, Stack Exchange
– Researchers with limited computing resources can use smaller models to
understand how and why these language models work
➔ Currently the best alternative for research institutions to
investigate topics like instruction tuning and reinforcement learning
from human feedback
79
Touvron, H. et al., 2023. Llama: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971.
Touvron, H. et al., 2023. Llama 2: Open Foundation and Fine-tuned Chat Models. arXiv preprint arXiv:2307.09288.
Dubey, A. et al., 2024. The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783.

Existing Large Language Models
• Many of the publically available LLMs are based on the
Llama series of models by Meta
80

Existing Large Language Models
81

See you next week!
• Next time: Prompt engineering and efficient adaptation
– Zero-shot, in-context learning, chain-of-thought, …
– Prompt tuning, adapter tuning, LoRA, ….
82

Instruction Tuning and Reinforcement Learning from Human Feedback Data and Web Science Group IE686 Large Language Models and Agents

More Related Content

Similar to Instruction Tuning and Reinforcement Learning from Human Feedback Data and Web Science Group IE686 Large Language Models and Agents (20)

Recently uploaded (20)

Instruction Tuning and Reinforcement Learning from Human Feedback Data and Web Science Group IE686 Large Language Models and Agents