SlideShare a Scribd company logo
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Instruction Tuning and Reinforcement
Learning from Human Feedback
IE686 Large Language Models and Agents
1
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Credits
• This slide set is based on slides from
– Jiaxin Huang
– Mrinmaya Sachan
– Tatsunori Hashimoto
• Many thanks to all of you!
2
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Outline
• Recap: Pre-training Language Models
• Scaling up and Emergent Abilities of LLMs
• Instruction Tuning
• Reinforcement Learning from Human Feedback
• Existing Large Language Models
3
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Recap: Language Models over Time
• Simple n-gram models followed by shallow neural methods
and RNNs
• The Transformer architecture started the age of pre-trained
language models
– Large-scale Pre-training followed by task-specific fine-tuning
➔ Transfer Learning
4
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Recap: Pre-training Data
5
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Recap: Pre-training Decoder-only
6
Original Sentence:
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Language Modeling ≠ Solving Tasks
• Language modelling with next token prediction does not
make the model a competent task solver
• How to adapt to correctly solving tasks?
7
Ouyang, L et al., 2022. Training Language Models to follow Instructions with Human Feedback. Advances in Neural Information
Processing Systems, 35, pp.27730-27744.
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Pre-train/Fine-tune Paradigm of PLMs
• The pre-training stage lets language models learn generic
representations and knowledge from large corpora, but they are
not fine-tuned on any form of user tasks.
• To adapt language models to a specific downstream task, use
comparably small task-specific datasets for fine-tuning
➔ Transfer knowledge from pre-training, show the model what we want the
output to look like and subsequently perform well on one task
8
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Outline
• Recap: Pre-training Language Models
• Scaling up and Emergent Abilities of LLMs
• Instruction Tuning
• Reinforcement Learning from Human Feedback
• Existing Large Language Models
9
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Scaling up Language Models
• Scaling in three dimensions has been shown to strongly
increase task solving capability and generalization
– Model size in terms of parameters
– Increasing pre-training data
– Available training compute
10
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Emergent Abilities of LLMs
• “Abilities that are not present in small models but arise in
large models”
• Three typical emergent abilities:
– In-context learning: After providing the LLM with one or several
task demonstrations in the prompt, it can generate the expected
output (next week)
– Instruction following: Fine-tuning the model with instructions for
various tasks at once, leads to strong performance on unseen tasks
(instruction tuning -> our focus today)
– Step-by-step reasoning: LLMs can perform complex tasks by
breaking down a problem into smaller steps. The chain-of-thought
prompting mechanism is a popular example (next week)
11
J. Wei et al., “Emergent Abilities of Large Language Models,” CoRR, vol. abs/2206.07682, 2022
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Emergent Abilities of LLMs
12
J. Wei et al., “Emergent Abilities of Large Language Models,” CoRR, vol. abs/2206.07682, 2022
• Emergent abilities can lead to sudden leaps in performance
on various tasks
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Typical LLM Training Procedure
1. Self-supervised pre-training
(next token prediction)
2. Supervised training on pairs of
human-written prompt/answer
pairs (Step 1)
3. LLM tasked to generate multiple
outputs for a prompt, which are
ranked by a human and used to
train a reward model (Step 2)
4. The LLM is optimized with
reinforcement learning using
the reward model (Step 3)
13
Ouyang, L et al., 2022. Training Language Models to follow Instructions with Human Feedback. Advances in Neural Information
Processing Systems, 35, pp.27730-27744.
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Outline
• Recap: Pre-training Language Models
• Scaling up and Emergent Abilities of LLMs
• Instruction Tuning
• Reinforcement Learning from Human Feedback
• Existing Large Language Models
14
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
LLM Training Framework
15
Instruction-Tuning Reinforcement Learning from Human Feedback
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Instruction Tuning
• Leverage emergent ability of the models
• Incorporate instructions into the fine-tuning procedure by
prepending a “description” of each task to be carried out
• Examples
– Sentiment -> “Is the sentiment of this movie review positive or
negative?”
– Translation (En to De) -> “Translate the following sentence into
German:”
– …
• Some simple templates are used to transform existing
datasets into an instructional format
16
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Instruction Tuning
• Fine-tune on many tasks at once
• Teaches language model to follow different natural
language instructions, so that it can perform well on
downstream tasks and even generalize to unseen tasks
17
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Increasing Generalization
18
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Instruction Tuning: Adding Diversity
• There is a gap between NLP tasks and user needs…
• More diversity needs to be added to the data...
19
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Adding Diversity via Task Prompts
• Example Task: Summarization
• Create diversity from the same example via prompt
variations
20
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
T0 – An Instruction-tuned LLM
21
Sanh, V. et al., Multitask Prompted Training Enables Zero-Shot Task Generalization.
In International Conference on Learning Representations.
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
T0 Training Sets
• Collected from multiple public NLP datasets and variety of tasks
22
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Training Mixtures and Unseen Sets
• Training Mixtures:
– Question answering, structure-to-text, summarization
– Sentiment analysis, topic classification, paraphrase identification
• Unseen test set:
– Sentence completion, BIG-Bench
– Natural language inference, coreference resolution, word sense
disambiguation
• T0 is trained using the T5 transformer (11B model)
23
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Task Adaptation with Prompt Templates
• Instead of directly using input/output pairs, specific
instructions are added to explain each task
• The outputs are natural language tokens instead of class
labels
24
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Performance on Unseen Tasks
• For T5 and T0, each dot represents one evaluation prompt
25
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Effect of Prompt Variations
• Increasing the number of paraphrasing prompts generally
leads to better performance
26
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Effects of More Training Datasets
• Adding more datasets consistently leads to higher median
performance
27
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Crowdsourcing for Instruction Tuning
• Crowdsourcing as source for
diverse instruction data
• Large dataset of natural
language instructions created
– For 61 distinct tasks
– 193K instances (input/output
pairs)
• Using a set instruction schema
for the annotators
28
Mishra, S. et al., 2022, May. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. In Proceedings of the
60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3470-3487).
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Proposed Data Schema
• Title: High-level description of task
• Definition: Core detailed instructions
of task
• Things to avoid: Instructions regarding
undesirable annotations that need to
be avoided
• Emphasis/caution: highlights
statements to be emphasized or
warned against
• Positive example: Example of desired
input/output pair
• Negative example: Example of
undesired input/output pair
29
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
An Example in this Schema
30
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Crowdsourced Dataset
• Random splitting of tasks (12 evaluation, 49 supervision)
• Leave-one-category-out
31
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Generalization to Unseen Tasks
• Model: BART (140M,
instruction-tuned)
• All instruction elements
help improve model
performance on unseen
tasks, apart from negative
examples
32
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Number of Training Tasks
• Generalization to unseen tasks improves with more
observed tasks
33
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Comparison to the GPT3 LLM
• Model: BART (140M params., instruction-tuned)
• Baseline: GPT3 (175B params., not instruction-tuned)
34
• Instructions consistently
improve model performance
on unseen tasks
• BART with instruction-tuning
can often outperform GPT3
without, albeit being a much
smaller model
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Using LLMs to generate Instructions
• (Good) Human-written instruction data is expensive
• Possible to reduce the labeling effort?
• Idea: generate instructions using an off-the-shelf LLM (GPT-
3) with human written seed tasks
35
Wang, Y., et al., 2023, July. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In Proceedings of
the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13484-13508).
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Self-Instruct Framework
• Classify whether the generated instruction is a classification
task
• Output-first: avoid bias towards one class label
36
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Self-Instruct Framework
• Filter out instructions similar with existing ones
• Add newly generated tasks into the task pool for next
iteration
37
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Selected Tasks Generated by GPT-3
38
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Self-Instruct Experiments
• Use GPT-3-davinci to generate new instruction tasks and
use them to subsequently fine-tune the model itself
• 175 seed tasks -> 52K instructions and 82K instances
39
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Self-Instruct Evaluation
40
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
LIMA: Less is More for Alignment
• Hypothesis: A model’s knowledge and capabilities are
learned almost entirely during pre-training, while
instruction tuning teaches the right format to use when
interacting with users
• Is a small amount of data enough to achieve this goal and
still generalize to new unseen tasks?
41
Zhou, C., et al., 2024. Lima: Less is More for Alignment. Advances in Neural Information Processing Systems, 36.
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
LIMA: Less is More for Alignment
• Only 1000 training examples: no self-generation and only
few manual annotations
– 750 top questions/answers selected from community forums
– 250 examples (prompt and response) manually written to exemplify
the desired response style of the model
• Finally instruction-tune 65B Llama model on these 1000
examples
42
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Comparing LIMA with other LLMs
• By asking human crowd workers and GPT-4 which model
response is the better one (binary decision)
43
Human Evaluation GPT4 Evaluation
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Important Factors
• Quality Control:
– Public data: select data with high user ratings
– Manually generated examples: make sure tone and format are
uniform
• Diversity Control:
– Public data: stratified sampling to increase domain diversity
– Manually generated examples: Create with wide range of
tasks/scenarios
44
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Quality vs. Quantity vs. Diversity
• Scaling up training data does not necessarily improve the
model response quality
• Quality and diversity are important before quantity
45
Filtered Stack Exchange: diverse and high quality
Unfiltered Stack Exchange: diverse but low quality
wikiHow: high quality but low diversity
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Format Constraint Impact on Response
• LIMA with or without 6 format constraint examples
– Generating product page with highlights, about the product and
how to use
– Paper reviews with summary, strengths, weaknesses and potentials
46
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Comparing Instruction Datasets
• There is not a single best instruction tuning dataset across
all tasks
• Combining datasets results in the best overall performance
47
Wang, Y., et al., 2023. How Far Can Camels Go? Exploring the State of Instruction Tuning on Open
Resources. Advances in Neural Information Processing Systems, 36, pp.74764-74786.
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Impact of Base Model
• Base model quality is extremely important for downstream
task performance
• Llama is pre-trained on more tokens than other models
48
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Impact of Model Size
• Smaller models benefit more from instruction-tuning
• Instruction-tuning does not help to enhance strong
capabilities already existing in the original model
49
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Summary: Instruction Tuning
• Instruction tuning enables language models to follow novel
user instructions that are not seen during fine-tuning
➔This is what users want!
• Instruction-tuned models perform well on many tasks not
just a single one as with task-specific fine-tuning
• Limitations:
– Data collection is expensive, especially for complex tasks (quality
and diversity control are necessary)
– Many tasks do not have a single acceptable output (format) but
many can be considered correct
– Instruction tuning does not directly model human preferences
50
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Summary: Instruction Tuning
• All presented techniques are used today to prepare
instruction-tuning data for LLMs
– Reformulating existing tasks into natural language format
– Crowdsourcing instructions and answers
– Generating instructions with LLMs themselves
51
Zhao et al.: A Survey of Large Language Models. 2024. arXiv:2303.18223
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Outline
• Recap: Pre-training Language Models
• Scaling up and Emergent Abilities of LLMs
• Instruction Tuning
• Reinforcement Learning from Human Feedback
• Existing Large Language Models
52
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
The Problem of Supervised Fine-tuning
• There is still a misalignment between the ML objective –
maximizing the likelihood of a specific piece of human-
written text – and what humans actually want – generation
of high-quality outputs as determined by humans
• Language models go through another phase of learning,
called alignment, where they learn how to present
information to users and align to human preferences, e.g.:
– Helpfulness
– Honesty
– Harmlessness
• Do you see a problem with these preferences?
53
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
LLM Pre-training Framework
54
Instruction-Tuning Reinforcement Learning from Human Feedback
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Reinforcement Learning Model
• An agent has a policy function,
which can take action At
according to current state St
• As a result of the action, the
agent receives a reward Rt
from the environment and
transits to the next state St+1
55
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
InstructGPT
• Agent: language model
• Action: predict the next token
• Policy: The output distribution of
the next token
• Reward: a reward model trained
by human evaluations on model
responses
➔ Removes the need for a human-
in-the-loop
56
Ouyang, L et al., 2022. Training Language Models to follow Instructions with Human Feedback. Advances in Neural Information
Processing Systems, 35, pp.27730-27744.
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Reward Model Training
• Prompt supervised fine-tuned language model to produce
pairs of answers
• Human annotators decide which one is preferred
• Reward model is trained to score yw higher than yl
• Reward model is often initialized from πSFT with a linear
layer to produce a scalar reward value
57
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
RLHF: Proximal Policy Optimization
• Optimize the language model with feedback from the
reward model
• Prevents mode collapse to single high reward answers
• Prevents the model from deviating too far from the
distribution where the reward model is accurate
58
Schulman, J. et al., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Comparison with Baselines
• RLHF models are more preferred by human labelers
59
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Evaluations on Different Aspects
60
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Limitations of PPO Methods
• Need to train multiple models
– Reward model
– Policy model
• Needs sampling from Language model during fine-tuning
• Complicated reinforcement learning training process
• Is it possible to directly train a language model from human
preference annotations?
61
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Direct Preference Optimization
• Removes the iterative
reinforcement learning
process by directly tuning
the model on human
preferences
• DPO eliminates the need to
– train a reward model
– sample from the LM during
fine-tuning
– perform large hyperparameter
search
62
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
DPO versus Baselines
• DPO provides higher expected reward compared to PPO (left)
• Higher win-rate compared to human-written summarizations, evaluated
by GPT4 (right)
63
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Comparison between PPO and DPO
• Proximal policy optimization
– Complex reinforcement learning
– Iterative process
– Can handle more informative
human feedback (e.g. numerical
ratings)
• Direct preference optimization
– Simpler fine-tuning process by
directly fitting reward model
– Cheaper and more stable
training
– Can only handle binary signals
64
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Fine-grained Human Feedback
• Assigning a single score to the model output may not be
informative enough
65
Wu, Z. et al., 2024. Fine-grained Human Feedback gives Better Rewards for
Language Model Training. Advances in Neural Information Processing Systems, 36.
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Multiple Reward Functions
• Provide a reward after every segment (e.g. a sentence) is
generated
• Different feedback types: factual incorrectness, irrelevance,
and information incompleteness
• Combined reward:
66
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Example: Detoxification
• Measure toxicity
– 0: non-toxic
– 1: toxic
67
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Example: Detoxification
• Learning from denser fine-grained rewards is more sample
efficient than learning from holistic rewards
• Fine-grained location of toxic content is a stronger training
signal than a single scalar value for the whole text.
68
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Customizing LLM Behavior
• Keep factualness/completeness reward weights fixed
• Alternate relevance reward weight: 0.4/0.3/0.2
• Relevance reward penalizes referencing passages and
auxiliary information
69
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Open Issues with RLHF
• There remain challenges within each of the three steps
– Human feedback
– Reward model
– Policy
70
Casper, S., et al., 2023. Open Problems and Fundamental Limitations of Reinforcement Learning
from Human Feedback. Transactions on Machine Learning Research.
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Challenges: Human Feedback
• Biases of human evaluators
– Studies found that ChatGPT became politically biased after RLHF
• Good oversight is difficult
– Evaluators are paid per example and may make mistakes given time
constraints
– Poor feedback when evaluating difficult tasks
• Data Quality
– Cost/Quality tradeoff
• Tradeoff between richness and efficiency of feedback types
– Comparison-based feedback, scalar feedback, correction feedback,
language feedback, …
71
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Challenges: Reward Model
• A single reward model cannot represent a diverse society of
humans
• Reward misgeneralization: reward model may fit with
human preference data due to unexpected features
• Evaluation of reward model is difficult and expensive
72
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Challenges: Policy
• Robust reinforcement learning is difficult
– Balance between exploring new actions and exploiting known
rewards
– Challenge increases in high-dimensional or sparse reward settings
• Policy misgeneralization: training and deployment
environments are different
73
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Summary: RLHF
• Reinforcement Learning from Human Feedback allows to
directly model human preferences and generalize beyond
the labelled data
• Reinforcement Learning from Human Feedback can improve
on doing only instruction-tuning
• Tricky to get right
• “Alignment Tax”: performance on tasks may suffer in favour
of modelling outputs to human preference
74
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Summary: RLHF
• Human preferences are unreliable!
– “Reward hacking” is common problem in RL
– Chatbots are rewarded to produce responses that seem
authoritative and helpful, regardless of truth, which can result in
hallucinations
• Models of human preferences are even more unreliable!
• Still very data expensive
• Very underexplored and fast-moving research area
75
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Current Developments
• Focus on Reasoning LLMs (OpenAI o1/o3, Deepseek, etc.)
– Incorporation of chain-of-thought prompting (next week) into
training procedure
– Introduction of additional tokens to “give the model time to think”
have also been shown to be helpful
• Reinforcement learning is used to automatically generate
reasoning examples (e.g. Deepseek)
– Problem: How to verify the final output is correct if we do not have
labels?
→Use domains where correct answer can be programmatically
derived (math, coding, ...)
76
OpenAI Blog: https://openai.com/index/learning-to-reason-with-llms/
OpenAI o1 system card: https://cdn.openai.com/o1-system-card-20241205.pdf
Deepseek R1 paper: https://arxiv.org/abs/2501.12948
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Outline
• Recap: Pre-training Language Models
• Scaling up and Emergent Abilities of LLMs
• Instruction Tuning
• Reinforcement Learning from Human Feedback
• Existing Large Language Models
77
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
A Problem for Open Research
• The presented training procedures for creating performant
LLMs requires huge amounts of compute resources for
extended amounts of time (weeks to months)
• Public research institutions mostly do not have this kind of
infrastructure/funding
• ChatGPT/Claude/Gemini/etc.: closed source/proprietary
models, we don’t know about the pre-training corpus and
we can’t access the weights of the models
➔ We can use them but we can only operate on assumptions
regarding their training data and specifics of the training
procedure
78
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Llama: Open-Source Language Models
• Open-source models by Meta
• Available in various versions and sizes ranging from 7B to 405B
parameters
• The pre-training corpus is transparent and the models are freely
available for anyone
– Pre-training corpus: English CommonCrawl, C4, Github, Wikipedia,
Gutenberg and Books3, ArXiv, Stack Exchange
– Researchers with limited computing resources can use smaller models to
understand how and why these language models work
➔ Currently the best alternative for research institutions to
investigate topics like instruction tuning and reinforcement learning
from human feedback
79
Touvron, H. et al., 2023. Llama: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971.
Touvron, H. et al., 2023. Llama 2: Open Foundation and Fine-tuned Chat Models. arXiv preprint arXiv:2307.09288.
Dubey, A. et al., 2024. The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783.
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Existing Large Language Models
• Many of the publically available LLMs are based on the
Llama series of models by Meta
80
Zhao et al.: A Survey of Large Language Models. 2024. arXiv:2303.18223
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
Existing Large Language Models
81
Zhao et al.: A Survey of Large Language Models. 2024. arXiv:2303.18223
University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025
Data and Web Science Group
See you next week!
• Next time: Prompt engineering and efficient adaptation
– Zero-shot, in-context learning, chain-of-thought, …
– Prompt tuning, adapter tuning, LoRA, ….
82

More Related Content

PDF
LLM Agents and Tool Use Data and Web Science Group IE686 Large Language Model...
PDF
Introduction and Organization Data and Web Science Group IE686 Large Language...
PDF
Project Topic Presentation Data and Web Science Group IE686 Large Language Mo...
PDF
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
PDF
Why xAPI? A Business Leader's Getting Started Guide
PPTX
Teaching with MATLAB
PPT
Open lw reference architecture project
PPTX
IEEE SocialCom 2009: NetViz Nirvana (NodeXL Learnability)
LLM Agents and Tool Use Data and Web Science Group IE686 Large Language Model...
Introduction and Organization Data and Web Science Group IE686 Large Language...
Project Topic Presentation Data and Web Science Group IE686 Large Language Mo...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
Why xAPI? A Business Leader's Getting Started Guide
Teaching with MATLAB
Open lw reference architecture project
IEEE SocialCom 2009: NetViz Nirvana (NodeXL Learnability)

Similar to Instruction Tuning and Reinforcement Learning from Human Feedback Data and Web Science Group IE686 Large Language Models and Agents (20)

PPTX
Model-Driven Spreadsheet Development
PPTX
Empowering End-Users to Collaboratively Structure Knowledge-Intensive Processes
PPT
VII Jornadas eMadrid "Education in exponential times". Mesa redonda eMadrid L...
PPT
Final ec2 kt
DOC
Resume
PDF
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
PPTX
2211 APSIPA
PDF
poster_eduacion_ai_universidad_catolica_chile.pdf
PPTX
College Monitoring system BY: Geekssay.com
PPT
Cb Cetis June 2007 Final
PDF
Teaching Data-driven Video Processing via Crowdsourced Data Collection
PDF
Data-X-Sparse-v2
PPTX
2. Evaluation design of the cofimvaba ict4 red initiative - Bridge 2014 version
PPTX
PhD_Research_Proposal_Machine_Learning.pptx
PDF
Modellbildung, Berechnung und Simulation in Forschung und Lehre
PPT
DRESD Project Presentation - December 2006
PDF
A new Moodle module supporting automatic verification of VHDL-based assignmen...
PPTX
Optimization Software in Operational Research Analysis in a Public University...
PPTX
Towards Open Architectures and Interoperability for Learning Analytics
PPTX
BPM Cluster Meeting 2018
Model-Driven Spreadsheet Development
Empowering End-Users to Collaboratively Structure Knowledge-Intensive Processes
VII Jornadas eMadrid "Education in exponential times". Mesa redonda eMadrid L...
Final ec2 kt
Resume
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
2211 APSIPA
poster_eduacion_ai_universidad_catolica_chile.pdf
College Monitoring system BY: Geekssay.com
Cb Cetis June 2007 Final
Teaching Data-driven Video Processing via Crowdsourced Data Collection
Data-X-Sparse-v2
2. Evaluation design of the cofimvaba ict4 red initiative - Bridge 2014 version
PhD_Research_Proposal_Machine_Learning.pptx
Modellbildung, Berechnung und Simulation in Forschung und Lehre
DRESD Project Presentation - December 2006
A new Moodle module supporting automatic verification of VHDL-based assignmen...
Optimization Software in Operational Research Analysis in a Public University...
Towards Open Architectures and Interoperability for Learning Analytics
BPM Cluster Meeting 2018
Ad

Recently uploaded (20)

PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
medical staffing services at VALiNTRY
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
top salesforce developer skills in 2025.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
AI in Product Development-omnex systems
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Nekopoi APK 2025 free lastest update
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Essential Infomation Tech presentation.pptx
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
L1 - Introduction to python Backend.pptx
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Which alternative to Crystal Reports is best for small or large businesses.pdf
Design an Analysis of Algorithms I-SECS-1021-03
CHAPTER 2 - PM Management and IT Context
medical staffing services at VALiNTRY
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
top salesforce developer skills in 2025.pdf
Transform Your Business with a Software ERP System
AI in Product Development-omnex systems
VVF-Customer-Presentation2025-Ver1.9.pptx
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Nekopoi APK 2025 free lastest update
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Essential Infomation Tech presentation.pptx
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
How Creative Agencies Leverage Project Management Software.pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
L1 - Introduction to python Backend.pptx
Ad

Instruction Tuning and Reinforcement Learning from Human Feedback Data and Web Science Group IE686 Large Language Models and Agents

  • 1. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Instruction Tuning and Reinforcement Learning from Human Feedback IE686 Large Language Models and Agents 1
  • 2. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Credits • This slide set is based on slides from – Jiaxin Huang – Mrinmaya Sachan – Tatsunori Hashimoto • Many thanks to all of you! 2
  • 3. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Outline • Recap: Pre-training Language Models • Scaling up and Emergent Abilities of LLMs • Instruction Tuning • Reinforcement Learning from Human Feedback • Existing Large Language Models 3
  • 4. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Recap: Language Models over Time • Simple n-gram models followed by shallow neural methods and RNNs • The Transformer architecture started the age of pre-trained language models – Large-scale Pre-training followed by task-specific fine-tuning ➔ Transfer Learning 4
  • 5. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Recap: Pre-training Data 5
  • 6. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Recap: Pre-training Decoder-only 6 Original Sentence:
  • 7. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Language Modeling ≠ Solving Tasks • Language modelling with next token prediction does not make the model a competent task solver • How to adapt to correctly solving tasks? 7 Ouyang, L et al., 2022. Training Language Models to follow Instructions with Human Feedback. Advances in Neural Information Processing Systems, 35, pp.27730-27744.
  • 8. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Pre-train/Fine-tune Paradigm of PLMs • The pre-training stage lets language models learn generic representations and knowledge from large corpora, but they are not fine-tuned on any form of user tasks. • To adapt language models to a specific downstream task, use comparably small task-specific datasets for fine-tuning ➔ Transfer knowledge from pre-training, show the model what we want the output to look like and subsequently perform well on one task 8
  • 9. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Outline • Recap: Pre-training Language Models • Scaling up and Emergent Abilities of LLMs • Instruction Tuning • Reinforcement Learning from Human Feedback • Existing Large Language Models 9
  • 10. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Scaling up Language Models • Scaling in three dimensions has been shown to strongly increase task solving capability and generalization – Model size in terms of parameters – Increasing pre-training data – Available training compute 10
  • 11. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Emergent Abilities of LLMs • “Abilities that are not present in small models but arise in large models” • Three typical emergent abilities: – In-context learning: After providing the LLM with one or several task demonstrations in the prompt, it can generate the expected output (next week) – Instruction following: Fine-tuning the model with instructions for various tasks at once, leads to strong performance on unseen tasks (instruction tuning -> our focus today) – Step-by-step reasoning: LLMs can perform complex tasks by breaking down a problem into smaller steps. The chain-of-thought prompting mechanism is a popular example (next week) 11 J. Wei et al., “Emergent Abilities of Large Language Models,” CoRR, vol. abs/2206.07682, 2022
  • 12. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Emergent Abilities of LLMs 12 J. Wei et al., “Emergent Abilities of Large Language Models,” CoRR, vol. abs/2206.07682, 2022 • Emergent abilities can lead to sudden leaps in performance on various tasks
  • 13. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Typical LLM Training Procedure 1. Self-supervised pre-training (next token prediction) 2. Supervised training on pairs of human-written prompt/answer pairs (Step 1) 3. LLM tasked to generate multiple outputs for a prompt, which are ranked by a human and used to train a reward model (Step 2) 4. The LLM is optimized with reinforcement learning using the reward model (Step 3) 13 Ouyang, L et al., 2022. Training Language Models to follow Instructions with Human Feedback. Advances in Neural Information Processing Systems, 35, pp.27730-27744.
  • 14. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Outline • Recap: Pre-training Language Models • Scaling up and Emergent Abilities of LLMs • Instruction Tuning • Reinforcement Learning from Human Feedback • Existing Large Language Models 14
  • 15. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group LLM Training Framework 15 Instruction-Tuning Reinforcement Learning from Human Feedback
  • 16. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Instruction Tuning • Leverage emergent ability of the models • Incorporate instructions into the fine-tuning procedure by prepending a “description” of each task to be carried out • Examples – Sentiment -> “Is the sentiment of this movie review positive or negative?” – Translation (En to De) -> “Translate the following sentence into German:” – … • Some simple templates are used to transform existing datasets into an instructional format 16
  • 17. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Instruction Tuning • Fine-tune on many tasks at once • Teaches language model to follow different natural language instructions, so that it can perform well on downstream tasks and even generalize to unseen tasks 17
  • 18. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Increasing Generalization 18
  • 19. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Instruction Tuning: Adding Diversity • There is a gap between NLP tasks and user needs… • More diversity needs to be added to the data... 19
  • 20. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Adding Diversity via Task Prompts • Example Task: Summarization • Create diversity from the same example via prompt variations 20
  • 21. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group T0 – An Instruction-tuned LLM 21 Sanh, V. et al., Multitask Prompted Training Enables Zero-Shot Task Generalization. In International Conference on Learning Representations.
  • 22. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group T0 Training Sets • Collected from multiple public NLP datasets and variety of tasks 22
  • 23. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Training Mixtures and Unseen Sets • Training Mixtures: – Question answering, structure-to-text, summarization – Sentiment analysis, topic classification, paraphrase identification • Unseen test set: – Sentence completion, BIG-Bench – Natural language inference, coreference resolution, word sense disambiguation • T0 is trained using the T5 transformer (11B model) 23
  • 24. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Task Adaptation with Prompt Templates • Instead of directly using input/output pairs, specific instructions are added to explain each task • The outputs are natural language tokens instead of class labels 24
  • 25. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Performance on Unseen Tasks • For T5 and T0, each dot represents one evaluation prompt 25
  • 26. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Effect of Prompt Variations • Increasing the number of paraphrasing prompts generally leads to better performance 26
  • 27. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Effects of More Training Datasets • Adding more datasets consistently leads to higher median performance 27
  • 28. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Crowdsourcing for Instruction Tuning • Crowdsourcing as source for diverse instruction data • Large dataset of natural language instructions created – For 61 distinct tasks – 193K instances (input/output pairs) • Using a set instruction schema for the annotators 28 Mishra, S. et al., 2022, May. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3470-3487).
  • 29. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Proposed Data Schema • Title: High-level description of task • Definition: Core detailed instructions of task • Things to avoid: Instructions regarding undesirable annotations that need to be avoided • Emphasis/caution: highlights statements to be emphasized or warned against • Positive example: Example of desired input/output pair • Negative example: Example of undesired input/output pair 29
  • 30. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group An Example in this Schema 30
  • 31. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Crowdsourced Dataset • Random splitting of tasks (12 evaluation, 49 supervision) • Leave-one-category-out 31
  • 32. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Generalization to Unseen Tasks • Model: BART (140M, instruction-tuned) • All instruction elements help improve model performance on unseen tasks, apart from negative examples 32
  • 33. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Number of Training Tasks • Generalization to unseen tasks improves with more observed tasks 33
  • 34. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Comparison to the GPT3 LLM • Model: BART (140M params., instruction-tuned) • Baseline: GPT3 (175B params., not instruction-tuned) 34 • Instructions consistently improve model performance on unseen tasks • BART with instruction-tuning can often outperform GPT3 without, albeit being a much smaller model
  • 35. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Using LLMs to generate Instructions • (Good) Human-written instruction data is expensive • Possible to reduce the labeling effort? • Idea: generate instructions using an off-the-shelf LLM (GPT- 3) with human written seed tasks 35 Wang, Y., et al., 2023, July. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13484-13508).
  • 36. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Self-Instruct Framework • Classify whether the generated instruction is a classification task • Output-first: avoid bias towards one class label 36
  • 37. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Self-Instruct Framework • Filter out instructions similar with existing ones • Add newly generated tasks into the task pool for next iteration 37
  • 38. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Selected Tasks Generated by GPT-3 38
  • 39. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Self-Instruct Experiments • Use GPT-3-davinci to generate new instruction tasks and use them to subsequently fine-tune the model itself • 175 seed tasks -> 52K instructions and 82K instances 39
  • 40. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Self-Instruct Evaluation 40
  • 41. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group LIMA: Less is More for Alignment • Hypothesis: A model’s knowledge and capabilities are learned almost entirely during pre-training, while instruction tuning teaches the right format to use when interacting with users • Is a small amount of data enough to achieve this goal and still generalize to new unseen tasks? 41 Zhou, C., et al., 2024. Lima: Less is More for Alignment. Advances in Neural Information Processing Systems, 36.
  • 42. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group LIMA: Less is More for Alignment • Only 1000 training examples: no self-generation and only few manual annotations – 750 top questions/answers selected from community forums – 250 examples (prompt and response) manually written to exemplify the desired response style of the model • Finally instruction-tune 65B Llama model on these 1000 examples 42
  • 43. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Comparing LIMA with other LLMs • By asking human crowd workers and GPT-4 which model response is the better one (binary decision) 43 Human Evaluation GPT4 Evaluation
  • 44. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Important Factors • Quality Control: – Public data: select data with high user ratings – Manually generated examples: make sure tone and format are uniform • Diversity Control: – Public data: stratified sampling to increase domain diversity – Manually generated examples: Create with wide range of tasks/scenarios 44
  • 45. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Quality vs. Quantity vs. Diversity • Scaling up training data does not necessarily improve the model response quality • Quality and diversity are important before quantity 45 Filtered Stack Exchange: diverse and high quality Unfiltered Stack Exchange: diverse but low quality wikiHow: high quality but low diversity
  • 46. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Format Constraint Impact on Response • LIMA with or without 6 format constraint examples – Generating product page with highlights, about the product and how to use – Paper reviews with summary, strengths, weaknesses and potentials 46
  • 47. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Comparing Instruction Datasets • There is not a single best instruction tuning dataset across all tasks • Combining datasets results in the best overall performance 47 Wang, Y., et al., 2023. How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources. Advances in Neural Information Processing Systems, 36, pp.74764-74786.
  • 48. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Impact of Base Model • Base model quality is extremely important for downstream task performance • Llama is pre-trained on more tokens than other models 48
  • 49. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Impact of Model Size • Smaller models benefit more from instruction-tuning • Instruction-tuning does not help to enhance strong capabilities already existing in the original model 49
  • 50. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Summary: Instruction Tuning • Instruction tuning enables language models to follow novel user instructions that are not seen during fine-tuning ➔This is what users want! • Instruction-tuned models perform well on many tasks not just a single one as with task-specific fine-tuning • Limitations: – Data collection is expensive, especially for complex tasks (quality and diversity control are necessary) – Many tasks do not have a single acceptable output (format) but many can be considered correct – Instruction tuning does not directly model human preferences 50
  • 51. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Summary: Instruction Tuning • All presented techniques are used today to prepare instruction-tuning data for LLMs – Reformulating existing tasks into natural language format – Crowdsourcing instructions and answers – Generating instructions with LLMs themselves 51 Zhao et al.: A Survey of Large Language Models. 2024. arXiv:2303.18223
  • 52. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Outline • Recap: Pre-training Language Models • Scaling up and Emergent Abilities of LLMs • Instruction Tuning • Reinforcement Learning from Human Feedback • Existing Large Language Models 52
  • 53. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group The Problem of Supervised Fine-tuning • There is still a misalignment between the ML objective – maximizing the likelihood of a specific piece of human- written text – and what humans actually want – generation of high-quality outputs as determined by humans • Language models go through another phase of learning, called alignment, where they learn how to present information to users and align to human preferences, e.g.: – Helpfulness – Honesty – Harmlessness • Do you see a problem with these preferences? 53
  • 54. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group LLM Pre-training Framework 54 Instruction-Tuning Reinforcement Learning from Human Feedback
  • 55. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Reinforcement Learning Model • An agent has a policy function, which can take action At according to current state St • As a result of the action, the agent receives a reward Rt from the environment and transits to the next state St+1 55
  • 56. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group InstructGPT • Agent: language model • Action: predict the next token • Policy: The output distribution of the next token • Reward: a reward model trained by human evaluations on model responses ➔ Removes the need for a human- in-the-loop 56 Ouyang, L et al., 2022. Training Language Models to follow Instructions with Human Feedback. Advances in Neural Information Processing Systems, 35, pp.27730-27744.
  • 57. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Reward Model Training • Prompt supervised fine-tuned language model to produce pairs of answers • Human annotators decide which one is preferred • Reward model is trained to score yw higher than yl • Reward model is often initialized from πSFT with a linear layer to produce a scalar reward value 57
  • 58. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group RLHF: Proximal Policy Optimization • Optimize the language model with feedback from the reward model • Prevents mode collapse to single high reward answers • Prevents the model from deviating too far from the distribution where the reward model is accurate 58 Schulman, J. et al., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • 59. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Comparison with Baselines • RLHF models are more preferred by human labelers 59
  • 60. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Evaluations on Different Aspects 60
  • 61. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Limitations of PPO Methods • Need to train multiple models – Reward model – Policy model • Needs sampling from Language model during fine-tuning • Complicated reinforcement learning training process • Is it possible to directly train a language model from human preference annotations? 61
  • 62. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Direct Preference Optimization • Removes the iterative reinforcement learning process by directly tuning the model on human preferences • DPO eliminates the need to – train a reward model – sample from the LM during fine-tuning – perform large hyperparameter search 62
  • 63. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group DPO versus Baselines • DPO provides higher expected reward compared to PPO (left) • Higher win-rate compared to human-written summarizations, evaluated by GPT4 (right) 63
  • 64. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Comparison between PPO and DPO • Proximal policy optimization – Complex reinforcement learning – Iterative process – Can handle more informative human feedback (e.g. numerical ratings) • Direct preference optimization – Simpler fine-tuning process by directly fitting reward model – Cheaper and more stable training – Can only handle binary signals 64
  • 65. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Fine-grained Human Feedback • Assigning a single score to the model output may not be informative enough 65 Wu, Z. et al., 2024. Fine-grained Human Feedback gives Better Rewards for Language Model Training. Advances in Neural Information Processing Systems, 36.
  • 66. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Multiple Reward Functions • Provide a reward after every segment (e.g. a sentence) is generated • Different feedback types: factual incorrectness, irrelevance, and information incompleteness • Combined reward: 66
  • 67. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Example: Detoxification • Measure toxicity – 0: non-toxic – 1: toxic 67
  • 68. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Example: Detoxification • Learning from denser fine-grained rewards is more sample efficient than learning from holistic rewards • Fine-grained location of toxic content is a stronger training signal than a single scalar value for the whole text. 68
  • 69. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Customizing LLM Behavior • Keep factualness/completeness reward weights fixed • Alternate relevance reward weight: 0.4/0.3/0.2 • Relevance reward penalizes referencing passages and auxiliary information 69
  • 70. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Open Issues with RLHF • There remain challenges within each of the three steps – Human feedback – Reward model – Policy 70 Casper, S., et al., 2023. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. Transactions on Machine Learning Research.
  • 71. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Challenges: Human Feedback • Biases of human evaluators – Studies found that ChatGPT became politically biased after RLHF • Good oversight is difficult – Evaluators are paid per example and may make mistakes given time constraints – Poor feedback when evaluating difficult tasks • Data Quality – Cost/Quality tradeoff • Tradeoff between richness and efficiency of feedback types – Comparison-based feedback, scalar feedback, correction feedback, language feedback, … 71
  • 72. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Challenges: Reward Model • A single reward model cannot represent a diverse society of humans • Reward misgeneralization: reward model may fit with human preference data due to unexpected features • Evaluation of reward model is difficult and expensive 72
  • 73. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Challenges: Policy • Robust reinforcement learning is difficult – Balance between exploring new actions and exploiting known rewards – Challenge increases in high-dimensional or sparse reward settings • Policy misgeneralization: training and deployment environments are different 73
  • 74. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Summary: RLHF • Reinforcement Learning from Human Feedback allows to directly model human preferences and generalize beyond the labelled data • Reinforcement Learning from Human Feedback can improve on doing only instruction-tuning • Tricky to get right • “Alignment Tax”: performance on tasks may suffer in favour of modelling outputs to human preference 74
  • 75. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Summary: RLHF • Human preferences are unreliable! – “Reward hacking” is common problem in RL – Chatbots are rewarded to produce responses that seem authoritative and helpful, regardless of truth, which can result in hallucinations • Models of human preferences are even more unreliable! • Still very data expensive • Very underexplored and fast-moving research area 75
  • 76. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Current Developments • Focus on Reasoning LLMs (OpenAI o1/o3, Deepseek, etc.) – Incorporation of chain-of-thought prompting (next week) into training procedure – Introduction of additional tokens to “give the model time to think” have also been shown to be helpful • Reinforcement learning is used to automatically generate reasoning examples (e.g. Deepseek) – Problem: How to verify the final output is correct if we do not have labels? →Use domains where correct answer can be programmatically derived (math, coding, ...) 76 OpenAI Blog: https://openai.com/index/learning-to-reason-with-llms/ OpenAI o1 system card: https://cdn.openai.com/o1-system-card-20241205.pdf Deepseek R1 paper: https://arxiv.org/abs/2501.12948
  • 77. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Outline • Recap: Pre-training Language Models • Scaling up and Emergent Abilities of LLMs • Instruction Tuning • Reinforcement Learning from Human Feedback • Existing Large Language Models 77
  • 78. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group A Problem for Open Research • The presented training procedures for creating performant LLMs requires huge amounts of compute resources for extended amounts of time (weeks to months) • Public research institutions mostly do not have this kind of infrastructure/funding • ChatGPT/Claude/Gemini/etc.: closed source/proprietary models, we don’t know about the pre-training corpus and we can’t access the weights of the models ➔ We can use them but we can only operate on assumptions regarding their training data and specifics of the training procedure 78
  • 79. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Llama: Open-Source Language Models • Open-source models by Meta • Available in various versions and sizes ranging from 7B to 405B parameters • The pre-training corpus is transparent and the models are freely available for anyone – Pre-training corpus: English CommonCrawl, C4, Github, Wikipedia, Gutenberg and Books3, ArXiv, Stack Exchange – Researchers with limited computing resources can use smaller models to understand how and why these language models work ➔ Currently the best alternative for research institutions to investigate topics like instruction tuning and reinforcement learning from human feedback 79 Touvron, H. et al., 2023. Llama: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971. Touvron, H. et al., 2023. Llama 2: Open Foundation and Fine-tuned Chat Models. arXiv preprint arXiv:2307.09288. Dubey, A. et al., 2024. The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783.
  • 80. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Existing Large Language Models • Many of the publically available LLMs are based on the Llama series of models by Meta 80 Zhao et al.: A Survey of Large Language Models. 2024. arXiv:2303.18223
  • 81. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group Existing Large Language Models 81 Zhao et al.: A Survey of Large Language Models. 2024. arXiv:2303.18223
  • 82. University of Mannheim | IE686 LLMs and Agents | Instruction Tuning and RLHF | Version 17.02.2025 Data and Web Science Group See you next week! • Next time: Prompt engineering and efficient adaptation – Zero-shot, in-context learning, chain-of-thought, … – Prompt tuning, adapter tuning, LoRA, …. 82