SlideShare a Scribd company logo
7
Most read
10
Most read
11
Most read
Paper review summary
Made by Seoung-Ho Choi
Content
• GPT-1 (Language Models are Unsupervised Multitask Learners)
• GPT-2 (Improving Language Understanding by Generative Pre-
Training)
• GPT-3 (Language Models are Few-Shot Learners)
GPT -1
Improving Language Understanding by Generative
Pre-Training, (A. Radford et al., 2018)
• Goal:
• We demonstrate that large gains on these tasks can be realized by generative pre-training of a language
model on a diverse corpus of unlabeled text.
• In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to
achieve effective transfer while requiring minimal changes to the model architecture.
• Introduction :
• Problem : models that can leverage linguistic information from unlabeled data provide a valuable alternative
to gathering more annotation, which can be time-consuming and expensive
• Issue 1 :First, it is unclear what type of optimization objectives are most effective at learning text
representations that are useful for transfer.
• Issue 2 : Second, there is no consensus on the most effective way to transfer these learned representations to
the target task
• Motivation :
• we explore a semi-supervised approach for language understanding tasks using a combination of
unsupervised pre-training and supervised fine-tuning.
• Contribution:
• Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks
Improving Language Understanding by Generative
Pre-Training, (A. Radford et al., 2018)
• Proposed Method
• Our training procedure consists of two stages.
• The first stage is learning a high-capacity language model on a large corpus of
text. This is followed by a fine-tuning stage, where we adapt the model to a
discriminative task with labeled data
• Unsupervised pre-training
Improving Language Understanding by Generative
Pre-Training, (A. Radford et al., 2018)
• Proposed Method
• Our training procedure consists of two stages.
• The first stage is learning a high-capacity language model on a large corpus of
text. This is followed by a fine-tuning stage, where we adapt the model to a
discriminative task with labeled data
• Supervised fine-tuning
Left section : Transformer architecture and training objectives
Right section : Input transformations for fine-tuning on different tasks
Improving Language Understanding by
Generative Pre-Training, A. Radford et al., 20
• Experiment
• We evaluate our approach on four types of language understanding tasks
• E.g. natural language inference, question answering , semantic similarity, and text classification.
• Five measure experiment
• Compare about state of the art methods
• Analysis
• Impact of number of layers transferred
• Effect of transferring increasing number of layer from the pre-trained language model
• Plot showing the evolution of zero-shot performance on different tasks as a function of LM pre-training updates.
• Zero-shot Behaviors
• We’d like to better understand why language model pre-training of transformers is effective
• Zero-shot := 훈련 데이터가 거의 또는 전혀 없어도 유용한 패턴 인식을 학습하는 방법
• Ablation studies
• we examine the performance of our method without the auxiliary LM objective during fine-tuning
• we analyze the effect of the Transformer by comparing it with a single layer 2048 unit LSTM using the same framework
• we also compare with our transformer architecture directly trained on supervised target tasks, without pre-training.
• Conclusion
• We introduced a framework for achieving strong natural language understanding with a single task-agnostic model through generative pre-
training and discriminative fine-tuning
• We study by pre-training on a diverse corpus with long stretches of contiguous text our model acquires significant world knowledge and ability
to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering,
semantic similarity assessment, entailment determination, and text classification, improving the state of the art on 9 of the 12 datasets
GPT-2
Language Models are Unsupervised Multitask
Learners, (A. Radford et al., 2019)
• Goal:
• We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new
dataset of millions of webpages called WebText
• Introduction
• Problem : Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the
lack of generalization observed in Language model systems
• Motivation
• Affect of attention
• Existing :
• Adding condition :
• More convergence about multi task learning:
• the global minimum of the unsupervised objective is also the global minimum of the supervised objective.
• performing unsupervised multitask learning
• Contribution
• We demonstrate that language models begin to learn these tasks without any explicit supervision
• We conditioned on a document plus questions, the answers generated by the language model
• GPT-2 is a 1.5B parameter Transformer that achieving state of the art results on 7 out of 8 tested language modeling datasets
in a zero-shot setting.
Language Models are Unsupervised Multitask
Learners, (A. Radford et al., 2019)
• Proposal Methods
• Training dataset
• Our approach motivates building as large and diverse a dataset as possible in order to
collect natural language demonstrations of tasks in as varied of domains and contexts as
possible
• Input representation
• Combine the empirical benefits of word-level LMs with the generality of byte-level
approaches.
• Model
• Adding Layer norm and skip diagram
Language Models are Unsupervised Multitask
Learners, (A. Radford et al., 2019)
• Experiment
• Showing out of-distribution using Web Text LMs
• Showing different categories of words using Children's Book Test dataset
• Showing long-range dependencies using LAMBADA dataset
• Task
• Reading Comprehension, Summarization , Translation, Question Answering
• Generalization vs Memorization
• Text Memorization, Model capacity, Diversity, Robustness (무엇이 중요한가?)
• Discussion
• Much research has been dedicated to learning (Hill et al., 2016), understanding (Levy and
Goldberg, 2014), and critically evaluating (Wieting and Kiela, 2019) the representations of
both supervised and unsupervised pre-training methods
• Conclusion
• GPT-2 zero-shots to state of the art performance on 7 out of 8 tested language
modeling datasets
Reference
• A. Radford et al., Improving Language Understanding by Generative
Pre-training, 2018.
• A. Radford et al., Language Models are Unsupervised Multitask
Leaners, 2019.
• T. Brown et al., Language Models are Few-Shot Learners,
arXiv:2005.14165v4, 2020.

More Related Content

PDF
Gpt models
PDF
Natural Language Processing (NLP)
PDF
Natural language processing (NLP) introduction
PPTX
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PPTX
PDF
Text summarization
PPTX
BERT introduction
Gpt models
Natural Language Processing (NLP)
Natural language processing (NLP) introduction
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Text summarization
BERT introduction

What's hot (20)

PPTX
PPTX
Introduction to Named Entity Recognition
PDF
Word2Vec
PPTX
Deep Learning for Natural Language Processing
PDF
An introduction to the Transformers architecture and BERT
PDF
Recurrent Neural Networks, LSTM and GRU
PPTX
NAMED ENTITY RECOGNITION
PPT
Natural Language Processing
PDF
BERT: Bidirectional Encoder Representations from Transformers
PPTX
Word embeddings
PPTX
natural language processing help at myassignmenthelp.net
PPTX
Bert.pptx
PPTX
What is word2vec?
PPTX
Notes on attention mechanism
PDF
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
PPTX
Natural Language processing
PPTX
Natural language processing
PPTX
Natural language processing and transformer models
PDF
GPT-2: Language Models are Unsupervised Multitask Learners
PDF
A Review of Deep Contextualized Word Representations (Peters+, 2018)
Introduction to Named Entity Recognition
Word2Vec
Deep Learning for Natural Language Processing
An introduction to the Transformers architecture and BERT
Recurrent Neural Networks, LSTM and GRU
NAMED ENTITY RECOGNITION
Natural Language Processing
BERT: Bidirectional Encoder Representations from Transformers
Word embeddings
natural language processing help at myassignmenthelp.net
Bert.pptx
What is word2vec?
Notes on attention mechanism
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Natural Language processing
Natural language processing
Natural language processing and transformer models
GPT-2: Language Models are Unsupervised Multitask Learners
A Review of Deep Contextualized Word Representations (Peters+, 2018)
Ad

Similar to Gpt1 and 2 model review (20)

PDF
LLM Learning Path Level 1 - Presentation Slides
PPTX
Vectorized Intent of Multilingual Large Language Models.pptx
PPTX
Explore the magic of " ChatGPT " .pptx.
PPTX
2010 INTERSPEECH
PDF
Successes and Frontiers of Deep Learning
PPTX
Proposal.pptx
PDF
A REVIEW OF PROMPT-FREE FEW-SHOT TEXT CLASSIFICATION METHODS
PDF
International Journal on Natural Language Computing (IJNLC)
PDF
A Review of Prompt-Free Few-Shot Text Classification Methods
PDF
Multilingual mixed code translation model
PDF
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
PDF
Learning Activity 1_ Viteri Flores_Arlyn Johanna
PPTX
Tomáš Mikolov - Distributed Representations for NLP
PDF
Primm and Classroom Talk Sue Sentance Nov 2020 pdf
PDF
1066_multitask_prompted_training_en.pdf
PDF
Nlp research presentation
PPTX
Tsl641 principles for call evaluation v 2
PDF
Comparing LLMs using a Unified Performance Ranking System
PPTX
Computer Assisted Language Learning
PDF
Comparing LLMs Using a Unified Performance Ranking System
LLM Learning Path Level 1 - Presentation Slides
Vectorized Intent of Multilingual Large Language Models.pptx
Explore the magic of " ChatGPT " .pptx.
2010 INTERSPEECH
Successes and Frontiers of Deep Learning
Proposal.pptx
A REVIEW OF PROMPT-FREE FEW-SHOT TEXT CLASSIFICATION METHODS
International Journal on Natural Language Computing (IJNLC)
A Review of Prompt-Free Few-Shot Text Classification Methods
Multilingual mixed code translation model
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Learning Activity 1_ Viteri Flores_Arlyn Johanna
Tomáš Mikolov - Distributed Representations for NLP
Primm and Classroom Talk Sue Sentance Nov 2020 pdf
1066_multitask_prompted_training_en.pdf
Nlp research presentation
Tsl641 principles for call evaluation v 2
Comparing LLMs using a Unified Performance Ranking System
Computer Assisted Language Learning
Comparing LLMs Using a Unified Performance Ranking System
Ad

More from Seoung-Ho Choi (20)

PDF
Seoung-Ho Choi Introduction to medical deep learning
PDF
Seungho Choi Introduction to deep learning solutions
PPTX
Ensemble normalization for stable training
PPTX
To classify Alzheimer’s Disease from 3D structural MRI data
PDF
Middle school winter science garden participation certificate
PDF
Elementary school winter model aircraft school certificate
PDF
Elementary school youth science exploration contest silver prize
PDF
Elementary school youth science contest silver prize
PDF
Middle school creativity problem solving ability contest encouragement prize
PDF
Middle school youth science exploration contest bronze prize
PDF
Elementary school minister of science and technology award
PDF
Elementary school completion certificate Jung-gu education center for the gifted
PDF
Encouragement award in Korean Information Science Society for Undergraduate S...
PDF
Best paper in Korean Communication Society
PDF
PS(Personal Statement) Korean
PDF
PS(Personal Statement) English
PPTX
A Study on the Importance of Adaptive Seed Value Exploration
PPTX
Bi-activation Function : an Enhanced Version of an Activation Function in C...
PPTX
Nonlinear Exponential Regularization : An Improved Version of Regularization ...
PPTX
Visualization Techniques for Outlier data
Seoung-Ho Choi Introduction to medical deep learning
Seungho Choi Introduction to deep learning solutions
Ensemble normalization for stable training
To classify Alzheimer’s Disease from 3D structural MRI data
Middle school winter science garden participation certificate
Elementary school winter model aircraft school certificate
Elementary school youth science exploration contest silver prize
Elementary school youth science contest silver prize
Middle school creativity problem solving ability contest encouragement prize
Middle school youth science exploration contest bronze prize
Elementary school minister of science and technology award
Elementary school completion certificate Jung-gu education center for the gifted
Encouragement award in Korean Information Science Society for Undergraduate S...
Best paper in Korean Communication Society
PS(Personal Statement) Korean
PS(Personal Statement) English
A Study on the Importance of Adaptive Seed Value Exploration
Bi-activation Function : an Enhanced Version of an Activation Function in C...
Nonlinear Exponential Regularization : An Improved Version of Regularization ...
Visualization Techniques for Outlier data

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Machine Learning_overview_presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Approach and Philosophy of On baking technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Big Data Technologies - Introduction.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectroscopy.pptx food analysis technology
Machine Learning_overview_presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Dropbox Q2 2025 Financial Results & Investor Presentation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
cuic standard and advanced reporting.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
Tartificialntelligence_presentation.pptx
Programs and apps: productivity, graphics, security and other tools
gpt5_lecture_notes_comprehensive_20250812015547.pdf

Gpt1 and 2 model review

  • 1. Paper review summary Made by Seoung-Ho Choi
  • 2. Content • GPT-1 (Language Models are Unsupervised Multitask Learners) • GPT-2 (Improving Language Understanding by Generative Pre- Training) • GPT-3 (Language Models are Few-Shot Learners)
  • 4. Improving Language Understanding by Generative Pre-Training, (A. Radford et al., 2018) • Goal: • We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text. • In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. • Introduction : • Problem : models that can leverage linguistic information from unlabeled data provide a valuable alternative to gathering more annotation, which can be time-consuming and expensive • Issue 1 :First, it is unclear what type of optimization objectives are most effective at learning text representations that are useful for transfer. • Issue 2 : Second, there is no consensus on the most effective way to transfer these learned representations to the target task • Motivation : • we explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning. • Contribution: • Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks
  • 5. Improving Language Understanding by Generative Pre-Training, (A. Radford et al., 2018) • Proposed Method • Our training procedure consists of two stages. • The first stage is learning a high-capacity language model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to a discriminative task with labeled data • Unsupervised pre-training
  • 6. Improving Language Understanding by Generative Pre-Training, (A. Radford et al., 2018) • Proposed Method • Our training procedure consists of two stages. • The first stage is learning a high-capacity language model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to a discriminative task with labeled data • Supervised fine-tuning Left section : Transformer architecture and training objectives Right section : Input transformations for fine-tuning on different tasks
  • 7. Improving Language Understanding by Generative Pre-Training, A. Radford et al., 20 • Experiment • We evaluate our approach on four types of language understanding tasks • E.g. natural language inference, question answering , semantic similarity, and text classification. • Five measure experiment • Compare about state of the art methods • Analysis • Impact of number of layers transferred • Effect of transferring increasing number of layer from the pre-trained language model • Plot showing the evolution of zero-shot performance on different tasks as a function of LM pre-training updates. • Zero-shot Behaviors • We’d like to better understand why language model pre-training of transformers is effective • Zero-shot := 훈련 데이터가 거의 또는 전혀 없어도 유용한 패턴 인식을 학습하는 방법 • Ablation studies • we examine the performance of our method without the auxiliary LM objective during fine-tuning • we analyze the effect of the Transformer by comparing it with a single layer 2048 unit LSTM using the same framework • we also compare with our transformer architecture directly trained on supervised target tasks, without pre-training. • Conclusion • We introduced a framework for achieving strong natural language understanding with a single task-agnostic model through generative pre- training and discriminative fine-tuning • We study by pre-training on a diverse corpus with long stretches of contiguous text our model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailment determination, and text classification, improving the state of the art on 9 of the 12 datasets
  • 9. Language Models are Unsupervised Multitask Learners, (A. Radford et al., 2019) • Goal: • We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText • Introduction • Problem : Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in Language model systems • Motivation • Affect of attention • Existing : • Adding condition : • More convergence about multi task learning: • the global minimum of the unsupervised objective is also the global minimum of the supervised objective. • performing unsupervised multitask learning • Contribution • We demonstrate that language models begin to learn these tasks without any explicit supervision • We conditioned on a document plus questions, the answers generated by the language model • GPT-2 is a 1.5B parameter Transformer that achieving state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting.
  • 10. Language Models are Unsupervised Multitask Learners, (A. Radford et al., 2019) • Proposal Methods • Training dataset • Our approach motivates building as large and diverse a dataset as possible in order to collect natural language demonstrations of tasks in as varied of domains and contexts as possible • Input representation • Combine the empirical benefits of word-level LMs with the generality of byte-level approaches. • Model • Adding Layer norm and skip diagram
  • 11. Language Models are Unsupervised Multitask Learners, (A. Radford et al., 2019) • Experiment • Showing out of-distribution using Web Text LMs • Showing different categories of words using Children's Book Test dataset • Showing long-range dependencies using LAMBADA dataset • Task • Reading Comprehension, Summarization , Translation, Question Answering • Generalization vs Memorization • Text Memorization, Model capacity, Diversity, Robustness (무엇이 중요한가?) • Discussion • Much research has been dedicated to learning (Hill et al., 2016), understanding (Levy and Goldberg, 2014), and critically evaluating (Wieting and Kiela, 2019) the representations of both supervised and unsupervised pre-training methods • Conclusion • GPT-2 zero-shots to state of the art performance on 7 out of 8 tested language modeling datasets
  • 12. Reference • A. Radford et al., Improving Language Understanding by Generative Pre-training, 2018. • A. Radford et al., Language Models are Unsupervised Multitask Leaners, 2019. • T. Brown et al., Language Models are Few-Shot Learners, arXiv:2005.14165v4, 2020.