Gpt1 and 2 model review

Paper review summary
Made by Seoung-Ho Choi

Content
• GPT-1 (Language Models are Unsupervised Multitask Learners)
• GPT-2 (Improving Language Understanding by Generative Pre-
Training)
• GPT-3 (Language Models are Few-Shot Learners)

Improving Language Understanding by Generative
Pre-Training, (A. Radford et al., 2018)
• Goal:
• We demonstrate that large gains on these tasks can be realized by generative pre-training of a language
model on a diverse corpus of unlabeled text.
• In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to
achieve effective transfer while requiring minimal changes to the model architecture.
• Introduction :
• Problem : models that can leverage linguistic information from unlabeled data provide a valuable alternative
to gathering more annotation, which can be time-consuming and expensive
• Issue 1 :First, it is unclear what type of optimization objectives are most effective at learning text
representations that are useful for transfer.
• Issue 2 : Second, there is no consensus on the most effective way to transfer these learned representations to
the target task
• Motivation :
• we explore a semi-supervised approach for language understanding tasks using a combination of
unsupervised pre-training and supervised fine-tuning.
• Contribution:
• Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks

• Proposed Method
• Our training procedure consists of two stages.
• The first stage is learning a high-capacity language model on a large corpus of
text. This is followed by a fine-tuning stage, where we adapt the model to a
discriminative task with labeled data
• Unsupervised pre-training

• Proposed Method
• Our training procedure consists of two stages.
• The first stage is learning a high-capacity language model on a large corpus of
text. This is followed by a fine-tuning stage, where we adapt the model to a
discriminative task with labeled data
• Supervised fine-tuning
Left section : Transformer architecture and training objectives
Right section : Input transformations for fine-tuning on different tasks

Improving Language Understanding by
Generative Pre-Training, A. Radford et al., 20
• Experiment
• We evaluate our approach on four types of language understanding tasks
• E.g. natural language inference, question answering , semantic similarity, and text classification.
• Five measure experiment
• Compare about state of the art methods
• Analysis
• Impact of number of layers transferred
• Effect of transferring increasing number of layer from the pre-trained language model
• Plot showing the evolution of zero-shot performance on different tasks as a function of LM pre-training updates.
• Zero-shot Behaviors
• We’d like to better understand why language model pre-training of transformers is effective
• Zero-shot := 훈련 데이터가 거의 또는 전혀 없어도 유용한 패턴 인식을 학습하는 방법
• Ablation studies
• we examine the performance of our method without the auxiliary LM objective during fine-tuning
• we analyze the effect of the Transformer by comparing it with a single layer 2048 unit LSTM using the same framework
• we also compare with our transformer architecture directly trained on supervised target tasks, without pre-training.
• Conclusion
• We introduced a framework for achieving strong natural language understanding with a single task-agnostic model through generative pre-
training and discriminative fine-tuning
• We study by pre-training on a diverse corpus with long stretches of contiguous text our model acquires significant world knowledge and ability
to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering,
semantic similarity assessment, entailment determination, and text classification, improving the state of the art on 9 of the 12 datasets

Language Models are Unsupervised Multitask
Learners, (A. Radford et al., 2019)
• Goal:
• We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new
dataset of millions of webpages called WebText
• Introduction
• Problem : Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the
lack of generalization observed in Language model systems
• Motivation
• Affect of attention
• Existing :
• Adding condition :
• More convergence about multi task learning:
• the global minimum of the unsupervised objective is also the global minimum of the supervised objective.
• performing unsupervised multitask learning
• Contribution
• We demonstrate that language models begin to learn these tasks without any explicit supervision
• We conditioned on a document plus questions, the answers generated by the language model
• GPT-2 is a 1.5B parameter Transformer that achieving state of the art results on 7 out of 8 tested language modeling datasets
in a zero-shot setting.

• Proposal Methods
• Training dataset
• Our approach motivates building as large and diverse a dataset as possible in order to
collect natural language demonstrations of tasks in as varied of domains and contexts as
possible
• Input representation
• Combine the empirical benefits of word-level LMs with the generality of byte-level
approaches.
• Model
• Adding Layer norm and skip diagram

• Experiment
• Showing out of-distribution using Web Text LMs
• Showing different categories of words using Children's Book Test dataset
• Showing long-range dependencies using LAMBADA dataset
• Task
• Reading Comprehension, Summarization , Translation, Question Answering
• Generalization vs Memorization
• Text Memorization, Model capacity, Diversity, Robustness (무엇이 중요한가?)
• Discussion
• Much research has been dedicated to learning (Hill et al., 2016), understanding (Levy and
Goldberg, 2014), and critically evaluating (Wieting and Kiela, 2019) the representations of
both supervised and unsupervised pre-training methods
• Conclusion
• GPT-2 zero-shots to state of the art performance on 7 out of 8 tested language
modeling datasets

Reference
• A. Radford et al., Improving Language Understanding by Generative
Pre-training, 2018.
• A. Radford et al., Language Models are Unsupervised Multitask
Leaners, 2019.
• T. Brown et al., Language Models are Few-Shot Learners,
arXiv:2005.14165v4, 2020.

Gpt1 and 2 model review

More Related Content

What's hot (20)

Similar to Gpt1 and 2 model review (20)

More from Seoung-Ho Choi (20)

Recently uploaded (20)

Gpt1 and 2 model review