SlideShare a Scribd company logo
5
Most read
8
Most read
11
Most read
GPT-3: Language Models are Few-Shot Learners
ALMA MATER STUDIORUM UNIVERSITY OF BOLOGNA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING – DISI
Luca Ragazzi, Giacomo Frisoni, Lorenzo Valgimigli
PhD Students, XXXVI Cycle
Department of Computer Science and Engineering – DISI
University of Bologna, Cesena, Italy
l.ragazzi@unibo.it, giacomo.frisoni@unibo.it, lorenzo.valgimigli@unibo.it
"Neural architectures: from the McCulloch-Pitts model to GPT-3" Presentation
October 29th, 2021
GPT-3: Language Models are Few-Shot Learners 2
Overview of GPT-3
• Generative Pre-trained Transformer – 3
• Developed by OpenAI in May 2020
• Largest neural network ever created
• Philosophy: the bigger, the better
What is the motivation around it?
GPT-3: Language Models are Few-Shot Learners 3
Pre-trained language models
• Current state-of-the-art models in NLP
• Trained in a semi-supervised learning with large corpora
• Both X and Y are extracted from the text without having a prior labeled dataset
• Acquire high capability for modeling the natural language (with task-agnostic architecture)
• Limitation (i): need for downstream task-specific datasets and fine-tuning
– Difficult to collect large supervised training datasets for every new task
• Limitation (ii): non-correlation with humans
– Humans do not require large supervised datasets to learn most language tasks, but only brief
directives or a tiny number of demonstrations are needed
– So, why give models a large dataset of labeled examples for every new task?
Why not try to create NLP systems to have the same fluidity and
generality as humans?
GPT-3: Language Models are Few-Shot Learners 4
Solution: more parameters and in-context learning
• Let models develop a broad set of skills and pattern recognition abilities during
pre-training and use them at inference time to adapt to the desired task rapidly
• Since in-context learning involves absorbing many skills and tasks within the
model's parameters, it is plausible that learning abilities correlate with model size.
OpenAI creates GPT-3 to show that very large unsupervised
language models trained with a lot of data can multitask to
the level of fine-tuned state-of-the-art models
GPT-3: Language Models are Few-Shot Learners 5
Model Architecture and Training Process
• The GPT-3 model architecture is the same as its GPT-2 predecessor
– Transformer-based, built using only decoder blocks (BERT opposite)
– Stronger in natural language generation (NLG), instead of creating contextual embeddings
• An auto-regressive language model
– GPT-3 is trained using next word prediction, outputting one token (wordpiece) at a time
– Differently from bidirectional models like BERT, the prediction at each step is conditioned only
on the left context (masked self-attention)
• From an architecture perspective, GPT-3 is not actually very novel!
– … So, what makes it so special and magical? It’s really big
GPT-3: Language Models are Few-Shot Learners 6
Trained Models
• More layers, wider layers, and more data to train on
– GPT-3 comes in eight sizes, ranging from 125M to 175B parameters
– GPT-3 175B (referenced by default) → 470x BERT-Large (345M), 117x GPT-2-Large (1.5B),
and 10x the previous record holder, Turing-NLG
– The largest model ever created (at the time of paper writing) w 96 attention layers, each with
96x128-dimension heads, and 3.2M batch size 😱
• “With great powers sizes comes great responsibilities costs” 🦸💰
– A single training run costs over $4.6M using a Tesla V100 cloud instance (3.14E23 required
FLOPS at 28 TFLOPS HW capacity for 355 GPU-years)
– Time is not the only enemy. GPT-3 needs 700GB memory to store FP32 parameters (4 Bytes
each), where the maximum memory in a single GPU is 48GB (Quadro RTX 8000)
– OpenAI used model parallelism on a high-bandwidth cluster (w V100 GPUs) by Microsoft
GPT-3: Language Models are Few-Shot Learners 7
Training Datasets
• Extensive training on massive unlabeled text datasets (300B tokens in total)
– Since neural networks are compressed/compiled version of the training data, the size of the
dataset should scale accordingly with the size of the model
– The author mainly use Common Crawl, a crawl of over 50B web pages (filter down for quality)
• GPT-3 has a lower data compression ratio than GPT-2
– 300/175=1.71 (GPT-3) vs 10/1.5=6.67… This raises the question: “Is it only a big memory?”
570GB of compressed plaintext (45TB before filtering)
Note: GPT-2 was trained on 40GB of Internet text (10B tokens)
Bug to ignore
some overlaps
between dev and
test sets, but
costs made re-
training unfeasible
GPT-3: Language Models are Few-Shot Learners 8
Zero-, one-, few-shot vs fine-tuning
• GPT-3 can perform specific tasks without any special tuning 🧙 🔮
– Most other pre-trained language models require an elaborate fine-tuning (also with
architectural changes) on thousands of samples to perform well on downstream tasks
– GPT-3 doesn’t need a fine-tuning step and directly uses a single pre-trained model for
all downstream tasks (plug-and-play 🔌), demonstrating even superior performance
• Three different evaluation settings focused on task-agnostic performance,
which allows zero, one, or a few examples to be prefixed to the input model
Fine-tuning(repeated gradient
updates using a large corpus of
example tasks) → postponed
(i) Zero-shot
(ii) One-shot
(iii) Few-shot
Contextto better
inform the model
aboutwhatit is
expected to do
"If I were to see this
text somewhere on
the Web, what will
be the most likely
next word?"
GPT-3: Language Models are Few-Shot Learners 9
Results - i
• Different sizes of GPT-3 were tested using
different benchmarks in various tasks (e.g.,
question answering, translation, summarization,
etc.) to study the generalization ability of such
large models. In particular, it was evaluated in
three contexts:
• zero-shot learning
• one-shot learning
• few-shot learning.
• Every time the raising of
parameters made
generalization
capabilities appear
GPT-3: Language Models are Few-Shot Learners 10
Results - ii
• GPT-3, the giant version, was compared to the SOTA solutions in different
datasets.
• LAMBADA, StoryCloze,HellaSwag,TriviaQA,BLUE...
• In many cases, it reaches and outperforms the previous SOTA, which are
neural models fine-tuned on the dataset.
GPT-3: Language Models are Few-Shot Learners 11
Limits
• Despite the great improvement of GPT-3, it still has notable weakness, it has
difficulty with common sense physics and in long text generation.
• Lose of coherence
• Contradiction
• Useless semantical repetition
• Large language model are not grounded in other domains of experience as video
or real-world physic interaction, lacking a large amount of context.
• GPT-3 works better after pretrain, it is still far from human level.
• Humans show strong zero-shot capabilities
• By now, it is impossible to say how GPT-3 learns during train
• It is even more complex understand what it learns in inference time
• Does it learn the new task from scratch? Does it reshape similar task it
learned?
• Finally, it shares some limitations common to most deep learning models
• The knowledge learned is no interpretable
• It requires large resources and time for the train
• It is strongly affected by biases in the data
GPT-3: Language Models are Few-Shot Learners 12
Ethical Concerns - i
• GPT-3 can be misused in dangerous
situations
• fake news generation, phishing,
fraudulent academic essay
• It is affected by biases in data on
different topics
• Gender
• Race
• Religion
GPT-3: Language Models are Few-Shot Learners 13
Ethical Concerns - ii
• The energy consumption of this large model is a problem that needs to be
underlined.
– GPT-3 consumed several thousand petaflop/s-day during pre-train while
GPT-2 tens petaflop/s-day.
• It is important to consider how the
resources are amortized during the
lifecycle of the model
• It consumes significant resources during
train, but it is surprisingly efficient once
trained.
• GPT-3, full version, can generate 100
pages of content from a trained model at
the cost of about 0.4 kW/hr.
14
GPT-3: Language Models are Few-Shot Learners
Thanks for the attention
(is all you need)

More Related Content

PDF
Implications of GPT-3
PPTX
What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?
PDF
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
PPTX
A brief primer on OpenAI's GPT-3
PPTX
Behind the Scenes of ChatGPT.pptx
PPTX
How does ChatGPT work: an Information Retrieval perspective
PPTX
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
PDF
Transformers, LLMs, and the Possibility of AGI
Implications of GPT-3
What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
A brief primer on OpenAI's GPT-3
Behind the Scenes of ChatGPT.pptx
How does ChatGPT work: an Information Retrieval perspective
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
Transformers, LLMs, and the Possibility of AGI

What's hot (20)

PDF
Introduction to Transformers for NLP - Olga Petrova
PPTX
Natural language processing and transformer models
PDF
NLP using transformers
PDF
An introduction to the Transformers architecture and BERT
PDF
Large Language Models - Chat AI.pdf
PPTX
Gpt1 and 2 model review
PPTX
Natural Language Processing (NLP) - Introduction
PDF
Deep learning for NLP and Transformer
PPTX
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
BERT: Bidirectional Encoder Representations from Transformers
PPTX
Fine tuning large LMs
PDF
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
PDF
Gpt models
PDF
LLMs Bootcamp
PDF
GPT-2: Language Models are Unsupervised Multitask Learners
PPTX
Introduction to Transformer Model
PDF
And then there were ... Large Language Models
PPTX
[Paper review] BERT
PDF
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Introduction to Transformers for NLP - Olga Petrova
Natural language processing and transformer models
NLP using transformers
An introduction to the Transformers architecture and BERT
Large Language Models - Chat AI.pdf
Gpt1 and 2 model review
Natural Language Processing (NLP) - Introduction
Deep learning for NLP and Transformer
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Bidirectional Encoder Representations from Transformers
Fine tuning large LMs
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Gpt models
LLMs Bootcamp
GPT-2: Language Models are Unsupervised Multitask Learners
Introduction to Transformer Model
And then there were ... Large Language Models
[Paper review] BERT
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Ad

Similar to gpt3_presentation.pdf (20)

PPTX
LLM GPT-3: Language models are few-shot learners
PDF
How to build a GPT model step-by-step guide .pdf
PDF
Artificial Intelligence Innovation The Future With OpenAI GPT-3 ARTiBA.pdf
PDF
Open ai’s gpt 3 language explained under 5 mins
PDF
How to build a GPT model.pdf
PDF
leewayhertz.com-How to build a GPT model (1).pdf
PDF
How to Build Your Own GPT Model - SoluLab
PDF
leewayhertz.com-How to build a GPT model (1).pdf
PPTX
NLP in 2020
PDF
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
PDF
Google's Pathways Language Model and Chain-of-Thought
PDF
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
PDF
Build Your Own GPT Model In 5 Easy Steps.pdf
PDF
ChatGPT and OpenAI.pdf
PDF
Allganize AI seminar - GPT3 and PET
PDF
OpenAI GPT in Depth - Questions and Misconceptions
PPTX
Understanding Large Language Models (1).pptx
PPTX
Writing Machines: Detection and Stylometric Profiling
PDF
ChatGPT Optional Anlysis with significant
PDF
What is GPT A Comprehensive Guide to OpenAI.pdf
LLM GPT-3: Language models are few-shot learners
How to build a GPT model step-by-step guide .pdf
Artificial Intelligence Innovation The Future With OpenAI GPT-3 ARTiBA.pdf
Open ai’s gpt 3 language explained under 5 mins
How to build a GPT model.pdf
leewayhertz.com-How to build a GPT model (1).pdf
How to Build Your Own GPT Model - SoluLab
leewayhertz.com-How to build a GPT model (1).pdf
NLP in 2020
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
Google's Pathways Language Model and Chain-of-Thought
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
Build Your Own GPT Model In 5 Easy Steps.pdf
ChatGPT and OpenAI.pdf
Allganize AI seminar - GPT3 and PET
OpenAI GPT in Depth - Questions and Misconceptions
Understanding Large Language Models (1).pptx
Writing Machines: Detection and Stylometric Profiling
ChatGPT Optional Anlysis with significant
What is GPT A Comprehensive Guide to OpenAI.pdf
Ad

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
annual-report-2024-2025 original latest.
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
1_Introduction to advance data techniques.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
Data_Analytics_and_PowerBI_Presentation.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction to machine learning and Linear Models
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Knowledge Engineering Part 1
annual-report-2024-2025 original latest.
oil_refinery_comprehensive_20250804084928 (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
1_Introduction to advance data techniques.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
STUDY DESIGN details- Lt Col Maksud (21).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck

gpt3_presentation.pdf

  • 1. GPT-3: Language Models are Few-Shot Learners ALMA MATER STUDIORUM UNIVERSITY OF BOLOGNA DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING – DISI Luca Ragazzi, Giacomo Frisoni, Lorenzo Valgimigli PhD Students, XXXVI Cycle Department of Computer Science and Engineering – DISI University of Bologna, Cesena, Italy l.ragazzi@unibo.it, giacomo.frisoni@unibo.it, lorenzo.valgimigli@unibo.it "Neural architectures: from the McCulloch-Pitts model to GPT-3" Presentation October 29th, 2021
  • 2. GPT-3: Language Models are Few-Shot Learners 2 Overview of GPT-3 • Generative Pre-trained Transformer – 3 • Developed by OpenAI in May 2020 • Largest neural network ever created • Philosophy: the bigger, the better What is the motivation around it?
  • 3. GPT-3: Language Models are Few-Shot Learners 3 Pre-trained language models • Current state-of-the-art models in NLP • Trained in a semi-supervised learning with large corpora • Both X and Y are extracted from the text without having a prior labeled dataset • Acquire high capability for modeling the natural language (with task-agnostic architecture) • Limitation (i): need for downstream task-specific datasets and fine-tuning – Difficult to collect large supervised training datasets for every new task • Limitation (ii): non-correlation with humans – Humans do not require large supervised datasets to learn most language tasks, but only brief directives or a tiny number of demonstrations are needed – So, why give models a large dataset of labeled examples for every new task? Why not try to create NLP systems to have the same fluidity and generality as humans?
  • 4. GPT-3: Language Models are Few-Shot Learners 4 Solution: more parameters and in-context learning • Let models develop a broad set of skills and pattern recognition abilities during pre-training and use them at inference time to adapt to the desired task rapidly • Since in-context learning involves absorbing many skills and tasks within the model's parameters, it is plausible that learning abilities correlate with model size. OpenAI creates GPT-3 to show that very large unsupervised language models trained with a lot of data can multitask to the level of fine-tuned state-of-the-art models
  • 5. GPT-3: Language Models are Few-Shot Learners 5 Model Architecture and Training Process • The GPT-3 model architecture is the same as its GPT-2 predecessor – Transformer-based, built using only decoder blocks (BERT opposite) – Stronger in natural language generation (NLG), instead of creating contextual embeddings • An auto-regressive language model – GPT-3 is trained using next word prediction, outputting one token (wordpiece) at a time – Differently from bidirectional models like BERT, the prediction at each step is conditioned only on the left context (masked self-attention) • From an architecture perspective, GPT-3 is not actually very novel! – … So, what makes it so special and magical? It’s really big
  • 6. GPT-3: Language Models are Few-Shot Learners 6 Trained Models • More layers, wider layers, and more data to train on – GPT-3 comes in eight sizes, ranging from 125M to 175B parameters – GPT-3 175B (referenced by default) → 470x BERT-Large (345M), 117x GPT-2-Large (1.5B), and 10x the previous record holder, Turing-NLG – The largest model ever created (at the time of paper writing) w 96 attention layers, each with 96x128-dimension heads, and 3.2M batch size 😱 • “With great powers sizes comes great responsibilities costs” 🦸💰 – A single training run costs over $4.6M using a Tesla V100 cloud instance (3.14E23 required FLOPS at 28 TFLOPS HW capacity for 355 GPU-years) – Time is not the only enemy. GPT-3 needs 700GB memory to store FP32 parameters (4 Bytes each), where the maximum memory in a single GPU is 48GB (Quadro RTX 8000) – OpenAI used model parallelism on a high-bandwidth cluster (w V100 GPUs) by Microsoft
  • 7. GPT-3: Language Models are Few-Shot Learners 7 Training Datasets • Extensive training on massive unlabeled text datasets (300B tokens in total) – Since neural networks are compressed/compiled version of the training data, the size of the dataset should scale accordingly with the size of the model – The author mainly use Common Crawl, a crawl of over 50B web pages (filter down for quality) • GPT-3 has a lower data compression ratio than GPT-2 – 300/175=1.71 (GPT-3) vs 10/1.5=6.67… This raises the question: “Is it only a big memory?” 570GB of compressed plaintext (45TB before filtering) Note: GPT-2 was trained on 40GB of Internet text (10B tokens) Bug to ignore some overlaps between dev and test sets, but costs made re- training unfeasible
  • 8. GPT-3: Language Models are Few-Shot Learners 8 Zero-, one-, few-shot vs fine-tuning • GPT-3 can perform specific tasks without any special tuning 🧙 🔮 – Most other pre-trained language models require an elaborate fine-tuning (also with architectural changes) on thousands of samples to perform well on downstream tasks – GPT-3 doesn’t need a fine-tuning step and directly uses a single pre-trained model for all downstream tasks (plug-and-play 🔌), demonstrating even superior performance • Three different evaluation settings focused on task-agnostic performance, which allows zero, one, or a few examples to be prefixed to the input model Fine-tuning(repeated gradient updates using a large corpus of example tasks) → postponed (i) Zero-shot (ii) One-shot (iii) Few-shot Contextto better inform the model aboutwhatit is expected to do "If I were to see this text somewhere on the Web, what will be the most likely next word?"
  • 9. GPT-3: Language Models are Few-Shot Learners 9 Results - i • Different sizes of GPT-3 were tested using different benchmarks in various tasks (e.g., question answering, translation, summarization, etc.) to study the generalization ability of such large models. In particular, it was evaluated in three contexts: • zero-shot learning • one-shot learning • few-shot learning. • Every time the raising of parameters made generalization capabilities appear
  • 10. GPT-3: Language Models are Few-Shot Learners 10 Results - ii • GPT-3, the giant version, was compared to the SOTA solutions in different datasets. • LAMBADA, StoryCloze,HellaSwag,TriviaQA,BLUE... • In many cases, it reaches and outperforms the previous SOTA, which are neural models fine-tuned on the dataset.
  • 11. GPT-3: Language Models are Few-Shot Learners 11 Limits • Despite the great improvement of GPT-3, it still has notable weakness, it has difficulty with common sense physics and in long text generation. • Lose of coherence • Contradiction • Useless semantical repetition • Large language model are not grounded in other domains of experience as video or real-world physic interaction, lacking a large amount of context. • GPT-3 works better after pretrain, it is still far from human level. • Humans show strong zero-shot capabilities • By now, it is impossible to say how GPT-3 learns during train • It is even more complex understand what it learns in inference time • Does it learn the new task from scratch? Does it reshape similar task it learned? • Finally, it shares some limitations common to most deep learning models • The knowledge learned is no interpretable • It requires large resources and time for the train • It is strongly affected by biases in the data
  • 12. GPT-3: Language Models are Few-Shot Learners 12 Ethical Concerns - i • GPT-3 can be misused in dangerous situations • fake news generation, phishing, fraudulent academic essay • It is affected by biases in data on different topics • Gender • Race • Religion
  • 13. GPT-3: Language Models are Few-Shot Learners 13 Ethical Concerns - ii • The energy consumption of this large model is a problem that needs to be underlined. – GPT-3 consumed several thousand petaflop/s-day during pre-train while GPT-2 tens petaflop/s-day. • It is important to consider how the resources are amortized during the lifecycle of the model • It consumes significant resources during train, but it is surprisingly efficient once trained. • GPT-3, full version, can generate 100 pages of content from a trained model at the cost of about 0.4 kW/hr.
  • 14. 14 GPT-3: Language Models are Few-Shot Learners Thanks for the attention (is all you need)