SlideShare a Scribd company logo
Measuring Massive Multitask
Language Understanding
San Kim
2021.04.14
ChDan Hendrycks(1), Collin Burns(2), Steven Basart(3), Andy Zou(1), Mantas Mazeika(4),
Dawn Song(1), Jacob Steinhardt(1)
1. UC Berkeley
2. Columbia University
3. Uchicago
4. UIUC
Language Models as Knowledge Bases? FAIR, UCL
1. Without fine-tuning, BERT contains relational knowledge competitive with traditional NLP methods that
have some access to oracle knowledge.
2. BERT also does remarkably well on open-domain question answering against a supervised baseline
3. Factual knowledge can be recovered surprisingly well from pretrained language models, however, for some
relations (particularly N-to-M relations) performance is very poor
4. BERT-large consistently outperforms other language models in recovering factual and commonsense
knowledge while at the same time being more robust to the phrasing of a query
Language Models as Knowledge Bases?
Language Models are Unsupervised Multitask Learners
OpenAI
Language Models are Unsupervised Multitask Learners
GPT3
there is still a need for task-specific datasets and task-specific fine-tuning: to achieve strong
performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds
of thousands of examples specific to that task.
• the need for a large dataset of labeled examples for every new task limits the applicability of
language models.
• the potential to exploit spurious correlations in training data fundamentally grows with the
expressiveness of the model and the narrowness of the training distribution.
• humans do not require large supervised datasets to learn most language tasks - a brief directive
in natural language or at most a tiny number of demonstrations is often sufficient to enable a
human to perform a new task
Pre-trained transformer language models
Motivation
OpenAI
GPT3
GPT3
GPT3
T5 Google
UnifiedQA
AI2, University of
Washington
UnifiedQA
UnifiedQA
Measuring Massive Multitask Language Understanding
Linguistics
Commonsense
Measuring Massive Multitask Language Understanding
Measuring Massive Multitask Language Understanding
Transformer models have driven this recent progress by pretraining on massive text corpora,
including all of Wikipedia, thousands of books, and numerous websites. These models
consequently see extensive information about specialized topics, most of which is not assessed by
existing NLP benchmarks.
a new benchmark for assessing models across a diverse set of subjects that humans learn.
• zero-shot and few-shot settings
• 57 subjects across STEM, the humanities, the social sciences, and more
• difficulty from an elementary level to an advanced professional level
• it tests both world knowledge and problem solving ability
• Mathmatics, history, law, ethics …
• The granularity and breadth of the subjects makes the benchmark ideal for identifying a
model’s blind spots.
Measuring Massive Multitask Language Understanding
Recent models learn enough information from pretraining that they can serve as knowledge bases.
However, no prior work has comprehensively measured the knowledge models have across many
real-world domains.
Humanities
• Law, philosophy, history, …
• Legal understanding: how to apply rules and standard to complex scenarios. Understanding
following rules and regulations (a necessary capability to constrain open-world machine)
• Philosophy: logical fallacies, formal logic, famous philosophical arguments
• Ethics: test a model’s understanding of normative statements through predicting widespread
moral intuitions
• History: covers a wide range of time periods and geographical location, including prehistory
and other advanced subjects
Measuring Massive Multitask Language Understanding
Social Science
• Economics, sociology, politics, geography, psychology, …
• Economics: microeconomics, macroeconomics, econometrics, cover different types of problems
– require a mixture of world knowledge, qualitative reasoning, or quantitative reasoning
STEM
• Physics, computer science, mathematics, …
• Conceptual physics (harder version of the physical commonsense benchmark Physical IQA)
• College mathematics questions (like those found on the GRE mathematics) (LaTeX)
• STEM subjects require knowledge of empirical methods, fluid intelligence, and procedural
knowledge
Others
• Professional Medicine task, finance, accounting, marketing, knowledge of global facts – (e.g.
statistics about poverty in different countries over time)
Measuring Massive Multitask Language Understanding
Measuring Massive Multitask Language Understanding
Measuring Massive Multitask Language Understanding
Measuring Massive Multitask Language Understanding
GPT-3 (few-shot, zero-shot setting)
UnifiedQA (without any further tuning to assess its transfer accuracy)
RoBERTa-base, ALBERT-xxlarge, GPT-2 (fine-tuned on UnifiedQA training data and dev+val set.)
Measuring Massive Multitask Language Understanding
Measuring Massive Multitask Language Understanding
Since language models train on vast text corpora, there is some chance that they have seen the exact
question and answer during pretraining. If they memorized the exact question and answer, then they would
attain higher accuracy than their true ability. Likewise, a question’s entropy would be especially low if it were
memorized.
We also note that most of our questions came from PDFs or websites where questions and answers are on
separate pages.
This suggests that our exact
questions were not memorized.
However, during pretraining models
encountered text related to our
questions through processing
Wikipedia.
We also note that most of our
questions came from PDFs or
websites where questions and
answers are on separate pages.
Measuring Massive Multitask Language Understanding
Measuring Massive Multitask Language Understanding
• Poorly on highly procedural problems.
• Calculation-heavy STEM subjects tend to have low accuracy compared to verbal subjects
• For GPT-3, 9 out of the 10 lowest-accuracy tasks are STEM subjects. (poor performance on
Elementary Mathematics and many other STEM subjects with “plug and chug” problems.
The tasks with near-random accuracy include calculation-heavy subjects such as physics and
mathematics and subjects related to human values such as law and morality.
Worryingly, we also find that GPT-3 does not have an accurate sense of what it does or does not
know since its average confidence can be up to 24% off from its actual accuracy.
• Multimodal Understanding: While text is capable of conveying an enormous number of
concepts about the world, many important concepts are conveyed mainly through other
modalities, such as images, audio, and physical interaction. … One such benchmark could be a
“Turk Test,” consisting of Amazon Mechanical Turk Human Intelligence Tasks.
• The Internet as a Training Set: A major distinction between our benchmark and previous
multitask NLP benchmarks is that we do not require large training sets. Instead, we assume that
models have acquired the requisite knowledge from reading vast quantities of diverse text from
the internet.
Measuring Massive Multitask Language Understanding
• Model Limitation: Models do not match expert-level performance (90%) on any subject, so for
all subjects it is subhuman. On average, models are only now starting to move beyond
random-chance accuracy levels. Addressing these shortcomings may be challenging. To
illustrate this, we attempted to create a better Professional Law model by pretraining on
specialized data but achieved only limited success. We collected approximately 2,000 additional
Professional Law training examples. After fine-tuning a RoBERTa-base model (Liu et al., 2019)
using this custom training set, our model attained 32.8% test accuracy. To test the impact of
additional specialized training data, we also had RoBERTa continue pretraining on approximately
1.6 million legal case summaries using Harvard’s Law Library case law corpus case.law, but after
fine-tuning it only attained 36.1% accuracy. This suggests that while additional pretraining on
relevant high quality text can help, it may not be enough to substantially increase the
performance of current models. It is unclear whether simply scaling up existing language
models will solve the test. Current understanding indicates that a 10× increase in model size
must be accompanied by an approximate 5× increase in data (Kaplan et al., 2020). Aside from
the tremendous expense in creating multi-trillion parameter language models, data may also
become a bottleneck, as there is far less written about esoteric branches of knowledge than
about everyday situations.
Measuring Massive Multitask Language Understanding
Measuring Massive Multitask Language Understanding

More Related Content

PDF
Nlp presentation
PPTX
NLP Project Presentation
PDF
Nlp research presentation
PPTX
Question answering
PPTX
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
PDF
Plug play language_models
PDF
Frontiers of Natural Language Processing
PDF
NLP & Machine Learning - An Introductory Talk
Nlp presentation
NLP Project Presentation
Nlp research presentation
Question answering
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
Plug play language_models
Frontiers of Natural Language Processing
NLP & Machine Learning - An Introductory Talk

What's hot (20)

PDF
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEY
PDF
Question Answering - Application and Challenges
PPTX
Discussion summary emergentism
PDF
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
PDF
Language Models for Information Retrieval
PPTX
Psychological processes in language acquisition
PPTX
Nautral Langauge Processing - Basics / Non Technical
PDF
How can text-mining leverage developments in Deep Learning? Presentation at ...
PDF
Learning to understand phrases by embedding the dictionary
PPT
The impact of standardized terminologies and domain-ontologies in multilingua...
DOCX
A neural probabilistic language model
PDF
Machine Learning in NLP
PPTX
Introduction to natural language processing, history and origin
PDF
A prior case study of natural language processing on different domain
PDF
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
PDF
Blenderbot
PDF
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
PDF
Anthiil Inside workshop on NLP
PDF
(Deep) Neural Networks在 NLP 和 Text Mining 总结
DOC
Course Syllabus
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEY
Question Answering - Application and Challenges
Discussion summary emergentism
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Language Models for Information Retrieval
Psychological processes in language acquisition
Nautral Langauge Processing - Basics / Non Technical
How can text-mining leverage developments in Deep Learning? Presentation at ...
Learning to understand phrases by embedding the dictionary
The impact of standardized terminologies and domain-ontologies in multilingua...
A neural probabilistic language model
Machine Learning in NLP
Introduction to natural language processing, history and origin
A prior case study of natural language processing on different domain
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
Blenderbot
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Anthiil Inside workshop on NLP
(Deep) Neural Networks在 NLP 和 Text Mining 总结
Course Syllabus
Ad

Similar to Measuring massive multitask language understanding (20)

PDF
NLP Workshop Presentation at Universitat de Barcelona
PDF
A Comparative Study of Text Comprehension in IELTS Reading Exam using GPT-3
PPTX
Discussant EARLI sig 27
PDF
1066_multitask_prompted_training_en.pdf
PDF
French machine reading for question answering
PPTX
7003 Nature of AI Lecture 1 2023 Max.pptx
PPTX
2010 INTERSPEECH
PDF
Thamme Gowda's PhD dissertation defense slides
PDF
Introaied nancy2019 luengo
PDF
Multi task learning stepping away from narrow expert models 7.11.18
PPTX
CS Education for All. A new wave of opportunity
PPTX
Mattingly "Text and Data Mining: Building Data Driven Applications"
PPT
NLP introduced and in 47 slides Lecture 1.ppt
PPTX
0 Course Plan by mohamed aziz ben haha.pptx
PDF
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
PDF
Learning with limited labelled data in NLP: multi-task learning and beyond
PPTX
Vectorized Intent of Multilingual Large Language Models.pptx
PPTX
Gobert, Dede, Martin, Rose "Panel: Learning Analytics and Learning Sciences"
PPTX
Seminar University of Loughborough: Using technology to support mathematics e...
PPTX
Knowledge base system appl. p 1,2-ver1
NLP Workshop Presentation at Universitat de Barcelona
A Comparative Study of Text Comprehension in IELTS Reading Exam using GPT-3
Discussant EARLI sig 27
1066_multitask_prompted_training_en.pdf
French machine reading for question answering
7003 Nature of AI Lecture 1 2023 Max.pptx
2010 INTERSPEECH
Thamme Gowda's PhD dissertation defense slides
Introaied nancy2019 luengo
Multi task learning stepping away from narrow expert models 7.11.18
CS Education for All. A new wave of opportunity
Mattingly "Text and Data Mining: Building Data Driven Applications"
NLP introduced and in 47 slides Lecture 1.ppt
0 Course Plan by mohamed aziz ben haha.pptx
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Learning with limited labelled data in NLP: multi-task learning and beyond
Vectorized Intent of Multilingual Large Language Models.pptx
Gobert, Dede, Martin, Rose "Panel: Learning Analytics and Learning Sciences"
Seminar University of Loughborough: Using technology to support mathematics e...
Knowledge base system appl. p 1,2-ver1
Ad

More from San Kim (19)

PPTX
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
PPTX
2023 EMNLP day_san.pptx
PPTX
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
PPTX
slide-acl2022-combined_san.pptx
PPTX
Compeition-Level Code Generation with AlphaCode.pptx
PPTX
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
PPTX
AI2 day.pptx
PPTX
Temporal reasoning task
PPTX
Answering complex open domain questions with multi-hop dense retrieval
PPTX
Abductive commonsense reasoning
PPTX
Electra
PPTX
XLnet RoBERTa Reformer
PPTX
Transformer xl
PPTX
Face recognition v1
PPTX
Gan seminar
PPTX
Deep learning study 3
PPTX
Deep learning study 2
PPTX
Deep learning study 1
PPTX
Back propagation
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
2023 EMNLP day_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
slide-acl2022-combined_san.pptx
Compeition-Level Code Generation with AlphaCode.pptx
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
AI2 day.pptx
Temporal reasoning task
Answering complex open domain questions with multi-hop dense retrieval
Abductive commonsense reasoning
Electra
XLnet RoBERTa Reformer
Transformer xl
Face recognition v1
Gan seminar
Deep learning study 3
Deep learning study 2
Deep learning study 1
Back propagation

Recently uploaded (20)

PPTX
Sustainable Sites - Green Building Construction
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
Construction Project Organization Group 2.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Artificial Intelligence
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Geodesy 1.pptx...............................................
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
DOCX
573137875-Attendance-Management-System-original
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPT
Project quality management in manufacturing
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
Sustainable Sites - Green Building Construction
III.4.1.2_The_Space_Environment.p pdffdf
Construction Project Organization Group 2.pptx
additive manufacturing of ss316l using mig welding
Artificial Intelligence
UNIT-1 - COAL BASED THERMAL POWER PLANTS
R24 SURVEYING LAB MANUAL for civil enggi
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Geodesy 1.pptx...............................................
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Safety Seminar civil to be ensured for safe working.
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
573137875-Attendance-Management-System-original
Embodied AI: Ushering in the Next Era of Intelligent Systems
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Project quality management in manufacturing
CYBER-CRIMES AND SECURITY A guide to understanding

Measuring massive multitask language understanding

  • 1. Measuring Massive Multitask Language Understanding San Kim 2021.04.14 ChDan Hendrycks(1), Collin Burns(2), Steven Basart(3), Andy Zou(1), Mantas Mazeika(4), Dawn Song(1), Jacob Steinhardt(1) 1. UC Berkeley 2. Columbia University 3. Uchicago 4. UIUC
  • 2. Language Models as Knowledge Bases? FAIR, UCL 1. Without fine-tuning, BERT contains relational knowledge competitive with traditional NLP methods that have some access to oracle knowledge. 2. BERT also does remarkably well on open-domain question answering against a supervised baseline 3. Factual knowledge can be recovered surprisingly well from pretrained language models, however, for some relations (particularly N-to-M relations) performance is very poor 4. BERT-large consistently outperforms other language models in recovering factual and commonsense knowledge while at the same time being more robust to the phrasing of a query
  • 3. Language Models as Knowledge Bases?
  • 4. Language Models are Unsupervised Multitask Learners OpenAI
  • 5. Language Models are Unsupervised Multitask Learners
  • 6. GPT3 there is still a need for task-specific datasets and task-specific fine-tuning: to achieve strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands of examples specific to that task. • the need for a large dataset of labeled examples for every new task limits the applicability of language models. • the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness of the model and the narrowness of the training distribution. • humans do not require large supervised datasets to learn most language tasks - a brief directive in natural language or at most a tiny number of demonstrations is often sufficient to enable a human to perform a new task Pre-trained transformer language models Motivation OpenAI
  • 14. Measuring Massive Multitask Language Understanding Linguistics Commonsense
  • 15. Measuring Massive Multitask Language Understanding
  • 16. Measuring Massive Multitask Language Understanding Transformer models have driven this recent progress by pretraining on massive text corpora, including all of Wikipedia, thousands of books, and numerous websites. These models consequently see extensive information about specialized topics, most of which is not assessed by existing NLP benchmarks. a new benchmark for assessing models across a diverse set of subjects that humans learn. • zero-shot and few-shot settings • 57 subjects across STEM, the humanities, the social sciences, and more • difficulty from an elementary level to an advanced professional level • it tests both world knowledge and problem solving ability • Mathmatics, history, law, ethics … • The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.
  • 17. Measuring Massive Multitask Language Understanding Recent models learn enough information from pretraining that they can serve as knowledge bases. However, no prior work has comprehensively measured the knowledge models have across many real-world domains. Humanities • Law, philosophy, history, … • Legal understanding: how to apply rules and standard to complex scenarios. Understanding following rules and regulations (a necessary capability to constrain open-world machine) • Philosophy: logical fallacies, formal logic, famous philosophical arguments • Ethics: test a model’s understanding of normative statements through predicting widespread moral intuitions • History: covers a wide range of time periods and geographical location, including prehistory and other advanced subjects
  • 18. Measuring Massive Multitask Language Understanding Social Science • Economics, sociology, politics, geography, psychology, … • Economics: microeconomics, macroeconomics, econometrics, cover different types of problems – require a mixture of world knowledge, qualitative reasoning, or quantitative reasoning STEM • Physics, computer science, mathematics, … • Conceptual physics (harder version of the physical commonsense benchmark Physical IQA) • College mathematics questions (like those found on the GRE mathematics) (LaTeX) • STEM subjects require knowledge of empirical methods, fluid intelligence, and procedural knowledge Others • Professional Medicine task, finance, accounting, marketing, knowledge of global facts – (e.g. statistics about poverty in different countries over time)
  • 19. Measuring Massive Multitask Language Understanding
  • 20. Measuring Massive Multitask Language Understanding
  • 21. Measuring Massive Multitask Language Understanding
  • 22. Measuring Massive Multitask Language Understanding GPT-3 (few-shot, zero-shot setting) UnifiedQA (without any further tuning to assess its transfer accuracy) RoBERTa-base, ALBERT-xxlarge, GPT-2 (fine-tuned on UnifiedQA training data and dev+val set.)
  • 23. Measuring Massive Multitask Language Understanding
  • 24. Measuring Massive Multitask Language Understanding Since language models train on vast text corpora, there is some chance that they have seen the exact question and answer during pretraining. If they memorized the exact question and answer, then they would attain higher accuracy than their true ability. Likewise, a question’s entropy would be especially low if it were memorized. We also note that most of our questions came from PDFs or websites where questions and answers are on separate pages. This suggests that our exact questions were not memorized. However, during pretraining models encountered text related to our questions through processing Wikipedia. We also note that most of our questions came from PDFs or websites where questions and answers are on separate pages.
  • 25. Measuring Massive Multitask Language Understanding
  • 26. Measuring Massive Multitask Language Understanding • Poorly on highly procedural problems. • Calculation-heavy STEM subjects tend to have low accuracy compared to verbal subjects • For GPT-3, 9 out of the 10 lowest-accuracy tasks are STEM subjects. (poor performance on Elementary Mathematics and many other STEM subjects with “plug and chug” problems. The tasks with near-random accuracy include calculation-heavy subjects such as physics and mathematics and subjects related to human values such as law and morality. Worryingly, we also find that GPT-3 does not have an accurate sense of what it does or does not know since its average confidence can be up to 24% off from its actual accuracy. • Multimodal Understanding: While text is capable of conveying an enormous number of concepts about the world, many important concepts are conveyed mainly through other modalities, such as images, audio, and physical interaction. … One such benchmark could be a “Turk Test,” consisting of Amazon Mechanical Turk Human Intelligence Tasks. • The Internet as a Training Set: A major distinction between our benchmark and previous multitask NLP benchmarks is that we do not require large training sets. Instead, we assume that models have acquired the requisite knowledge from reading vast quantities of diverse text from the internet.
  • 27. Measuring Massive Multitask Language Understanding • Model Limitation: Models do not match expert-level performance (90%) on any subject, so for all subjects it is subhuman. On average, models are only now starting to move beyond random-chance accuracy levels. Addressing these shortcomings may be challenging. To illustrate this, we attempted to create a better Professional Law model by pretraining on specialized data but achieved only limited success. We collected approximately 2,000 additional Professional Law training examples. After fine-tuning a RoBERTa-base model (Liu et al., 2019) using this custom training set, our model attained 32.8% test accuracy. To test the impact of additional specialized training data, we also had RoBERTa continue pretraining on approximately 1.6 million legal case summaries using Harvard’s Law Library case law corpus case.law, but after fine-tuning it only attained 36.1% accuracy. This suggests that while additional pretraining on relevant high quality text can help, it may not be enough to substantially increase the performance of current models. It is unclear whether simply scaling up existing language models will solve the test. Current understanding indicates that a 10× increase in model size must be accompanied by an approximate 5× increase in data (Kaplan et al., 2020). Aside from the tremendous expense in creating multi-trillion parameter language models, data may also become a bottleneck, as there is far less written about esoteric branches of knowledge than about everyday situations.
  • 28. Measuring Massive Multitask Language Understanding
  • 29. Measuring Massive Multitask Language Understanding