SlideShare a Scribd company logo
4
Most read
6
Most read
Benchmark Comparison of
Large Language Models
On this particular test,
GPT-4 performed the best...
... but let’s see how...
Tested models, tested skillsets, tested domains
Models
GPT-4
GPT-3.5
LLaMA2
Bard
Claude
Vicuna
Alpaca
WizardLM
Tulu
Skillsets
Logical Robustness
Logical Correctness
Logical Efficiency
Factuality
Commonsense
Understanding
Comprehension
Insightfulness
Completeness
Metacognition
Readability
Conciseness
Harmlessness
Domains
Language
Culture
Health
History
Natural Science
Math
Social Science
Technology
Coding
Humanities
Results of comparisons I
Results of comparisons II
Open-sourced Proprietary Oracle
Vicuna Alpaca LLAMA2 GPT-3.5 Bard Claude GPT-4
Logical Robustness 2.29 2.04 2.65 4.00 3.51 3.59 4.25
Logical Correctness 2.61 2.41 2.96 3.83 3.52 3.68 4.25
Logical Efficiency 2.87 2.44 3.09 4.29 3.82 4.13 4.54
Factuality 3.38 2.87 3.60 3.91 3.76 3.89 4.23
Common sense 3.49 3.13 3.77 4.13 4.02 4.09 4.50
Comprehension 3.55 2.91 3.73 3.97 3.84 4.13 4.34
Insightfulness 3.03 2.35 3.57 3.28 3.43 3.46 3.80
Completeness 3.46 2.62 3.92 3.8 3.92 4.17 4.26
Metacognition 3.69 2.13 3.98 3.74 3.34 3.92 4.33
Readability 4.65 4.43 4.74 4.86 4.68 4.82 4.85
Conciseness 4.36 4.43 3.95 4.57 3.69 4.56 4.69
Harmlessness 4.91 4.26 4.94 4.97 4.79 4.91 4.85
Source
Submitted (non-reviewed) paper
Ye, Seonghyeon, et al. "FLASK: Fine-grained Language Model Evaluation based on
Alignment Skill Sets." arXiv preprint arXiv:2307.10928 (2023).
Web-sources
https://github.com/kaistAI/FLASK
CC BY 4.0, Matej Varga

More Related Content

PDF
Intro to LLMs
PDF
generative-ai-fundamentals and Large language models
PDF
LLMs Bootcamp
PDF
Large Language Models Bootcamp
PDF
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
PDF
Unlocking the Power of Generative AI An Executive's Guide.pdf
PPTX
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
PDF
Transformers, LLMs, and the Possibility of AGI
Intro to LLMs
generative-ai-fundamentals and Large language models
LLMs Bootcamp
Large Language Models Bootcamp
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
Unlocking the Power of Generative AI An Executive's Guide.pdf
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Transformers, LLMs, and the Possibility of AGI

What's hot (20)

PDF
Generative Models and ChatGPT
PDF
The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!
PDF
Responsible Generative AI
PDF
Generative-AI-in-enterprise-20230615.pdf
PDF
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
PPTX
DevDive_UnleashthFullPotentialofAutomationwithGenAI.pptx
PPTX
Generative AI, WiDS 2023.pptx
PDF
Data strategy demistifying data
PDF
Customizing LLMs
PDF
Introduction to Knowledge Graphs and Semantic AI
PDF
Fundamentals of Artificial Intelligence — QU AIO Leadership in AI
PDF
ChatGPT-the-revolution-is-coming.pdf
PPTX
The Future of AI is Generative not Discriminative 5/26/2021
PDF
8 Steps to Build a LangChain RAG Chatbot.
PDF
AI 2023.pdf
PDF
Generative AI at the edge.pdf
PDF
Let's talk about GPT: A crash course in Generative AI for researchers
PDF
General introduction to AI ML DL DS
PDF
Generative AI: Past, Present, and Future – A Practitioner's Perspective
PPTX
A brief primer on OpenAI's GPT-3
Generative Models and ChatGPT
The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!
Responsible Generative AI
Generative-AI-in-enterprise-20230615.pdf
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
DevDive_UnleashthFullPotentialofAutomationwithGenAI.pptx
Generative AI, WiDS 2023.pptx
Data strategy demistifying data
Customizing LLMs
Introduction to Knowledge Graphs and Semantic AI
Fundamentals of Artificial Intelligence — QU AIO Leadership in AI
ChatGPT-the-revolution-is-coming.pdf
The Future of AI is Generative not Discriminative 5/26/2021
8 Steps to Build a LangChain RAG Chatbot.
AI 2023.pdf
Generative AI at the edge.pdf
Let's talk about GPT: A crash course in Generative AI for researchers
General introduction to AI ML DL DS
Generative AI: Past, Present, and Future – A Practitioner's Perspective
A brief primer on OpenAI's GPT-3
Ad

Similar to Benchmark comparison of Large Language Models (13)

PDF
Evaluation of Medium-Sized Language Models in German and English Language
PDF
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
PDF
Benchmarking Large Language Models with a Unified Performance Ranking Metric
PDF
Benchmarking Large Language Models with a Unified Performance Ranking Metric
PDF
Benchmarking Large Language Models with a Unified Performance Ranking Metric
PDF
Comparing LLMs using a Unified Performance Ranking System
PDF
Comparing LLMs Using a Unified Performance Ranking System
PPTX
GPT-4: A Glimpse into GPT-4 and Let's Demystify
PDF
You and Your Research -- LLMs Perspective
PDF
Large Language Models for Test Case Evolution and Repair
PDF
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
PPTX
Open Source vs Closed Source LLMs. Pros and Cons
PPTX
Large Language Models: Diving into GPT, LLaMA, and More
Evaluation of Medium-Sized Language Models in German and English Language
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
Benchmarking Large Language Models with a Unified Performance Ranking Metric
Benchmarking Large Language Models with a Unified Performance Ranking Metric
Benchmarking Large Language Models with a Unified Performance Ranking Metric
Comparing LLMs using a Unified Performance Ranking System
Comparing LLMs Using a Unified Performance Ranking System
GPT-4: A Glimpse into GPT-4 and Let's Demystify
You and Your Research -- LLMs Perspective
Large Language Models for Test Case Evolution and Repair
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
Open Source vs Closed Source LLMs. Pros and Cons
Large Language Models: Diving into GPT, LLaMA, and More
Ad

Recently uploaded (20)

PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
UNIT 4 Total Quality Management .pptx
PDF
PPT on Performance Review to get promotions
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Well-logging-methods_new................
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
OOP with Java - Java Introduction (Basics)
UNIT-1 - COAL BASED THERMAL POWER PLANTS
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Internet of Things (IOT) - A guide to understanding
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
UNIT 4 Total Quality Management .pptx
PPT on Performance Review to get promotions
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Foundation to blockchain - A guide to Blockchain Tech
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Well-logging-methods_new................
Automation-in-Manufacturing-Chapter-Introduction.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
OOP with Java - Java Introduction (Basics)

Benchmark comparison of Large Language Models

  • 2. On this particular test, GPT-4 performed the best...
  • 3. ... but let’s see how...
  • 4. Tested models, tested skillsets, tested domains Models GPT-4 GPT-3.5 LLaMA2 Bard Claude Vicuna Alpaca WizardLM Tulu Skillsets Logical Robustness Logical Correctness Logical Efficiency Factuality Commonsense Understanding Comprehension Insightfulness Completeness Metacognition Readability Conciseness Harmlessness Domains Language Culture Health History Natural Science Math Social Science Technology Coding Humanities
  • 6. Results of comparisons II Open-sourced Proprietary Oracle Vicuna Alpaca LLAMA2 GPT-3.5 Bard Claude GPT-4 Logical Robustness 2.29 2.04 2.65 4.00 3.51 3.59 4.25 Logical Correctness 2.61 2.41 2.96 3.83 3.52 3.68 4.25 Logical Efficiency 2.87 2.44 3.09 4.29 3.82 4.13 4.54 Factuality 3.38 2.87 3.60 3.91 3.76 3.89 4.23 Common sense 3.49 3.13 3.77 4.13 4.02 4.09 4.50 Comprehension 3.55 2.91 3.73 3.97 3.84 4.13 4.34 Insightfulness 3.03 2.35 3.57 3.28 3.43 3.46 3.80 Completeness 3.46 2.62 3.92 3.8 3.92 4.17 4.26 Metacognition 3.69 2.13 3.98 3.74 3.34 3.92 4.33 Readability 4.65 4.43 4.74 4.86 4.68 4.82 4.85 Conciseness 4.36 4.43 3.95 4.57 3.69 4.56 4.69 Harmlessness 4.91 4.26 4.94 4.97 4.79 4.91 4.85
  • 7. Source Submitted (non-reviewed) paper Ye, Seonghyeon, et al. "FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets." arXiv preprint arXiv:2307.10928 (2023). Web-sources https://github.com/kaistAI/FLASK
  • 8. CC BY 4.0, Matej Varga