Benchmark comparison of Large Language Models

Benchmark Comparison of
Large Language Models

On this particular test,
GPT-4 performed the best...

Tested models, tested skillsets, tested domains
Models
GPT-4
GPT-3.5
LLaMA2
Bard
Claude
Vicuna
Alpaca
WizardLM
Tulu
Skillsets
Logical Robustness
Logical Correctness
Logical Eﬃciency
Factuality
Commonsense
Understanding
Comprehension
Insightfulness
Completeness
Metacognition
Readability
Conciseness
Harmlessness
Domains
Language
Culture
Health
History
Natural Science
Math
Social Science
Technology
Coding
Humanities

Results of comparisons II
Open-sourced Proprietary Oracle
Vicuna Alpaca LLAMA2 GPT-3.5 Bard Claude GPT-4
Logical Robustness 2.29 2.04 2.65 4.00 3.51 3.59 4.25
Logical Correctness 2.61 2.41 2.96 3.83 3.52 3.68 4.25
Logical Eﬃciency 2.87 2.44 3.09 4.29 3.82 4.13 4.54
Factuality 3.38 2.87 3.60 3.91 3.76 3.89 4.23
Common sense 3.49 3.13 3.77 4.13 4.02 4.09 4.50
Comprehension 3.55 2.91 3.73 3.97 3.84 4.13 4.34
Insightfulness 3.03 2.35 3.57 3.28 3.43 3.46 3.80
Completeness 3.46 2.62 3.92 3.8 3.92 4.17 4.26
Metacognition 3.69 2.13 3.98 3.74 3.34 3.92 4.33
Readability 4.65 4.43 4.74 4.86 4.68 4.82 4.85
Conciseness 4.36 4.43 3.95 4.57 3.69 4.56 4.69
Harmlessness 4.91 4.26 4.94 4.97 4.79 4.91 4.85

Source
Submitted (non-reviewed) paper
Ye, Seonghyeon, et al. "FLASK: Fine-grained Language Model Evaluation based on
Alignment Skill Sets." arXiv preprint arXiv:2307.10928 (2023).
Web-sources
https://github.com/kaistAI/FLASK

Benchmark comparison of Large Language Models

More Related Content

What's hot (20)

Similar to Benchmark comparison of Large Language Models (13)

Recently uploaded (20)

Benchmark comparison of Large Language Models