SlideShare a Scribd company logo
Comparing LLMs Using a Unified Performance
Ranking System
Maikel Leon
Department of Business Technology, Miami Herbert Business School,
Abstract. Large Language Models (LLMs) have transformed natural language processing and AI-driven
applications. These advances include OpenAI’s GPT, Meta’s LLaMA, and Google’s PaLM. These advances
have happened quickly. Finding a common metric to compare these models presents a substantial barrier
for researchers and practitioners, notwithstanding their transformative power. This research proposes a
novel performance ranking metric to satisfy the pressing demand for a complete evaluation system. Our
statistic comprehensively compares LLM capacities by combining qualitative and quantitative evaluations.
We examine the advantages and disadvantages of top LLMs by thorough benchmarking, providing insightful
information on how they compare performance. This project aims to progress the development of more
reliable and effective language models and make it easier to make well-informed decisions when choosing
models.
Keywords: Large Language Models (LLMs), Performance Evaluation, Benchmarking, Qualitative Anal-
ysis, and Quantitative Metrics.
1 Introduction
Artificial intelligence (AI) has evolved significantly over the past several decades, rev-
olutionizing various industries and transforming how we interact with technology. The
journey from early AI systems to modern LLMs is marked by machine learning (ML)
and deep learning advancements. Initially, AI focused on rule-based systems and symbolic
reasoning, which laid the groundwork for more sophisticated approaches [1]. The advent
of ML introduced data-driven techniques that enabled systems to learn and improve from
experience. Deep learning further accelerated This paradigm shift by leveraging neural
networks to model complex patterns and achieve unprecedented performance levels in
tasks such as image and speech recognition. The development of LLMs, such as GPT-3
and beyond, represents the latest frontier in this evolution, harnessing vast amounts of
data and computational power to generate human-like text and perform a wide array of
language-related tasks. This paper explores the progression from traditional AI to ML,
deep learning, and the emergence of LLMs, highlighting key milestones, technological ad-
vancements, and their implications for the future of AI.
LLMs have emerged as transformative tools in Natural Language Processing (NLP),
demonstrating unparalleled capabilities in understanding and generating human language.
Models such as OpenAI’s GPT, Meta’s LLaMA, and Google’s PaLM have set new bench-
marks in tasks ranging from text completion to sentiment analysis. These advancements
have expanded the horizons of what is possible with AI and underscored the critical need
for robust evaluation frameworks that can comprehensively assess and compare the effec-
tiveness of these models. LLMs represent a culmination of advancements in deep learning,
leveraging vast amounts of data and computational power to achieve remarkable linguistic
capabilities [2]. Each iteration, from GPT-3 to the latest GPT-4 with 175 billion pa-
rameters, has pushed the boundaries of language understanding and generation. Meta’s
University of Miami, Florida, USA
33
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024
DOI:10.5121/ijaia.2024.15403
LLaMA, optimized for efficiency with 65 billion parameters, excels in multilingual applica-
tions, while Google’s PaLM, with its 540 billion parameters, tackles complex multitasking
scenarios [3].
The following are some key advancements:
– GPT Series: Known for its versatility in generating coherent text across various
domains.
– LLaMA: Notable for its efficiency and performance in real-time applications and mul-
tilingual contexts.
– PaLM: Designed to handle complex question-answering and multitasking challenges
with high accuracy.
These models have revolutionized healthcare, finance, and education industries, en-
hancing customer interactions, automating tasks, and enabling personalized learning ex-
periences [4]. Despite their advancements, the evaluation of LLMs remains fragmented and
lacks a unified methodology. Current evaluation metrics often focus on specific aspects of
model performance, such as perplexity scores or accuracy rates in predefined tasks. How-
ever, these metrics do not provide a comprehensive view of overall model effectiveness,
leading to challenges in comparing different models directly.
Some current limitations are listed below:
– Fragmented Metrics: Diverse evaluation criteria hinder direct comparisons between
LLMs.
– Qualitative vs. Quantitative: Emphasis on either qualitative insights or quantita-
tive benchmarks, but not both.
– Application-Specific Challenges: Difficulty selecting the most suitable LLM for
specific real-world applications.
These limitations underscore the need for a standardized evaluation framework inte-
grating qualitative assessments with quantitative benchmarks. To address these challenges,
this paper proposes a novel performance ranking metric to assess LLM capabilities compre-
hensively. Our approach integrates qualitative insights, such as model interpretability and
coherence in generated text, with quantitative metrics, including computational efficiency
and performance across standardized NLP benchmarks. By synthesizing these dimensions,
our metric offers a holistic perspective on LLM performance that facilitates meaningful
comparisons and supports informed decision-making in model selection [5].
The following are the objectives of the study:
– Develop a standardized evaluation framework for LLMs that captures qualitative and
quantitative aspects.
– Conduct a comparative analysis of leading models (GPT-4, LLaMA, PaLM) to high-
light strengths and limitations.
– Propose guidelines for selecting the most suitable LLM for specific NLP applications
based on comprehensive evaluation criteria.
In addition to proposing a new evaluation methodology, this study provides empirical
insights into the performance of leading LLMs across diverse application domains. Table
1 summarizes key characteristics and performance metrics, offering a structured overview
of the models under consideration.
This study’s contributions are expected to advance the field of NLP by establishing a
standardized approach to evaluating LLMs, enhancing transparency, and supporting the
development of more effective AI-driven language models. This research aims to acceler-
ate progress in AI research and applications by addressing the current gaps in evaluation
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024
34
Table 1. Comparison of Leading Large Language Models
Model Developer Parameter Count Primary Use
Cases
GPT-4 OpenAI 175 billion Text generation,
code completion
LLaMA Meta 65 billion Multilingual tasks,
real-time applica-
tions
PaLM Google 540 billion Complex question
answering, multi-
tasking
methodologies, ultimately benefiting industries and society. Developing a unified perfor-
mance ranking metric is crucial for unlocking the full potential of Large Language Models
in real-world applications. By providing a comprehensive evaluation framework, this paper
aims to contribute to the ongoing dialogue on model evaluation and drive future innova-
tions in AI-driven language processing [6].
2 Understanding Generative AI and LLMs
AI encompasses diverse methodologies and approaches tailored for specific tasks and ap-
plications. The distinction between regular AI and Generative AI, such as Large Language
Models (LLMs), lies in their fundamental approach to data processing and task execution:
– Regular AI (Symbolic AI): Traditional AI models rely on explicit programming
and predefined rules to process structured data and execute tasks. They excel in tasks
with clear rules and well-defined inputs and outputs, such as rule-based systems in
chess-playing or automated decision-making processes [7].
– Generative AI (LLMs): Generative AI, exemplified by LLMs, operates differently
by learning from vast amounts of unstructured data to generate outputs. These models
use deep learning techniques to understand and produce human-like text, exhibiting
creativity and adaptability in language tasks.
Generative AI represents a paradigm shift in AI and Natural Language Processing
(NLP), enabling machines to perform tasks that require understanding and generation
of natural language in a way that closely mimics human capabilities. Particularly, LLMs
have demonstrated remarkable capabilities across various applications:
– Text Generation: LLMs like OpenAI’s GPT series can generate coherent and con-
textually relevant text, from short sentences to entire articles, based on prompts or
input text.
– Translation: Models such as Google’s T5 have shown effective translation capabilities,
converting text between multiple languages with high accuracy and fluency.
– Question Answering: LLMs are proficient in answering natural language questions
based on their understanding of context and information retrieval from large datasets.
– Creative Writing: Some LLMs have been trained to generate creative content such
as poems, stories, and even music compositions, showcasing their versatility and cre-
ativity.
– Chatbots and Virtual Assistants: AI-powered chatbots and virtual assistants lever-
age LLMs to engage in natural conversations, provide customer support, and perform
tasks such as scheduling appointments or making reservations.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024
35
These examples illustrate how Generative AI, specifically LLMs, extends beyond tra-
ditional AI applications by enabling machines to understand and generate human-like
text with contextually appropriate responses and creative outputs [8]. LLMs are a promi-
nent example of Generative AI, distinguished by their ability to process and generate
human-like text based on vast amounts of data. These models, particularly those based
on Transformer architectures, have revolutionized NLP by:
– Scale: LLMs are trained on massive datasets comprising billions of words or sentences
from diverse sources such as books, articles, and websites.
– Contextual Understanding: They exhibit a strong capability to understand and
generate text in context, allowing them to produce coherent and contextually relevant
responses.
– Generativity: LLMs can generate human-like text, including completing sentences,
answering questions, and producing creative content such as poems or stories.
– Transfer Learning: They benefit from transfer learning, where models pre-trained on
large datasets can be fine-tuned on specific tasks with smaller, task-specific datasets.
LLMs exemplify the power of Generative AI in harnessing deep learning to achieve
remarkable capabilities in understanding and generating natural language. Their ability
to generate indistinguishable text from human-generated content marks a significant ad-
vancement in AI research and applications. LLMs leverage advanced machine learning
techniques, primarily deep learning architectures, to achieve their impressive capabilities
in NLP. These models are typically based on Transformer architectures, which have be-
come the cornerstone of modern NLP tasks due to their ability to process sequential data
efficiently.
The Transformer architecture, introduced by Vaswani et al. (2017), revolutionized NLP
by replacing recurrent neural networks (RNNs) and convolutional neural networks (CNNs)
with a self-attention mechanism [9]. Key components of the Transformer include:
– Self-Attention Mechanism: The model can weigh the significance of different words
in a sentence, capturing long-range dependencies efficiently.
– Multi-head Attention: Enhances the model’s ability to focus on different positions
and learn diverse input representations.
– Feedforward Neural Networks: Process the outputs of the attention mechanism to
generate context-aware representations [10].
– Layer Normalization and Residual Connections: Aid in stabilizing training and
facilitating the flow of gradients through deep networks.
LLMs employ Transformer-based architectures with more layers, parameters, and com-
putational resources to handle larger datasets and achieve state-of-the-art performance in
various NLP tasks. Training LLMs involves several stages and techniques to optimize
performance and efficiency:
– Pre-training: Initial training on large-scale datasets (e.g., books, articles, web text)
to learn general language patterns and representations. Models like GPT-3 are pre-
trained on massive corpora to capture broad linguistic knowledge [11].
– Fine-tuning: Further training on task-specific datasets (e.g., question answering, text
completion) to adapt the model’s parameters to specific applications. Fine-tuning en-
hances model performance and ensures applicability to real-world tasks.
– Regularization Techniques: Methods such as dropout and weight decay prevent
overfitting and improve generalization capabilities, which are crucial for robust perfor-
mance across different datasets.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024
36
In addition to machine learning architectures, LLMs rely on sophisticated data struc-
tures to efficiently manage and process vast amounts of textual data. Key data structures
include:
– Tokenizers: Convert raw text into tokens (words, subwords) suitable for model input.
Tokenization methods vary, with models like BERT using WordPiece and Byte-Pair
Encoding (BPE) to effectively handle rare words and subword units.
– Embeddings: Represent words or tokens as dense vectors in a continuous vector space.
Embeddings capture semantic relationships and contextual information, enhancing the
model’s ability to understand and generate coherent text.
– Attention Matrices: Store attention weights computed during self-attention opera-
tions. These matrices enable the model to effectively focus on relevant parts of input
sequences and learn contextual dependencies.
– Cached Computations: Optimize inference speed by caching intermediate compu-
tations during attention and feedforward operations, reducing redundant calculations
and improving efficiency [12].
These data structures play a critical role in LLMs’ performance and scalability, en-
abling them to handle large-scale datasets and achieve state-of-the-art results in various
NLP benchmarks. Integrating advanced machine learning techniques, such as Transformer
architectures and sophisticated data structures, is fundamental to developing and succeed-
ing Large Language Models (LLMs). These models represent a significant advancement in
natural language processing, enabling machines to understand and generate human-like
text with unprecedented accuracy and complexity. By leveraging scalable architectures and
efficient data handling mechanisms, LLMs continue to push the boundaries of AI research
and application, paving the way for transformative innovations in language understanding
and generation [13].
3 Evolution of Large Language Models
LLMs have undergone a remarkable evolution over the past decades, driven by advance-
ments in deep learning, computational resources, and the availability of large-scale datasets.
This section provides a comprehensive overview of the evolution of LLMs from their early
conception to their current capabilities, highlighting key milestones and technological
breakthroughs that have shaped their development. The concept of LLMs emerged from
early efforts in statistical language modeling and neural networks, aiming to improve the
understanding and generation of human language. Traditional approaches such as n-gram
models and Hidden Markov Models (HMMs) provided foundational insights into language
patterns but were limited in capturing semantic nuances and context. The shift towards
neural network-based approaches in the early 2000s marked a significant milestone, laying
the groundwork for more sophisticated language models capable of learning hierarchical
representations of text.
Key milestones are:
– Early 2000s: Development of neural network-based language models, focusing on
improving language modeling accuracy and efficiency.
– 2010s: Emergence of recurrent neural networks (RNNs) and Long Short-Term Memory
(LSTM) networks, which enhanced the ability to capture long-range dependencies in
language [14]. Models like LSTM-based language models showed improved performance
in tasks such as text generation and sentiment analysis.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024
37
– 2017 - 2020: Breakthrough with the Transformer architecture introduced in mod-
els like GPT (Generative Pre-trained Transformer) by OpenAI. Transformers revolu-
tionized language modeling by leveraging self-attention mechanisms to capture global
dependencies in text, leading to significant improvements in NLP tasks.
The evolution of LLMs has been closely intertwined with advancements in hardware
capabilities, algorithmic improvements, and the availability of large-scale datasets. The
following table provides an overview of key technological advancements and their impact
on the development of LLMs:
Table 2. Technological Advancements in LLMs
Technological Advance-
ment
Impact on LLM Development
Increase in computational
power
Enabled training of larger and more complex models
(e.g., GPT-3, GPT-4)
Availability of large-scale
datasets (e.g., Common
Crawl, Wikipedia)
Facilitated pre-training of models on vast amounts of text
data, improving language understanding
Introduction of Transformer
architecture
Revolutionized language modeling by capturing global
dependencies through self-attention mechanisms
Optimization techniques
(e.g., learning rate sched-
ules, gradient normaliza-
tion)
Enhanced training stability and convergence of deep neu-
ral networks
These advancements have propelled LLMs from experimental prototypes to practical
tools with broad applications across industries, including healthcare, finance, and educa-
tion. Integrating advanced technologies has enhanced LLMs’ capabilities and expanded
their potential to address complex natural language understanding and generation chal-
lenges [15].
Recent advancements in LLMs have focused on enhancing model capabilities in several
key areas:
– Multimodal Understanding: Integration of vision and language capabilities in mod-
els like CLIP (Contrastive Language-Image Pre-training) and DALL-E, enabling tasks
such as image captioning and generation.
– Zero-Shot Learning: Ability to perform tasks with minimal or no task-specific train-
ing data, demonstrating generalized learning capabilities.
– Ethical Considerations: Increasing focus on fairness, transparency, and bias miti-
gation in model development and deployment, addressing societal concerns related to
AI ethics [16].
These advancements underscore the dynamic nature of LLMs and their potential to
reshape the landscape of AI-driven technologies in the coming years. LLMs are poised to
drive innovation and address real-world challenges across diverse domains by continually
pushing the boundaries of language understanding and generation. The evolution of Large
Language Models (LLMs) from their early conception to their current capabilities reflects
significant advancements in deep learning, computational resources, and data availability.
As LLMs continue to evolve, driven by innovations in architecture and training techniques,
they promise to revolutionize diverse fields ranging from healthcare to finance and beyond.
By understanding the historical context and technological milestones of LLM development,
researchers and practitioners can better appreciate the transformative potential of these
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024
38
models in advancing AI research and applications. When evaluating different LLMs, several
key parameters must be considered to determine their suitability for specific tasks and
applications, Figure 1 aims to provide an overall perspective among well-known models.
Fig. 1. Comparison of Large Language Models (LLMs) Across Key Parameters: Model Size, Multilingual
Support, Training Data, Text Generation Capabilities, and Ease of Integration.
4 The need for comparing LLMs
The evaluation of LLMs poses several challenges due to the diversity in model archi-
tectures, training methodologies, and evaluation metrics. Existing evaluation frameworks
often focus on specific tasks or datasets, leading to fragmented assessments that do not
provide a holistic view of model performance across different applications. This fragmented
approach hinders effective LLM comparison, making it difficult for researchers, developers,
and industry stakeholders to select the most suitable model for specific use cases.
Some key challenges are:
– Fragmented Metrics: Current evaluation metrics emphasize task-specific perfor-
mance (e.g., accuracy, perplexity) without considering broader applicability.
– Lack of Standardization: Absence of a standardized index or benchmark for com-
paring LLMs across diverse tasks and datasets [17].
– Complexity in Model Comparison: Difficulty in interpreting and comparing re-
sults from different evaluation studies due to varied experimental setups and reporting
practices.
Addressing these challenges requires the development of a unified index that integrates
qualitative assessments and quantitative benchmarks to provide a comprehensive evalua-
tion of LLM capabilities. To bridge the gap in LLM evaluation, this paper proposes the
development of a unified performance index designed to assess and compare LLMs across
multiple dimensions.
The proposed index would incorporate the following criteria:
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024
39
– Quantitative Metrics: Performance on standard NLP benchmarks (e.g., GLUE, Su-
perGLUE, SQuAD) to measure model accuracy and effectiveness in specific tasks.
– Computational Efficiency: Evaluation of model efficiency regarding inference time,
memory usage, and energy consumption is crucial for practical deployment.
– Robustness and Generalization: Assessment of model robustness to domain shifts,
adversarial inputs, and generalization ability across different datasets and languages.
Table 3 outlines the proposed criteria for the unified performance index:
Table 3. Criteria for Unified Performance Index
Criterion Description
Quantitative Metrics Performance on standardized NLP benchmarks (e.g., ac-
curacy, F1 score) across diverse tasks
Computational Efficiency Evaluation of model inference speed, memory footprint,
and energy efficiency
Robustness and Generalization Assessment of model performance under varying condi-
tions and ability to generalize
By establishing a unified index, stakeholders in academia and industry would benefit
from:
– Informed Decision-Making: Facilitated selection of LLMs based on comprehensive
performance assessments aligned with specific application requirements.
– Accelerated Research: Enhanced comparability of research findings and accelerated
progress in developing more effective LLM architectures and training methodologies.
– Industry Applications: Improved deployment of LLMs in real-world applications,
ensuring optimal performance and efficiency in diverse operational contexts.
Overall, developing a unified performance index for LLMs is essential for advancing
the field of NLP, fostering transparency, and driving innovation in AI-driven language
processing technologies. The lack of a standardized index for comparing Large Language
Models (LLMs) represents a significant challenge in current NLP research and applica-
tions. This paper aims to address this gap and contribute to advancing LLM evaluation
methodologies by proposing a unified performance index that integrates qualitative as-
sessments and quantitative benchmarks. Through systematic comparison and evaluation,
stakeholders can make informed decisions, accelerate research progress, and optimize the
deployment of LLMs in diverse real-world applications [18].
5 Designing a metric to evaluate the performance of LLMs: a fictional
example
To evaluate LLMs’ performance, we can develop a comprehensive metric that incorpo-
rates both quantitative and qualitative aspects of performance. A suitable metric should
cover accuracy, contextual understanding, coherence, fluency, and resource efficiency. The
proposed metric, the ”Comprehensive Language Model Performance Index (CLMPI),”
combines these aspects into a single framework.
These are components of the CLMPI:
1. Accuracy (ACC):
– Definition: Measures the factual and grammatical correctness of the responses.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024
40
– Methodology: Compare LLM outputs against a curated dataset of questions and
expert answers.
– Calculation: Percentage of correct answers (factually and grammatically) over the
total number of responses.
2. Contextual Understanding (CON):
– Definition: Assesses the model’s ability to understand and integrate context from
the conversation or document history.
– Methodology: Use context-heavy dialogue or document samples to test if the
LLM maintains topic relevance and effectively utilizes the provided historical in-
formation.
– Calculation: Scoring responses for relevance and context integration on a scale
from 0 (no context used) to 5 (excellent use of context).
3. Coherence (COH):
– Definition: Evaluates how logically connected and structurally sound the responses
are.
– Methodology: Analysis of response sequences to ensure logical flow and connec-
tion of ideas.
– Calculation: Human or automated scoring of response sequences on a scale from
0 (incoherent) to 5 (highly coherent).
4. Fluency (FLU):
– Definition: Measures the linguistic smoothness and readability of the text.
– Methodology: Responses are analyzed for natural language use, grammatical cor-
rectness, and stylistic fluency.
– Calculation: Rate responses on a scale from 0 (not fluent) to 5 (very fluent).
5. Resource Efficiency (EFF):
– Definition: Assesses the computational resources (like time and memory) used by
the LLM for tasks.
– Methodology: Measure the average time and system resources consumed for gen-
erating responses.
– Calculation: Efficiency score calculated by
EFF =
1
Time Taken (seconds) + Memory Used (MB)/100
The CLMPI score would be an aggregate, weighted sum of the individual metrics:
CLMPI = (w1 × ACC) + (w2 × CON) + (w3 × COH) + (w4 × FLU) + (w5 × EFF)
where wi are the weights assigned to each metric based on the priority of aspects. These
weights are determined based on the specific needs and usage context of the LLM.
Imagine we are evaluating an LLM designed for academic research assistance:
– Accuracy: The LLM correctly answers 85 out of 100 factual questions.
ACC = 85%
– Contextual Understanding: It scores an average of 4.2 on integrating lecture notes
into its responses.
CON = 4.2
– Coherence: Responses logically flow and are well-structured, with an average score
of 4.0.
COH = 4.0
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024
41
– Fluency: The text is readable and stylistically appropriate, with minimal grammatical
errors, scoring 4.5.
FLU = 4.5
– Resource Efficiency: The model uses 200 MB of memory and takes 1.5 seconds on
average for response generation.
EFF =
1
1.5 + 200/100
≈ 0.32
Assuming equal weights for simplicity (wi = 1):
CLMPI = 0.85 + 4.2 + 4.0 + 4.5 + 0.32 = 17.87
This CLMPI score out of a possible 25 (if maximum scores are 5 for each metric except
accuracy being a percentage) provides a quantitative measure of the LLM’s performance
across various dimensions critical to its role as an academic aide. Adjusting weights ac-
cording to specific performance priorities could further refine this metric. This example
illustrates how different aspects of LLM functionality are crucial for particular applications
and how a comprehensive metric like CLMPI can provide a balanced assessment.
Below is a comparison table for three fictional large language models (LLMs): LLM-A,
LLM-B, and LLM-C. The table compares their performance across the critical metrics
defined in the Comprehensive Language Model Performance Index (CLMPI): Accuracy
(ACC), Contextual Understanding (CON), Coherence (COH), Fluency (FLU), and Re-
source Efficiency (EFF). For this example, LLM-C is designed to outperform the other
models significantly, especially in terms of efficiency and contextual understanding.
Table 4. Comparison of LLM Performance
Metric LLM-A LLM-B LLM-C Description
Accuracy (ACC) 78% 82% 88% Percentage of questions answered correctly
Contextual Understanding (CON) 3.5 4.0 4.8 Score out of 5, effectiveness of using context
Coherence (COH) 3.8 4.0 4.5 Score out of 5, logical structuring of text
Fluency (FLU) 3.9 4.3 4.7 Score out of 5, linguistic smoothness
Resource Efficiency (EFF) 0.25 0.30 0.45 Efficiency score, higher is better
Overall CLMPI Score (out of 25) 14.15 15.15 18.65 Weighted sum of all scores
The weights could be assigned in the following way:
– Accuracy (ACC): 0.25
– Contextual Understanding (CON): 0.20
– Coherence (COH): 0.20
– Fluency (FLU): 0.20
– Resource Efficiency (EFF): 0.15
Each CLMPI score is calculated as follows, assuming these weights:
– LLM-A CLMPI Calculation:
CLMPI-A = (0.78×0.25)+(3.5×0.20)+(3.8×0.20)+(3.9×0.20)+(0.25×0.15)×25
CLMPI-A = 0.195 + 0.70 + 0.76 + 0.78 + 0.0375 = 2.4735 × 25 = 14.15
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024
43
– LLM-B CLMPI Calculation:
CLMPI-B = (0.82×0.25)+(4.0×0.20)+(4.0×0.20)+(4.3×0.20)+(0.30×0.15)×25
CLMPI-B = 0.205 + 0.80 + 0.80 + 0.86 + 0.045 = 2.71 × 25 = 15.15
– LLM-C CLMPI Calculation:
CLMPI-C = (0.88×0.25)+(4.8×0.20)+(4.5×0.20)+(4.7×0.20)+(0.45×0.15)×25
CLMPI-C = 0.22 + 0.96 + 0.90 + 0.94 + 0.0675 = 3.0875 × 25 = 18.65
LLM-C outperforms LLM-A and LLM-B across all metrics, notably in resource effi-
ciency and contextual understanding, which are critical for performance in dynamic and
resource-constrained environments. This table effectively illustrates how different models
can be evaluated against important characteristics, providing insight into their strengths
and weaknesses. Using a weighted metric system (CLMPI) allows for balanced considera-
tion of various aspects crucial for the practical deployment of LLMs.
6 Reflection
The rapid advancement of Large Language Models (LLMs) has transformed natural lan-
guage processing (NLP), offering unprecedented capabilities in tasks such as text gen-
eration, translation, and sentiment analysis. Models like OpenAI’s GPT series, Meta’s
LLaMA, and Google’s PaLM have demonstrated remarkable proficiency in understanding
and generating human language, paving the way for applications across diverse domains.
However, the absence of a standardized framework for comparing LLMs poses significant
challenges in evaluating their performance comprehensively. The landscape lacks a uni-
fied index integrating qualitative insights and quantitative metrics to assess LLMs across
various dimensions. Evaluation methodologies often focus on specific tasks or datasets,
resulting in fragmented assessments that do not provide a holistic view of model capabil-
ities. This fragmentation hinders researchers, developers, and industry stakeholders from
making informed decisions regarding model selection and deployment.
Addressing these challenges requires the development of a robust evaluation frame-
work that considers factors such as model accuracy, computational efficiency, and robust-
ness across different domains and languages. Such a framework would facilitate meaningful
comparisons between LLMs, enabling researchers to identify each model’s strengths, weak-
nesses, and optimal use cases. The need for accurately comparing Large Language Models
(LLMs) is paramount for advancing the field of natural language processing (NLP) and
maximizing the potential of AI-driven technologies in real-world applications.
By establishing a standardized evaluation framework, stakeholders in academia, indus-
try, and policy-making can benefit in several ways:
– Informed Decision-Making: Facilitated selection of LLMs based on comprehensive
performance assessments aligned with specific application requirements.
– Accelerated Research: Enhanced comparability of research findings and accelerated
progress in developing more effective LLM architectures and training methodologies.
– Optimized Applications: Improved deployment of LLMs in diverse domains, ensur-
ing optimal performance, efficiency, and ethical considerations [19].
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024
44
Furthermore, a unified framework for comparing LLMs promotes transparency and
reproducibility in AI research, fostering collaboration and innovation across the global sci-
entific community. As LLMs continue to evolve and expand their capabilities, establishing
rigorous evaluation standards becomes increasingly critical to unlocking their full poten-
tial and addressing societal challenges. In conclusion, developing a standardized evaluation
framework for LLMs is essential for advancing AI research, enabling transformative appli-
cations, and ensuring responsible deployment of AI technologies. By addressing the current
gaps in LLM evaluation, we can harness the power of these models to drive innovation
and benefit society at large.
References
1. G. Nápoles, Y. Salgueiro, I. Grau, and M. Leon, “Recurrence-aware long-term cognitive network for
explainable pattern classification,” IEEE Transactions on Cybernetics, vol. 53, no. 10, pp. 6083–6094,
2023.
2. A. Upadhyay, E. Farahmand, I. Muntilde;oz, M. Akber Khan, and N. Witte, “Influence of llms on
learning and teaching in higher education,” SSRN Electronic Journal, 2024.
3. J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, and X. Hu, “Harnessing
the power of llms in practice: A survey on chatgpt and beyond,” ACM Trans. Knowl. Discov. Data,
vol. 18, apr 2024.
4. M. Leon, “Business technology and innovation through problem-based learning,” in Canada Interna-
tional Conference on Education (CICE-2023) and World Congress on Education (WCE-2023), CICE-
2023, Infonomics Society, July 2023.
5. N. Capodieci, C. Sanchez-Adames, J. Harris, and U. Tatar, “The impact of generative ai and llms
on the cybersecurity profession,” in 2024 Systems and Information Engineering Design Symposium
(SIEDS), pp. 448–453, 2024.
6. G. Nápoles, J. L. Salmeron, W. Froelich, R. Falcon, M. Leon, F. Vanhoenshoven, R. Bello, and K. Van-
hoof, Fuzzy Cognitive Modeling: Theoretical and Practical Considerations, p. 77–87. Springer Singa-
pore, July 2019.
7. G. Nápoles, M. Leon, I. Grau, and K. Vanhoof, “FCM expert: Software tool for scenario analysis and
pattern classification based on fuzzy cognitive maps,” International Journal on Artificial Intelligence
Tools, vol. 27, no. 07, p. 1860010, 2018.
8. A. R. Asadi, “Llms in design thinking: Autoethnographic insights and design implications,” in Proceed-
ings of the 2023 5th World Symposium on Software Engineering, WSSE ’23, (New York, NY, USA),
p. 55–60, Association for Computing Machinery, 2023.
9. E. Struble, M. Leon, and E. Skordilis, “Intelligent prevention of ddos attacks using reinforcement
learning and smart contracts,” The International FLAIRS Conference Proceedings, vol. 37, May 2024.
10. G. Nápoles, M. L. Espinosa, I. Grau, K. Vanhoof, and R. Bello, Fuzzy cognitive maps based models for
pattern classification: Advances and challenges, vol. 360, pp. 83–98. Springer Verlag, 2018.
11. R. D. Pesl, M. Stötzner, I. Georgievski, and M. Aiello, “Uncovering llms for service-composition:
Challenges and opportunities,” in Service-Oriented Computing – ICSOC 2023 Workshops (F. Monti,
P. Plebani, N. Moha, H.-y. Paik, J. Barzen, G. Ramachandran, D. Bianchini, D. A. Tamburri, and
M. Mecella, eds.), (Singapore), pp. 39–48, Springer Nature Singapore, 2024.
12. M. Leon, L. Mkrtchyan, B. Depaire, D. Ruan, and K. Vanhoof, “Learning and clustering of fuzzy
cognitive maps for travel behaviour analysis,” Knowledge and Information Systems, vol. 39, no. 2,
pp. 435–462, 2013.
13. T. Han, L. C. Adams, K. Bressem, F. Busch, L. Huck, S. Nebelung, and D. Truhn, “Comparative
analysis of gpt-4vision, gpt-4 and open source llms in clinical diagnostic accuracy: A benchmark against
human expertise,” medRxiv, 2023.
14. M. Leon, “Aggregating procedure for fuzzy cognitive maps,” The International FLAIRS Conference
Proceedings, vol. 36, no. 1, 2023.
15. N. R. Rydzewski, D. Dinakaran, S. G. Zhao, E. Ruppin, B. Turkbey, D. E. Citrin, and K. R. Patel,
“Comparative evaluation of llms in clinical oncology,” NEJM AI, vol. 1, Apr. 2024.
16. H. DeSimone and M. Leon, “Explainable ai: The quest for transparency in business and beyond,” in
2024 7th International Conference on Information and Computer Technologies (ICICT), IEEE, Mar.
2024.
17. J. Sallou, T. Durieux, and A. Panichella, “Breaking the silence: the threats of using llms in soft-
ware engineering,” in Proceedings of the 2024 ACM/IEEE 44th International Conference on Software
Engineering: New Ideas and Emerging Results, ICSE-NIER’24, (New York, NY, USA), p. 102–106,
Association for Computing Machinery, 2024.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024
45
18. J. Chen, X. Lu, Y. Du, M. Rejtig, R. Bagley, M. Horn, and U. Wilensky, “Learning agent-based
modeling with llm companions: Experiences of novices and experts using chatgpt & netlogo chat,” in
Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI ’24, (New York,
NY, USA), Association for Computing Machinery, 2024.
19. G. Nápoles, F. Hoitsma, A. Knoben, A. Jastrzebska, and M. Leon, “Prolog-based agnostic explanation
module for structured pattern classification,” Information Sciences, vol. 622, p. 1196–1227, Apr. 2023.
Author
Dr. Maikel Leon is interested in applying AI/ML techniques to modeling real-world
problems using knowledge engineering, knowledge representation, and data mining meth-
ods. His most recent research focuses on XAI and is recently featured in Information
Sciences and IEEE Transactions on Cybernetics journals. Dr. Leon is a reviewer for the
International Journal of Knowledge and Information Systems, Journal of Experimental
and Theoretical Artificial Intelligence, Soft Computing, and IEEE Transactions on Fuzzy
Systems. He is a Committee Member of the Florida Artificial Intelligence Research Society.
He is a frequent contributor on technology topics for CNN en Español TV and the winner
of the Cuban Academy of Sciences National Award for the Most Relevant Research in
Computer Science. Dr. Leon obtained his PhD in Computer Science at Hasselt University,
Belgium, previously having studied computation (Master of Science and Bachelor of Sci-
ence) at Central University of Las Villas, Cuba.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024
46

More Related Content

PDF
Benchmarking Large Language Models with a Unified Performance Ranking Metric
PDF
Benchmarking Large Language Models with a Unified Performance Ranking Metric
PDF
Benchmarking Large Language Models with a Unified Performance Ranking Metric
PPTX
Vectorized Intent of Multilingual Large Language Models.pptx
PDF
solulab.com-Comparison of Large Language Models The Ultimate Guide (1).pdf
PDF
Crafting Your Customized Legal Mastery: A Guide to Building Your Private LLM
PDF
Top Comparison of Large Language ModelsLLMs Explained.pdf
PDF
Evaluating the top large language models.pdf
Benchmarking Large Language Models with a Unified Performance Ranking Metric
Benchmarking Large Language Models with a Unified Performance Ranking Metric
Benchmarking Large Language Models with a Unified Performance Ranking Metric
Vectorized Intent of Multilingual Large Language Models.pptx
solulab.com-Comparison of Large Language Models The Ultimate Guide (1).pdf
Crafting Your Customized Legal Mastery: A Guide to Building Your Private LLM
Top Comparison of Large Language ModelsLLMs Explained.pdf
Evaluating the top large language models.pdf

Similar to Comparing LLMs using a Unified Performance Ranking System (20)

PDF
Comparison of Large Language Models The Ultimate Guide.pdf
PDF
Top Comparison of Large Language ModelsLLMs Explained (2).pdf
PDF
Top Comparison of Large Language ModelsLLMs Explained.pdf
PDF
leewayhertz.com-How to build a private LLM (1).pdf
PDF
Large Language Models.pdf
PDF
A comprehensive guide to prompt engineering.pdf
PDF
Train foundation model for domain-specific language model
PDF
solulab.com-Top Comparison of Large Language ModelsLLMs Explained.pdf
PDF
solulab.com-Top Comparison of Large Language ModelsLLMs Explained.pdf
PDF
Rapid PHP 2025 v18.2.0.265 Crack Free Full Activated!
PDF
uTorrent 3.6.0 Crack (Pro Unlocked) for Windows Free 2025
PDF
Lumion Pro Crack 2025 With License Key [Latest] Free
PPTX
UTorrent Pro 3.5.5 Build 45231 Crack With Latest Version
PDF
A REVIEW OF PROMPT-FREE FEW-SHOT TEXT CLASSIFICATION METHODS
PDF
International Journal on Natural Language Computing (IJNLC)
PDF
A Review of Prompt-Free Few-Shot Text Classification Methods
PDF
How to Enhance NLP’s Accuracy with Large Language Models_ A Comprehensive Gui...
PDF
A comprehensive guide to prompt engineering.pdf
PDF
A comprehensive guide to prompt engineering.pdf
PDF
genai principles booklet with details of
Comparison of Large Language Models The Ultimate Guide.pdf
Top Comparison of Large Language ModelsLLMs Explained (2).pdf
Top Comparison of Large Language ModelsLLMs Explained.pdf
leewayhertz.com-How to build a private LLM (1).pdf
Large Language Models.pdf
A comprehensive guide to prompt engineering.pdf
Train foundation model for domain-specific language model
solulab.com-Top Comparison of Large Language ModelsLLMs Explained.pdf
solulab.com-Top Comparison of Large Language ModelsLLMs Explained.pdf
Rapid PHP 2025 v18.2.0.265 Crack Free Full Activated!
uTorrent 3.6.0 Crack (Pro Unlocked) for Windows Free 2025
Lumion Pro Crack 2025 With License Key [Latest] Free
UTorrent Pro 3.5.5 Build 45231 Crack With Latest Version
A REVIEW OF PROMPT-FREE FEW-SHOT TEXT CLASSIFICATION METHODS
International Journal on Natural Language Computing (IJNLC)
A Review of Prompt-Free Few-Shot Text Classification Methods
How to Enhance NLP’s Accuracy with Large Language Models_ A Comprehensive Gui...
A comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdf
genai principles booklet with details of
Ad

More from gerogepatton (20)

PDF
International Journal of Artificial Intelligence & Applications (IJAIA)
PDF
Performance Evaluation of Block-Sized Algorithms for Majority Vote in Facial ...
PDF
3rd International Conference on Artificial Intelligence and IoT (AIIoT 2025)
PDF
International Journal of Artificial Intelligence & Applications (IJAIA)
PDF
3rd International Conference on Artificial Intelligence and IoT (AIIoT 2025)
PDF
Augmented and Synthetic Data in Artificial Intelligence
PDF
3rd International Conference on AI, Data Mining and Data Science (AIDD 2025)
PDF
July 2025 - Top 10 Read Articles in Artificial Intelligence and Applications ...
PDF
6th International Conference on Natural Language Processing and Computational...
PDF
From Insight to Impact: The Evolution of Data-Driven Decision Making in the A...
PDF
6th International Conference on Artificial Intelligence and Machine Learning ...
PDF
3rd International Conference on Artificial Intelligence and IoT (AIIoT 2025)
PDF
International Journal of Artificial Intelligence & Applications (IJAIA)
PDF
AI-Driven Vulnerability Analysis in Smart Contracts: Trends, Challenges and F...
PDF
International Journal of Artificial Intelligence & Applications (IJAIA)
PDF
6th International Conference on Artificial Intelligence and Machine Learning ...
PDF
A Thorough Introduction to Multimodal Machine Translation
PDF
International Journal of Artificial Intelligence & Applications (IJAIA)
PDF
6th International Conference on Advanced Machine Learning (AMLA 2025)
PDF
OWE-CVD: An Optimized Weighted Ensemble for Heart Disease Prediction
International Journal of Artificial Intelligence & Applications (IJAIA)
Performance Evaluation of Block-Sized Algorithms for Majority Vote in Facial ...
3rd International Conference on Artificial Intelligence and IoT (AIIoT 2025)
International Journal of Artificial Intelligence & Applications (IJAIA)
3rd International Conference on Artificial Intelligence and IoT (AIIoT 2025)
Augmented and Synthetic Data in Artificial Intelligence
3rd International Conference on AI, Data Mining and Data Science (AIDD 2025)
July 2025 - Top 10 Read Articles in Artificial Intelligence and Applications ...
6th International Conference on Natural Language Processing and Computational...
From Insight to Impact: The Evolution of Data-Driven Decision Making in the A...
6th International Conference on Artificial Intelligence and Machine Learning ...
3rd International Conference on Artificial Intelligence and IoT (AIIoT 2025)
International Journal of Artificial Intelligence & Applications (IJAIA)
AI-Driven Vulnerability Analysis in Smart Contracts: Trends, Challenges and F...
International Journal of Artificial Intelligence & Applications (IJAIA)
6th International Conference on Artificial Intelligence and Machine Learning ...
A Thorough Introduction to Multimodal Machine Translation
International Journal of Artificial Intelligence & Applications (IJAIA)
6th International Conference on Advanced Machine Learning (AMLA 2025)
OWE-CVD: An Optimized Weighted Ensemble for Heart Disease Prediction
Ad

Recently uploaded (20)

PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Welding lecture in detail for understanding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Construction Project Organization Group 2.pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
Digital Logic Computer Design lecture notes
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
web development for engineering and engineering
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
composite construction of structures.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Welding lecture in detail for understanding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Construction Project Organization Group 2.pptx
Sustainable Sites - Green Building Construction
Digital Logic Computer Design lecture notes
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Lecture Notes Electrical Wiring System Components
CH1 Production IntroductoryConcepts.pptx
bas. eng. economics group 4 presentation 1.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
web development for engineering and engineering
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
UNIT-1 - COAL BASED THERMAL POWER PLANTS
composite construction of structures.pdf

Comparing LLMs using a Unified Performance Ranking System

  • 1. Comparing LLMs Using a Unified Performance Ranking System Maikel Leon Department of Business Technology, Miami Herbert Business School, Abstract. Large Language Models (LLMs) have transformed natural language processing and AI-driven applications. These advances include OpenAI’s GPT, Meta’s LLaMA, and Google’s PaLM. These advances have happened quickly. Finding a common metric to compare these models presents a substantial barrier for researchers and practitioners, notwithstanding their transformative power. This research proposes a novel performance ranking metric to satisfy the pressing demand for a complete evaluation system. Our statistic comprehensively compares LLM capacities by combining qualitative and quantitative evaluations. We examine the advantages and disadvantages of top LLMs by thorough benchmarking, providing insightful information on how they compare performance. This project aims to progress the development of more reliable and effective language models and make it easier to make well-informed decisions when choosing models. Keywords: Large Language Models (LLMs), Performance Evaluation, Benchmarking, Qualitative Anal- ysis, and Quantitative Metrics. 1 Introduction Artificial intelligence (AI) has evolved significantly over the past several decades, rev- olutionizing various industries and transforming how we interact with technology. The journey from early AI systems to modern LLMs is marked by machine learning (ML) and deep learning advancements. Initially, AI focused on rule-based systems and symbolic reasoning, which laid the groundwork for more sophisticated approaches [1]. The advent of ML introduced data-driven techniques that enabled systems to learn and improve from experience. Deep learning further accelerated This paradigm shift by leveraging neural networks to model complex patterns and achieve unprecedented performance levels in tasks such as image and speech recognition. The development of LLMs, such as GPT-3 and beyond, represents the latest frontier in this evolution, harnessing vast amounts of data and computational power to generate human-like text and perform a wide array of language-related tasks. This paper explores the progression from traditional AI to ML, deep learning, and the emergence of LLMs, highlighting key milestones, technological ad- vancements, and their implications for the future of AI. LLMs have emerged as transformative tools in Natural Language Processing (NLP), demonstrating unparalleled capabilities in understanding and generating human language. Models such as OpenAI’s GPT, Meta’s LLaMA, and Google’s PaLM have set new bench- marks in tasks ranging from text completion to sentiment analysis. These advancements have expanded the horizons of what is possible with AI and underscored the critical need for robust evaluation frameworks that can comprehensively assess and compare the effec- tiveness of these models. LLMs represent a culmination of advancements in deep learning, leveraging vast amounts of data and computational power to achieve remarkable linguistic capabilities [2]. Each iteration, from GPT-3 to the latest GPT-4 with 175 billion pa- rameters, has pushed the boundaries of language understanding and generation. Meta’s University of Miami, Florida, USA 33 International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024 DOI:10.5121/ijaia.2024.15403
  • 2. LLaMA, optimized for efficiency with 65 billion parameters, excels in multilingual applica- tions, while Google’s PaLM, with its 540 billion parameters, tackles complex multitasking scenarios [3]. The following are some key advancements: – GPT Series: Known for its versatility in generating coherent text across various domains. – LLaMA: Notable for its efficiency and performance in real-time applications and mul- tilingual contexts. – PaLM: Designed to handle complex question-answering and multitasking challenges with high accuracy. These models have revolutionized healthcare, finance, and education industries, en- hancing customer interactions, automating tasks, and enabling personalized learning ex- periences [4]. Despite their advancements, the evaluation of LLMs remains fragmented and lacks a unified methodology. Current evaluation metrics often focus on specific aspects of model performance, such as perplexity scores or accuracy rates in predefined tasks. How- ever, these metrics do not provide a comprehensive view of overall model effectiveness, leading to challenges in comparing different models directly. Some current limitations are listed below: – Fragmented Metrics: Diverse evaluation criteria hinder direct comparisons between LLMs. – Qualitative vs. Quantitative: Emphasis on either qualitative insights or quantita- tive benchmarks, but not both. – Application-Specific Challenges: Difficulty selecting the most suitable LLM for specific real-world applications. These limitations underscore the need for a standardized evaluation framework inte- grating qualitative assessments with quantitative benchmarks. To address these challenges, this paper proposes a novel performance ranking metric to assess LLM capabilities compre- hensively. Our approach integrates qualitative insights, such as model interpretability and coherence in generated text, with quantitative metrics, including computational efficiency and performance across standardized NLP benchmarks. By synthesizing these dimensions, our metric offers a holistic perspective on LLM performance that facilitates meaningful comparisons and supports informed decision-making in model selection [5]. The following are the objectives of the study: – Develop a standardized evaluation framework for LLMs that captures qualitative and quantitative aspects. – Conduct a comparative analysis of leading models (GPT-4, LLaMA, PaLM) to high- light strengths and limitations. – Propose guidelines for selecting the most suitable LLM for specific NLP applications based on comprehensive evaluation criteria. In addition to proposing a new evaluation methodology, this study provides empirical insights into the performance of leading LLMs across diverse application domains. Table 1 summarizes key characteristics and performance metrics, offering a structured overview of the models under consideration. This study’s contributions are expected to advance the field of NLP by establishing a standardized approach to evaluating LLMs, enhancing transparency, and supporting the development of more effective AI-driven language models. This research aims to acceler- ate progress in AI research and applications by addressing the current gaps in evaluation International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024 34
  • 3. Table 1. Comparison of Leading Large Language Models Model Developer Parameter Count Primary Use Cases GPT-4 OpenAI 175 billion Text generation, code completion LLaMA Meta 65 billion Multilingual tasks, real-time applica- tions PaLM Google 540 billion Complex question answering, multi- tasking methodologies, ultimately benefiting industries and society. Developing a unified perfor- mance ranking metric is crucial for unlocking the full potential of Large Language Models in real-world applications. By providing a comprehensive evaluation framework, this paper aims to contribute to the ongoing dialogue on model evaluation and drive future innova- tions in AI-driven language processing [6]. 2 Understanding Generative AI and LLMs AI encompasses diverse methodologies and approaches tailored for specific tasks and ap- plications. The distinction between regular AI and Generative AI, such as Large Language Models (LLMs), lies in their fundamental approach to data processing and task execution: – Regular AI (Symbolic AI): Traditional AI models rely on explicit programming and predefined rules to process structured data and execute tasks. They excel in tasks with clear rules and well-defined inputs and outputs, such as rule-based systems in chess-playing or automated decision-making processes [7]. – Generative AI (LLMs): Generative AI, exemplified by LLMs, operates differently by learning from vast amounts of unstructured data to generate outputs. These models use deep learning techniques to understand and produce human-like text, exhibiting creativity and adaptability in language tasks. Generative AI represents a paradigm shift in AI and Natural Language Processing (NLP), enabling machines to perform tasks that require understanding and generation of natural language in a way that closely mimics human capabilities. Particularly, LLMs have demonstrated remarkable capabilities across various applications: – Text Generation: LLMs like OpenAI’s GPT series can generate coherent and con- textually relevant text, from short sentences to entire articles, based on prompts or input text. – Translation: Models such as Google’s T5 have shown effective translation capabilities, converting text between multiple languages with high accuracy and fluency. – Question Answering: LLMs are proficient in answering natural language questions based on their understanding of context and information retrieval from large datasets. – Creative Writing: Some LLMs have been trained to generate creative content such as poems, stories, and even music compositions, showcasing their versatility and cre- ativity. – Chatbots and Virtual Assistants: AI-powered chatbots and virtual assistants lever- age LLMs to engage in natural conversations, provide customer support, and perform tasks such as scheduling appointments or making reservations. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024 35
  • 4. These examples illustrate how Generative AI, specifically LLMs, extends beyond tra- ditional AI applications by enabling machines to understand and generate human-like text with contextually appropriate responses and creative outputs [8]. LLMs are a promi- nent example of Generative AI, distinguished by their ability to process and generate human-like text based on vast amounts of data. These models, particularly those based on Transformer architectures, have revolutionized NLP by: – Scale: LLMs are trained on massive datasets comprising billions of words or sentences from diverse sources such as books, articles, and websites. – Contextual Understanding: They exhibit a strong capability to understand and generate text in context, allowing them to produce coherent and contextually relevant responses. – Generativity: LLMs can generate human-like text, including completing sentences, answering questions, and producing creative content such as poems or stories. – Transfer Learning: They benefit from transfer learning, where models pre-trained on large datasets can be fine-tuned on specific tasks with smaller, task-specific datasets. LLMs exemplify the power of Generative AI in harnessing deep learning to achieve remarkable capabilities in understanding and generating natural language. Their ability to generate indistinguishable text from human-generated content marks a significant ad- vancement in AI research and applications. LLMs leverage advanced machine learning techniques, primarily deep learning architectures, to achieve their impressive capabilities in NLP. These models are typically based on Transformer architectures, which have be- come the cornerstone of modern NLP tasks due to their ability to process sequential data efficiently. The Transformer architecture, introduced by Vaswani et al. (2017), revolutionized NLP by replacing recurrent neural networks (RNNs) and convolutional neural networks (CNNs) with a self-attention mechanism [9]. Key components of the Transformer include: – Self-Attention Mechanism: The model can weigh the significance of different words in a sentence, capturing long-range dependencies efficiently. – Multi-head Attention: Enhances the model’s ability to focus on different positions and learn diverse input representations. – Feedforward Neural Networks: Process the outputs of the attention mechanism to generate context-aware representations [10]. – Layer Normalization and Residual Connections: Aid in stabilizing training and facilitating the flow of gradients through deep networks. LLMs employ Transformer-based architectures with more layers, parameters, and com- putational resources to handle larger datasets and achieve state-of-the-art performance in various NLP tasks. Training LLMs involves several stages and techniques to optimize performance and efficiency: – Pre-training: Initial training on large-scale datasets (e.g., books, articles, web text) to learn general language patterns and representations. Models like GPT-3 are pre- trained on massive corpora to capture broad linguistic knowledge [11]. – Fine-tuning: Further training on task-specific datasets (e.g., question answering, text completion) to adapt the model’s parameters to specific applications. Fine-tuning en- hances model performance and ensures applicability to real-world tasks. – Regularization Techniques: Methods such as dropout and weight decay prevent overfitting and improve generalization capabilities, which are crucial for robust perfor- mance across different datasets. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024 36
  • 5. In addition to machine learning architectures, LLMs rely on sophisticated data struc- tures to efficiently manage and process vast amounts of textual data. Key data structures include: – Tokenizers: Convert raw text into tokens (words, subwords) suitable for model input. Tokenization methods vary, with models like BERT using WordPiece and Byte-Pair Encoding (BPE) to effectively handle rare words and subword units. – Embeddings: Represent words or tokens as dense vectors in a continuous vector space. Embeddings capture semantic relationships and contextual information, enhancing the model’s ability to understand and generate coherent text. – Attention Matrices: Store attention weights computed during self-attention opera- tions. These matrices enable the model to effectively focus on relevant parts of input sequences and learn contextual dependencies. – Cached Computations: Optimize inference speed by caching intermediate compu- tations during attention and feedforward operations, reducing redundant calculations and improving efficiency [12]. These data structures play a critical role in LLMs’ performance and scalability, en- abling them to handle large-scale datasets and achieve state-of-the-art results in various NLP benchmarks. Integrating advanced machine learning techniques, such as Transformer architectures and sophisticated data structures, is fundamental to developing and succeed- ing Large Language Models (LLMs). These models represent a significant advancement in natural language processing, enabling machines to understand and generate human-like text with unprecedented accuracy and complexity. By leveraging scalable architectures and efficient data handling mechanisms, LLMs continue to push the boundaries of AI research and application, paving the way for transformative innovations in language understanding and generation [13]. 3 Evolution of Large Language Models LLMs have undergone a remarkable evolution over the past decades, driven by advance- ments in deep learning, computational resources, and the availability of large-scale datasets. This section provides a comprehensive overview of the evolution of LLMs from their early conception to their current capabilities, highlighting key milestones and technological breakthroughs that have shaped their development. The concept of LLMs emerged from early efforts in statistical language modeling and neural networks, aiming to improve the understanding and generation of human language. Traditional approaches such as n-gram models and Hidden Markov Models (HMMs) provided foundational insights into language patterns but were limited in capturing semantic nuances and context. The shift towards neural network-based approaches in the early 2000s marked a significant milestone, laying the groundwork for more sophisticated language models capable of learning hierarchical representations of text. Key milestones are: – Early 2000s: Development of neural network-based language models, focusing on improving language modeling accuracy and efficiency. – 2010s: Emergence of recurrent neural networks (RNNs) and Long Short-Term Memory (LSTM) networks, which enhanced the ability to capture long-range dependencies in language [14]. Models like LSTM-based language models showed improved performance in tasks such as text generation and sentiment analysis. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024 37
  • 6. – 2017 - 2020: Breakthrough with the Transformer architecture introduced in mod- els like GPT (Generative Pre-trained Transformer) by OpenAI. Transformers revolu- tionized language modeling by leveraging self-attention mechanisms to capture global dependencies in text, leading to significant improvements in NLP tasks. The evolution of LLMs has been closely intertwined with advancements in hardware capabilities, algorithmic improvements, and the availability of large-scale datasets. The following table provides an overview of key technological advancements and their impact on the development of LLMs: Table 2. Technological Advancements in LLMs Technological Advance- ment Impact on LLM Development Increase in computational power Enabled training of larger and more complex models (e.g., GPT-3, GPT-4) Availability of large-scale datasets (e.g., Common Crawl, Wikipedia) Facilitated pre-training of models on vast amounts of text data, improving language understanding Introduction of Transformer architecture Revolutionized language modeling by capturing global dependencies through self-attention mechanisms Optimization techniques (e.g., learning rate sched- ules, gradient normaliza- tion) Enhanced training stability and convergence of deep neu- ral networks These advancements have propelled LLMs from experimental prototypes to practical tools with broad applications across industries, including healthcare, finance, and educa- tion. Integrating advanced technologies has enhanced LLMs’ capabilities and expanded their potential to address complex natural language understanding and generation chal- lenges [15]. Recent advancements in LLMs have focused on enhancing model capabilities in several key areas: – Multimodal Understanding: Integration of vision and language capabilities in mod- els like CLIP (Contrastive Language-Image Pre-training) and DALL-E, enabling tasks such as image captioning and generation. – Zero-Shot Learning: Ability to perform tasks with minimal or no task-specific train- ing data, demonstrating generalized learning capabilities. – Ethical Considerations: Increasing focus on fairness, transparency, and bias miti- gation in model development and deployment, addressing societal concerns related to AI ethics [16]. These advancements underscore the dynamic nature of LLMs and their potential to reshape the landscape of AI-driven technologies in the coming years. LLMs are poised to drive innovation and address real-world challenges across diverse domains by continually pushing the boundaries of language understanding and generation. The evolution of Large Language Models (LLMs) from their early conception to their current capabilities reflects significant advancements in deep learning, computational resources, and data availability. As LLMs continue to evolve, driven by innovations in architecture and training techniques, they promise to revolutionize diverse fields ranging from healthcare to finance and beyond. By understanding the historical context and technological milestones of LLM development, researchers and practitioners can better appreciate the transformative potential of these International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024 38
  • 7. models in advancing AI research and applications. When evaluating different LLMs, several key parameters must be considered to determine their suitability for specific tasks and applications, Figure 1 aims to provide an overall perspective among well-known models. Fig. 1. Comparison of Large Language Models (LLMs) Across Key Parameters: Model Size, Multilingual Support, Training Data, Text Generation Capabilities, and Ease of Integration. 4 The need for comparing LLMs The evaluation of LLMs poses several challenges due to the diversity in model archi- tectures, training methodologies, and evaluation metrics. Existing evaluation frameworks often focus on specific tasks or datasets, leading to fragmented assessments that do not provide a holistic view of model performance across different applications. This fragmented approach hinders effective LLM comparison, making it difficult for researchers, developers, and industry stakeholders to select the most suitable model for specific use cases. Some key challenges are: – Fragmented Metrics: Current evaluation metrics emphasize task-specific perfor- mance (e.g., accuracy, perplexity) without considering broader applicability. – Lack of Standardization: Absence of a standardized index or benchmark for com- paring LLMs across diverse tasks and datasets [17]. – Complexity in Model Comparison: Difficulty in interpreting and comparing re- sults from different evaluation studies due to varied experimental setups and reporting practices. Addressing these challenges requires the development of a unified index that integrates qualitative assessments and quantitative benchmarks to provide a comprehensive evalua- tion of LLM capabilities. To bridge the gap in LLM evaluation, this paper proposes the development of a unified performance index designed to assess and compare LLMs across multiple dimensions. The proposed index would incorporate the following criteria: International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024 39
  • 8. – Quantitative Metrics: Performance on standard NLP benchmarks (e.g., GLUE, Su- perGLUE, SQuAD) to measure model accuracy and effectiveness in specific tasks. – Computational Efficiency: Evaluation of model efficiency regarding inference time, memory usage, and energy consumption is crucial for practical deployment. – Robustness and Generalization: Assessment of model robustness to domain shifts, adversarial inputs, and generalization ability across different datasets and languages. Table 3 outlines the proposed criteria for the unified performance index: Table 3. Criteria for Unified Performance Index Criterion Description Quantitative Metrics Performance on standardized NLP benchmarks (e.g., ac- curacy, F1 score) across diverse tasks Computational Efficiency Evaluation of model inference speed, memory footprint, and energy efficiency Robustness and Generalization Assessment of model performance under varying condi- tions and ability to generalize By establishing a unified index, stakeholders in academia and industry would benefit from: – Informed Decision-Making: Facilitated selection of LLMs based on comprehensive performance assessments aligned with specific application requirements. – Accelerated Research: Enhanced comparability of research findings and accelerated progress in developing more effective LLM architectures and training methodologies. – Industry Applications: Improved deployment of LLMs in real-world applications, ensuring optimal performance and efficiency in diverse operational contexts. Overall, developing a unified performance index for LLMs is essential for advancing the field of NLP, fostering transparency, and driving innovation in AI-driven language processing technologies. The lack of a standardized index for comparing Large Language Models (LLMs) represents a significant challenge in current NLP research and applica- tions. This paper aims to address this gap and contribute to advancing LLM evaluation methodologies by proposing a unified performance index that integrates qualitative as- sessments and quantitative benchmarks. Through systematic comparison and evaluation, stakeholders can make informed decisions, accelerate research progress, and optimize the deployment of LLMs in diverse real-world applications [18]. 5 Designing a metric to evaluate the performance of LLMs: a fictional example To evaluate LLMs’ performance, we can develop a comprehensive metric that incorpo- rates both quantitative and qualitative aspects of performance. A suitable metric should cover accuracy, contextual understanding, coherence, fluency, and resource efficiency. The proposed metric, the ”Comprehensive Language Model Performance Index (CLMPI),” combines these aspects into a single framework. These are components of the CLMPI: 1. Accuracy (ACC): – Definition: Measures the factual and grammatical correctness of the responses. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024 40
  • 9. – Methodology: Compare LLM outputs against a curated dataset of questions and expert answers. – Calculation: Percentage of correct answers (factually and grammatically) over the total number of responses. 2. Contextual Understanding (CON): – Definition: Assesses the model’s ability to understand and integrate context from the conversation or document history. – Methodology: Use context-heavy dialogue or document samples to test if the LLM maintains topic relevance and effectively utilizes the provided historical in- formation. – Calculation: Scoring responses for relevance and context integration on a scale from 0 (no context used) to 5 (excellent use of context). 3. Coherence (COH): – Definition: Evaluates how logically connected and structurally sound the responses are. – Methodology: Analysis of response sequences to ensure logical flow and connec- tion of ideas. – Calculation: Human or automated scoring of response sequences on a scale from 0 (incoherent) to 5 (highly coherent). 4. Fluency (FLU): – Definition: Measures the linguistic smoothness and readability of the text. – Methodology: Responses are analyzed for natural language use, grammatical cor- rectness, and stylistic fluency. – Calculation: Rate responses on a scale from 0 (not fluent) to 5 (very fluent). 5. Resource Efficiency (EFF): – Definition: Assesses the computational resources (like time and memory) used by the LLM for tasks. – Methodology: Measure the average time and system resources consumed for gen- erating responses. – Calculation: Efficiency score calculated by EFF = 1 Time Taken (seconds) + Memory Used (MB)/100 The CLMPI score would be an aggregate, weighted sum of the individual metrics: CLMPI = (w1 × ACC) + (w2 × CON) + (w3 × COH) + (w4 × FLU) + (w5 × EFF) where wi are the weights assigned to each metric based on the priority of aspects. These weights are determined based on the specific needs and usage context of the LLM. Imagine we are evaluating an LLM designed for academic research assistance: – Accuracy: The LLM correctly answers 85 out of 100 factual questions. ACC = 85% – Contextual Understanding: It scores an average of 4.2 on integrating lecture notes into its responses. CON = 4.2 – Coherence: Responses logically flow and are well-structured, with an average score of 4.0. COH = 4.0 International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024 41
  • 10. – Fluency: The text is readable and stylistically appropriate, with minimal grammatical errors, scoring 4.5. FLU = 4.5 – Resource Efficiency: The model uses 200 MB of memory and takes 1.5 seconds on average for response generation. EFF = 1 1.5 + 200/100 ≈ 0.32 Assuming equal weights for simplicity (wi = 1): CLMPI = 0.85 + 4.2 + 4.0 + 4.5 + 0.32 = 17.87 This CLMPI score out of a possible 25 (if maximum scores are 5 for each metric except accuracy being a percentage) provides a quantitative measure of the LLM’s performance across various dimensions critical to its role as an academic aide. Adjusting weights ac- cording to specific performance priorities could further refine this metric. This example illustrates how different aspects of LLM functionality are crucial for particular applications and how a comprehensive metric like CLMPI can provide a balanced assessment. Below is a comparison table for three fictional large language models (LLMs): LLM-A, LLM-B, and LLM-C. The table compares their performance across the critical metrics defined in the Comprehensive Language Model Performance Index (CLMPI): Accuracy (ACC), Contextual Understanding (CON), Coherence (COH), Fluency (FLU), and Re- source Efficiency (EFF). For this example, LLM-C is designed to outperform the other models significantly, especially in terms of efficiency and contextual understanding. Table 4. Comparison of LLM Performance Metric LLM-A LLM-B LLM-C Description Accuracy (ACC) 78% 82% 88% Percentage of questions answered correctly Contextual Understanding (CON) 3.5 4.0 4.8 Score out of 5, effectiveness of using context Coherence (COH) 3.8 4.0 4.5 Score out of 5, logical structuring of text Fluency (FLU) 3.9 4.3 4.7 Score out of 5, linguistic smoothness Resource Efficiency (EFF) 0.25 0.30 0.45 Efficiency score, higher is better Overall CLMPI Score (out of 25) 14.15 15.15 18.65 Weighted sum of all scores The weights could be assigned in the following way: – Accuracy (ACC): 0.25 – Contextual Understanding (CON): 0.20 – Coherence (COH): 0.20 – Fluency (FLU): 0.20 – Resource Efficiency (EFF): 0.15 Each CLMPI score is calculated as follows, assuming these weights: – LLM-A CLMPI Calculation: CLMPI-A = (0.78×0.25)+(3.5×0.20)+(3.8×0.20)+(3.9×0.20)+(0.25×0.15)×25 CLMPI-A = 0.195 + 0.70 + 0.76 + 0.78 + 0.0375 = 2.4735 × 25 = 14.15 International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024 43
  • 11. – LLM-B CLMPI Calculation: CLMPI-B = (0.82×0.25)+(4.0×0.20)+(4.0×0.20)+(4.3×0.20)+(0.30×0.15)×25 CLMPI-B = 0.205 + 0.80 + 0.80 + 0.86 + 0.045 = 2.71 × 25 = 15.15 – LLM-C CLMPI Calculation: CLMPI-C = (0.88×0.25)+(4.8×0.20)+(4.5×0.20)+(4.7×0.20)+(0.45×0.15)×25 CLMPI-C = 0.22 + 0.96 + 0.90 + 0.94 + 0.0675 = 3.0875 × 25 = 18.65 LLM-C outperforms LLM-A and LLM-B across all metrics, notably in resource effi- ciency and contextual understanding, which are critical for performance in dynamic and resource-constrained environments. This table effectively illustrates how different models can be evaluated against important characteristics, providing insight into their strengths and weaknesses. Using a weighted metric system (CLMPI) allows for balanced considera- tion of various aspects crucial for the practical deployment of LLMs. 6 Reflection The rapid advancement of Large Language Models (LLMs) has transformed natural lan- guage processing (NLP), offering unprecedented capabilities in tasks such as text gen- eration, translation, and sentiment analysis. Models like OpenAI’s GPT series, Meta’s LLaMA, and Google’s PaLM have demonstrated remarkable proficiency in understanding and generating human language, paving the way for applications across diverse domains. However, the absence of a standardized framework for comparing LLMs poses significant challenges in evaluating their performance comprehensively. The landscape lacks a uni- fied index integrating qualitative insights and quantitative metrics to assess LLMs across various dimensions. Evaluation methodologies often focus on specific tasks or datasets, resulting in fragmented assessments that do not provide a holistic view of model capabil- ities. This fragmentation hinders researchers, developers, and industry stakeholders from making informed decisions regarding model selection and deployment. Addressing these challenges requires the development of a robust evaluation frame- work that considers factors such as model accuracy, computational efficiency, and robust- ness across different domains and languages. Such a framework would facilitate meaningful comparisons between LLMs, enabling researchers to identify each model’s strengths, weak- nesses, and optimal use cases. The need for accurately comparing Large Language Models (LLMs) is paramount for advancing the field of natural language processing (NLP) and maximizing the potential of AI-driven technologies in real-world applications. By establishing a standardized evaluation framework, stakeholders in academia, indus- try, and policy-making can benefit in several ways: – Informed Decision-Making: Facilitated selection of LLMs based on comprehensive performance assessments aligned with specific application requirements. – Accelerated Research: Enhanced comparability of research findings and accelerated progress in developing more effective LLM architectures and training methodologies. – Optimized Applications: Improved deployment of LLMs in diverse domains, ensur- ing optimal performance, efficiency, and ethical considerations [19]. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024 44
  • 12. Furthermore, a unified framework for comparing LLMs promotes transparency and reproducibility in AI research, fostering collaboration and innovation across the global sci- entific community. As LLMs continue to evolve and expand their capabilities, establishing rigorous evaluation standards becomes increasingly critical to unlocking their full poten- tial and addressing societal challenges. In conclusion, developing a standardized evaluation framework for LLMs is essential for advancing AI research, enabling transformative appli- cations, and ensuring responsible deployment of AI technologies. By addressing the current gaps in LLM evaluation, we can harness the power of these models to drive innovation and benefit society at large. References 1. G. Nápoles, Y. Salgueiro, I. Grau, and M. Leon, “Recurrence-aware long-term cognitive network for explainable pattern classification,” IEEE Transactions on Cybernetics, vol. 53, no. 10, pp. 6083–6094, 2023. 2. A. Upadhyay, E. Farahmand, I. Muntilde;oz, M. Akber Khan, and N. Witte, “Influence of llms on learning and teaching in higher education,” SSRN Electronic Journal, 2024. 3. J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, and X. Hu, “Harnessing the power of llms in practice: A survey on chatgpt and beyond,” ACM Trans. Knowl. Discov. Data, vol. 18, apr 2024. 4. M. Leon, “Business technology and innovation through problem-based learning,” in Canada Interna- tional Conference on Education (CICE-2023) and World Congress on Education (WCE-2023), CICE- 2023, Infonomics Society, July 2023. 5. N. Capodieci, C. Sanchez-Adames, J. Harris, and U. Tatar, “The impact of generative ai and llms on the cybersecurity profession,” in 2024 Systems and Information Engineering Design Symposium (SIEDS), pp. 448–453, 2024. 6. G. Nápoles, J. L. Salmeron, W. Froelich, R. Falcon, M. Leon, F. Vanhoenshoven, R. Bello, and K. Van- hoof, Fuzzy Cognitive Modeling: Theoretical and Practical Considerations, p. 77–87. Springer Singa- pore, July 2019. 7. G. Nápoles, M. Leon, I. Grau, and K. Vanhoof, “FCM expert: Software tool for scenario analysis and pattern classification based on fuzzy cognitive maps,” International Journal on Artificial Intelligence Tools, vol. 27, no. 07, p. 1860010, 2018. 8. A. R. Asadi, “Llms in design thinking: Autoethnographic insights and design implications,” in Proceed- ings of the 2023 5th World Symposium on Software Engineering, WSSE ’23, (New York, NY, USA), p. 55–60, Association for Computing Machinery, 2023. 9. E. Struble, M. Leon, and E. Skordilis, “Intelligent prevention of ddos attacks using reinforcement learning and smart contracts,” The International FLAIRS Conference Proceedings, vol. 37, May 2024. 10. G. Nápoles, M. L. Espinosa, I. Grau, K. Vanhoof, and R. Bello, Fuzzy cognitive maps based models for pattern classification: Advances and challenges, vol. 360, pp. 83–98. Springer Verlag, 2018. 11. R. D. Pesl, M. Stötzner, I. Georgievski, and M. Aiello, “Uncovering llms for service-composition: Challenges and opportunities,” in Service-Oriented Computing – ICSOC 2023 Workshops (F. Monti, P. Plebani, N. Moha, H.-y. Paik, J. Barzen, G. Ramachandran, D. Bianchini, D. A. Tamburri, and M. Mecella, eds.), (Singapore), pp. 39–48, Springer Nature Singapore, 2024. 12. M. Leon, L. Mkrtchyan, B. Depaire, D. Ruan, and K. Vanhoof, “Learning and clustering of fuzzy cognitive maps for travel behaviour analysis,” Knowledge and Information Systems, vol. 39, no. 2, pp. 435–462, 2013. 13. T. Han, L. C. Adams, K. Bressem, F. Busch, L. Huck, S. Nebelung, and D. Truhn, “Comparative analysis of gpt-4vision, gpt-4 and open source llms in clinical diagnostic accuracy: A benchmark against human expertise,” medRxiv, 2023. 14. M. Leon, “Aggregating procedure for fuzzy cognitive maps,” The International FLAIRS Conference Proceedings, vol. 36, no. 1, 2023. 15. N. R. Rydzewski, D. Dinakaran, S. G. Zhao, E. Ruppin, B. Turkbey, D. E. Citrin, and K. R. Patel, “Comparative evaluation of llms in clinical oncology,” NEJM AI, vol. 1, Apr. 2024. 16. H. DeSimone and M. Leon, “Explainable ai: The quest for transparency in business and beyond,” in 2024 7th International Conference on Information and Computer Technologies (ICICT), IEEE, Mar. 2024. 17. J. Sallou, T. Durieux, and A. Panichella, “Breaking the silence: the threats of using llms in soft- ware engineering,” in Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, ICSE-NIER’24, (New York, NY, USA), p. 102–106, Association for Computing Machinery, 2024. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024 45
  • 13. 18. J. Chen, X. Lu, Y. Du, M. Rejtig, R. Bagley, M. Horn, and U. Wilensky, “Learning agent-based modeling with llm companions: Experiences of novices and experts using chatgpt & netlogo chat,” in Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI ’24, (New York, NY, USA), Association for Computing Machinery, 2024. 19. G. Nápoles, F. Hoitsma, A. Knoben, A. Jastrzebska, and M. Leon, “Prolog-based agnostic explanation module for structured pattern classification,” Information Sciences, vol. 622, p. 1196–1227, Apr. 2023. Author Dr. Maikel Leon is interested in applying AI/ML techniques to modeling real-world problems using knowledge engineering, knowledge representation, and data mining meth- ods. His most recent research focuses on XAI and is recently featured in Information Sciences and IEEE Transactions on Cybernetics journals. Dr. Leon is a reviewer for the International Journal of Knowledge and Information Systems, Journal of Experimental and Theoretical Artificial Intelligence, Soft Computing, and IEEE Transactions on Fuzzy Systems. He is a Committee Member of the Florida Artificial Intelligence Research Society. He is a frequent contributor on technology topics for CNN en Español TV and the winner of the Cuban Academy of Sciences National Award for the Most Relevant Research in Computer Science. Dr. Leon obtained his PhD in Computer Science at Hasselt University, Belgium, previously having studied computation (Master of Science and Bachelor of Sci- ence) at Central University of Las Villas, Cuba. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.15, No.4, July 2024 46