Eduardo Ordax’s Post

View profile for Eduardo Ordax

🤖 Generative AI Lead @ AWS ☁️ (150k+) | Startup Advisor | Public Speaker | AI Outsider

𝗧𝗵𝗲 𝗶𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝗰𝗲 𝗼𝗳 𝗟𝗟𝗠 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻! 🧠 ✔   Although evaluating the outputs of LLMs is key, it remains a challenging task for many. Deciding the right set of evaluation metrics along with the strategy is imperative before moving to production. In this post I’ll help you with both! 1️⃣ 𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝗮𝗹 𝗦𝗰𝗼𝗿𝗲 (not ideal for LLMs as they don’t take any semantic into account) 🔸 BLEU: evaluates the output against annotated ground truths 🔸 ROUGE: evaluates text summaries from NLP and calculates recall 🔸 METOR: calculates scores by assessing precision and recall 🔸 Levenshtein distance: calculates the minimum number of single-character edits required to change one word into another 2️⃣ 𝗠𝗼𝗱𝗲𝗹-𝗕𝗮𝘀𝗲𝗱 𝗦𝗰𝗼𝗿𝗲 & 𝗡𝗼𝗻-𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝗮𝗹 (rely on NLP models, more accurate but also more unreliable due to their probabilistic nature) ➡ 𝗡𝗼𝗻 𝗟𝗟𝗠 𝗦𝗰𝗼𝗿𝗲 🔸 NLI: uses Natural Language Inference models to classify whether an LLM output is logically consistent, contradictory, or unrelated. 🔸 BLEURT: uses pre-trained models like BERT to score LLM outputs on some expected outputs. ➡ 𝗟𝗟𝗠𝗘𝘃𝗮𝗹 🔸 G-Eval: uses LLMs to evaluate LLM outputs 🔸 Prometheus: fully open-source LLM that is comparable to GPT-4’s evaluation capabilities 3️⃣ 𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝗮𝗹 & 𝗠𝗼𝗱𝗲𝗹-𝗕𝗮𝘀𝗲𝗱 𝗦𝗰𝗼𝗿𝗲 🔸 BERTScore & MoverScore: relies on pre-trained language models like BERT and computes the cosine similarity. 🔸 GPTScore: uses the conditional probability of generating the target text as an evaluation metric. 🔸 SelfCheckGPT: It is a simple sampling-based approach that is used to fact-check LLM outputs. 🔸 QAG Score: leverages LLMs’ high reasoning capabilities to reliably evaluate LLM outputs. And finally, the 𝗟𝗟𝗠 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗮𝗽𝗽𝗿𝗼𝗮𝗰𝗵 will be defined based on: ❇ Do you have the groundtruth ❇ Does the model provide discrete outputs ❇ Do you want to want to automate the process #LLM #AI #GenAI #Evaluation

  • diagram
Evgeny Krapivin

Entrepreneur • Engineering Leader • Software Architect

1y

Great visual and great insight, Eduardo!

Kaan Kabalak

Grinding, Nerding and Marketing for Data & AI Products | Data Product Manager @ Witful Vision

1y

Love these diagrams Eduardo Ordax, please keep them coming 💚

Ashish Patel 🇮🇳

Sr Principal AI Architect at Oracle | Generative AI Expert & Strategist | xIBMer | Author: Hands-on Time Series Analytics with Python | IBM Quantum ML Certified | 13+ Yrs AI | IIMA | 100K+ Followers | 6x LinkedInTopVoice

1y

How do you effectively address the potential bias and ethical considerations while implementing these evaluation metrics in real-world AI applications?

Mahesh A.

6+ years exp of Data Scientist, Gen AI Developer. Technical Skills: Gen AI, LLM's, RAG, AI/ML, Python, Flask, API, Microservices, Snowflake, MySQL, Amazon Personalizer. OTT - Content Recommendation System Development.

1y

Good to know!

Very informative, thanks a lot Eduardo Ordax 😊🤝👍

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories