SlideShare a Scribd company logo
1 | © Copyright 2024 Zilliz
1
1 | © Copyright 10/22/23 Zilliz
1 | © Copyright 2024 Zilliz
Stefan Webb
Developer Advocate, Zilliz
stefan.webb@zilliz.com
https://www.linkedin.com/in/stefan-webb
https://x.com/stefan_webb
Webinar: Building a Principled RAG Pipeline | Host
2 | © Copyright 2024 Zilliz
2
01 Introduction
CONTENTS
02 Evaluating Foundation Models
Challenges and Limitations
03
Open Source Evaluation Frameworks
04
3 | © Copyright Zilliz
3
01 Introduction
4 | © Copyright Zilliz
4
5 | © Copyright Zilliz
5
Why is Semantic Search Important?
10%
Other
newly generated data in 2025
will be unstructured data
90%
Data Source: The Digitization of the World by IDC
6 | © Copyright Zilliz
6
RAG Refresher
7 | © Copyright Zilliz
7
RAG Pipeline Options
Credit: https://github.com/langchain-ai/rag-from-scratch]
8 | © Copyright Zilliz
8
RAG Pipeline Options
Credit: Gao et al., 2024]
9 | © Copyright Zilliz
9
How to Make RAG Design Choices?
Measure effects
Consider alternatives
Update and repeat
How??
10 | © Copyright Zilliz
10
Scope of Webinar
● Deeper discussion of
limitations of LLM-as-a-Judge
● Further / alternative methods
● Evaluating output:
○ Agents
○ Multimodal models
○ Online
● Evaluating latency
● Adversarial attacks
● Fundamentals of
LLM-as-a-Judge
● Evaluating output:
○ LLMs
○ RAG
○ Offline
● Frameworks:
○ LM Harness
○ RAGAS
11 | © Copyright Zilliz
11
02 Evaluating Foundation
Models
12 | © Copyright Zilliz
12
How to Measure “Performanceˮ?
● Distinction between evaluation on a task vs evaluation on itself
● Comparing answer to ground-truth vs. comparing output to
context
● Evaluating retrieval vs. evaluating LLM output
● With ground-truth and without ground-truth
● Human evaluation is “gold standardˮ but doesnʼt scale well
● Surprisingly, (strong) LLMs can evaluate LLMs when there is no
ground truth
13 | © Copyright Zilliz
13
Task-Based Evaluation
● Knowledge-based
○ MMLU
○ HellaSwag
○ ARC
● Instruct following
○ Flan
○ Self-instruct
○ NaturalInstructions
● Conversational
○ CoQA
○ MMDialog
○ OpenAssistant
14 | © Copyright Zilliz
14
“Introspectionˮ-based Evaluation
● Generation-based
○ Faithfulness / groundedness
■ “Measures the factual consistency of the generated
answer against the given context. If any claims are made
in the answer that cannot be deduced from context,
then these will be penalized.”
○ Answer Relevancy
■ “Refers to the degree to which a response directly
addresses and is appropriate for a given question or
context.”
● Retrieval-based
○ Context Relevance
■ “Measures how relevant retrieved contexts are to the
question. Ideally, the context should only contain
information necessary to answer the question.”
○ Context Recall / retrieval-based metrics
■ Measures the recall of the retrieved context using the
annotated answer as ground truth.
15 | © Copyright Zilliz
15
03 Challenges and Limitations
16 | © Copyright Zilliz
16
Position
Bias
17 | © Copyright Zilliz
17
Verbosity
Bias
18 | © Copyright Zilliz
18
Can Answer
But Canʼt
Judge
19 | © Copyright Zilliz
19
C-o-T Failure
Mode
20 | © Copyright Zilliz
20
Data Quality
Like in other parts of Gen AI, the main
challenge is producing the right dataset!
21 | © Copyright Zilliz
21
Fine-tuned Judge Models
● https://huggingface.co/nuclia/REMi-v0
● https://huggingface.co/grounded-ai
● https://huggingface.co/prometheus-eval
● https://huggingface.co/flowaicom/Flow-Judge-v0.1
● https://huggingface.co/facebook/Self-taught-evaluator-llama3.1-70B
22 | © Copyright Zilliz
22
04 Open-Source Evaluation
Frameworks
23 | © Copyright Zilliz
23
LM Evaluation Harness
● https://github.com/EleutherAI/lm-evaluation-harness
24 | © Copyright Zilliz
24
RAGAS
Covered in following webinar!
25 | © Copyright Zilliz
25
ARES, HuggingFace Lighteval, etc.
Covered in following webinar!
26 | © Copyright Zilliz
26
About Milvus
Milvus is an open-source vector database for
GenAI projects. pip install on your laptop, plug into
popular AI dev tools, and push to production with
a single line of code.
30K
GitHub Stars
66M
Docker Pulls
400
Contributors
2.7K
Forks
Easy Setup
Pip-install to start coding in a notebook within seconds
Integration
Plug into OpenAI, Langchain, LlmaIndex, and many more
Reusable Code
Write once, and deploy with one line of code into the production
environment
Feature-rich
Dense & sparse embeddings, filtering, reranking and beyond
27 | © Copyright 2024 Zilliz
27
Milvus Users
28 | © Copyright 2024 Zilliz
28
28 | © Copyright 10/22/23 Zilliz
28 | © Copyright 2024 Zilliz
github.com/milvus-io/milvus zilliz.com/learn/generative-ai
29 | © Copyright Zilliz
29
Summary
There are many methods,
models, etc.
We need quantitative
eval for principled design
Tasks can evaluate LLMs
when thereʼs a ground truth
LLMs can evaluate LLMs
when thereʼs no ground truth
(also, when there is a ground
truth)
There are several excellent
open-source evaluation
frameworks although it is a
relative nascent area
30 | © Copyright Zilliz
30
T H A N K Y O U
31 | © Copyright 2024 Zilliz
31
https://milvus.io/discord
https://github.com/milvus-io/milvus
https://x.com/milvusio
https://www.linkedin.com/company/the-milvus-project
LET’S STAY CONNECTED!
Stefan Webb
Developer Advocate, Zilliz
32 | © Copyright 2024 Zilliz
32
Become a
Speaker!
Interesting in speaking
at and/or sponsoring a
Zilliz Unstructured Data Meetup?
Fill out this form
🎤🎤🎤
33 | © Copyright 2024 Zilliz
33
Join us at our next meetup!
lu.ma/unstructured-data-meetup
Nov 13, South Bay, Sports Basement Sunnyvale
“Evaluating Retrieval-Augmented Generationˮ
Nov 19, San Francisco, GitHub Office
“Structured Output from Unstructured Input: Part 2ˮ

More Related Content

PDF
Building an Agentic RAG locally with Ollama and Milvus
PDF
GraphRAG Agents with Neo4j, Milvus and GPT4
PDF
Using LLM Agents with Llama 3.2, LangGraph and Milvus
PDF
Multi-agent Systems with Mistral AI, Milvus and Llama-agents
PDF
Multi-agent Systems with Mistral AI, Milvus and Llama-agents
PDF
Retrieval Augmented Generation Evaluation with Ragas
PDF
2025-04-05 - Block71 Event - The Landscape of GenAI and Ecosystem.pdf
PDF
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Building an Agentic RAG locally with Ollama and Milvus
GraphRAG Agents with Neo4j, Milvus and GPT4
Using LLM Agents with Llama 3.2, LangGraph and Milvus
Multi-agent Systems with Mistral AI, Milvus and Llama-agents
Multi-agent Systems with Mistral AI, Milvus and Llama-agents
Retrieval Augmented Generation Evaluation with Ragas
2025-04-05 - Block71 Event - The Landscape of GenAI and Ecosystem.pdf
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama

Similar to Evaluating Retrieval-Augmented Generation - Webinar (20)

PDF
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
PDF
Constrained Sampling from Large Language Models: Producing Structured Output
PDF
What Makes "Deep Research"? A Dive into AI Agents
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
PDF
MultiModal RAG using vLLM and Pixtral - Stephen Batifol
PDF
Agentic AI in Action: Real-Time Vision, Memory & Autonomy with Browser Use & ...
PDF
How Vector Databases are Revolutionizing Unstructured Data Search in AI Appli...
PDF
08-13-2024 NYC Meetup Unstructured Data Processing From Cloud to Edge (Milvus)
PDF
NYC Meetup Unstructured Data Processing From Cloud to Edge (Milvus)
PDF
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
PDF
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
PDF
Advanced Retrieval Augmented Generation Techniques
PDF
Using LLM Agents with Llama 3, LangGraph and Milvus
PDF
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
PDF
Building Production Ready Search Pipelines with Spark and Milvus
PDF
09-18-2024 NYC Meetup Vector Databases 102
PDF
Supercharge Spark: Unleashing Big Data Potential with Milvus for RAG systems
PPTX
Roman Kyslyi: Production RAG and GraphRAG (UA)
PDF
US AI Safety Institute and Trustworthy AI Details.
PDF
Hands-on Tutorial: Building an Agent to Reason about Private Data with OpenAI...
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Constrained Sampling from Large Language Models: Producing Structured Output
What Makes "Deep Research"? A Dive into AI Agents
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
MultiModal RAG using vLLM and Pixtral - Stephen Batifol
Agentic AI in Action: Real-Time Vision, Memory & Autonomy with Browser Use & ...
How Vector Databases are Revolutionizing Unstructured Data Search in AI Appli...
08-13-2024 NYC Meetup Unstructured Data Processing From Cloud to Edge (Milvus)
NYC Meetup Unstructured Data Processing From Cloud to Edge (Milvus)
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Advanced Retrieval Augmented Generation Techniques
Using LLM Agents with Llama 3, LangGraph and Milvus
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Building Production Ready Search Pipelines with Spark and Milvus
09-18-2024 NYC Meetup Vector Databases 102
Supercharge Spark: Unleashing Big Data Potential with Milvus for RAG systems
Roman Kyslyi: Production RAG and GraphRAG (UA)
US AI Safety Institute and Trustworthy AI Details.
Hands-on Tutorial: Building an Agent to Reason about Private Data with OpenAI...
Ad

More from Zilliz (20)

PDF
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
PDF
Zilliz Cloud Demo for performance and scale
PDF
Open Source Milvus Vector Database v 2.6
PDF
Zilliz Cloud Monthly Technical Review: May 2025
PDF
Smarter RAG Pipelines: Scaling Search with Milvus and Feast
PDF
Webinar - Zilliz Cloud Monthly Demo - March 2025
PDF
Combining Lexical and Semantic Search with Milvus 2.5
PDF
Bedrock Data Automation (Preview): Simplifying Unstructured Data Processing
PDF
Deploying a Multimodal RAG System Using Open Source Milvus, LlamaIndex, and vLLM
PDF
February Product Demo: Discover the Power of Zilliz Cloud
PDF
Full Text Search with Milvus 2.5 - UD Meetup Berlin Jan 23
PDF
Building the Next-Gen Apps with Multimodal Retrieval using Twelve Labs & Milvus
PDF
Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf
PDF
Accelerate AI Agents with Multimodal RAG powered by Friendli Endpoints and Mi...
PDF
1 Table = 1000 Words? Foundation Models for Tabular Data
PDF
How Milvus allows you to run Full Text Search
PDF
How to Optimize Your Embedding Model Selection and Development through TDA Cl...
PDF
Milvus: Scaling Vector Data Solutions for Gen AI
PDF
Keeping Data Fresh: Mastering Updates in Vector Databases
PDF
Milvus 2.5: Full-Text Search, More Powerful Metadata Filtering, and more!
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
Zilliz Cloud Demo for performance and scale
Open Source Milvus Vector Database v 2.6
Zilliz Cloud Monthly Technical Review: May 2025
Smarter RAG Pipelines: Scaling Search with Milvus and Feast
Webinar - Zilliz Cloud Monthly Demo - March 2025
Combining Lexical and Semantic Search with Milvus 2.5
Bedrock Data Automation (Preview): Simplifying Unstructured Data Processing
Deploying a Multimodal RAG System Using Open Source Milvus, LlamaIndex, and vLLM
February Product Demo: Discover the Power of Zilliz Cloud
Full Text Search with Milvus 2.5 - UD Meetup Berlin Jan 23
Building the Next-Gen Apps with Multimodal Retrieval using Twelve Labs & Milvus
Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf
Accelerate AI Agents with Multimodal RAG powered by Friendli Endpoints and Mi...
1 Table = 1000 Words? Foundation Models for Tabular Data
How Milvus allows you to run Full Text Search
How to Optimize Your Embedding Model Selection and Development through TDA Cl...
Milvus: Scaling Vector Data Solutions for Gen AI
Keeping Data Fresh: Mastering Updates in Vector Databases
Milvus 2.5: Full-Text Search, More Powerful Metadata Filtering, and more!
Ad

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
A Presentation on Artificial Intelligence
MYSQL Presentation for SQL database connectivity
Agricultural_Statistics_at_a_Glance_2022_0.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation theory and applications.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
sap open course for s4hana steps from ECC to s4
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The AUB Centre for AI in Media Proposal.docx
Empathic Computing: Creating Shared Understanding
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation_ Review paper, used for researhc scholars
Network Security Unit 5.pdf for BCA BBA.
Per capita expenditure prediction using model stacking based on satellite ima...
Big Data Technologies - Introduction.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Approach and Philosophy of On baking technology
20250228 LYD VKU AI Blended-Learning.pptx

Evaluating Retrieval-Augmented Generation - Webinar

  • 1. 1 | © Copyright 2024 Zilliz 1 1 | © Copyright 10/22/23 Zilliz 1 | © Copyright 2024 Zilliz Stefan Webb Developer Advocate, Zilliz stefan.webb@zilliz.com https://www.linkedin.com/in/stefan-webb https://x.com/stefan_webb Webinar: Building a Principled RAG Pipeline | Host
  • 2. 2 | © Copyright 2024 Zilliz 2 01 Introduction CONTENTS 02 Evaluating Foundation Models Challenges and Limitations 03 Open Source Evaluation Frameworks 04
  • 3. 3 | © Copyright Zilliz 3 01 Introduction
  • 4. 4 | © Copyright Zilliz 4
  • 5. 5 | © Copyright Zilliz 5 Why is Semantic Search Important? 10% Other newly generated data in 2025 will be unstructured data 90% Data Source: The Digitization of the World by IDC
  • 6. 6 | © Copyright Zilliz 6 RAG Refresher
  • 7. 7 | © Copyright Zilliz 7 RAG Pipeline Options Credit: https://github.com/langchain-ai/rag-from-scratch]
  • 8. 8 | © Copyright Zilliz 8 RAG Pipeline Options Credit: Gao et al., 2024]
  • 9. 9 | © Copyright Zilliz 9 How to Make RAG Design Choices? Measure effects Consider alternatives Update and repeat How??
  • 10. 10 | © Copyright Zilliz 10 Scope of Webinar ● Deeper discussion of limitations of LLM-as-a-Judge ● Further / alternative methods ● Evaluating output: ○ Agents ○ Multimodal models ○ Online ● Evaluating latency ● Adversarial attacks ● Fundamentals of LLM-as-a-Judge ● Evaluating output: ○ LLMs ○ RAG ○ Offline ● Frameworks: ○ LM Harness ○ RAGAS
  • 11. 11 | © Copyright Zilliz 11 02 Evaluating Foundation Models
  • 12. 12 | © Copyright Zilliz 12 How to Measure “Performanceˮ? ● Distinction between evaluation on a task vs evaluation on itself ● Comparing answer to ground-truth vs. comparing output to context ● Evaluating retrieval vs. evaluating LLM output ● With ground-truth and without ground-truth ● Human evaluation is “gold standardˮ but doesnʼt scale well ● Surprisingly, (strong) LLMs can evaluate LLMs when there is no ground truth
  • 13. 13 | © Copyright Zilliz 13 Task-Based Evaluation ● Knowledge-based ○ MMLU ○ HellaSwag ○ ARC ● Instruct following ○ Flan ○ Self-instruct ○ NaturalInstructions ● Conversational ○ CoQA ○ MMDialog ○ OpenAssistant
  • 14. 14 | © Copyright Zilliz 14 “Introspectionˮ-based Evaluation ● Generation-based ○ Faithfulness / groundedness ■ “Measures the factual consistency of the generated answer against the given context. If any claims are made in the answer that cannot be deduced from context, then these will be penalized.” ○ Answer Relevancy ■ “Refers to the degree to which a response directly addresses and is appropriate for a given question or context.” ● Retrieval-based ○ Context Relevance ■ “Measures how relevant retrieved contexts are to the question. Ideally, the context should only contain information necessary to answer the question.” ○ Context Recall / retrieval-based metrics ■ Measures the recall of the retrieved context using the annotated answer as ground truth.
  • 15. 15 | © Copyright Zilliz 15 03 Challenges and Limitations
  • 16. 16 | © Copyright Zilliz 16 Position Bias
  • 17. 17 | © Copyright Zilliz 17 Verbosity Bias
  • 18. 18 | © Copyright Zilliz 18 Can Answer But Canʼt Judge
  • 19. 19 | © Copyright Zilliz 19 C-o-T Failure Mode
  • 20. 20 | © Copyright Zilliz 20 Data Quality Like in other parts of Gen AI, the main challenge is producing the right dataset!
  • 21. 21 | © Copyright Zilliz 21 Fine-tuned Judge Models ● https://huggingface.co/nuclia/REMi-v0 ● https://huggingface.co/grounded-ai ● https://huggingface.co/prometheus-eval ● https://huggingface.co/flowaicom/Flow-Judge-v0.1 ● https://huggingface.co/facebook/Self-taught-evaluator-llama3.1-70B
  • 22. 22 | © Copyright Zilliz 22 04 Open-Source Evaluation Frameworks
  • 23. 23 | © Copyright Zilliz 23 LM Evaluation Harness ● https://github.com/EleutherAI/lm-evaluation-harness
  • 24. 24 | © Copyright Zilliz 24 RAGAS Covered in following webinar!
  • 25. 25 | © Copyright Zilliz 25 ARES, HuggingFace Lighteval, etc. Covered in following webinar!
  • 26. 26 | © Copyright Zilliz 26 About Milvus Milvus is an open-source vector database for GenAI projects. pip install on your laptop, plug into popular AI dev tools, and push to production with a single line of code. 30K GitHub Stars 66M Docker Pulls 400 Contributors 2.7K Forks Easy Setup Pip-install to start coding in a notebook within seconds Integration Plug into OpenAI, Langchain, LlmaIndex, and many more Reusable Code Write once, and deploy with one line of code into the production environment Feature-rich Dense & sparse embeddings, filtering, reranking and beyond
  • 27. 27 | © Copyright 2024 Zilliz 27 Milvus Users
  • 28. 28 | © Copyright 2024 Zilliz 28 28 | © Copyright 10/22/23 Zilliz 28 | © Copyright 2024 Zilliz github.com/milvus-io/milvus zilliz.com/learn/generative-ai
  • 29. 29 | © Copyright Zilliz 29 Summary There are many methods, models, etc. We need quantitative eval for principled design Tasks can evaluate LLMs when thereʼs a ground truth LLMs can evaluate LLMs when thereʼs no ground truth (also, when there is a ground truth) There are several excellent open-source evaluation frameworks although it is a relative nascent area
  • 30. 30 | © Copyright Zilliz 30 T H A N K Y O U
  • 31. 31 | © Copyright 2024 Zilliz 31 https://milvus.io/discord https://github.com/milvus-io/milvus https://x.com/milvusio https://www.linkedin.com/company/the-milvus-project LET’S STAY CONNECTED! Stefan Webb Developer Advocate, Zilliz
  • 32. 32 | © Copyright 2024 Zilliz 32 Become a Speaker! Interesting in speaking at and/or sponsoring a Zilliz Unstructured Data Meetup? Fill out this form 🎤🎤🎤
  • 33. 33 | © Copyright 2024 Zilliz 33 Join us at our next meetup! lu.ma/unstructured-data-meetup Nov 13, South Bay, Sports Basement Sunnyvale “Evaluating Retrieval-Augmented Generationˮ Nov 19, San Francisco, GitHub Office “Structured Output from Unstructured Input: Part 2ˮ