Evaluating Retrieval-Augmented Generation - Webinar

1 | © Copyright 2024 Zilliz
1
1 | © Copyright 10/22/23 Zilliz
Stefan Webb
Developer Advocate, Zilliz
stefan.webb@zilliz.com
https://www.linkedin.com/in/stefan-webb
https://x.com/stefan_webb
Webinar: Building a Principled RAG Pipeline | Host

2
01 Introduction
CONTENTS
02 Evaluating Foundation Models
Challenges and Limitations
03
Open Source Evaluation Frameworks
04

3 | © Copyright Zilliz
3
01 Introduction

5
Why is Semantic Search Important?
10%
Other
newly generated data in 2025
will be unstructured data
90%
Data Source: The Digitization of the World by IDC

6
RAG Refresher

7
RAG Pipeline Options
Credit: https://github.com/langchain-ai/rag-from-scratch]

8
RAG Pipeline Options
Credit: Gao et al., 2024]

9
How to Make RAG Design Choices?
Measure effects
Consider alternatives
Update and repeat
How??

10
Scope of Webinar
● Deeper discussion of
limitations of LLM-as-a-Judge
● Further / alternative methods
● Evaluating output:
○ Agents
○ Multimodal models
○ Online
● Evaluating latency
● Adversarial attacks
● Fundamentals of
LLM-as-a-Judge
● Evaluating output:
○ LLMs
○ RAG
○ Offline
● Frameworks:
○ LM Harness
○ RAGAS

11
02 Evaluating Foundation
Models

12
How to Measure “Performanceˮ?
● Distinction between evaluation on a task vs evaluation on itself
● Comparing answer to ground-truth vs. comparing output to
context
● Evaluating retrieval vs. evaluating LLM output
● With ground-truth and without ground-truth
● Human evaluation is “gold standardˮ but doesnʼt scale well
● Surprisingly, (strong) LLMs can evaluate LLMs when there is no
ground truth

13
Task-Based Evaluation
● Knowledge-based
○ MMLU
○ HellaSwag
○ ARC
● Instruct following
○ Flan
○ Self-instruct
○ NaturalInstructions
● Conversational
○ CoQA
○ MMDialog
○ OpenAssistant

14
“Introspectionˮ-based Evaluation
● Generation-based
○ Faithfulness / groundedness
■ “Measures the factual consistency of the generated
answer against the given context. If any claims are made
in the answer that cannot be deduced from context,
then these will be penalized.”
○ Answer Relevancy
■ “Refers to the degree to which a response directly
addresses and is appropriate for a given question or
context.”
● Retrieval-based
○ Context Relevance
■ “Measures how relevant retrieved contexts are to the
question. Ideally, the context should only contain
information necessary to answer the question.”
○ Context Recall / retrieval-based metrics
■ Measures the recall of the retrieved context using the
annotated answer as ground truth.

15
03 Challenges and Limitations

16
Position
Bias

17
Verbosity
Bias

18
Can Answer
But Canʼt
Judge

19
C-o-T Failure
Mode

20
Data Quality
Like in other parts of Gen AI, the main
challenge is producing the right dataset!

21
Fine-tuned Judge Models
● https://huggingface.co/nuclia/REMi-v0
● https://huggingface.co/grounded-ai
● https://huggingface.co/prometheus-eval
● https://huggingface.co/flowaicom/Flow-Judge-v0.1
● https://huggingface.co/facebook/Self-taught-evaluator-llama3.1-70B

22
04 Open-Source Evaluation
Frameworks

23
LM Evaluation Harness
● https://github.com/EleutherAI/lm-evaluation-harness

24
RAGAS
Covered in following webinar!

25
ARES, HuggingFace Lighteval, etc.
Covered in following webinar!

26
About Milvus
Milvus is an open-source vector database for
GenAI projects. pip install on your laptop, plug into
popular AI dev tools, and push to production with
a single line of code.
30K
GitHub Stars
66M
Docker Pulls
400
Contributors
2.7K
Forks
Easy Setup
Pip-install to start coding in a notebook within seconds
Integration
Plug into OpenAI, Langchain, LlmaIndex, and many more
Reusable Code
Write once, and deploy with one line of code into the production
environment
Feature-rich
Dense & sparse embeddings, ﬁltering, reranking and beyond

27
Milvus Users

29
Summary
There are many methods,
models, etc.
We need quantitative
eval for principled design
Tasks can evaluate LLMs
when thereʼs a ground truth
LLMs can evaluate LLMs
when thereʼs no ground truth
(also, when there is a ground
truth)
There are several excellent
open-source evaluation
frameworks although it is a
relative nascent area

30
T H A N K Y O U

31
https://milvus.io/discord
https://github.com/milvus-io/milvus
https://x.com/milvusio
https://www.linkedin.com/company/the-milvus-project
LET’S STAY CONNECTED!
Stefan Webb
Developer Advocate, Zilliz

32
Become a
Speaker!
Interesting in speaking
at and/or sponsoring a
Zilliz Unstructured Data Meetup?
Fill out this form
🎤🎤🎤

33
Join us at our next meetup!
lu.ma/unstructured-data-meetup
Nov 13, South Bay, Sports Basement Sunnyvale
“Evaluating Retrieval-Augmented Generationˮ
Nov 19, San Francisco, GitHub Office
“Structured Output from Unstructured Input: Part 2ˮ

Evaluating Retrieval-Augmented Generation - Webinar

More Related Content

Similar to Evaluating Retrieval-Augmented Generation - Webinar (20)

More from Zilliz (20)

Recently uploaded (20)

Evaluating Retrieval-Augmented Generation - Webinar