Retrieval Augmented Generation Evaluation with Ragas

1 | © Copyright 11/17/23 Zilliz
Speaker
Christy Bergman
Developer Advocate, Zilliz
https://www.linkedin.com/in/christybergman/
https://github.com/milvus-io/milvus
discord: https://discord.gg/FjCMmaJng6

Image source: https://thedataquarry.com/posts/vector-db-1/

3 Pillars of Generative AI

Opportunities in Unstructured Data

T H A N K Y O U
󰚥 We need your stars!
https://github.com/milvus-io/milvus
💬Join our discord: https://discord.gg/FjCMmaJng6

AGENDA
01 AI Hallucinations and RAG
03
04 RAG Evaluation Methods
02 4 Challenges
Demo

01
AI Hallucinations
and RAG

Example AI Hallucination
gemini
wikipedia

Example AI Hallucination
gemini
wikipedia
hallucinated
answer

Why do models hallucinate?
• The reason LLMs
hallucinate is because
…
• They are trained on
sequences of words
(tokens)
Sample Data
The hamster cabinet …
!!@#%# …
Monkey eats shark …
trees in the moons…

Vector
Database
Where do Vectors Come From?
Unstructured Data
Embeddings here
Pre-trained Deep
Learning Models
Vectors

Where do Vectors Come From?
Unstructured Data Vectors

Semantic Similarity
Image from Sutor et al
Woman = [0.3, 0.4]
Queen = [0.3, 0.9]
King = [0.5, 0.7]
Woman = [0.3, 0.4]
Queen = [0.3, 0.9]
King = [0.5, 0.7]
Man = [0.5, 0.2]
Queen - Woman + Man = King
Queen = [0.3, 0.9]
- Woman = [0.3, 0.4]
[0.0, 0.5]
+ Man = [0.5, 0.2]
King = [0.5, 0.7]
Man = [0.5, 0.2]

Retrieval Augmented Generation (RAG)
Your Data
Embedding Model
Vector Database
Question
Question + Context
Search
Gen AI Model
Reliable Answers
What is the default
AUTOINDEX distance
metric in Milvus?
The default
AUTOINDEX distance
metric in Milvus is L2.

Conversation
Data
Documentation
Data
Lecture or Q/A
Data
Pain Point #3: Chunking

Conversation
Data
Documentation
Data
Question Answer
Data
add
conversation
memory
use Q&A tuple
formatting
Pain Point #3: Chunking

Pain Point #3: Chunks need more context
Tesla Roadster
2018
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tem
2023
sed do eiusmod tem
Chunk #1
Chunk #2
Naive Chunks

Tesla Roadster
2018
sed do eiusmod tem
2023
sed do eiusmod tem
Tesla Roadster 2018
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tem
Tesla Roadster 2023
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tem
HTMLHeaderTextSplitter
ParentDocumentRetriever
Title 2-levels above
Title 1-level above
Naive Chunks Better Chunks
HierarchicalNodeParser
AutoMergingRetriever

Naive Chunks
Better Chunks

04
RAG Evaluation
Methods

Foundation Model Evals vs Production System Evals
Your RAG system
Arena Elo score

RAG Evaluation Methods
https://arxiv.org/pdf/2306.05685.pdf
GPT-4 favors itself with a 10% higher
win rate; Claude-v1 favors itself with a
25% higher win rate
Open weight Prometheus-eval aligns
with human judgments up to 85% as
of May 2024.

Known Problems with LLM-as-Judge
https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG
GPT-4 is not a good
judge of
comprehensiveness
GPT-4
Matches
Human
judgements on
Correctness &
Readability

Known Problems with LLM-as-Judge
https://arxiv.org/pdf/2305.17926
AI scores
max/min higher
Humans
score
medians
higher

RAG Evaluation Methods
https://github.com/explodinggradients/ragas
faithfulness
context_precision
context_recall
Query
Context
answer_relevancy
Ground Truth
Answer
answer_correctness
answer_similarity
Response

05 Demo RAG Eval

RETRIEVAL +46%, GENERATION +6%
####################################################
# Avg Context Precision htmlsplitter score = 0.67 (46% improvement)
# Avg Context Precision simple score = 0.46
####################################################
####################################################
# Avg mistralai mixtral_8x7b_instruct score = 0.7031 (6% improvement over
gpt-3.5-turbo)
# Avg llama3_70b_anyscale_chat score = 0.6888
# Avg llama3_70b_groq_instruct score = 0.6867
# Avg llama_3_70b_octoai_instruct score = 0.6863
# Avg llama_3_8b_ollama_instruct score = 0.6783
# Avg openai gpt-3.5-turbo score = 0.665
####################################################

Retrieval Augmented Generation Evaluation with Ragas

More Related Content

What's hot (20)

Similar to Retrieval Augmented Generation Evaluation with Ragas (20)

More from Zilliz (20)

Recently uploaded (20)

Retrieval Augmented Generation Evaluation with Ragas