From Hype to Reality: The RAG Technique That’s Powering Next-Gen AI - The Search for Smarter Search — A RAG Story About Precision
If you’ve been following along, you already know we’ve come a long way.
In Part 1, we built our first RAG pipeline — simple, clean, and effective. In Part 2, we taught it how to read PDFs and break them down into smart, retrievable chunks. But here’s the thing: even the smartest chunks fall short if your search engine isn’t sharp.
Let me tell you a quick story.
When “Close Enough” Isn’t Good Enough
I was testing an internal chatbot trained on thousands of corporate policy docs. I typed:
“Can I expense a cab if I miss the last train home after a client dinner?”
The bot paused. Then responded confidently with a quote about reimbursing public transport receipts… from an unrelated travel policy. It wasn’t wrong — but it wasn’t right either.
And that’s when I knew: our retrieval engine needed work. Not more documents. Not a smarter model. What we needed was precision.
Enter Hybrid Retrieval — The Yin and Yang of Search
RAG systems typically use semantic search — powerful, sure, but it sometimes overlooks exact keyword matches. Meanwhile, keyword search (like BM25) is good at catching specifics but terrible with meaning.
So what do we do? We combine them. Like peanut butter and jelly. Or Batman and Robin. Here’s how it works:
# Pseudocode illustration of hybrid retrieval
semantic_results = semantic_search(query)
keyword_results = keyword_search(query)
# Score and combine both sets
final_results = rerank_based_on_combined_scores(semantic_results, keyword_results)
This approach ensures that our system understands both the intention behind the question and the words that actually matter.
How RAG Learns to Prioritize — Reranking to the Rescue
Once we have our results, the next challenge is choosing the best ones to show first. That’s where reranking comes in. We use strategies like:
Here’s a simplified example:
doc['final_score'] = 0.5 * doc['bm25_score'] + 0.5 * doc['semantic_score']
You can tweak the weights depending on your use case — e.g., tighter compliance may need more keyword accuracy. This reranking step drastically reduces hallucinations and boosts confidence in your system.
LangChain Retriever Wrappers: Clean Code, Cleaner Results
At this point, your codebase might start getting messy. That’s why tools like LangChain retriever wrappers are handy. They let you encapsulate hybrid search logic cleanly and plug it into your retrieval pipeline like this:
from langchain.vectorstores import Qdrant
from langchain.chains import RetrievalQA
retriever = CustomHybridRetriever(...) # wraps keyword + semantic logic
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
Boom. Cleaner integration. Faster iterations. Production-readiness = ALMOST PERFECTION :-)
Measuring What Matters — Evaluation & Benchmarking (Expanded)
You’ve built a fancy hybrid retriever. Your RAG pipeline looks sharp. But without measurement, it’s all guesswork.
So how do we know it's better? Let’s break it down into three key evaluation pillars, and I’ll show you how to write code for each:1. Precision @ K This tells you how many of the top K retrieved results are actually relevant.
Step 1: Define your ground truth
Let’s say you have some test queries and the correct documents you expect:
test_queries = [
{
"query": "What is the refund policy for late cancellations?",
"expected_doc_ids": ["doc_003", "doc_007"]
},
{
"query": "How do I claim medical expenses?",
"expected_doc_ids": ["doc_014"]
}
]
Step 2: Evaluate precision@k
def precision_at_k(retriever, query, expected_doc_ids, k=5):
results = retriever.get_relevant_documents(query)[:k]
retrieved_ids = [doc.metadata["doc_id"] for doc in results]
hits = len(set(retrieved_ids) & set(expected_doc_ids))
return hits / k
# Run evaluation
for test in test_queries:
score = precision_at_k(retriever, test["query"], test["expected_doc_ids"], k=5)
print(f"Query: {test['query']}\nPrecision@5: {score:.2f}\n")
2. Answer Relevancy
Did the LLM actually answer the question correctly, based on the retrieved docs? This one needs some human-labeled Q&A pairs. Here's a basic semi-automated approach using LangChain:
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
test_set = [
{
"query": "How do I update my billing address?",
"expected_answer": "You can update your billing address by logging into your profile settings and selecting 'Billing Info'."
}
]
from difflib import SequenceMatcher
def answer_similarity(answer, expected):
return SequenceMatcher(None, answer.lower(), expected.lower()).ratio()
for test in test_set:
response = qa_chain.run(test["query"])
similarity = answer_similarity(response, test["expected_answer"])
print(f"Query: {test['query']}\nAnswer Similarity Score: {similarity:.2f}\n")
You can replace this with ROUGE or BERTScore for more advanced NLP evaluation if needed.
3. Hallucination Rate
This is the scariest metric. It tells you when the model generates confident nonsense. We track this by:
Here’s a simple function:
def hallucination_detector(query, answer, documents):
context = " ".join([doc.page_content for doc in documents])
return answer.lower() not in context.lower()
# Run a hallucination test
for test in test_set:
docs = retriever.get_relevant_documents(test["query"])
answer = qa_chain.run(test["query"])
hallucinated = hallucination_detector(test["query"], answer, docs)
print(f"Query: {test['query']}\nHallucinated: {'Yes' if hallucinated else 'No'}\n")
This is a naive version, but in practice you can use:
Final Thoughts
Metrics like Precision@K, Answer Relevancy, and Hallucination Rate give you visibility into how reliable and accurate your RAG system truly is. But don’t overfit to any one number. The best systems combine:
In Part 4, we’ll explore how to scale this entire setup — handling document refreshes, adding monitoring, retries, and self-healing workflows. Stay tuned. The RAG factory is about to go industrial. 🏭🔥
#RAG #RetrievalAugmentedGeneration #GenerativeAI #LangChain #HybridRetrieval #VectorSearch #LLM #LargeLanguageModels #BM25 #Embeddings #Reranking #PrecisionAtK #ModelEvaluation #HallucinationDetection #QAEvaluation #AIBenchmarking #AI #ArtificialIntelligence #MachineLearning #DataScience #MLOps #PromptEngineering #AICommunity #NLPTesting