Retrieval Augmented Generation with Google Colab
Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is a powerful approach that combines the strengths of information retrieval with generative AI models. It enhances the model's ability to generate more accurate and contextually relevant answers by incorporating real-time retrieval of external knowledge. This hybrid method is particularly useful for tasks requiring up-to-date or specialized information that might not be captured within a generative model's static knowledge base.
How RAG Works
RAG integrates two components:
Retrieval: The AI retrieves relevant information or documents from external sources (like databases, search engines, or document stores) based on a user's query.
Generation: The retrieved information is then passed to a generative model (such as GPT, Claude, or similar models), which uses this data to create a coherent, detailed response.
This two-step process allows the model to access current information and generate content that is not limited to its pre-trained knowledge, making it ideal for tasks like customer support, research, or dynamic content generation.
Key Steps in RAG
Query Understanding: The model interprets the user query and decides what type of information is needed.
Information Retrieval: The system searches a connected database or knowledge repository for the most relevant documents or facts.
Document Ranking: The retrieved documents are ranked based on relevance to the query.
Answer Generation: The generative model uses the top-ranked documents to create a response that is more informed and contextually accurate than if it relied on pre-trained data alone.
Example of RAG in Action
Use Case: Generating answers to medical queries
Without RAG:
A model trained up to a certain date may not have the latest medical research.
It generates answers based solely on the information it was trained on, possibly leading to outdated or incomplete responses.
With RAG:
The model can retrieve the latest medical papers, clinical guidelines, or research articles in real time.
It then uses this up-to-date information to provide a more accurate, evidence-based response.
Query Example: "What are the latest treatments for Type 2 diabetes?"
The RAG system retrieves recent medical papers or drug approvals from trusted sources.
The generative model then synthesizes the retrieved information into a detailed, current response.
Benefits of RAG
Up-to-Date Information: RAG enables models to provide real-time information, overcoming the limitations of static training data.
Improved Accuracy: By combining external retrieval with generation, responses are more precise, especially for specialized or niche domains.
Customizable Knowledge Base: Organizations can use RAG to train models on proprietary or domain-specific datasets, such as company documents, product manuals, or legal codes.
Reduced Hallucination: Since generative models can sometimes "hallucinate" (produce incorrect or fabricated information), RAG mitigates this by grounding responses in actual retrieved data.
Practical Applications of RAG
Customer Support: RAG can provide detailed answers to customer queries by retrieving relevant knowledge from FAQ databases or internal documents, ensuring responses are accurate and aligned with company policies.
Research Assistance: RAG is highly useful in fields like law, healthcare, and academia, where real-time access to updated research or case laws is essential.
Personalized Recommendations: By retrieving user-specific information, such as previous interactions or preferences, RAG systems can generate more personalized and context-aware recommendations.
Search-Augmented Chatbots: Chatbots using RAG can handle complex questions by retrieving and integrating real-time data from external sources such as web pages, databases, or APIs.
RAG Architecture
The architecture of a RAG system typically includes:
Retrieval Component: This can be a vector search engine like Elasticsearch, Pinecone, or FAISS, or even a traditional keyword-based search engine.
Generative Component: Large Language Models (LLMs) such as GPT-4, LLaMA, or Claude, which are capable of generating fluent and coherent text.
Document Embeddings: The documents or data being retrieved are transformed into embeddings (numerical representations) for efficient search and matching.
Scoring and Ranking: Once documents are retrieved, a ranking mechanism (e.g., BM25 or cosine similarity) ensures the most relevant documents are used for the generation phase.
Tools and Technologies for Building RAG Systems
Vector Databases:
LLM Integration:
Document Chunking:
Embedding Models:
Storage Solutions:
RAG Use Cases and Solutions
Client-Specific Knowledge Retrieval: A company can build a RAG system to retrieve client-specific data and generate personalized reports or responses.
Documentation Generation: Developers can use RAG to generate documentation from codebases, pulling relevant code snippets and comments to form coherent documents.
Legal Document Analysis: In legal tech, RAG can be used to retrieve case laws or regulations and generate detailed analysis or summaries, ensuring that the information is current and accurate.
Sample RAG application to use our own data with LLM
We will use the following tools for our demonstration.
Langchain - an LLM Framework (will see deeper on tomorrow class)
Sentence Transformers - For breaking down the sentences
FAISS - Facebook AI Similarity Search - Product of Meta. This is for storing the vector embeddings locally (in Colab file system)
Huggingface Embeddings - Convert the chunked data into embeddings - We use MPNET: Masked and Permuted Pre-training for Language Understanding.
Using colabxterm and run ollama locally in Colab (This will give us the ability to run llama 3.1 model using google colab itself)
RAG Having Three main Stages
Data Ingestion
Data Retrieval
Data Generation
This will import the relevant libraries required. After importing, let us load the file for processing.
In my case, I took my CV as my sample. I just want to demonstrate the unknown data from LLM ;)
Here I split my entire CV into chunks using following code :
The splitted document looks like above
For Generating text embeddings out of the splitted text, we use MTEB
MTEB: Massive Text Embedding Benchmark
The all-mpnet-base-v2 model provides the best quality, while all-MiniLM-L6-v2 is 5 times faster and still offers good quality.
BGE(BAAI general embedding) BAAI: https://huggingface.co/BAAI
Dataset size: Larger datasets generally benefit from more powerful models like MPNet.
Computational resources: If you have limited resources, BGE Small En or MiniLM might be better options.
Task complexity: For complex tasks like question answering or text summarization, MPNet is often preferred.
Embedding dimensionality: Different models produce embeddings of varying dimensions.Choose based on downstream task requirements.
Performance vs. efficiency trade-off: Decide if you prioritize high accuracy or faster processing
Experimentation is key. Try different models and evaluate their performance on your specific task and dataset to find the best fit.
MPNET: Masked and Permuted Pre-training for Language Understanding.
https://huggingface.co/sentence-transformers
https://huggingface.co/spaces/mteb/leaderboard
https://huggingface.co/blog/mteb
The above embedding from huggingface can be accessed without any huggingface tokens. This generates the embeddings and will store those embeddings in a local vector database called FAISS - Facebook AI Similarity Search
Why Use FAISS
Efficiency
Versatility
Scalability
Integration
GPU Support
Security Considerations
Data Control
Reduced Exposure
Compliance
Latency and Performance
Network Security
Here we are loading the embeddings into vectorstore using FAISS
This will create a folder named faiss_index_ in google colab and can store the embeddings.
creating retriever on top of the vector store to retrieve the data. Now our embedding is done and our vector store is ready with our private data (my CV in this case).
For having an LLM, I wanted to try running a ollama server in google colab. So I installed langchain support for ollama like this, So that i can access ollama python functions inside my program.
This is the most awaited part. Colab is having an inbuilt terminal called xterm. This can be installed using pip like this. After installing that, we can access the built in terminal with the below command,
Post that, access xterm using this command,
Import the Ollama like this in python. So that you can use Ollama function with langchain framework,
Test the installed llama3.1 with the prompt to make sure that it did not know your question before,
Now we can use the Retrieval QA chain from Ollama to pass our query along with the retriever (which we built on top of FAISS vector store)
Now you will get the response since this is fetched using RAG