Retrieval Augmented Generation with Google Colab

Arul Benjamin Chandru E

Service Manager at Cognizant | Mainframe Modernization | 🤖 Microsoft Certified: Azure AI Engineer Associate | Google Cloud Certified : Generative AI Leader🎓 Gen AI Educator to 20K+ Learners via Udemy

Published Sep 5, 2024

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a powerful approach that combines the strengths of information retrieval with generative AI models. It enhances the model's ability to generate more accurate and contextually relevant answers by incorporating real-time retrieval of external knowledge. This hybrid method is particularly useful for tasks requiring up-to-date or specialized information that might not be captured within a generative model's static knowledge base.

How RAG Works

RAG integrates two components:

Retrieval: The AI retrieves relevant information or documents from external sources (like databases, search engines, or document stores) based on a user's query.
Generation: The retrieved information is then passed to a generative model (such as GPT, Claude, or similar models), which uses this data to create a coherent, detailed response.

This two-step process allows the model to access current information and generate content that is not limited to its pre-trained knowledge, making it ideal for tasks like customer support, research, or dynamic content generation.

Key Steps in RAG

Query Understanding: The model interprets the user query and decides what type of information is needed.
Information Retrieval: The system searches a connected database or knowledge repository for the most relevant documents or facts.
Document Ranking: The retrieved documents are ranked based on relevance to the query.
Answer Generation: The generative model uses the top-ranked documents to create a response that is more informed and contextually accurate than if it relied on pre-trained data alone.

Example of RAG in Action

Use Case: Generating answers to medical queries

Without RAG:

A model trained up to a certain date may not have the latest medical research.
It generates answers based solely on the information it was trained on, possibly leading to outdated or incomplete responses.

With RAG:

The model can retrieve the latest medical papers, clinical guidelines, or research articles in real time.
It then uses this up-to-date information to provide a more accurate, evidence-based response.

Query Example: "What are the latest treatments for Type 2 diabetes?"

The RAG system retrieves recent medical papers or drug approvals from trusted sources.
The generative model then synthesizes the retrieved information into a detailed, current response.

Benefits of RAG

Up-to-Date Information: RAG enables models to provide real-time information, overcoming the limitations of static training data.
Improved Accuracy: By combining external retrieval with generation, responses are more precise, especially for specialized or niche domains.
Customizable Knowledge Base: Organizations can use RAG to train models on proprietary or domain-specific datasets, such as company documents, product manuals, or legal codes.
Reduced Hallucination: Since generative models can sometimes "hallucinate" (produce incorrect or fabricated information), RAG mitigates this by grounding responses in actual retrieved data.

Practical Applications of RAG

Customer Support: RAG can provide detailed answers to customer queries by retrieving relevant knowledge from FAQ databases or internal documents, ensuring responses are accurate and aligned with company policies.
Research Assistance: RAG is highly useful in fields like law, healthcare, and academia, where real-time access to updated research or case laws is essential.
Personalized Recommendations: By retrieving user-specific information, such as previous interactions or preferences, RAG systems can generate more personalized and context-aware recommendations.
Search-Augmented Chatbots: Chatbots using RAG can handle complex questions by retrieving and integrating real-time data from external sources such as web pages, databases, or APIs.

RAG Architecture

The architecture of a RAG system typically includes:

Retrieval Component: This can be a vector search engine like Elasticsearch, Pinecone, or FAISS, or even a traditional keyword-based search engine.
Generative Component: Large Language Models (LLMs) such as GPT-4, LLaMA, or Claude, which are capable of generating fluent and coherent text.
Document Embeddings: The documents or data being retrieved are transformed into embeddings (numerical representations) for efficient search and matching.
Scoring and Ranking: Once documents are retrieved, a ranking mechanism (e.g., BM25 or cosine similarity) ensures the most relevant documents are used for the generation phase.

Tools and Technologies for Building RAG Systems

Vector Databases:
LLM Integration:
Document Chunking:
Embedding Models:
Storage Solutions:

RAG Use Cases and Solutions

Client-Specific Knowledge Retrieval: A company can build a RAG system to retrieve client-specific data and generate personalized reports or responses.
Documentation Generation: Developers can use RAG to generate documentation from codebases, pulling relevant code snippets and comments to form coherent documents.
Legal Document Analysis: In legal tech, RAG can be used to retrieve case laws or regulations and generate detailed analysis or summaries, ensuring that the information is current and accurate.

Sample RAG application to use our own data with LLM

We will use the following tools for our demonstration.

Langchain - an LLM Framework (will see deeper on tomorrow class)
Sentence Transformers - For breaking down the sentences
FAISS - Facebook AI Similarity Search - Product of Meta. This is for storing the vector embeddings locally (in Colab file system)
Huggingface Embeddings - Convert the chunked data into embeddings - We use MPNET: Masked and Permuted Pre-training for Language Understanding.
Using colabxterm and run ollama locally in Colab (This will give us the ability to run llama 3.1 model using google colab itself)

RAG Having Three main Stages

Data Ingestion
Data Retrieval
Data Generation

This will import the relevant libraries required. After importing, let us load the file for processing.

In my case, I took my CV as my sample. I just want to demonstrate the unknown data from LLM ;)

Here I split my entire CV into chunks using following code :

The splitted document looks like above

For Generating text embeddings out of the splitted text, we use MTEB

MTEB: Massive Text Embedding Benchmark

The all-mpnet-base-v2 model provides the best quality, while all-MiniLM-L6-v2 is 5 times faster and still offers good quality.

BGE(BAAI general embedding) BAAI: https://huggingface.co/BAAI

Dataset size: Larger datasets generally benefit from more powerful models like MPNet.

Computational resources: If you have limited resources, BGE Small En or MiniLM might be better options.

Task complexity: For complex tasks like question answering or text summarization, MPNet is often preferred.

Embedding dimensionality: Different models produce embeddings of varying dimensions.Choose based on downstream task requirements.

Performance vs. efficiency trade-off: Decide if you prioritize high accuracy or faster processing

Experimentation is key. Try different models and evaluate their performance on your specific task and dataset to find the best fit.

MPNET: Masked and Permuted Pre-training for Language Understanding.

https://huggingface.co/sentence-transformers

https://huggingface.co/spaces/mteb/leaderboard

https://huggingface.co/blog/mteb

The above embedding from huggingface can be accessed without any huggingface tokens. This generates the embeddings and will store those embeddings in a local vector database called FAISS - Facebook AI Similarity Search

Why Use FAISS

Efficiency
Versatility
Scalability
Integration
GPU Support

Security Considerations

Data Control
Reduced Exposure
Compliance
Latency and Performance
Network Security

Here we are loading the embeddings into vectorstore using FAISS

This will create a folder named faiss_index_ in google colab and can store the embeddings.

creating retriever on top of the vector store to retrieve the data. Now our embedding is done and our vector store is ready with our private data (my CV in this case).

For having an LLM, I wanted to try running a ollama server in google colab. So I installed langchain support for ollama like this, So that i can access ollama python functions inside my program.

This is the most awaited part. Colab is having an inbuilt terminal called xterm. This can be installed using pip like this. After installing that, we can access the built in terminal with the below command,

Post that, access xterm using this command,

Import the Ollama like this in python. So that you can use Ollama function with langchain framework,

Test the installed llama3.1 with the prompt to make sure that it did not know your question before,

Now we can use the Retrieval QA chain from Ollama to pass our query along with the retriever (which we built on top of FAISS vector store)

Now you will get the response since this is fetched using RAG

Retrieval Augmented Generation with Google Colab

Arul Benjamin Chandru E

Service Manager at Cognizant | Mainframe Modernization | 🤖 Microsoft Certified: Azure AI Engineer Associate | Google Cloud Certified : Generative AI Leader🎓 Gen AI Educator to 20K+ Learners via Udemy

Retrieval-Augmented Generation (RAG)

How RAG Works

Key Steps in RAG

Example of RAG in Action

Benefits of RAG

Practical Applications of RAG

RAG Architecture

Tools and Technologies for Building RAG Systems

RAG Use Cases and Solutions

Sample RAG application to use our own data with LLM

MTEB: Massive Text Embedding Benchmark

Why Use FAISS

Security Considerations

Data with Arul

775 followers

More articles by this author

Others also viewed

SPARK: An AI Prompting Framework for Superior Model Outputs: Part 1

🌟 Trend Highlight: Retrieval-Augmented Multi-Modal Models

The Critic Agent: Enhancing Multi-Agent Systems Through Constructive Evaluation

NewMind AI Journal #62

RAG vs. MCP: Complementary Powerhouses for the Next Generation of AI Agents

Demystifying Retrieval Augmented Generation (RAG)

AI Titans Face-Off: March 2025's Ultimate Model Showdown

Scaling AI models for growth

AI Monthly Insights #4

Data-driven Generative AI solutions: Agentic RAG Application vs LLM Fine-Tuning

Explore topics

Retrieval-Augmented Generation (RAG)

How RAG Works

Key Steps in RAG

Example of RAG in Action

Benefits of RAG

Practical Applications of RAG

RAG Architecture

Tools and Technologies for Building RAG Systems

RAG Use Cases and Solutions

Sample RAG application to use our own data with LLM

MTEB: Massive Text Embedding Benchmark

Why Use FAISS

Security Considerations

Data with Arul

775 followers

Building a YouTube Transcript Summarizer with Streamlit and IBM WatsonX

Jul 4, 2024

Leveraging AI to Summarize YouTube Transcripts with Google Gemini

Jul 3, 2024

How to Build an Image Recognition App Using Gemini Pro Vision and Streamlit

Jul 2, 2024

How to Build an Image Recognition App Using Gemini Pro Vision and Streamlit

Jul 2, 2024

Unlocking AI's Creative Power: Building an Image Generation App with React, FastAPI, Uvicorn, Swagger, Stable Diffusion and Chakra UI

Jun 7, 2023

The Power of Consistency: How to Achieve Your Goals

Jun 4, 2023

IBM MF COBOL Logic Suggestion App using Open AI (Chat GPT), Next.js, Tailwind CSS, Lottie Player and Vercel (For Deployments)

Mar 23, 2023

Thriller story writer using OpenAI — ChatGPT

Dec 30, 2022

Create your own COVID tracker using node.js, npm, express, ejs and bootstrap

Aug 24, 2021

"Jack of all trades master of none"

Nov 8, 2014

Others also viewed

SPARK: An AI Prompting Framework for Superior Model Outputs: Part 1

🌟 Trend Highlight: Retrieval-Augmented Multi-Modal Models

The Critic Agent: Enhancing Multi-Agent Systems Through Constructive Evaluation

NewMind AI Journal #62

RAG vs. MCP: Complementary Powerhouses for the Next Generation of AI Agents

Demystifying Retrieval Augmented Generation (RAG)

AI Titans Face-Off: March 2025's Ultimate Model Showdown

Scaling AI models for growth

AI Monthly Insights #4

Data-driven Generative AI solutions: Agentic RAG Application vs LLM Fine-Tuning

Explore topics