From Zero To SaaS #44 RAG

Jasper Ruijs

I build AI Sales Agents to automate your pipeline without increasing headcount 🎯 | Leverage AI to make your sales team more human 🚀 | AI + Sales Process Optimisation 🦾 | Founder of Growsteady

Published Jun 13, 2025

The AI world is moving so fast that, as an antidote, I have chosen to focus on fundamentals. Said Rik Boere , a top-notch agent builder.

One of these fundamentals is data management; no matter how advanced AI becomes, it always stores information somewhere.

Doing a deep dive in the databases showed me that:

A database is not a larger version of a Google Sheet.
RAG has its limitations
Reminded me that LLMs don’t understand language.

This week's edition, I will walk you through the fundamentals of data, as preparation for diving into RAG.

The Fundamentals of Digital Information

Data itself is just a bunch of 0s and 1s, but they are stored in a type. ANALOGY: words on a page.

String: Sequence of characters ("hello!", '42', "John_Doe").
Float, numbers with decimals (3.14, -0.001, 2.0)
Int, numbers without ‘,’( -10, 0, 42)
Boolean, a value that’s either true or false.
List, collection of items [1, 2, 3], ["a", "b", "c"]

To read this data, the code that processes data uses formats to recognize the structure.

Like we read sentences to connect word concepts, we use formats to add more complexity.

JSON, used for Web APIs, configuration files, data storage.

{
 "name": "Jasper",
 "age": 30
}

XML, Legacy systems, SOAP APIs, document storage (e.g., Office files).

<user>
  <name>Jasper</name>
  <age>30</age>
</user>

CSV, Spreadsheets, tabular data, data import/export.

name,age
Jasper,30

YAML, Config files (e.g., Docker Compose, GitHub Actions), infrastructure-as-code.

name: Jasper
age: 30

HTML, Web content, templating, and rendering structured documents.

<h1> Jasper Data </h1>
<p>Name: Jasper</p>
<p>Age: 30</p>

So, we write and form sentences, but to form a book, you need to add pages.

Each page has a page number, which in the digital information world serves as the ID of the dataset.

On this level, we enter with databases.

If a spreadsheet is a page, then what is the book?

These databases should be seen as gigantic rows stacked upon each other like a flat with 1000 floors, with 40 apartments on each floor.

The LLM Database

RAG stands for Retrieval-Augmented Generation, a term that describes the process of transforming data into numbers to facilitate its use by an LLM.

When you type text into ChatGPT, it is transformed into numbers and fed to the model, which then activates a neural network. The model then performs calculations and outputs numbers, which are converted back into text.

When setting up a RAG database, you use an embedding model, a type of model that transforms text into vectors, or lists of numbers. Because the model gives numbers back, you have to use that exact model to translate it back to text.

When you use RAG, you improve two processes,

A) You use fewer tokens to retrieve information because strings or words use more tokens than this list of numbers.

B) You add context to assist LLM in determining which concepts should be linked to each other.

In which scenarios should RAG not be used?

Suppose the format of the answer needs to be different and more accurate than the standard model. In that case, you use fine-tuning, where you retrain specific parts of the model's weights to decrease the likelihood of hallucination in particular areas, such as government administration, law, finance, and medical records.
As LLMs are based on text, I do not recommend transforming table-formatted data, such as your CRM, ATS, Lead Generation, and Advertisement Campaign Data, to RAG.
I also find it challenging to live-sync RAG data and optimize records. Therefore, it is excellent for a knowledge base assistant and in administration; however, I prefer to use API calls for the platforms mentioned above.
You cannot do SQL in the RAG database, but you can also do semantic search, as in even typos or words, in the direction it would find the information you are looking for.
In some cases, you can get away with prompt caching, where you store part of the input data, so the model does not have to translate all the data to numbers each time.

My RAG protocol

Webscraping OR finding all the relevant files

I use Firecrawl to crawl the entire website, and then Firecrawl again to get the Markup of every page.

2. Data Cleaning

Then I create a loop that adds each page with context in the beginning, and at the end, which enhances retrieval.

If you have multiple documents, you can best transform the PDFs, etc., into TXT files and then combine all the TXT files into one.

3. Vectorisation Setup

I ran a test on https://platform.vectorize.io/ to find the settings for vectorisation.

Pinecone is the easiest to set up, but you have less control over it.

Supabase is a bit more challenging to set up, but you have more control and can see how each item is stored, which is recommended for beginners if you're setting it up for the first time.

Then set up a vector database.

4. Vectorisation Settings

I set up the n8n settings.

You always have to select the A) right model, set the B) dimensions, and set the C) chunk size and D) overlap.

Then, I execute the workflow, and at the end, I check the DB provider to see if I have the records.

5. Setting up the retrieval agent.

I ensure that the retrieval agent uses the correct embedding model and has sufficient memory to process the data from the Pincone tool; I usually set it to 5.

6. Using ChatGPT to create the agent prompt

I always start with the prompt: 'You are a prompt engineering genius, you need to make a prompt for a RAG agent, which uses the X tool to receive the data.'

You help the user to do x

Start with the mission,

Then, describe how to use the tool

Give an example of the input and output.

7. Optimise the agent prompt

If this prompt is 8, what do you need to make it a 10?

For those who are not into reading, I am filming a video on how to RAG, which will be available either next week or the week after.

Happy Building!

Join me and 161 others to learn How to Build Your First Email Agent.

Mon, Jun 23, 2025, 11:00 AM — 12:00 PM

Jacek Gabanowicz

2mo

Good breakdown. Honestly it's a tool usability issue with some of these rag agents and some tools are much better intuitively to create your output with much less overall technical friction.

2 Reactions

David Benett

Building AI First Teams & Organisations

Great breakdown Jasper. Thanks for sharing.

See more comments

From Zero To SaaS #44 RAG

Jasper Ruijs

I build AI Sales Agents to automate your pipeline without increasing headcount 🎯 | Leverage AI to make your sales team more human 🚀 | AI + Sales Process Optimisation 🦾 | Founder of Growsteady

The Fundamentals of Digital Information

The LLM Database

My RAG protocol

Zero To AI

1,920 follower

More articles by this author

Others also viewed

Why Unified Data Access Is the Missing Link in Enterprise LLM Applications

Inside Look at Powering Intelligence, How ML Models Are Trained

Unlocking Business Agility with Multi-Agent AI and Microsoft Fabric’s Data Mirroring

Google Adopts Anthropic’s MCP: Pioneering AI Data Connectivity

Why Chasing the Hare is Killing Enterprise GenAI – Time to Bet on the Tortoise Again

💡Smarter AI Agents, YouTube Data Extraction & Scraping Trends for 2025

Integration of Microsoft Fabric and Azure AI Foundry – What, Why, and Who Benefits

Alation centralizes data knowledge by employing machine learning and crowdsourcing

Performance Optimization in Azure AI Search – Best Practices and Code Samples

Why AutoML failed to live up to the hype

Explore topics

The Fundamentals of Digital Information

The LLM Database

My RAG protocol

Zero To AI

1,920 follower

From Zero To SaaS #46 MCP Deep Dive

Aug 1, 2025

From Zero To SaaS #45 Scoping Agentic Projects

Jul 27, 2025

From Zero To SaaS #43 Learning AI Agents

May 3, 2025

From Zero To SaaS #43 Vibe Coding

Apr 11, 2025

From Zero To SaaS #42 Building AI MVPs

Feb 22, 2025

From Zero To SaaS #41 AI SaaS Integration

Feb 9, 2025

From Zero To SaaS #40 Agentic Architecture

Jan 29, 2025

From Zero To SaaS #39 The 2 year progression of Agents

Jan 8, 2025

Dawn of AI Marketing 🦾 #38 - Sales Simulation

Nov 7, 2024

Dawn of AI Marketing 🦾 #37 - Dark Social

Aug 28, 2024

Others also viewed

Why Unified Data Access Is the Missing Link in Enterprise LLM Applications

Inside Look at Powering Intelligence, How ML Models Are Trained

Unlocking Business Agility with Multi-Agent AI and Microsoft Fabric’s Data Mirroring

Google Adopts Anthropic’s MCP: Pioneering AI Data Connectivity

Why Chasing the Hare is Killing Enterprise GenAI – Time to Bet on the Tortoise Again

💡Smarter AI Agents, YouTube Data Extraction & Scraping Trends for 2025

Integration of Microsoft Fabric and Azure AI Foundry – What, Why, and Who Benefits

Alation centralizes data knowledge by employing machine learning and crowdsourcing

Performance Optimization in Azure AI Search – Best Practices and Code Samples

Why AutoML failed to live up to the hype

Explore topics