"Scaling RAG Applications to serve millions of users", Kevin Goedecke

Scaling RAG
applications to serve
millions of users
Lessons learned scaling LLMs, Vector databases and more…
Powered by

What we’ll cover today…
📝 Agenda - Who is SlideSpeak? What do we do?
- Our Growth so far
- Mistakes we’ve made
- Architecture Overview
- Challenges of scaling RAG applications
- Challenges of scaling Vector databases
- What we’re currently busy with…
- Q&A

Create presentations from
Word, PDF or Excel files
Create presentations for any
topic
Some of our key features:
SlideSpeak is an AI-powered platform
to create and automate presentations
What we do

We save people hours of work
by speeding up manual
presentation workflows

OUR GROWTH
SO FAR
We’ve launched 6 months ago,
here’s what we’ve achieved. >2m
Files upload
>2m
LLM Tokens
consumed per
minute
>1000
LLM calls
per minute
250k
MAUs

User
Vector DB
Context
LLM
Response Query
Inference Retrieval
Prompt
Quick recap
on RAG

Using LLMs for everything
Storing vector data forever
No monitoring
Downtimes cost us thousands of $$$
Mistakes we’ve
made… so far…
��
��
��
Storing vectors is expensive, like
really expensive…
If not absolutely necessary use avoid
using LLMs

Prompts, JSON Mode, Regression
Scaling Vector Databases
Scaling LLM providers
Rate limits, Downtimes, …
Challenges
scaling RAG
Applications
��
��
��
Slow queries, index building, …
LLM specific prompts, Pydantic being
unreliable, …

- Rate limits are not as high as
they seem
- Difficult to balance RPM and
TPM
- 40 page document has on avg.
24k tokens, with 2m limit thats
83 documents per minute or 1.5
per second
OpenAI Rate Limits (6/10/24)
Scaling LLM
providers
The problem

- We’ve migrated to Azure
OpenAI
- Not because the rate limits are
higher, but you can load
balance 🤯
https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models
Scaling LLM
providers
The solution

Source:
https://techcommunity.microsoft.com/t5/fasttrack-for-azure/smart-load-balancing-for-openai-end
points-and-azure-api/ba-p/3991616
Scaling LLM providers
Load Balancing in Azure

Scaling LLM
providers
The solution

- Find similar information in a lot of data
- Feed that to the LLM as context
Why do we need Vector Databases?

5 Problems when scaling Postgres with PGVector
- 🦥 Slow queries
- 🧠 Memory intense
- 🔍 Challenge of combining vector search and
metadata search (also called hybrid search)
needs careful query optimization
Challenges of scaling Vector Databases

A Vector is an array of 4-byte floating point numbers.
Number of Vectors Total Size
7 Thousand 7,000 × 1536 × 4 = 43 Megabyte
1 Million 1,000,000 × 1536 × 4 = 5.7 Gigabyte
10 Millions 10,000,000 × 1536 × 4 = 57 Gigabyte
1 Billion 1000,000,000 × 1536 × 4 = 5.5 Terabyte

Build an index over the vector data
HNSW: Efficient nearest neighbor search algorithm (ANN) for high-dimensional data.
ef_construction: Defines how many similarity candidates to look for
m: Defines how many of the closest neighbors to pick from the ef_construction list
** This might be more tricky if you use hybrid search (metadata + vector data)

Define what could be a good partition value for you (date, filetype, category, …)
Use partitioning
What we’ve done to scale PGVector

General tips when
working with
PGVector
Make sure to delete unnecessary
vector data
01
Never use the same Postgres
database for Vectors and other data
02

If nothing helps… Scale horizontally
What we’ve done to scale PGVector

What we’re
currently busy
with…
System Extension
Azure AI Postgres Extension
RAG Evaluation
– Implementing robust evaluation testing methods
– Exploring Ragas for advanced system performance
✅
��
��
– Enabling direct creation of embeddings within Azure
– Extend system to cover images and other non-textual data
Here’s what currently keeps
us up at night…

Questions?
And now it’s your turn…

"Scaling RAG Applications to serve millions of users", Kevin Goedecke

More Related Content

Similar to "Scaling RAG Applications to serve millions of users", Kevin Goedecke (20)

More from Fwdays (20)

Recently uploaded (20)

"Scaling RAG Applications to serve millions of users", Kevin Goedecke