SlideShare a Scribd company logo
17
Most read
20
Most read
21
Most read
Scaling RAG
applications to serve
millions of users
Lessons learned scaling LLMs, Vector databases and more…
Powered by
What we’ll cover today…
📝 Agenda - Who is SlideSpeak? What do we do?
- Our Growth so far
- Mistakes we’ve made
- Architecture Overview
- Challenges of scaling RAG applications
- Challenges of scaling Vector databases
- What we’re currently busy with…
- Q&A
Create presentations from
Word, PDF or Excel files
Create presentations for any
topic
Some of our key features:
SlideSpeak is an AI-powered platform
to create and automate presentations
What we do
We save people hours of work
by speeding up manual
presentation workflows
OUR GROWTH
SO FAR
We’ve launched 6 months ago,
here’s what we’ve achieved. >2m
Files upload
>2m
LLM Tokens
consumed per
minute
>1000
LLM calls
per minute
250k
MAUs
User
Vector DB
Context
LLM
Response Query
Inference Retrieval
Prompt
Quick recap
on RAG
Using LLMs for everything
Storing vector data forever
No monitoring
Downtimes cost us thousands of $$$
Mistakes we’ve
made… so far…
��
��
��
Storing vectors is expensive, like
really expensive…
If not absolutely necessary use avoid
using LLMs
Architecture
Prompts, JSON Mode, Regression
Scaling Vector Databases
Scaling LLM providers
Rate limits, Downtimes, …
Challenges
scaling RAG
Applications
��
��
��
Slow queries, index building, …
LLM specific prompts, Pydantic being
unreliable, …
- Rate limits are not as high as
they seem
- Difficult to balance RPM and
TPM
- 40 page document has on avg.
24k tokens, with 2m limit thats
83 documents per minute or 1.5
per second
OpenAI Rate Limits (6/10/24)
Scaling LLM
providers
The problem
- We’ve migrated to Azure
OpenAI
- Not because the rate limits are
higher, but you can load
balance 🤯
https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models
Scaling LLM
providers
The solution
Source:
https://techcommunity.microsoft.com/t5/fasttrack-for-azure/smart-load-balancing-for-openai-end
points-and-azure-api/ba-p/3991616
Scaling LLM providers
Load Balancing in Azure
Scaling LLM
providers
The solution
- Find similar information in a lot of data
- Feed that to the LLM as context
Why do we need Vector Databases?
5 Problems when scaling Postgres with PGVector
- 🦥 Slow queries
- 🧠 Memory intense
- 🔍 Challenge of combining vector search and
metadata search (also called hybrid search)
needs careful query optimization
Challenges of scaling Vector Databases
Challenges of scaling Vector Databases
A Vector is an array of 4-byte floating point numbers.
Number of Vectors Total Size
7 Thousand 7,000 × 1536 × 4 = 43 Megabyte
1 Million 1,000,000 × 1536 × 4 = 5.7 Gigabyte
10 Millions 10,000,000 × 1536 × 4 = 57 Gigabyte
1 Billion 1000,000,000 × 1536 × 4 = 5.5 Terabyte
Build an index over the vector data
Challenges of scaling Vector Databases
HNSW: Efficient nearest neighbor search algorithm (ANN) for high-dimensional data.
ef_construction: Defines how many similarity candidates to look for
m: Defines how many of the closest neighbors to pick from the ef_construction list
** This might be more tricky if you use hybrid search (metadata + vector data)
Define what could be a good partition value for you (date, filetype, category, …)
Use partitioning
What we’ve done to scale PGVector
General tips when
working with
PGVector
Make sure to delete unnecessary
vector data
01
Never use the same Postgres
database for Vectors and other data
02
If nothing helps… Scale horizontally
What we’ve done to scale PGVector
What we’re
currently busy
with…
System Extension
Azure AI Postgres Extension
RAG Evaluation
– Implementing robust evaluation testing methods
– Exploring Ragas for advanced system performance
✅
��
��
– Enabling direct creation of embeddings within Azure
– Extend system to cover images and other non-textual data
Here’s what currently keeps
us up at night…
Questions?
And now it’s your turn…

More Related Content

PPTX
TechDayPakistan-Slides RAG with Cosmos DB.pptx
PPTX
Full text search, vector search or both?
PDF
LLMs in Production: Tooling, Process, and Team Structure
PDF
Vectors are the new JSON in PostgreSQL (SCaLE 21x)
PDF
Is writing performant code too expensive?
PDF
Blending AI in Enterprise Architecture.pdf
PPTX
LangChain + Docugami Webinar
PDF
Build your own discovery index of scholary e-resources
TechDayPakistan-Slides RAG with Cosmos DB.pptx
Full text search, vector search or both?
LLMs in Production: Tooling, Process, and Team Structure
Vectors are the new JSON in PostgreSQL (SCaLE 21x)
Is writing performant code too expensive?
Blending AI in Enterprise Architecture.pdf
LangChain + Docugami Webinar
Build your own discovery index of scholary e-resources

Similar to "Scaling RAG Applications to serve millions of users", Kevin Goedecke (20)

PPTX
Supercharging your Data with Azure AI Search and Azure OpenAI
PDF
Maximizing AI Performance with Vector Databases: A Comprehensive Guide
PDF
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
PDF
Supercharge Spark: Unleashing Big Data Potential with Milvus for RAG systems
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
PPTX
Vector Databases and Why Are They Used in Modern AI - Marko Lohert - ATD 2024
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
PPTX
Why Should Gen AI Adopters Switch to Graph Based AI Agents_.pptx
PDF
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
PPTX
Vector_db_introduction.pptx
PPTX
Integrating Machine Learning Capabilities into your team
PDF
stackconf 2022: Introduction to Vector Search with Weaviate
PPTX
RAG Patterns and Vector Search in Generative AI
PDF
Use Case Patterns for LLM Applications (1).pdf
PPTX
Presentation Azure Chat Bot Project.pptx
PDF
AI presentation and introduction - Retrieval Augmented Generation RAG 101
PDF
Challenges in Structured Document Data Extraction at Scale with LLMs
PDF
PraveenBOUT++
PDF
Neo4j Graph DB & LLM.graphs & genAI introduction & cheatsheet.with comments.pdf
PDF
introductiontoragretrievalaugmentedgenerationanditsapplication-240312101523-6...
Supercharging your Data with Azure AI Search and Azure OpenAI
Maximizing AI Performance with Vector Databases: A Comprehensive Guide
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Supercharge Spark: Unleashing Big Data Potential with Milvus for RAG systems
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Vector Databases and Why Are They Used in Modern AI - Marko Lohert - ATD 2024
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Why Should Gen AI Adopters Switch to Graph Based AI Agents_.pptx
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
Vector_db_introduction.pptx
Integrating Machine Learning Capabilities into your team
stackconf 2022: Introduction to Vector Search with Weaviate
RAG Patterns and Vector Search in Generative AI
Use Case Patterns for LLM Applications (1).pdf
Presentation Azure Chat Bot Project.pptx
AI presentation and introduction - Retrieval Augmented Generation RAG 101
Challenges in Structured Document Data Extraction at Scale with LLMs
PraveenBOUT++
Neo4j Graph DB & LLM.graphs & genAI introduction & cheatsheet.with comments.pdf
introductiontoragretrievalaugmentedgenerationanditsapplication-240312101523-6...
Ad

More from Fwdays (20)

PDF
"Mastering UI Complexity: State Machines and Reactive Patterns at Grammarly",...
PDF
"Effect, Fiber & Schema: tactical and technical characteristics of Effect.ts"...
PPTX
"Computer Use Agents: From SFT to Classic RL", Maksym Shamrai
PPTX
"Як ми переписали Сільпо на Angular", Євген Русаков
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
PDF
"Validation and Observability of AI Agents", Oleksandr Denisyuk
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
PPTX
"Co-Authoring with a Machine: What I Learned from Writing a Book on Generativ...
PPTX
"Human-AI Collaboration Models for Better Decisions, Faster Workflows, and Cr...
PDF
"AI is already here. What will happen to your team (and your role) tomorrow?"...
PPTX
"Is it worth investing in AI in 2025?", Alexander Sharko
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
PDF
"Scaling in space and time with Temporal", Andriy Lupa.pdf
PDF
"Database isolation: how we deal with hundreds of direct connections to the d...
PDF
"Scaling in space and time with Temporal", Andriy Lupa .pdf
PPTX
"Provisioning via DOT-Chain: from catering to drone marketplaces", Volodymyr ...
PPTX
" Observability with Elasticsearch: Best Practices for High-Load Platform", A...
PPTX
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
PPTX
"Istio Ambient Mesh in production: our way from Sidecar to Sidecar-less",Hlib...
"Mastering UI Complexity: State Machines and Reactive Patterns at Grammarly",...
"Effect, Fiber & Schema: tactical and technical characteristics of Effect.ts"...
"Computer Use Agents: From SFT to Classic RL", Maksym Shamrai
"Як ми переписали Сільпо на Angular", Євген Русаков
"AI Transformation: Directions and Challenges", Pavlo Shaternik
"Validation and Observability of AI Agents", Oleksandr Denisyuk
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
"Co-Authoring with a Machine: What I Learned from Writing a Book on Generativ...
"Human-AI Collaboration Models for Better Decisions, Faster Workflows, and Cr...
"AI is already here. What will happen to your team (and your role) tomorrow?"...
"Is it worth investing in AI in 2025?", Alexander Sharko
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Database isolation: how we deal with hundreds of direct connections to the d...
"Scaling in space and time with Temporal", Andriy Lupa .pdf
"Provisioning via DOT-Chain: from catering to drone marketplaces", Volodymyr ...
" Observability with Elasticsearch: Best Practices for High-Load Platform", A...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"Istio Ambient Mesh in production: our way from Sidecar to Sidecar-less",Hlib...
Ad

Recently uploaded (20)

PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Encapsulation theory and applications.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
August Patch Tuesday
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Zenith AI: Advanced Artificial Intelligence
Assigned Numbers - 2025 - Bluetooth® Document
WOOl fibre morphology and structure.pdf for textiles
Group 1 Presentation -Planning and Decision Making .pptx
Getting Started with Data Integration: FME Form 101
Chapter 5: Probability Theory and Statistics
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Encapsulation theory and applications.pdf
Hindi spoken digit analysis for native and non-native speakers
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
August Patch Tuesday
Web App vs Mobile App What Should You Build First.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
cloud_computing_Infrastucture_as_cloud_p
Zenith AI: Advanced Artificial Intelligence

"Scaling RAG Applications to serve millions of users", Kevin Goedecke

  • 1. Scaling RAG applications to serve millions of users Lessons learned scaling LLMs, Vector databases and more… Powered by
  • 2. What we’ll cover today… 📝 Agenda - Who is SlideSpeak? What do we do? - Our Growth so far - Mistakes we’ve made - Architecture Overview - Challenges of scaling RAG applications - Challenges of scaling Vector databases - What we’re currently busy with… - Q&A
  • 3. Create presentations from Word, PDF or Excel files Create presentations for any topic Some of our key features: SlideSpeak is an AI-powered platform to create and automate presentations What we do
  • 4. We save people hours of work by speeding up manual presentation workflows
  • 5. OUR GROWTH SO FAR We’ve launched 6 months ago, here’s what we’ve achieved. >2m Files upload >2m LLM Tokens consumed per minute >1000 LLM calls per minute 250k MAUs
  • 6. User Vector DB Context LLM Response Query Inference Retrieval Prompt Quick recap on RAG
  • 7. Using LLMs for everything Storing vector data forever No monitoring Downtimes cost us thousands of $$$ Mistakes we’ve made… so far… �� �� �� Storing vectors is expensive, like really expensive… If not absolutely necessary use avoid using LLMs
  • 9. Prompts, JSON Mode, Regression Scaling Vector Databases Scaling LLM providers Rate limits, Downtimes, … Challenges scaling RAG Applications �� �� �� Slow queries, index building, … LLM specific prompts, Pydantic being unreliable, …
  • 10. - Rate limits are not as high as they seem - Difficult to balance RPM and TPM - 40 page document has on avg. 24k tokens, with 2m limit thats 83 documents per minute or 1.5 per second OpenAI Rate Limits (6/10/24) Scaling LLM providers The problem
  • 11. - We’ve migrated to Azure OpenAI - Not because the rate limits are higher, but you can load balance 🤯 https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models Scaling LLM providers The solution
  • 14. - Find similar information in a lot of data - Feed that to the LLM as context Why do we need Vector Databases?
  • 15. 5 Problems when scaling Postgres with PGVector - 🦥 Slow queries - 🧠 Memory intense - 🔍 Challenge of combining vector search and metadata search (also called hybrid search) needs careful query optimization Challenges of scaling Vector Databases
  • 16. Challenges of scaling Vector Databases A Vector is an array of 4-byte floating point numbers. Number of Vectors Total Size 7 Thousand 7,000 × 1536 × 4 = 43 Megabyte 1 Million 1,000,000 × 1536 × 4 = 5.7 Gigabyte 10 Millions 10,000,000 × 1536 × 4 = 57 Gigabyte 1 Billion 1000,000,000 × 1536 × 4 = 5.5 Terabyte
  • 17. Build an index over the vector data Challenges of scaling Vector Databases HNSW: Efficient nearest neighbor search algorithm (ANN) for high-dimensional data. ef_construction: Defines how many similarity candidates to look for m: Defines how many of the closest neighbors to pick from the ef_construction list ** This might be more tricky if you use hybrid search (metadata + vector data)
  • 18. Define what could be a good partition value for you (date, filetype, category, …) Use partitioning What we’ve done to scale PGVector
  • 19. General tips when working with PGVector Make sure to delete unnecessary vector data 01 Never use the same Postgres database for Vectors and other data 02
  • 20. If nothing helps… Scale horizontally What we’ve done to scale PGVector
  • 21. What we’re currently busy with… System Extension Azure AI Postgres Extension RAG Evaluation – Implementing robust evaluation testing methods – Exploring Ragas for advanced system performance ✅ �� �� – Enabling direct creation of embeddings within Azure – Extend system to cover images and other non-textual data Here’s what currently keeps us up at night…