SlideShare a Scribd company logo
4
Most read
5
Most read
6
Most read
Chat with your data, privately and locally
Jay Rodge, Developer Advocate - LLMs | Unstructured Data Meetup
An LLM is a Deep Neural Network
Map from “all previous words” to “next word”
Through hard work, he supported
himself and his •••
“family”
Because it crossed state lines, that
criminal behavior attracted the
attention of the •••
“FBI”
Joe Biden, who in 2011 was
the •••
“Vice”
// loop over the string
int i;
for (i = 0; i < •••
“strlen”
This restaurant was fabulous!
My star rating is •••
“five”
A few thousand
previous words for
context
Predict the next word or
group of words
Transformer Architecture
Deep Neural Network
??
LLMs are Powerful Tools but Not Accurate Enough
Without a connection to enterprise data sources, LLMs cannot provide accurate information
User
Foundation Model
Prompt
Response
Risk of outdated information
Hallucinations
Lacking proprietary knowledge
Retrieval Augmented Generation Workflow
Enable LLMs to provide up to date and domain specific answers
User
Framework for LLMs
Foundation Model
Prompt
Response
Vector Database
Proprietary Data
Text Embedding Model
Ranked Data
BENEFITS OF LOCAL AI ON RTX
Low Latency
• High responsiveness
for latency-sensitive
applications
• No network quality
dependency
• AI goes wherever the
user goes
• No connectivity
interruptions
Always Available Data Privacy/Locality
• Private/proprietary
data stays on device
• No data uploads to
cloud
No Server Costs
• Reduce server costs by
moving compute to
device
• Access more compute
without growing your
budget
RAG on NVIDIA RTX
using TensorRT-LLM, Milvus and LlamaIndex
RAG ON RTX MACHINES
User
Framework for LLMs
LlamaIndex
Foundation Model
TRT-LLM optimized Llama2
Prompt
Response
Vector Database
Milvus
Proprietary Data
Text Embedding Model
HF MiniLM L6 v2
Ranked Data
RAG ON RTX MACHINES
User
Framework for LLMs
LlamaIndex
Foundation Model
TRT-LLM optimized Llama2
Prompt
Response
Vector Database
Milvus
Proprietary Data
Text Embedding Model
HF MiniLM L6 v2
Ranked Data
GPU-Accelerated with NVIDIA RAPIDS
Milvus: Cloud-Native Vector Database
Indexing
[ 1, 0, 3, 5 ]
[ 1, 4, 6, 9 ]
[ 4, 6, 2, 5 ]
[ 3, 8, 6, 1 ]
Retrieving
Querying
Large Language
Models
Recommendations
Systems
Computer Vision
Apps
Vector Database
Embeddings
Text
Image
Video
Product
Database
Milvus: Cloud-Native Vector Database
Milvus 2.4 bring Next-Gen GPU Indexing
Index Building
(Lower is better)
Vector Search
(Higher is better)
0 100 200 300 400 500
OpenAI 500K 1536-dim
Cohere 1M 768 dim
0
5000
10000
15000
20000
25000
30000
35000
Batch Size=1 Batch Size=100
GPU: NVIDIA A10G, CPU: Intel Xeon 8375C (Ice Lake), SW: Milvus 2.4 (Source)
CPU (HNSW) GPU (CAGRA)
Queries
per
second
Time (s)
Milvus: Cloud-Native Vector Database
Getting Started
Starting Milvus Server
Step 1
Querying through client
Step 2
from llama_index.vector_stores.milvus
import MilvusVectorStore
vector_store = MilvusVectorStore(
host = "127.0.0.1",
port = 19530,
dim = 384
)
Optimizing LLM with TensorRT-LLM
User
Framework for LLMs
LlamaIndex
Foundation Model
Prompt
Response
Vector Database
Milvus
Proprietary Data
Text Embedding Model
HF MiniLM L6 v2
Ranked Data
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
TensorRT-LLM
LLM specific optimizations:
• KV Caching
• Multi-GPU, Muti-Node
• Custom MHA optimizations
• Paged KV Cache (Attention)
• etc…
• TensorRT-LLM
• Built on-top of TensorRT
• Leverages TensorRT for general graph optimizations & fast kernels
• Adds LLM specific optimizations:
• KV Caching & Custom MHA Kernels
• Inflight batching, Paged KV Cache (Attention)
• Multi-GPU, Multi-Node
• & more
• ONLY for LLMs
• TensorRT
• General purpose Deep Learning Inference Compiler
• Graph rewriting, constant folding, kernel fusion
• Optimized GEMMs & pointwise kernels
• Kernel Auto-Tuning
• Memory Optimizations
• & more
• All AI Workloads
TensorRT-LLM in the DL Compiler Ecosystem
TensorRT-LLM builds on TensorRT Compilation
TensorRT
General Purpose Compiler
• Optimized GEMMs & general kernels
• Kernel Fusion
• Auto Tuning
• Memory Optimizations
• Multi-stream execution
54
137
63
188
RTX 4060 RTX 4090
TENSORRT-LLM FOR WINDOWS
Fastest LLM inference backend comes to RTX
Up to 5X faster performance
Top models optimized – Llama 2, CodeLlama, Mistral,
Gemma, Phi-2, ChatGLM2
Available for download on Github/NVIDIA
Integrated with popular OS Developer Ecosystems
Llama 2 7B Int4 inference performance INSEQ=100, OUTSEQ=100 | Previous leading backend is llama.cpp for BS=1 and HF xformers AutoGPTQ for BS=8
61
216
319
RTX 4060 RTX 4090
Previous leading backend TensorRT-LLM
829
Batch size 1 Batch size 8
LLM Inference Performance (tokens/s)
GeForce RTX 40 series
Jan.ai
Oobabooga
Quantization
Supported Precisions & Models
• Utilizes Hopper FP8 “Transfomer Engine”
• Support many 8bit & 4bit methods
• FP8, INT8/INT4 Weight only, INT8 Smooth Quant, AWQ, GPTQ
• Support varies by model
• Reduced model size, memory bandwidth, & compute
• Improves performance & allows for larger models per GPU
• Model optimization toolkit to quantize pre-trained models
• Precision documentation
Precision support on various GPU architectures
Quantization
Supported Precisions & Models
• Utilizes Hopper FP8 “Transfomer Engine”
• Support many 8bit & 4bit methods
• FP8, INT8/INT4 Weight only, INT8 Smooth Quant, AWQ, GPTQ
• Support varies by model
• Reduced model size, memory bandwidth, & compute
• Improves performance & allows for larger models per GPU
• Model optimization toolkit to quantize pre-trained models
• Precision documentation
Model Quantization support with TensorRT-LLM
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
Optimizing Llama2 13B with TensorRT-LLM
Model from
Hugging Face
Model from
NVIDIA NGC
Quantization
TensorRT
Engine Builder
Quantized checkpoint
Quantized
checkpoint Llama2.engine
Chat with your data, privately and locally
RAG on RTX Machines
User
Framework for LLMs
LlamaIndex
Foundation Model
TRT-LLM Optimized Llama2 7B
Prompt
Response
Vector Database
Milvus
Proprietary Data
Text Embedding Model
HF MiniLM L6 v2
Ranked Data
LlamaIndex
Data Framework for building LLM applications
GET STARTED
Developer RAG project Multimodal RAG example Experiment with AI foundation models
AI Decoded
Your guide to the latest AI
advancements powered by RTX.
Get weekly updates directly in your
inbox by subscribing to the AI
Decoded newsletter at:
nvda.ws/3VcIk7C
Thank You

More Related Content

PPTX
LLaMA_Final The Meta LLM Presentation.pptx
PPTX
Open Source vs Closed Source LLMs. Pros and Cons
PDF
8 Steps to Build a LangChain RAG Chatbot.
PDF
LLMs in Production: Tooling, Process, and Team Structure
PDF
Evaluating LLM Models for Production Systems Methods and Practices -
PDF
Introduction to Multimodal LLMs with LLaVA
PDF
presentation.pdf
PDF
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
LLaMA_Final The Meta LLM Presentation.pptx
Open Source vs Closed Source LLMs. Pros and Cons
8 Steps to Build a LangChain RAG Chatbot.
LLMs in Production: Tooling, Process, and Team Structure
Evaluating LLM Models for Production Systems Methods and Practices -
Introduction to Multimodal LLMs with LLaVA
presentation.pdf
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost

Similar to Chat with your data, privately and locally (20)

PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
PPTX
[DSC DACH 24] AI and XR - Ivan Voras
PPTX
Introduction-to-vLLM-Supercharging-LLM-Inference.pptx
PDF
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
PDF
Integrate LLM in your applications 101
PDF
Master LLMs with LangChain -the basics of LLM
PDF
"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap
PPTX
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
PDF
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
PDF
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
PDF
SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)
PDF
Open LLMs: Viable for Production or Low-Quality Toy?
PDF
10 Limitations of Large Language Models and Mitigation Options
PDF
Agentic AI Use Cases using GenAI LLM models
PPTX
Build With AI Event: GDG on Campus IIE.pptx
PPTX
Gnerative AI presidency Module1_L4_LLMs_new.pptx
PDF
"Chat with your private data using Llama3 and LLPhant in PHP", Enrico Zimuel
PPTX
Local Applications of Large Language Models based on RAG.pptx
PDF
Using LLM Agents with Llama 3.2, LangGraph and Milvus
PPTX
How to Run LLM Locally, and Why - Marko Lohert - Graz 2025.pptx
Optimizing Large Language Models with vLLM and Related Tools.pdf
[DSC DACH 24] AI and XR - Ivan Voras
Introduction-to-vLLM-Supercharging-LLM-Inference.pptx
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Integrate LLM in your applications 101
Master LLMs with LangChain -the basics of LLM
"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)
Open LLMs: Viable for Production or Low-Quality Toy?
10 Limitations of Large Language Models and Mitigation Options
Agentic AI Use Cases using GenAI LLM models
Build With AI Event: GDG on Campus IIE.pptx
Gnerative AI presidency Module1_L4_LLMs_new.pptx
"Chat with your private data using Llama3 and LLPhant in PHP", Enrico Zimuel
Local Applications of Large Language Models based on RAG.pptx
Using LLM Agents with Llama 3.2, LangGraph and Milvus
How to Run LLM Locally, and Why - Marko Lohert - Graz 2025.pptx
Ad

More from Zilliz (20)

PDF
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
PDF
Zilliz Cloud Demo for performance and scale
PDF
Open Source Milvus Vector Database v 2.6
PDF
Zilliz Cloud Monthly Technical Review: May 2025
PDF
Smarter RAG Pipelines: Scaling Search with Milvus and Feast
PDF
Hands-on Tutorial: Building an Agent to Reason about Private Data with OpenAI...
PDF
Agentic AI in Action: Real-Time Vision, Memory & Autonomy with Browser Use & ...
PDF
Webinar - Zilliz Cloud Monthly Demo - March 2025
PDF
What Makes "Deep Research"? A Dive into AI Agents
PDF
Combining Lexical and Semantic Search with Milvus 2.5
PDF
Bedrock Data Automation (Preview): Simplifying Unstructured Data Processing
PDF
Deploying a Multimodal RAG System Using Open Source Milvus, LlamaIndex, and vLLM
PDF
February Product Demo: Discover the Power of Zilliz Cloud
PDF
Full Text Search with Milvus 2.5 - UD Meetup Berlin Jan 23
PDF
Building the Next-Gen Apps with Multimodal Retrieval using Twelve Labs & Milvus
PDF
Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf
PDF
Accelerate AI Agents with Multimodal RAG powered by Friendli Endpoints and Mi...
PDF
1 Table = 1000 Words? Foundation Models for Tabular Data
PDF
How Milvus allows you to run Full Text Search
PDF
How to Optimize Your Embedding Model Selection and Development through TDA Cl...
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
Zilliz Cloud Demo for performance and scale
Open Source Milvus Vector Database v 2.6
Zilliz Cloud Monthly Technical Review: May 2025
Smarter RAG Pipelines: Scaling Search with Milvus and Feast
Hands-on Tutorial: Building an Agent to Reason about Private Data with OpenAI...
Agentic AI in Action: Real-Time Vision, Memory & Autonomy with Browser Use & ...
Webinar - Zilliz Cloud Monthly Demo - March 2025
What Makes "Deep Research"? A Dive into AI Agents
Combining Lexical and Semantic Search with Milvus 2.5
Bedrock Data Automation (Preview): Simplifying Unstructured Data Processing
Deploying a Multimodal RAG System Using Open Source Milvus, LlamaIndex, and vLLM
February Product Demo: Discover the Power of Zilliz Cloud
Full Text Search with Milvus 2.5 - UD Meetup Berlin Jan 23
Building the Next-Gen Apps with Multimodal Retrieval using Twelve Labs & Milvus
Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf
Accelerate AI Agents with Multimodal RAG powered by Friendli Endpoints and Mi...
1 Table = 1000 Words? Foundation Models for Tabular Data
How Milvus allows you to run Full Text Search
How to Optimize Your Embedding Model Selection and Development through TDA Cl...
Ad

Recently uploaded (20)

PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
August Patch Tuesday
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
A Presentation on Artificial Intelligence
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
project resource management chapter-09.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Mushroom cultivation and it's methods.pdf
cloud_computing_Infrastucture_as_cloud_p
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
August Patch Tuesday
Building Integrated photovoltaic BIPV_UPV.pdf
A novel scalable deep ensemble learning framework for big data classification...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Chapter 5: Probability Theory and Statistics
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
1. Introduction to Computer Programming.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Enhancing emotion recognition model for a student engagement use case through...
Programs and apps: productivity, graphics, security and other tools
A Presentation on Artificial Intelligence
NewMind AI Weekly Chronicles - August'25-Week II
project resource management chapter-09.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Hindi spoken digit analysis for native and non-native speakers
Mushroom cultivation and it's methods.pdf

Chat with your data, privately and locally

  • 1. Chat with your data, privately and locally Jay Rodge, Developer Advocate - LLMs | Unstructured Data Meetup
  • 2. An LLM is a Deep Neural Network Map from “all previous words” to “next word” Through hard work, he supported himself and his ••• “family” Because it crossed state lines, that criminal behavior attracted the attention of the ••• “FBI” Joe Biden, who in 2011 was the ••• “Vice” // loop over the string int i; for (i = 0; i < ••• “strlen” This restaurant was fabulous! My star rating is ••• “five” A few thousand previous words for context Predict the next word or group of words Transformer Architecture Deep Neural Network
  • 3. ?? LLMs are Powerful Tools but Not Accurate Enough Without a connection to enterprise data sources, LLMs cannot provide accurate information User Foundation Model Prompt Response Risk of outdated information Hallucinations Lacking proprietary knowledge
  • 4. Retrieval Augmented Generation Workflow Enable LLMs to provide up to date and domain specific answers User Framework for LLMs Foundation Model Prompt Response Vector Database Proprietary Data Text Embedding Model Ranked Data
  • 5. BENEFITS OF LOCAL AI ON RTX Low Latency • High responsiveness for latency-sensitive applications • No network quality dependency • AI goes wherever the user goes • No connectivity interruptions Always Available Data Privacy/Locality • Private/proprietary data stays on device • No data uploads to cloud No Server Costs • Reduce server costs by moving compute to device • Access more compute without growing your budget
  • 6. RAG on NVIDIA RTX using TensorRT-LLM, Milvus and LlamaIndex
  • 7. RAG ON RTX MACHINES User Framework for LLMs LlamaIndex Foundation Model TRT-LLM optimized Llama2 Prompt Response Vector Database Milvus Proprietary Data Text Embedding Model HF MiniLM L6 v2 Ranked Data
  • 8. RAG ON RTX MACHINES User Framework for LLMs LlamaIndex Foundation Model TRT-LLM optimized Llama2 Prompt Response Vector Database Milvus Proprietary Data Text Embedding Model HF MiniLM L6 v2 Ranked Data
  • 9. GPU-Accelerated with NVIDIA RAPIDS Milvus: Cloud-Native Vector Database Indexing [ 1, 0, 3, 5 ] [ 1, 4, 6, 9 ] [ 4, 6, 2, 5 ] [ 3, 8, 6, 1 ] Retrieving Querying Large Language Models Recommendations Systems Computer Vision Apps Vector Database Embeddings Text Image Video Product Database
  • 10. Milvus: Cloud-Native Vector Database Milvus 2.4 bring Next-Gen GPU Indexing Index Building (Lower is better) Vector Search (Higher is better) 0 100 200 300 400 500 OpenAI 500K 1536-dim Cohere 1M 768 dim 0 5000 10000 15000 20000 25000 30000 35000 Batch Size=1 Batch Size=100 GPU: NVIDIA A10G, CPU: Intel Xeon 8375C (Ice Lake), SW: Milvus 2.4 (Source) CPU (HNSW) GPU (CAGRA) Queries per second Time (s)
  • 11. Milvus: Cloud-Native Vector Database Getting Started Starting Milvus Server Step 1 Querying through client Step 2 from llama_index.vector_stores.milvus import MilvusVectorStore vector_store = MilvusVectorStore( host = "127.0.0.1", port = 19530, dim = 384 )
  • 12. Optimizing LLM with TensorRT-LLM User Framework for LLMs LlamaIndex Foundation Model Prompt Response Vector Database Milvus Proprietary Data Text Embedding Model HF MiniLM L6 v2 Ranked Data
  • 13. NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. TensorRT-LLM LLM specific optimizations: • KV Caching • Multi-GPU, Muti-Node • Custom MHA optimizations • Paged KV Cache (Attention) • etc… • TensorRT-LLM • Built on-top of TensorRT • Leverages TensorRT for general graph optimizations & fast kernels • Adds LLM specific optimizations: • KV Caching & Custom MHA Kernels • Inflight batching, Paged KV Cache (Attention) • Multi-GPU, Multi-Node • & more • ONLY for LLMs • TensorRT • General purpose Deep Learning Inference Compiler • Graph rewriting, constant folding, kernel fusion • Optimized GEMMs & pointwise kernels • Kernel Auto-Tuning • Memory Optimizations • & more • All AI Workloads TensorRT-LLM in the DL Compiler Ecosystem TensorRT-LLM builds on TensorRT Compilation TensorRT General Purpose Compiler • Optimized GEMMs & general kernels • Kernel Fusion • Auto Tuning • Memory Optimizations • Multi-stream execution
  • 14. 54 137 63 188 RTX 4060 RTX 4090 TENSORRT-LLM FOR WINDOWS Fastest LLM inference backend comes to RTX Up to 5X faster performance Top models optimized – Llama 2, CodeLlama, Mistral, Gemma, Phi-2, ChatGLM2 Available for download on Github/NVIDIA Integrated with popular OS Developer Ecosystems Llama 2 7B Int4 inference performance INSEQ=100, OUTSEQ=100 | Previous leading backend is llama.cpp for BS=1 and HF xformers AutoGPTQ for BS=8 61 216 319 RTX 4060 RTX 4090 Previous leading backend TensorRT-LLM 829 Batch size 1 Batch size 8 LLM Inference Performance (tokens/s) GeForce RTX 40 series Jan.ai Oobabooga
  • 15. Quantization Supported Precisions & Models • Utilizes Hopper FP8 “Transfomer Engine” • Support many 8bit & 4bit methods • FP8, INT8/INT4 Weight only, INT8 Smooth Quant, AWQ, GPTQ • Support varies by model • Reduced model size, memory bandwidth, & compute • Improves performance & allows for larger models per GPU • Model optimization toolkit to quantize pre-trained models • Precision documentation Precision support on various GPU architectures
  • 16. Quantization Supported Precisions & Models • Utilizes Hopper FP8 “Transfomer Engine” • Support many 8bit & 4bit methods • FP8, INT8/INT4 Weight only, INT8 Smooth Quant, AWQ, GPTQ • Support varies by model • Reduced model size, memory bandwidth, & compute • Improves performance & allows for larger models per GPU • Model optimization toolkit to quantize pre-trained models • Precision documentation Model Quantization support with TensorRT-LLM
  • 17. NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. Optimizing Llama2 13B with TensorRT-LLM Model from Hugging Face Model from NVIDIA NGC Quantization TensorRT Engine Builder Quantized checkpoint Quantized checkpoint Llama2.engine
  • 19. RAG on RTX Machines User Framework for LLMs LlamaIndex Foundation Model TRT-LLM Optimized Llama2 7B Prompt Response Vector Database Milvus Proprietary Data Text Embedding Model HF MiniLM L6 v2 Ranked Data
  • 20. LlamaIndex Data Framework for building LLM applications
  • 21. GET STARTED Developer RAG project Multimodal RAG example Experiment with AI foundation models
  • 22. AI Decoded Your guide to the latest AI advancements powered by RTX. Get weekly updates directly in your inbox by subscribing to the AI Decoded newsletter at: nvda.ws/3VcIk7C Thank You