Chat with your data, privately and locally

Chat with your data, privately and locally
Jay Rodge, Developer Advocate - LLMs | Unstructured Data Meetup

An LLM is a Deep Neural Network
Map from “all previous words” to “next word”
Through hard work, he supported
himself and his •••
“family”
Because it crossed state lines, that
criminal behavior attracted the
attention of the •••
“FBI”
Joe Biden, who in 2011 was
the •••
“Vice”
// loop over the string
int i;
for (i = 0; i < •••
“strlen”
This restaurant was fabulous!
My star rating is •••
“five”
A few thousand
previous words for
context
Predict the next word or
group of words
Transformer Architecture
Deep Neural Network

??
LLMs are Powerful Tools but Not Accurate Enough
Without a connection to enterprise data sources, LLMs cannot provide accurate information
User
Foundation Model
Prompt
Response
Risk of outdated information
Hallucinations
Lacking proprietary knowledge

Retrieval Augmented Generation Workflow
Enable LLMs to provide up to date and domain specific answers
User
Framework for LLMs
Foundation Model
Prompt
Response
Vector Database
Proprietary Data
Text Embedding Model
Ranked Data

BENEFITS OF LOCAL AI ON RTX
Low Latency
• High responsiveness
for latency-sensitive
applications
• No network quality
dependency
• AI goes wherever the
user goes
• No connectivity
interruptions
Always Available Data Privacy/Locality
• Private/proprietary
data stays on device
• No data uploads to
cloud
No Server Costs
• Reduce server costs by
moving compute to
device
• Access more compute
without growing your
budget

RAG on NVIDIA RTX
using TensorRT-LLM, Milvus and LlamaIndex

RAG ON RTX MACHINES
User
Framework for LLMs
LlamaIndex
Foundation Model
TRT-LLM optimized Llama2
Prompt
Response
Vector Database
Milvus
Proprietary Data
HF MiniLM L6 v2
Ranked Data

GPU-Accelerated with NVIDIA RAPIDS
Milvus: Cloud-Native Vector Database
Indexing
[ 1, 0, 3, 5 ]
[ 1, 4, 6, 9 ]
[ 4, 6, 2, 5 ]
[ 3, 8, 6, 1 ]
Retrieving
Querying
Large Language
Models
Recommendations
Systems
Computer Vision
Apps
Vector Database
Embeddings
Text
Image
Video
Product
Database

Milvus 2.4 bring Next-Gen GPU Indexing
Index Building
(Lower is better)
Vector Search
(Higher is better)
0 100 200 300 400 500
OpenAI 500K 1536-dim
Cohere 1M 768 dim
0
5000
10000
15000
20000
25000
30000
35000
Batch Size=1 Batch Size=100
GPU: NVIDIA A10G, CPU: Intel Xeon 8375C (Ice Lake), SW: Milvus 2.4 (Source)
CPU (HNSW) GPU (CAGRA)
Queries
per
second
Time (s)

Getting Started
Starting Milvus Server
Step 1
Querying through client
Step 2
from llama_index.vector_stores.milvus
import MilvusVectorStore
vector_store = MilvusVectorStore(
host = "127.0.0.1",
port = 19530,
dim = 384
)

Optimizing LLM with TensorRT-LLM
User
Framework for LLMs
LlamaIndex
Foundation Model
Prompt
Response
Vector Database
Milvus
Proprietary Data
HF MiniLM L6 v2
Ranked Data

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
TensorRT-LLM
LLM specific optimizations:
• KV Caching
• Multi-GPU, Muti-Node
• Custom MHA optimizations
• Paged KV Cache (Attention)
• etc…
• TensorRT-LLM
• Built on-top of TensorRT
• Leverages TensorRT for general graph optimizations & fast kernels
• Adds LLM specific optimizations:
• KV Caching & Custom MHA Kernels
• Inflight batching, Paged KV Cache (Attention)
• Multi-GPU, Multi-Node
• & more
• ONLY for LLMs
• TensorRT
• General purpose Deep Learning Inference Compiler
• Graph rewriting, constant folding, kernel fusion
• Optimized GEMMs & pointwise kernels
• Kernel Auto-Tuning
• Memory Optimizations
• & more
• All AI Workloads
TensorRT-LLM in the DL Compiler Ecosystem
TensorRT-LLM builds on TensorRT Compilation
TensorRT
General Purpose Compiler
• Optimized GEMMs & general kernels
• Kernel Fusion
• Auto Tuning
• Memory Optimizations
• Multi-stream execution

54
137
63
188
RTX 4060 RTX 4090
TENSORRT-LLM FOR WINDOWS
Fastest LLM inference backend comes to RTX
Up to 5X faster performance
Top models optimized – Llama 2, CodeLlama, Mistral,
Gemma, Phi-2, ChatGLM2
Available for download on Github/NVIDIA
Integrated with popular OS Developer Ecosystems
Llama 2 7B Int4 inference performance INSEQ=100, OUTSEQ=100 | Previous leading backend is llama.cpp for BS=1 and HF xformers AutoGPTQ for BS=8
61
216
319
RTX 4060 RTX 4090
Previous leading backend TensorRT-LLM
829
Batch size 1 Batch size 8
LLM Inference Performance (tokens/s)
GeForce RTX 40 series
Jan.ai
Oobabooga

Quantization
Supported Precisions & Models
• Utilizes Hopper FP8 “Transfomer Engine”
• Support many 8bit & 4bit methods
• FP8, INT8/INT4 Weight only, INT8 Smooth Quant, AWQ, GPTQ
• Support varies by model
• Reduced model size, memory bandwidth, & compute
• Improves performance & allows for larger models per GPU
• Model optimization toolkit to quantize pre-trained models
• Precision documentation
Precision support on various GPU architectures

Quantization
Supported Precisions & Models
• Utilizes Hopper FP8 “Transfomer Engine”
• Support many 8bit & 4bit methods
• FP8, INT8/INT4 Weight only, INT8 Smooth Quant, AWQ, GPTQ
• Support varies by model
• Reduced model size, memory bandwidth, & compute
• Improves performance & allows for larger models per GPU
• Model optimization toolkit to quantize pre-trained models
• Precision documentation
Model Quantization support with TensorRT-LLM

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
Optimizing Llama2 13B with TensorRT-LLM
Model from
Hugging Face
Model from
NVIDIA NGC
Quantization
TensorRT
Engine Builder
Quantized checkpoint
Quantized
checkpoint Llama2.engine

Chat with your data, privately and locally

RAG on RTX Machines
User
Framework for LLMs
LlamaIndex
Foundation Model
TRT-LLM Optimized Llama2 7B
Prompt
Response
Vector Database
Milvus
Proprietary Data
HF MiniLM L6 v2
Ranked Data

LlamaIndex
Data Framework for building LLM applications

GET STARTED
Developer RAG project Multimodal RAG example Experiment with AI foundation models

AI Decoded
Your guide to the latest AI
advancements powered by RTX.
Get weekly updates directly in your
inbox by subscribing to the AI
Decoded newsletter at:
nvda.ws/3VcIk7C
Thank You

Chat with your data, privately and locally

More Related Content

Similar to Chat with your data, privately and locally (20)

More from Zilliz (20)

Recently uploaded (20)

Chat with your data, privately and locally