Keeping Data Fresh: Mastering Updates in Vector Databases

1 | © Copyright 8/16/23 Zilliz
Stephen Batifol | Zilliz
MLOps.community, Zilliz Summit
Keeping Data Fresh: Mastering
Updates in Vector Databases

Stephen Batifol
Developer Advocate, Zilliz
stephen.batifol@zilliz.com
https://www.linkedin.com/in/stephen-batifol/
https://twitter.com/stephenbtl
Speaker

27K
GitHub
Stars
25M
Downloads
250
Contributors
2,600
+
Forks
Milvus is an open-source vector database for GenAI projects. pip install on your
laptop, plug into popular AI dev tools, and push to production with a single line of
code.
Easy Setup
pip install to
start coding in a
notebook within
seconds.
Reusable Code
Write once, and
deploy with one line
of code into the
production
environment
Integration
Plug into OpenAI,
Langchain,
LlmaIndex, and
many more
Feature-rich
Dense & sparse
embeddings,
filtering, reranking
and beyond

Milvus Lite
pip install pymilvus

Seamless integration with all popular AI toolkits

Zilliz Cloud is a fully-managed vector
database built atop of OSS Milvus
Open Source
Flexible & Secure Deployment
Enterprise features
for production-ready
Cardinal Search Engine &
Use Case Optimized Compute
Milvus completely
re-engineered to
be optimized
Pipelines Connectors Model Library
A streamlined
unstructured data
platform
Stable Milvus versions
are continuously
deployed to Zilliz Cloud

| © Copyright 8/16/23 Zilliz
7
RAG
Retrieval Augmented Generation)

Basic Idea
Use RAG to force the LLM to work with your data
by injecting it via a vector database like Milvus

Basic RAG Architecture

Why Even Use a Vector DB?
Beyond High-Performance Search
• CRUD Operations: Just like traditional databases, vector
databases allow you to Create, Read, Update, and Delete data.
• Data Freshness: Vector databases ensure your data remains
up-to-date, reflecting the latest information for accurate searches.
• Persistence: Your data is securely stored and persists even if the
system restarts.
• Availability: Your data is readily accessible for search and retrieval
operations.
• Scalability: Vector databases can handle growing data volumes
efficiently.

Complete Data Management
• Data Management: Vector databases provide tools to manage
your data effectively, including data ingestion, indexing, and
querying.
• Backup and Migration: Create backups of your data for disaster
recovery and easily migrate your data between different systems.

Operational ease
• Cloud or On-Premise Deployment: Vector databases can be
deployed easily on various platforms, including cloud and
on-premise environments.
• Observability: Monitor the health and performance of your vector
database to ensure optimal operation.
• Multi-tenancy: Support multiple users or applications accessing
the same database instance securely.

Database + Vector Plugin?
• Limited functionality - Filtering, Hybrid search, Range
search
• Scalability concerns - sharing vector data has its unique
challenge
• And LLM is evolving fast, itʼs hard to catch up!

Design Principles
• Disaggregate storage and computation
• Fully depends on mature storage
systems
• Micro Service - scale by functionality
• Separate Streaming and historical data
• Pluggable engine, storage and index
• Log As data

Meta Storage
Root Query Data Index
Coordinator Service
Proxy
Proxy
etcd
Log Broker
SDK
Load Balancer
DDL/DCL
DML
NOTIFICATION
CONTROL SIGNAL
Object Storage
Minio / S3 / AzureBlob
Log Snapshot Delta File Index File
Worker Node QUERY DATA DATA
Message Storage
VECTOR
DATABASE
Access Layer
Query Node Data Node Index Node
Milvus Architecture

Entity

Each shard is managed by a supervisor
(shard leader). This supervisor is
responsible for:
• Adding new information to the shard.
• Regularly storing the data in a safe
place (object storage).
• Serving the latest information for
search requests.
• Forwarding historical data requests to
other cabinets (query nodes) if
needed.
Milvus Data Layout - Shard

Growing Segment:
• In-memory segment replaying data
from the Log Broker.
• Uses a FLAT index to ensure data is
fresh and appendable.
Sealed Segment:
• Immutable segment using
alternative indexing methods for
efficiency.
Milvus Data Layout - Segments

Behind the Scenes: How Data Gets Added and
Accessed
• Sharding: Large datasets are
divided into smaller,
manageable sections called
shards. Each shard is handled
by a dedicated datanode.
• Write-Ahead Log WAL:
When you add new data, a
proxy service writes it to a
temporary log called a WAL
(e.g., Kafka, Pulsar). Think of
it as a to-do list for the
datanodes.

Behind the Scenes: How Data Gets Added and
Accessed
• Datanodes subscribe to the WAL
and:
• Add new data to their assigned shard.
• Remove outdated data (if needed)
• Flush accumulated data to permanent
storage.
• Query Nodes also subscribe to the
WAL but focus on:
• Creating and managing Segments
within each shard for fast searching.
• Ensuring searches access the latest
information.

Index Building
To avoid frequent index building
for data updates.
A collection in Milvus is divided
further into segments, each with
its own index.

Data query refers to:
• Searching a specified
collection for k number of
vectors nearest to a target
vector or for all vectors within
a specified distance range to
the vector.
Data query

Things to avoid!
• Auto increment Primary Keys
• Auto Generated IDs based on something you
donʼt know

Identifier Strategies
Unique Identifiers
• Ensure each piece of data (e.g., documents, images, or vectors) has a
unique identifier.
Example: Assign UUIDs to each document or vector for precise
identification and retrieval.
Metadata Tagging
• Attach metadata to each data item to facilitate quick searches and
categorization.
Example: Add tags such as date_created, author, and content_type
to documents.
For instance, a document might have metadata: {"id": "12345",
"date_created": "2024-06-06", "author": "John Doe",
"content_type": "research_paper"}

Composite Identifiers
• Combine multiple fields to create a unique identifier.
Example: Combine user_id and timestamp to create unique IDs for
user-generated content. E.g. {"id": "user123_20240606T123000"}
Hierarchical IDs
• Use hierarchical structures for complex data sets.
Example: For a hierarchical document system, use IDs like {"id":
"projectA_chapter1_section2"}
Identifier Strategies

Update Process with Pymilvus

Update Process in a Partition

Upsert Entities with
Langchain

A Detailed Look
1. Identify the Data Needing an Update
a. Prepares the updated content along with the primary key of
the document.
2. Data Transmission to the Proxy
a. The application sends an upsert request to the Milvus client
SDK.
b. The SDK transmits the request (containing updated data and
primary key) to the Proxy service.
3. Proxy communicates with Data Coordinator
a. Data Node checks PK
4. Upserts
a. Update or add new document to the collection

A Detailed Look (next)
5. Data Allocation and Indexing
a. Data is in the Growing Segment
6. Datanode assigned to the Segment receives the
updated data
a. Datanode creates a sequence number for the data and stores
it
b. Data accumulates within the growing segment until a certain
condition ⇒ Flush the segment.
7. Flush
a. Growing Segment ⇒ Sealed Segment
b. Write the data to a permanent storage S3, etc)
c. Ensures data persistence and avoids data loss.
8. Re-Indexing
a. Milvus triggers re-indexing for the upserted Data

milvus.io
github.com/milvus-io/
@milvusio
@stephenbtl
/in/stephen-batifol
Connect with me!
Thank you!

Keeping Data Fresh: Mastering Updates in Vector Databases

More Related Content

Similar to Keeping Data Fresh: Mastering Updates in Vector Databases (20)

More from Zilliz (20)

Recently uploaded (20)

Keeping Data Fresh: Mastering Updates in Vector Databases