Building the Next-Gen Apps with Multimodal Retrieval using Twelve Labs & Milvus

Building the Next-Gen Apps with
Multimodal Retrieval using Twelve
Labs & Milvus
Hrishikesh Yadav

ABOUT
Developer Advocate @TwelveLabs
Ex AI Engineer @Shaga
Kaggle 2x Expert
Applied Gen AI Researcher
Member @SuperTeamDao
Hrishikesh Yadav
About Mysellf
Deep Learning and Applied Generative AI Researcher
I like to participate and judge the hackathon
Worked on the product - Shaga and CrimeDekho
Published Research Work around the Predictive Policing and Time
Forecasting

1.Discussion of Multimodal Embeddings
2.Embed API Exploration
3.Usecases to Explore and Build
4.Demo of Multimodal RAG with Twelve Labs and Milvus
5.Visual Similarity (Image to Video Semgents)
6.Ideas to Explore and Build
7.QnA
Agenda

• Users can find relevant content across any format
regardless of query type -
⚬ search with text to find videos
⚬ images to find videos
⚬ voice to find visuals, breaking down traditional
search limitations.
Embedding Powers Application
“And alot many with the any to any application”

Multimodal Embeddings
Encoders
Video
Image
Audio
Text
[-0.037437707,-0.015245657…]
[-0.037437707,-0.015245657…]
[-0.037437707,-0.015245657…]
[-0.037437707,-0.015245657…]
Crowd of men with a horse

Any-to-Any Search!
Any
Modality
Retrieved
Any
Modality
Data
or
QUERY
IN
Crowd of men with a horse

Why Empowering Product with
Multimodal
More modality, More power to the user,
More personalization

Building the Next-Gen Apps with Multimodal Retrieval using Twelve Labs & Milvus

Video-level Embedding
Embed
POST
input_type
file
: video, audio, image, text
: video.mp4
Embed
: 61e1127861c43d6d9b736194
GET
task_id
Embeddin
gs
[0.6,-0.2,0.3,0.4,...]
GET
‹#›
Video Video embeddings (semantic
representation)
[0.6,-0.2,0.3,0.4,...]
Clip-level Embeddings
GET
[0.6,-0.2,0.3,0.4,...],
[0.6,-0.2,0.3,0.4,...], …
Embed API

Marengo 2.6 Benchmarks
Get the entire report: https://www.twelvelabs.io/blog/introducing-marengo-2-6

Surveillance Analysis
Assistant
• Usecase in the surveillance were the
knowledge base contaning the CCTV
video Footages and details can be
loaded into the embedding format.
• Video understanding with the
embedding would save a lot of time,
searching for the particular video in the
surveillance.

Organization Documentation Archive
Assistant
• Knowledge Base containig the
organization documentation of all
modalities.
• Employees can ask natural questions
query and instantly receive relevant
results across all formats.
• Automatically connects related content
across formats - when viewing a
technical specification, instantly see
related implementation videos.

Museum Guide
Assistant
• Visitors can take a photo of any artwork
or simply describe what interests them
to receive instant insights about the
piece, related artworks across the
museum.
• Delivers audio visual tours by
understanding the visual elements of
artwork and finding the relevant info.
More Personalization, More Engagement

Demo of Multimodal RAG with Twelve
Labs and Milvus

Fashion Assistant with LLM
https://github.com/Hrishikesh332/Twelve-Labs-Fashion-chat-assistant
• The details about the application -
⚬ Vector Database - Milvus
⚬ Embedding - Marengo-2.6-retreival
⚬ LLM Model - gpt-3.5-turbo (OpenAI)
⚬ Deployment - Streamlit Cloud

Fashion Assistant with LLM and Multimodal
Retreival

Demo of Image to Video Segment

Image Query to Video Segment Retrieval

Image Query to Video Segment
Retrieval

For Detailed Working Tutorial Blog -
Scan
Here

ABOUT
Twitter - @hrishikesh_ai
Developer Advocate @TwelveLabs
Prev. AI Engineer @Shaga
Kaggle 2x Expert
Applied Gen AI Researcher
Member @SuperTeamDao

Building the Next-Gen Apps with Multimodal Retrieval using Twelve Labs & Milvus

More Related Content

Similar to Building the Next-Gen Apps with Multimodal Retrieval using Twelve Labs & Milvus (18)

More from Zilliz (20)

Recently uploaded (20)

Building the Next-Gen Apps with Multimodal Retrieval using Twelve Labs & Milvus