“Unveiling the Power of Multimodal Large Language Models: Revolutionizing Perceptual AI,” a Presentation from BenchSci

Unveiling the Power of
Multi-Modal Large Language
Models: Revolutionizing
Perceptual AI
Istvan Fehervari
Director, Data and ML

The era of large language models
3
© 2024 Istvan Fehervari
LLM
Webpage
design
Co-pilots, digital
assistants
Customer
support bots
Data
exploration/SQL
Copywriting
Education / mental
health bots

• Revolution started in machine translation
• Context-sensitive next token prediction via
attention
• Transformer blocks composed of layers of
attention and feed-forward blocks
• Encoder-decoder architectures
• Intelligence emerges through scale
Large language models
4
Input Text
Tokenize
Deep neural network
Decode tokens
Output text

• Enables learnable weights of content pieces
for arbitrary context → better reasoning
• Works very well for ordered and unordered
sets
• Works well with external contexts, e.g., cross-
attention with vision
• All modern ML models leverage some form of
attention
Why the attention mechanism is critical
5
Low attention
High attention
Puppies Are Cute

• LLMs are composed of decoder blocks
• Inputs are all previous tokens – including predicted ones
• Decoder outputs a token distribution based on all
previous tokens
• During generation we sample tokens from the output
distribution
• Temperature
• Top-k / top-n
Building blocks of LLMs
6

• Supervised training
• Input – output pairs (e.g., translation)
• Fine-tune on specific task
• Self-supervised training
• Next/masked token prediction – needs a large body of data
• Reinforcement learning human feedback (RLHF)
• For instruction tuning, use human ranking to learn a reward function
Training LLMs
7

• LLaMA (2023/2) - 7B / 13B / 33B / 65B
• Falcon (2023/5) - 7B / 40B / 180B
• LLaMA2 (2023/6) - 7B / 13B / 70B
• Mistral (2023/9) - 7B (based on LLaMA2)
• Vicuna / Alpaca (based on LLaMA)
• Phi-2 (2023/12) - 2.7B
• Mixtral (2023/12) - 8x7B
Main foundational open LLMs
8

Perception via Language
9

• Annotated class labels are expensive → captions are abundant
• Era of natural language supervision
• WebImageText dataset: 400 million images with text captions
• Created with web scraping
• Query words are composed of all words occurring at least 100 times
on Wikipedia
Rise of a new dataset
10

• Predicting captions directly does not scale well
• Instead, predict how well a text description and an image “fit together”
• First example of prompt engineering in vision
CLIP: combining language and vision
11

Object detection with Grounded DINO
• Open-vocabulary detection
• Text backbone is a pretrained transformer
like BERT
• Text-image and image-text cross-attention
at several stages
Object detection with CLIP
12

Segmentation with Grounded SAM
• Open-vocabulary segmentation
• Detect boxes with Grounded DINO → Predict mask with SAM
Image segmentation with CLIP
13

• CLIPSeg – Image segmentation with prompts
Image segmentation with CLIP
14
Lüddecke et al. - Image Segmentation Using Text and Image Prompts, CVPR 2022

• Stable Diffusion uses CLIP text embeddings
Image generation with CLIP
15

16
LLMs with Vision

• We want our models to reason over visual input
• What data is needed?
Learning paradigms for (V)LLMs
17

Dataset building via
1. Image captioning (ideally with bounding box ground truth)
2. Visual QA datasets
3. Synthetic: create (2) from (1)
Can be done manually or LLM-assisted
Training V-LLMs
18

• (Multi-modal) in-context learning (e.g., Otter)
• Inject demonstration set to context
• Requires large context
• Can be used to teach LLMs to use external tools
• (Multi-modal) chain-of-thought (e.g., ScienceQA)
• Immediate reasoning steps for superior output
• Adaptive or pre-defined chain configuration
• Chain construction: infilling or predicting
Learning paradigms for (V)LLMs
19

• Learnable interface between modalities
• Expert model translating (e.g., vision)
into text
• Special tokens/function calling to
access aux models
LLMs with vision capabilities
20
Modality
bridging
Learnable
interface
Query-based
Instruct BLIP,
VisionLLM,
Macaw-LLM
Projection-
based
LLaVa,
PanadaGPT,
Video-ChatGPT
Parameter-
efficient tuning
LaVIN
Expert model
VideoChat-
Text, LLaMA-
Adapter V2

• Use CLIP to map vision and language tokens to the same latent space
(shallow alignment) – LLaVa-1.5
• Keep LLM and image encoder frozen – only train a shallow projection layer
Modality bridging with shallow alignment
21

Deeper alignment: Mixture of experts with vision
22
• Visual Experts – Mixture of experts with vision,
e.g., CogVLM
• Experts are separate feedforward layers
• Only a few experts are activated during
inference
• Learn a gating network

LLM-aided visual reasoning
23
• LLM function calling
• Controller – task planning
• Decision maker – summarize, continue or not
• Semantic refiner – generate text wrt. context
• Strong generalization
• Emergent ability (e.g., understand meme images)
• Better control
Object
detector
Segmentation
model
LLM
(reasoning)
Instruction/question
Answer

24
Vision-language on the Edge

• Programming → natural language
instructions
• Training free solutions
• Shorter time-to-market
• Short lead time to adapt to changing
environments
Applications
25

• Control: more natural, frictionless UX
• Voice or chat to control/monitor devices/networks
• Answer usability questions (no more manuals)
• Personalized onboarding to new devices
• Feedback:
• Output is interpreted without human-in-the-loop (e.g., alarm systems)
Applications
26

• LLMs need lot of resources (compute, memory)
• Visual input, CoT requires larger context
• Latency is still an issue on the edge
• Output control of LLMs is still unsolved, prone to hallucinations
• AI safety – bias is an unsolved issue
• AI alignment is an upcoming field of research
Challenges
27

• Language as control brought tremendous improvements
• LLMs can operate very well with visual signals
• Future products will be more user-friendly, more natural
• Faster time to market, better adaptability both tech and business
• Performance on the edge today is a challenge but will be solved
• AI safety / alignment is the new challenge without a clear answer in sight
Conclusions
28

Questions?
29

• Yin et al. - A Survey on Multimodal Large Language Models
• Zhang et al. - Vision-Language Models for Vision Tasks: A Survey
• Yin et al. - A Survey on Multimodal Large Language Models
• Lüddecke et al. - Image Segmentation Using Text and Image Prompts
• Liu et al. - Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object
Detection
• Kirillov et al. - Segment Anything
• Wang et al. - CogVLM: Visual Expert for Pretrained Language Models
• Liu et al. - LLaVA: Large Language and Vision Assistant
• Li et al. - Otter: A Multi-Modal Model with In-Context Instruction Tuning
Resources
30

“Unveiling the Power of Multimodal Large Language Models: Revolutionizing Perceptual AI,” a Presentation from BenchSci

More Related Content

Similar to “Unveiling the Power of Multimodal Large Language Models: Revolutionizing Perceptual AI,” a Presentation from BenchSci (20)

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

“Unveiling the Power of Multimodal Large Language Models: Revolutionizing Perceptual AI,” a Presentation from BenchSci