SlideShare a Scribd company logo
Unveiling the Power of
Multi-Modal Large Language
Models: Revolutionizing
Perceptual AI
Istvan Fehervari
Director, Data and ML
The era of large language models
3
© 2024 Istvan Fehervari
LLM
Webpage
design
Co-pilots, digital
assistants
Customer
support bots
Data
exploration/SQL
Copywriting
Education / mental
health bots
• Revolution started in machine translation
• Context-sensitive next token prediction via
attention
• Transformer blocks composed of layers of
attention and feed-forward blocks
• Encoder-decoder architectures
• Intelligence emerges through scale
Large language models
4
© 2024 Istvan Fehervari
Input Text
Tokenize
Deep neural network
Decode tokens
Output text
• Enables learnable weights of content pieces
for arbitrary context → better reasoning
• Works very well for ordered and unordered
sets
• Works well with external contexts, e.g., cross-
attention with vision
• All modern ML models leverage some form of
attention
Why the attention mechanism is critical
5
© 2024 Istvan Fehervari
Low attention
High attention
Puppies Are Cute
• LLMs are composed of decoder blocks
• Inputs are all previous tokens – including predicted ones
• Decoder outputs a token distribution based on all
previous tokens
• During generation we sample tokens from the output
distribution
• Temperature
• Top-k / top-n
Building blocks of LLMs
6
© 2024 Istvan Fehervari
• Supervised training
• Input – output pairs (e.g., translation)
• Fine-tune on specific task
• Self-supervised training
• Next/masked token prediction – needs a large body of data
• Reinforcement learning human feedback (RLHF)
• For instruction tuning, use human ranking to learn a reward function
Training LLMs
7
© 2024 Istvan Fehervari
• LLaMA (2023/2) - 7B / 13B / 33B / 65B
• Falcon (2023/5) - 7B / 40B / 180B
• LLaMA2 (2023/6) - 7B / 13B / 70B
• Mistral (2023/9) - 7B (based on LLaMA2)
• Vicuna / Alpaca (based on LLaMA)
• Phi-2 (2023/12) - 2.7B
• Mixtral (2023/12) - 8x7B
Main foundational open LLMs
8
© 2024 Istvan Fehervari
© 2024 Istvan Fehervari
Perception via Language
9
• Annotated class labels are expensive → captions are abundant
• Era of natural language ​supervision
• WebImageText dataset: 400 million images with text captions
• Created with web scraping
• Query words are composed of all words occurring at least 100 times
on Wikipedia
Rise of a new dataset
10
© 2024 Istvan Fehervari
• Predicting captions directly does not scale well
• Instead, predict how well a text description and an image “fit together”
• First example of prompt engineering in vision
CLIP: combining language and vision
11
© 2024 Istvan Fehervari
Object detection with Grounded DINO
• Open-vocabulary detection
• Text backbone is a pretrained​ transformer
like BERT
• Text-image and image-text cross-attention
at several stages
Object detection with CLIP
12
© 2024 Istvan Fehervari
Segmentation with Grounded SAM
• Open-vocabulary segmentation
• Detect boxes with Grounded DINO → Predict mask with SAM
Image segmentation with CLIP
13
© 2024 Istvan Fehervari
• CLIPSeg – Image segmentation with prompts
Image segmentation with CLIP
14
© 2024 Istvan Fehervari
Lüddecke et al. - Image Segmentation Using Text and Image Prompts, CVPR 2022
• Stable Diffusion uses CLIP text embeddings
Image generation with CLIP
15
© 2024 Istvan Fehervari
16
© 2024 Istvan Fehervari
LLMs with Vision
• We want our models to reason over visual input
• What data is needed?
Learning paradigms for (V)LLMs
17
© 2024 Istvan Fehervari
Data​set building via
1. Image captioning (ideally with bounding box ground truth)
2. Visual QA datasets
3. Synthetic: create (2) from (1)
Can be done manually or LLM-assisted
Training V-LLMs
18
© 2024 Istvan Fehervari
• (Multi-modal) in-context learning (e.g., Otter)
• Inject demonstration set to context
• Requires large context
• Can be used to teach LLMs to use external tools
• (Multi-modal) chain-of-thought (e.g., ScienceQA)
• Immediate reasoning steps for superior output
• Adaptive or pre-defined chain configuration
• Chain construction: infilling or predicting
Learning paradigms for (V)LLMs
19
© 2024 Istvan Fehervari
• Learnable interface between modalities
• Expert model translating (e.g., vision)
into text
• Special tokens/function calling​ to
access aux models
LLMs with vision capabilities
20
© 2024 Istvan Fehervari
Modality
bridging
Learnable
interface
Query-based
Instruct BLIP,
VisionLLM,
Macaw-LLM
Projection-
based
LLaVa,
PanadaGPT,
Video-ChatGPT
Parameter-
efficient tuning
LaVIN
Expert model
VideoChat-
Text, LLaMA-
Adapter V2
• Use CLIP to map vision and language tokens to the same latent space
(shallow alignment) – LLaVa-1.5
• Keep LLM and image encoder frozen – only train a shallow projection layer
Modality bridging with shallow alignment
21
© 2024 Istvan Fehervari
Deeper alignment: Mixture of experts with vision
22
© 2024 Istvan Fehervari
• Visual Experts – Mixture of experts with vision,
e.g., CogVLM
• Experts are separate feedforward layers
• Only a few experts are activated during
inference
• Learn a gating network
LLM-aided visual reasoning
23
© 2024 Istvan Fehervari
• LLM function calling
• Controller – task planning
• Decision maker – summarize, continue or not
• Semantic refiner – generate text wrt. context
• Strong generalization
• Emergent ability (e.g., understand meme images)
• Better control
Object
detector
Segmentation
model
LLM
(reasoning)
Instruction/question
Answer
24
© 2024 Istvan Fehervari
Vision-language on the Edge
• Programming → natural language
instructions
• Training free solutions
• Shorter time-to-market
• Short lead time to adapt to changing
environments
Applications
25
© 2024 Istvan Fehervari
• Control: more natural, frictionless UX
• Voice or chat to control/monitor devices/networks
• Answer usability questions (no more manuals)
• Personalized onboarding to new devices
• Feedback:
• Output is interpreted without human-in-the-loop (e.g., alarm systems)
Applications
26
© 2024 Istvan Fehervari
• LLMs need lot of resources (compute, memory)
• Visual input, CoT requires larger context
• Latency is still an issue on the edge
• Output control of LLMs is still unsolved, prone to hallucinations
• AI safety – bias is an unsolved issue
• AI alignment is an upcoming field of research
Challenges
27
© 2024 Istvan Fehervari
• Language as control brought tremendous improvements
• LLMs can operate very well with visual signals
• Future products will be more user-friendly, more natural
• Faster time to market, better adaptability both tech and business
• Performance on the edge today is a challenge but will be solved
• AI safety / alignment is the new challenge without a clear answer in sight
Conclusions
28
© 2024 Istvan Fehervari
Questions?
29
© 2024 Istvan Fehervari
• Yin et al. - A Survey on Multimodal Large Language Models
• Zhang et al. - Vision-Language Models for Vision Tasks: A Survey
• Yin et al. - A Survey on Multimodal Large Language Models
• Lüddecke et al. - Image Segmentation Using Text and Image Prompts
• Liu et al. - Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object
Detection
• Kirillov et al. - Segment Anything
• Wang et al. - CogVLM: Visual Expert for Pretrained Language Models
• Liu et al. - LLaVA: Large Language and Vision Assistant
• Li et al. - Otter: A Multi-Modal Model with In-Context Instruction Tuning
Resources
30
© 2024 Istvan Fehervari

More Related Content

PDF
Manual de nós (bombeiros 1)
PDF
Materiais Ct Nos Amarras
PDF
المهارات الحياتية لليافعين
PDF
WTFAST Crack Latest Version FREE Downlaod 2025
PDF
uTorrent Pro Crack Latest Version free 2025
PDF
Adobe Master Collection CC Crack 2025 FREE
PDF
AOMEI Partition Assistant Crack 2025 FREE
PDF
K7 Total Security 16.0.1260 Crack + License Key Free
Manual de nós (bombeiros 1)
Materiais Ct Nos Amarras
المهارات الحياتية لليافعين
WTFAST Crack Latest Version FREE Downlaod 2025
uTorrent Pro Crack Latest Version free 2025
Adobe Master Collection CC Crack 2025 FREE
AOMEI Partition Assistant Crack 2025 FREE
K7 Total Security 16.0.1260 Crack + License Key Free

Similar to “Unveiling the Power of Multimodal Large Language Models: Revolutionizing Perceptual AI,” a Presentation from BenchSci (20)

PPTX
Evolving Scala, Scalar conference, Warsaw, March 2025
PDF
KubeCon & CloudNative Con 2024 Artificial Intelligent
PDF
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
PDF
“Challenges and Solutions of Moving Vision LLMs to the Edge,” a Presentation ...
PDF
AI-SDV 2021: Stefan Geissler - AI support for creating and maintaining vocabu...
PDF
Implementing high-quality and cost-effiient AI applications with small langua...
PDF
Session 2.1 ontological representation of the telecom domain for advanced a...
PDF
NLP and Deep Learning for non_experts
PPTX
Gnerative AI presidency Module1_L4_LLMs_new.pptx
PDF
Your easy move to serverless computing and radically simplified data processing
PPTX
No BS Guide to Deep Learning in the Enterprise
PDF
SWEBOK Guide Evolution and Its Emerging Areas including Machine Learning Patt...
PDF
ctchou-resume
PDF
New Developments in H2O: April 2017 Edition
PDF
Blazingly Fast GenAI App Development using Spring AI
PDF
Coding Secure Infrastructure in the Cloud using the PIE framework
PDF
OWF14 - Big Data : The State of Machine Learning in 2014
PPTX
[DSC Europe 23] Andrija Petrovic - A Hybrid LLM based Enterprise Chatbot Solu...
PPTX
Deep Learning Made Easy with Deep Features
PDF
Deep Domain
Evolving Scala, Scalar conference, Warsaw, March 2025
KubeCon & CloudNative Con 2024 Artificial Intelligent
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
“Challenges and Solutions of Moving Vision LLMs to the Edge,” a Presentation ...
AI-SDV 2021: Stefan Geissler - AI support for creating and maintaining vocabu...
Implementing high-quality and cost-effiient AI applications with small langua...
Session 2.1 ontological representation of the telecom domain for advanced a...
NLP and Deep Learning for non_experts
Gnerative AI presidency Module1_L4_LLMs_new.pptx
Your easy move to serverless computing and radically simplified data processing
No BS Guide to Deep Learning in the Enterprise
SWEBOK Guide Evolution and Its Emerging Areas including Machine Learning Patt...
ctchou-resume
New Developments in H2O: April 2017 Edition
Blazingly Fast GenAI App Development using Spring AI
Coding Secure Infrastructure in the Cloud using the PIE framework
OWF14 - Big Data : The State of Machine Learning in 2014
[DSC Europe 23] Andrija Petrovic - A Hybrid LLM based Enterprise Chatbot Solu...
Deep Learning Made Easy with Deep Features
Deep Domain
Ad

More from Edge AI and Vision Alliance (20)

PDF
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
PDF
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
PDF
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
PDF
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
PDF
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
PDF
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
PDF
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
PDF
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
PDF
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
PDF
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
PDF
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
PDF
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
Ad

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Spectroscopy.pptx food analysis technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Unlocking AI with Model Context Protocol (MCP)
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Programs and apps: productivity, graphics, security and other tools
sap open course for s4hana steps from ECC to s4
Electronic commerce courselecture one. Pdf
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation_ Review paper, used for researhc scholars
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The AUB Centre for AI in Media Proposal.docx
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Understanding_Digital_Forensics_Presentation.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectroscopy.pptx food analysis technology

“Unveiling the Power of Multimodal Large Language Models: Revolutionizing Perceptual AI,” a Presentation from BenchSci

  • 1. Unveiling the Power of Multi-Modal Large Language Models: Revolutionizing Perceptual AI Istvan Fehervari Director, Data and ML
  • 2. The era of large language models 3 © 2024 Istvan Fehervari LLM Webpage design Co-pilots, digital assistants Customer support bots Data exploration/SQL Copywriting Education / mental health bots
  • 3. • Revolution started in machine translation • Context-sensitive next token prediction via attention • Transformer blocks composed of layers of attention and feed-forward blocks • Encoder-decoder architectures • Intelligence emerges through scale Large language models 4 © 2024 Istvan Fehervari Input Text Tokenize Deep neural network Decode tokens Output text
  • 4. • Enables learnable weights of content pieces for arbitrary context → better reasoning • Works very well for ordered and unordered sets • Works well with external contexts, e.g., cross- attention with vision • All modern ML models leverage some form of attention Why the attention mechanism is critical 5 © 2024 Istvan Fehervari Low attention High attention Puppies Are Cute
  • 5. • LLMs are composed of decoder blocks • Inputs are all previous tokens – including predicted ones • Decoder outputs a token distribution based on all previous tokens • During generation we sample tokens from the output distribution • Temperature • Top-k / top-n Building blocks of LLMs 6 © 2024 Istvan Fehervari
  • 6. • Supervised training • Input – output pairs (e.g., translation) • Fine-tune on specific task • Self-supervised training • Next/masked token prediction – needs a large body of data • Reinforcement learning human feedback (RLHF) • For instruction tuning, use human ranking to learn a reward function Training LLMs 7 © 2024 Istvan Fehervari
  • 7. • LLaMA (2023/2) - 7B / 13B / 33B / 65B • Falcon (2023/5) - 7B / 40B / 180B • LLaMA2 (2023/6) - 7B / 13B / 70B • Mistral (2023/9) - 7B (based on LLaMA2) • Vicuna / Alpaca (based on LLaMA) • Phi-2 (2023/12) - 2.7B • Mixtral (2023/12) - 8x7B Main foundational open LLMs 8 © 2024 Istvan Fehervari
  • 8. © 2024 Istvan Fehervari Perception via Language 9
  • 9. • Annotated class labels are expensive → captions are abundant • Era of natural language ​supervision • WebImageText dataset: 400 million images with text captions • Created with web scraping • Query words are composed of all words occurring at least 100 times on Wikipedia Rise of a new dataset 10 © 2024 Istvan Fehervari
  • 10. • Predicting captions directly does not scale well • Instead, predict how well a text description and an image “fit together” • First example of prompt engineering in vision CLIP: combining language and vision 11 © 2024 Istvan Fehervari
  • 11. Object detection with Grounded DINO • Open-vocabulary detection • Text backbone is a pretrained​ transformer like BERT • Text-image and image-text cross-attention at several stages Object detection with CLIP 12 © 2024 Istvan Fehervari
  • 12. Segmentation with Grounded SAM • Open-vocabulary segmentation • Detect boxes with Grounded DINO → Predict mask with SAM Image segmentation with CLIP 13 © 2024 Istvan Fehervari
  • 13. • CLIPSeg – Image segmentation with prompts Image segmentation with CLIP 14 © 2024 Istvan Fehervari Lüddecke et al. - Image Segmentation Using Text and Image Prompts, CVPR 2022
  • 14. • Stable Diffusion uses CLIP text embeddings Image generation with CLIP 15 © 2024 Istvan Fehervari
  • 15. 16 © 2024 Istvan Fehervari LLMs with Vision
  • 16. • We want our models to reason over visual input • What data is needed? Learning paradigms for (V)LLMs 17 © 2024 Istvan Fehervari
  • 17. Data​set building via 1. Image captioning (ideally with bounding box ground truth) 2. Visual QA datasets 3. Synthetic: create (2) from (1) Can be done manually or LLM-assisted Training V-LLMs 18 © 2024 Istvan Fehervari
  • 18. • (Multi-modal) in-context learning (e.g., Otter) • Inject demonstration set to context • Requires large context • Can be used to teach LLMs to use external tools • (Multi-modal) chain-of-thought (e.g., ScienceQA) • Immediate reasoning steps for superior output • Adaptive or pre-defined chain configuration • Chain construction: infilling or predicting Learning paradigms for (V)LLMs 19 © 2024 Istvan Fehervari
  • 19. • Learnable interface between modalities • Expert model translating (e.g., vision) into text • Special tokens/function calling​ to access aux models LLMs with vision capabilities 20 © 2024 Istvan Fehervari Modality bridging Learnable interface Query-based Instruct BLIP, VisionLLM, Macaw-LLM Projection- based LLaVa, PanadaGPT, Video-ChatGPT Parameter- efficient tuning LaVIN Expert model VideoChat- Text, LLaMA- Adapter V2
  • 20. • Use CLIP to map vision and language tokens to the same latent space (shallow alignment) – LLaVa-1.5 • Keep LLM and image encoder frozen – only train a shallow projection layer Modality bridging with shallow alignment 21 © 2024 Istvan Fehervari
  • 21. Deeper alignment: Mixture of experts with vision 22 © 2024 Istvan Fehervari • Visual Experts – Mixture of experts with vision, e.g., CogVLM • Experts are separate feedforward layers • Only a few experts are activated during inference • Learn a gating network
  • 22. LLM-aided visual reasoning 23 © 2024 Istvan Fehervari • LLM function calling • Controller – task planning • Decision maker – summarize, continue or not • Semantic refiner – generate text wrt. context • Strong generalization • Emergent ability (e.g., understand meme images) • Better control Object detector Segmentation model LLM (reasoning) Instruction/question Answer
  • 23. 24 © 2024 Istvan Fehervari Vision-language on the Edge
  • 24. • Programming → natural language instructions • Training free solutions • Shorter time-to-market • Short lead time to adapt to changing environments Applications 25 © 2024 Istvan Fehervari
  • 25. • Control: more natural, frictionless UX • Voice or chat to control/monitor devices/networks • Answer usability questions (no more manuals) • Personalized onboarding to new devices • Feedback: • Output is interpreted without human-in-the-loop (e.g., alarm systems) Applications 26 © 2024 Istvan Fehervari
  • 26. • LLMs need lot of resources (compute, memory) • Visual input, CoT requires larger context • Latency is still an issue on the edge • Output control of LLMs is still unsolved, prone to hallucinations • AI safety – bias is an unsolved issue • AI alignment is an upcoming field of research Challenges 27 © 2024 Istvan Fehervari
  • 27. • Language as control brought tremendous improvements • LLMs can operate very well with visual signals • Future products will be more user-friendly, more natural • Faster time to market, better adaptability both tech and business • Performance on the edge today is a challenge but will be solved • AI safety / alignment is the new challenge without a clear answer in sight Conclusions 28 © 2024 Istvan Fehervari
  • 29. • Yin et al. - A Survey on Multimodal Large Language Models • Zhang et al. - Vision-Language Models for Vision Tasks: A Survey • Yin et al. - A Survey on Multimodal Large Language Models • Lüddecke et al. - Image Segmentation Using Text and Image Prompts • Liu et al. - Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection • Kirillov et al. - Segment Anything • Wang et al. - CogVLM: Visual Expert for Pretrained Language Models • Liu et al. - LLaVA: Large Language and Vision Assistant • Li et al. - Otter: A Multi-Modal Model with In-Context Instruction Tuning Resources 30 © 2024 Istvan Fehervari