SlideShare a Scribd company logo
4
Most read
8
Most read
10
Most read
Vision-Language-Action Models for
Embodied AI
A Deep Dive into the 2024 Survey
Mohibkhan Pathan
CMPE 258 - Deep Learning
Professor Vijay Eranti
What is Embodied
AI?
08
Why Do We Need VLA Models?
02
What Are VLA Models?
03
High-Level Task Planners
07
Training Data and Benchmarks
01
Challenges
09
Table of contents
The Three Main Parts of
VLA Models
11
Key Components
05
Low-Level Control
06
The Future of VLAs
10
References
04
● AI that lives in the real world, not just on a screen
● Can see, understand, and take actions
● Example: A robot that follows the command “bring me a cup”
● It combines vision, language, and movement
● Not like ChatGPT or CLIP — they don’t interact with the physical world
What is Embodied AI?
● Real-world tasks are complex and multi-step
● Robots must understand what to do, what they see, and how to act
● One model for just vision or just language is not enough
● VLA models let robots follow natural commands like humans
● Helps in homes, hospitals, factories, and more
Why Do We Need VLA Models?
What are VLA Models?
● Real-world tasks are complex and
multi-step
● Robots must understand what to do, what
they see, and how to act
● One model for just vision or just language
is not enough
● VLA models let robots follow natural
commands like humans
● Helps in homes, hospitals, factories, and
more
The Three Main Parts of VLA Models
● VLA models have 3 main parts:
1. Components – visual/language
encoders and world models
2. Low-Level Control – small step
actions (e.g. move, pick)
3. High-Level Planners – break big
tasks into small steps
● This structure helps models plan and act
better
● Like a team: planner = brain, controller =
hands
Key Components
● Vision Encoder – helps robot “see” (e.g.
CLIP, R3M, MVP)
● Language Encoder – understands
commands (e.g. BERT, GPT)
● Dynamics Model – learns how actions
change the world
● World Model – predicts what will happen
next (like a mini-simulator)
● These parts work together to help the
robot think before it moves
Low-Level Control
● Low-level control = small actions like pick, move,
turn
● Uses info from camera + instruction to act in
real-time
● Common methods:
○ FiLM – adjusts vision using language
○ Cross-Attention – connects vision +
language deeply
○ Concatenation – joins both inputs together
● Models: CLIPort, BC-Z, RT-1, UniPi
● Some use transformers or even learn from videos
High-Level Task Planners
● Planner breaks long tasks into smaller actions
● (e.g. “clean room” → pick up toy → wipe table)
● Two common types:
○ Language-based: LLM writes out steps in
text
○ Code-based: LLM creates commands
using functions like pick() or move()
● Helps robot know what to do next
● Famous examples: SayCan, InnerMonologue,
ProgPrompt
Training Data and Benchmarks
● Real robot data is hard to collect and expensive
● Simulators are used for faster and safer training
(e.g. Habitat, AI2-THOR)
● Some models learn from human videos or
internet data
● Benchmarks help test models:
○ EmbodiedQA: ask + explore
○ RLBench: robot manipulation
○ EgoPlan / PlanBench: test planning skills
● Important to compare models fairly
● Real data is hard to get – robot demos take time
● Models are slow – need to act faster in real life
● System is complex – many parts must work together
● Struggle with new tasks – not good at generalizing
● No standard tests – hard to compare different models
● Safety is important – robots must be trusted by people
Challenges
● Smarter planning with better world models
● Faster and smaller models for real-time use
● Use in homes, hospitals, factories, and more
● Safer and more human-friendly robot behavior
● Learn from the world just like humans do
The Future of VLA Models
Ma, Y., Song, Z., Zhuang, Y., Hao, J., & King, I. (2024).
A Survey on Vision-Language-Action Models for Embodied AI.
arXiv preprint arXiv:2408.14496.
https://arxiv.org/abs/2408.14496
Resources

More Related Content

PDF
Dmytro Kuzmenko: Potential of Vision-Language-Action (VLA) models (UA)
PDF
Application of Foundation Model for Autonomous Driving
PDF
Introduction to Multimodal LLMs with LLaVA
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
π0.5: a Vision-Language-Action Model with Open-World Generalization
PDF
“Unveiling the Power of Multimodal Large Language Models: Revolutionizing Per...
PDF
Introduction to Multimodal LLMs with LLaVA
PDF
GOSIM_China_2024_Embodied AI Data VLA World Model
Dmytro Kuzmenko: Potential of Vision-Language-Action (VLA) models (UA)
Application of Foundation Model for Autonomous Driving
Introduction to Multimodal LLMs with LLaVA
Embodied AI: Ushering in the Next Era of Intelligent Systems
π0.5: a Vision-Language-Action Model with Open-World Generalization
“Unveiling the Power of Multimodal Large Language Models: Revolutionizing Per...
Introduction to Multimodal LLMs with LLaVA
GOSIM_China_2024_Embodied AI Data VLA World Model

Similar to Vision-Language-Action Models for Embodied AI (15)

PDF
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
PDF
Levels of AI Agents: from Rules to Large Language Models
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
PDF
Dmytro Kuzmenko: State-of-the-Art AI in Robotics (UA)
PPTX
HUMANOID ROBOT.pptx
PDF
Social (assistive) robots
PPTX
Large Language Models: Diving into GPT, LLaMA, and More
PPTX
Decision Making with Compositional Foundation Models
PDF
VITA-1.5 Towards GPT-4o Level Real-Time Vision and Speech Interaction
PPT
presentation.ppt
PPT
presentation.ppt
PPT
PDF
“Vision-language Representations for Robotics,” a Presentation from the Unive...
PPT
Behavior-based robotics
PPTX
[DSC DACH 24] AI and XR - Ivan Voras
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
Levels of AI Agents: from Rules to Large Language Models
Optimizing Large Language Models with vLLM and Related Tools.pdf
Dmytro Kuzmenko: State-of-the-Art AI in Robotics (UA)
HUMANOID ROBOT.pptx
Social (assistive) robots
Large Language Models: Diving into GPT, LLaMA, and More
Decision Making with Compositional Foundation Models
VITA-1.5 Towards GPT-4o Level Real-Time Vision and Speech Interaction
presentation.ppt
presentation.ppt
“Vision-language Representations for Robotics,” a Presentation from the Unive...
Behavior-based robotics
[DSC DACH 24] AI and XR - Ivan Voras
Ad

Recently uploaded (20)

PDF
medical staffing services at VALiNTRY
PDF
System and Network Administration Chapter 2
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Nekopoi APK 2025 free lastest update
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Cost to Outsource Software Development in 2025
PPTX
assetexplorer- product-overview - presentation
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PPTX
history of c programming in notes for students .pptx
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Introduction to Artificial Intelligence
medical staffing services at VALiNTRY
System and Network Administration Chapter 2
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Computer Software and OS of computer science of grade 11.pptx
wealthsignaloriginal-com-DS-text-... (1).pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Why Generative AI is the Future of Content, Code & Creativity?
Nekopoi APK 2025 free lastest update
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Which alternative to Crystal Reports is best for small or large businesses.pdf
Cost to Outsource Software Development in 2025
assetexplorer- product-overview - presentation
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
history of c programming in notes for students .pptx
PTS Company Brochure 2025 (1).pdf.......
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Introduction to Artificial Intelligence
Ad

Vision-Language-Action Models for Embodied AI

  • 1. Vision-Language-Action Models for Embodied AI A Deep Dive into the 2024 Survey Mohibkhan Pathan CMPE 258 - Deep Learning Professor Vijay Eranti
  • 2. What is Embodied AI? 08 Why Do We Need VLA Models? 02 What Are VLA Models? 03 High-Level Task Planners 07 Training Data and Benchmarks 01 Challenges 09 Table of contents The Three Main Parts of VLA Models 11 Key Components 05 Low-Level Control 06 The Future of VLAs 10 References 04
  • 3. ● AI that lives in the real world, not just on a screen ● Can see, understand, and take actions ● Example: A robot that follows the command “bring me a cup” ● It combines vision, language, and movement ● Not like ChatGPT or CLIP — they don’t interact with the physical world What is Embodied AI?
  • 4. ● Real-world tasks are complex and multi-step ● Robots must understand what to do, what they see, and how to act ● One model for just vision or just language is not enough ● VLA models let robots follow natural commands like humans ● Helps in homes, hospitals, factories, and more Why Do We Need VLA Models?
  • 5. What are VLA Models? ● Real-world tasks are complex and multi-step ● Robots must understand what to do, what they see, and how to act ● One model for just vision or just language is not enough ● VLA models let robots follow natural commands like humans ● Helps in homes, hospitals, factories, and more
  • 6. The Three Main Parts of VLA Models ● VLA models have 3 main parts: 1. Components – visual/language encoders and world models 2. Low-Level Control – small step actions (e.g. move, pick) 3. High-Level Planners – break big tasks into small steps ● This structure helps models plan and act better ● Like a team: planner = brain, controller = hands
  • 7. Key Components ● Vision Encoder – helps robot “see” (e.g. CLIP, R3M, MVP) ● Language Encoder – understands commands (e.g. BERT, GPT) ● Dynamics Model – learns how actions change the world ● World Model – predicts what will happen next (like a mini-simulator) ● These parts work together to help the robot think before it moves
  • 8. Low-Level Control ● Low-level control = small actions like pick, move, turn ● Uses info from camera + instruction to act in real-time ● Common methods: ○ FiLM – adjusts vision using language ○ Cross-Attention – connects vision + language deeply ○ Concatenation – joins both inputs together ● Models: CLIPort, BC-Z, RT-1, UniPi ● Some use transformers or even learn from videos
  • 9. High-Level Task Planners ● Planner breaks long tasks into smaller actions ● (e.g. “clean room” → pick up toy → wipe table) ● Two common types: ○ Language-based: LLM writes out steps in text ○ Code-based: LLM creates commands using functions like pick() or move() ● Helps robot know what to do next ● Famous examples: SayCan, InnerMonologue, ProgPrompt
  • 10. Training Data and Benchmarks ● Real robot data is hard to collect and expensive ● Simulators are used for faster and safer training (e.g. Habitat, AI2-THOR) ● Some models learn from human videos or internet data ● Benchmarks help test models: ○ EmbodiedQA: ask + explore ○ RLBench: robot manipulation ○ EgoPlan / PlanBench: test planning skills ● Important to compare models fairly
  • 11. ● Real data is hard to get – robot demos take time ● Models are slow – need to act faster in real life ● System is complex – many parts must work together ● Struggle with new tasks – not good at generalizing ● No standard tests – hard to compare different models ● Safety is important – robots must be trusted by people Challenges
  • 12. ● Smarter planning with better world models ● Faster and smaller models for real-time use ● Use in homes, hospitals, factories, and more ● Safer and more human-friendly robot behavior ● Learn from the world just like humans do The Future of VLA Models
  • 13. Ma, Y., Song, Z., Zhuang, Y., Hao, J., & King, I. (2024). A Survey on Vision-Language-Action Models for Embodied AI. arXiv preprint arXiv:2408.14496. https://arxiv.org/abs/2408.14496 Resources