SlideShare a Scribd company logo
1
Building Customizable Generative Models through
Compositional Generation
Yilun Du
July 2024
2
Is Data All You Need?
Is Data All You Need?
Question: Is the blue cup below the red
bowl?
GPT-4V: No, the blue cup is not below the red
bowl. The blue cup is to the right side of the
picture, and the red bowl is not visible in the
image…
3
4
Why Does Generative AI on Other Settings Work Poorly?
An astronaut riding a
horse
in a photorealistic style.
A large blue metal cube in front of a large
cyan metal cylinder. A large blue metal
cube on the left of small metal sphere.
Compared to pixels in images (and many other continuous distributions), natural
language is much simpler (structured for effective communication) and naturally
compositional (infinite use of finite means).
Natural Language
Training
Distribution
Real World Distribution Real World Distribution
Training
Distribution
Embodied Data
5
Composition of Learned Factors Yields Strong Generalization
An astronaut riding a
horse
in a photorealistic style.
A large blue metal cube in front of a large
cyan metal cylinder. A large blue metal
cube on the left of small metal sphere.
Natural Language
Training
Distribution
Real World Distribution Real World Distribution
Training
Distribution
Embodied Data
Energy Based Models (EBMs) provide a probabilistic
manner to represent the real distribution as a
composition of tractable factors!
6
An astronaut riding a
horse
in a photorealistic style.
A large blue metal cube in front of a large
cyan metal cylinder. A large blue metal
cube on the left of small metal sphere.
Factor (Axis 1)
Factor
(Axis
2)
Composition of Learned Factors Yields Strong Generalization
Natural Language
Training
Distribution
Real World Distribution
Training
Distribution
Embodied Data
Energy Based Models (EBMs) provide a probabilistic
manner to represent the real distribution as a
composition of tractable factors!
𝑝 (𝑥)∝∏
𝑖=1
𝑁
𝑝𝑖(𝑥𝑖)
Real World Distribution
7
Composition of Learned Factors Yields Strong Generalization
An astronaut riding a
horse
in a photorealistic style.
A large blue metal cube in front of a large
cyan metal cylinder. A large blue metal
cube on the left of small metal sphere.
Factors can be combined at prediction
time to represent the entire
distribution (even parts with no data).
Factor
(Axis
2)
Factor (Axis 1)
Natural Language
Training
Distribution
Real World Distribution
Training
Distribution
Embodied Data
Real World Distribution
𝑝 (𝑥)∝∏
𝑖=1
𝑁
𝑝𝑖(𝑥𝑖)
8
Composition of Learned Factors Yields Strong Generalization
An astronaut riding a
horse
in a photorealistic style.
A large blue metal cube in front of a large
cyan metal cylinder. A large blue metal
cube on the left of small metal sphere.
New compositions of factors enable
fast customization of generative
models to new settings.
Factor
(Axis
2)
Factor (Axis 1)
Natural Language
Training
Distribution
Real World Distribution
Training
Distribution
Embodied Data
Real World Distribution
𝑝 (𝑥)∝∏
𝑖=1
𝑁
𝑝𝑖(𝑥𝑖)
9
Energy Based Models
: ℝ K
→ℝ
E
Introduce the idea of
Energy Based Models
and compositionality.
Planning
Generalizing beyond
demonstrations
through planning
Foundation
Models
Combining many
foundation models for
intelligent decision
making.
10
Implicit Modeling with Energy Functions
E
Red Truck
Energy Function
High Energy
11
Low Energy
Implicit Modeling with Energy Functions
E
Red Truck
Energy Function
12
Implicit Modeling with Energy Functions
E
Red Truck
Energy Function
E(x)
Low Energy
13
Implicit Modeling with Energy Functions
E
Red Truck
Energy Function
Encode probability
distribution function:
p(x) e-E(x)
E(x)
Low Energy
14
Sampling from Probability Distributions
E
Red Truck
Energy Function
Langevin Dynamics
[1] Du and Mordatch. Implicit Generation and Modeling with Energy Based Models. NeurIPS 2019
𝑥𝑡=𝑥𝑡 − 1 − ∇𝑥 𝐸 (𝑥𝑡 )+𝑁 (0 , 𝜎 )
15
• Maximum likelihood training of EBMs corresponds to minimizing the loss:
Learning Energy Functions
[1] Du and Mordatch. Implicit Generation and Modeling with Energy Based Models. NeurIPS 2019
∇ 𝜃 ℒ𝑀𝐿= 𝔼𝑥 𝑝 ( 𝑥) [ ∇ 𝜃 𝐸 ( 𝑥)] − 𝔼𝑥 𝑝 𝜃 ( 𝑥 ) [ ∇𝜃 𝐸 ( 𝑥)]
Drawn from Data Drawn from Learned
Distribution
16
Learning Energy Functions
• Maximum likelihood training of EBMs corresponds to minimizing the loss:
• Who showed how to scale EBM to modern generative tasks, by leveraging a
combination of Langevin dynamics + replay buffers [1].
∇ 𝜃 ℒ𝑀𝐿= 𝔼𝑥 𝑝 ( 𝑥) [ ∇ 𝜃 𝐸 ( 𝑥)] − 𝔼𝑥 𝑝 𝜃 ( 𝑥 ) [ ∇𝜃 𝐸 ( 𝑥)]
Drawn from Data Drawn from Learned
Distribution
[1] Du and Mordatch. Implicit Generation and Modeling with Energy Based Models. NeurIPS 2019
Diffusion Models are Energy Based Models
• Diffusion models are Energy Based Models where the gradient of the
energy function is the learned denoising function:
• Can implement compositionality with diffusion models by treating the
denoise function as the gradient of the energy function [2].
[1] Du and Mordatch. Implicit Generation and Modeling with Energy Based Models. NeurIPS 2019
[2] Du et al. Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and
MCMC. ICML 2023
𝑥𝑡 =𝑥𝑡 − 1 − ∇𝑥 𝐸 (𝑥𝑡 )+𝑁 (0 , 𝜎 ) 𝑥𝑡 =𝑥𝑡 − 1 − 𝜀𝜃 (𝑥𝑡 )+𝑁 (0 , 𝜎 )
18
[6] 3D Shapes
[4] Video Plans
[3] Policies
[1] Images [2] Robot Plans
[5] Soft Robots
[1] Liu*, Li*, Du* et al. Composable Visual Generation with Diffusion Models. ECCV 2022
[2] Janner*, Du* et al. Planning with Diffusion for Flexible Behavior Synthesis. ICML 2022
[3] Chi, Feng, Du et al. Diffusion Policy: Visualmotor Policy Learning via Action Diffusion. RSS 2023.
[4] Du et al. Learning Universal Policies through Text-Conditioned Video Generation. NeurIPS 2023
[5] Wang, Zheng, Ma, Du* et al. Diffusebot: Breeding Soft Robots with Physics Augmented Diffusion Models. NeurIPS 2023
[6] Zhou, Du et al. 3D Shape Generation and Completion through Point-Voxel Diffusion. ICCV 2021
EBMs Can Represent Distributions Across High Dimensional Inputs
19
Composing Different Energy Functions at Prediction Time
E
Minimize Energy
E1
Red Truck
Energy Function
20
Composing Different Energy Functions at Prediction Time
E
Minimize Energy
E1
En
Red Truck
Energy Function
Desert
Energy Function
21
Combining Probability Distributions with Energy Functions
[1] Du et al. Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC. ICML 2023
Products: =
×
𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) 𝑝2(𝑥)
22
Combining Probability Distributions with Energy Functions
[1] Du et al. Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC. ICML 2023
Products:
Mixtures:
=
×
𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) 𝑝2(𝑥)
=
+¿
𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) + 𝑝2(𝑥)
23
Combining Probability Distributions with Energy Functions
[1] Du et al. Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC. ICML 2023
Products:
Mixtures:
=
×
𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) 𝑝2(𝑥)
=
+¿
𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) + 𝑝2(𝑥)
Inversion: ¿ =
𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) / 𝑝2(𝑥)𝑎
24
Unsupervised Discovery of Composable Components
Training
Distribution
How can we discover
the independent
factors of a
distributions?
25
Factor
2
Factor 1
Training
Distribution
𝑝 (𝑥)∝∏
𝑖=1
𝑁
𝑝𝑖(𝑥)
Train generative
model with
factorized structure
through maximum
likelihood.
Unsupervised Discovery of Composable Components
26
|
E
Minimize Energy
E
Inferred Energy
Function
|
E
Inferred Energy
Function
𝑥=arg min
~
𝑥
∑𝑘 𝐸 (~
𝑥∨Enc𝑘(𝑥))
Encourage
independence
through
information
bottleneck
Unsupervised Discovery of Composable Components
𝑝𝜃 (𝑥∨ℑ)∝∏
𝑖=1
𝑁
𝑝𝜃
𝑖
(𝑥∨Enc𝑖 (ℑ)¿)¿
27
[1] Du et al. Unsupervised Learning of Compositional Energy Concepts. NeurIPS 2021
Discovering Composable Components in Natural Images
Input Image Inferred Energy
Functions
Compose Energy
Function
𝑝 𝜃 ( 𝑥 ∨ 𝑧 1 )
𝑝 𝜃 ( 𝑥 ∨ 𝑧 2 )
𝑝 𝜃 ( 𝑥 ∨ 𝑧 3 )
𝑝 𝜃 ( 𝑥 ∨ 𝑧 4 )
28
Compositional Generalization of Inferred Components
Dataset 1 Dataset 2
Images
[1] Su*, Du*, Liu* et al. Compositional Image Decomposition with Diffusion Models.
29
Compositional Generalization of Inferred Components
Dataset 1 Dataset 2
Images
[1] Su*, Du*, Liu* et al. Compositional Image Decomposition with Diffusion Models.
30
Compositional Generalization of Inferred Components
Dataset 1 Dataset 2
Images Composition
[1] Su*, Du*, Liu* et al. Compositional Image Decomposition with Diffusion Models.
Sample from the
composed
distribution:
31
Energy Based Models
: ℝ K
→ℝ
E
Introduce the idea of
Energy Based Models
and compositionality.
Planning
Generalizing beyond
demonstrations
through planning
Foundation
Models
Combining many
foundation models for
intelligent decision
making.
32
Generalizing Beyond Demonstrations through Compoistion
We want to construct
robotic agents that can
generalize beyond the
demonstrations they
have seen!
Trajectory Synthesis
Dem
onstrations
Robot Actions
33
Planning through Compositional Generation
Decompose
demonstrations
into a model
capturing the
dynamics of the
environment + a
model to capture
the goal of a task.
Trajectory Synthesis
Dem
onstrations
Goal
Dynamics
Robot Actions
(, goal)
34
Planning through Compositional Generation
Trajectory Synthesis
Dem
onstrations
Goal
Dynamics
Robot Actions
(, goal)
Enables
generalization to
new combinations
of states + goals
through
probabilistic
planning!
35
Planning with Energy Minimization
E
Minimize Energy
E1
E2
Trajectory Energy
Function
Cost Functions
(Goal, Value Functions,
Test-Time Constraints)
[1] Du et al. Model Based Planning with Energy Based Models. CoRL 2019.
[2] Janner*, Du* et al. Planning with Diffusion for Flexible Behavior Synthesis. ICML 2022
[3] Ajay*, Du*, Gupta* et al. Is Conditional Generative Modeling all You Need For Decision Making? ICLR 2023
𝒯
𝒯
𝒯
Planning/reinforcement
learning through inference
time search on composed
distributions.
36
)
Reinforcement Learning through Value
Composition
𝐬0
𝐚0
𝐬1
𝐚1
𝐬2
𝐚2
𝐬3
𝐚3
𝐬4
𝐚4
𝐬5
𝐚5
+
)
) + )
37
)
Reinforcement Learning through Value
Composition
𝐬0
𝐚0
𝐬1
𝐚1
𝐬2
𝐚2
𝐬3
𝐚3
𝐬4
𝐚4
𝐬5
𝐚5
+
)
) + )
Solve many different tasks with one model!
38
Reinforcement Learning through Value
Composition
39
Goal Planning through Optimization
𝐬0
𝐚0
𝐬1
𝐚1
𝐬2
𝐚2
𝐬3
𝐚3
𝐬4
𝐚4
𝐬5
𝐚5
)
)
) + )
𝐬0
𝐚0
𝐬1
𝐚1
𝐬2
𝐚2
𝐬3
𝐚3
𝐬4
𝐚4
𝐠
𝐚5
𝐬0 g
Construct a zero-shot goal-seeking policy!
40
Goal Planning through Optimization
41
Goal Planning through Optimization
Single Task Planning Multi-Task Planning
42
Goal Planning through Optimization
Single Task Planning Multi-Task Planning
Baselines require per task retraining
while our trained model can be applied
across tasks!
43
Test Time Cost Functions
)
𝐬0
𝐚0
𝐬1
𝐚1
𝐬2
𝐚2
𝐬3
𝐚3
𝐬4
𝐚4
𝐬5
𝐚5
) +
)
44
Optimizing Hand-Specified Cost Functions with Energy Minimization
45
Planning From Partial Visual Observations on Real Robots
[1] Chi, Feng, Du et al. Diffusion Policy: Visualmotor Policy Learning via Action Diffusion. RSS 2023.
Can learn complex trajectory
planning given only visual
observations given very few
(50) demos.
46
Fast Task Learning by Finetuning the Task Energy Function
E
Minimize Energy
E1
E2
Trajectory Energy
Function
Cost Functions
(Goal, Value Functions,
Test-Time Constraints)
[1] Netanyahu, Du et al. Few-Shot Task Learning through Inverse Generative Modeling.
𝒯
𝒯
𝒯
Can learn new tasks from
very few demonstrations!
47
Fast Task Learning by Finetuning the Task Energy Function
[1] Netanyahu, Du et al. Few-Shot Task Learning through Inverse Generative Modeling.
Train Tasks
Highway
Exit
Merge
Intersection
48
Fast Task Learning by Finetuning the Task Energy Function
[1] Netanyahu, Du et al. Few-Shot Task Learning through Inverse Generative Modeling.
Test Task
(5 Demos)
Few-shot Energy
Composition
Behavioral
Cloning
In Context
Learning
49
Fast Task Learning by Finetuning the Task Energy Function
[1] Netanyahu, Du et al. Few-Shot Task Learning through Inverse Generative Modeling.
Train Tasks
(217 Demos)
Push on Surface Push around Bowl
Pick and Place on Book Pick and Place on Table
50
Fast Task Learning by Finetuning the Task Energy Function
[1] Netanyahu, Du et al. Few-Shot Task Learning through Inverse Generative Modeling.
Test Task
(10 Demos)
Push on Book
51
Energy Based Models
: ℝ K
→ℝ
E
Introduce the idea of
Energy Based Models
and compositionality.
Planning
Generalizing beyond
demonstrations
through planning
Foundation
Models
Combining many
foundation models for
intelligent decision
making.
52
Solving Long Horizon Tasks by Composing Foundation Models
Solving a long horizon
decision-making task requires
generalization across many
sources of knowledge.
Make A Cup of Tea
Decision Making
Dem
onstrations
Embodied Agents
Composing Foundation Models for Embodied Agents
Compose foundation
models representing
each axis of information!
Make A Cup of Tea
Decision Making
Dem
onstrations
Embodied Agents
Visual
Knowledge
Procedural Information
Composing Foundation Models for Embodied Agents
Compose foundation
models representing
each axis of information!
Make A Cup of Tea
Decision Making
Dem
onstrations
Embodied Agents
Visual
Knowledge
Procedural Information
55
Hierarchical Planning with Foundation Models
Language Model Video Model Egocentric
Action
Model
Task Information Motion
Information
Kinematics Information
[1] Ajay*, Han*, Du* et al. Compositional Foundation Models for Hierarchical Planning. NeurIPS 2023
We want to construct a plan to make a cup of tea that is
semantically, geometrically, and physically executable on a
robot!
1) Look for a tea kettle
2) Heat water
3) Find teabag
4) …
56
Hierarchical Planning with Foundation Models
Language Model Video Model Egocentric
Action
Model
Task Information Motion
Information
Kinematics Information
[1] Ajay*, Han*, Du* et al. Compositional Foundation Models for Hierarchical Planning. NeurIPS 2023
We want to construct a plan to make a cup of tea that is
semantically, geometrically, and physically executable on a
robot!
57
Hierarchical Planning with Foundation Models
High Level
Action
Physical
Plausibility
Kinematic
Plausibility
Low Level Action
[1] Ajay*, Han*, Du* et al. Compositional Foundation Models for Hierarchical Planning. NeurIPS 2023
Language Model Video Model Egocentric
Action
Model
Task Information Motion
Information
Kinematics Information
We want to construct a plan to make a cup of tea that is
semantically, geometrically, and physically executable on a
robot!
58
Hierarchical Decision Making with Multimodal Models
Stack red block on a
cyan block and place
a brown block to the
right of stack.
Goal
Start Image
Place cyan block in brown box.
Generated Video Plan
Generated Language Plan
59
Place white block in a cyan bowl.
Hierarchical Decision Making with Multimodal Models
Stack red block on a
cyan block and place
a brown block to the
right of stack.
Goal
Start Image
Generated Video Plan
Generated Language Plan
A
I
Goal: Stack red block on top of
brown block and place yellow
block to the left of the stack
Hierarchical Decision Making with Multimodal Models Execution
61
The Issue of Compositional Sampling Across Models
The previous consensus procedure is slow because language
models often proposes plans not grounded in vision and the video
likelihood is hard to optimize.
High Level Action
Physical
Plausibility
Language Model Video Model
Task Information Motion
Information
62
Effective Planning with Vision Language Models
Use a vision language model to propose visually grounded plans
and as a heuristic to accelerate consensus.
High Level Action
Physical
Plausibility
Vision Language
Model
Video Model
Task Information Motion
Information
[1] Du et al. Video Language Planning. ICLR 2024.
63
Vision Language Models as Text Policies
Use the VLM as a policy (x, g), which given an image x and text goal g
outputs possible text actions a to execute next.
Vision Language
Model
Task Information
[1] Du et al. Video Language Planning. ICLR 2024.
Image x
Text Goal g
Put the fruits in
the top drawer.
Open top drawer.
Place banana in
top drawer.
64
Video Models as Text Conditioned Video Planners
Video Model
Task Information
[1] Du et al. Video Language Planning. ICLR 2024.
Image x
Text Action a
Place banana in
top drawer
Continue to use the video model fVM(x, a) as video space planner, which given
an image x and a text action a, synthesizes and image plan .
65
Video Models as Text Conditioned Video Planners
Video Model
Task Information
[1] Du et al. Video Language Planning. ICLR 2024.
Image x
Text Action a
Open top drawer
Continue to use the video model fVM(x, a) as video space planner, which given
an image x and a text action a, synthesizes and image plan .
66
Vision Language Models as Heuristic Functions
Use the VLM as a heuristic function HVLM(, g), which give a partial video
plan and text goal g estimates completion to a goal.
Vision Language
Model
Task Information
[1] Du et al. Video Language Planning. ICLR 2024.
Video Plan
Heuristic Estimate
Text Goal g
Put the fruits in
the top drawer.
67
Vision Language Models as Heuristic Functions
Use the VLM as a heuristic function HVLM(, g), which give a partial video
plan and text goal g estimates completion to a goal.
Vision Language
Model
Task Information
[1] Du et al. Video Language Planning. ICLR 2024.
Video Plan
Heuristic Estimate
Text Goal g
Put the fruits in
the top drawer.
68
Zero-Shot Tree Search with Foundation Models
[1] Du et al. Video Language Planning. ICLR 2024.
Enables efficient consensus using tree search! 𝜏image= argmin
𝜏image
𝑓 VM
, 𝜋VLM
HVLM (𝜏image , 𝑔)
69
Zero-Shot Planning and Execution with Foundation Models
Task Information Motion
Information
Kinematics Information
[1] Du et al. Video Language Planning. ICLR 2024
We can plan and execute unseen long horizon tasks on the real robot
without any explicit task training!
Vision Language Model
(PALM-E)
Video Model
(UniPi)
Action Model
(Goal-Conditioned RT2)
70
Zero-Shot Planning and Execution with Foundation Models
Goal: Put the fruit into
the top drawer.
AI
Move the red circle to the left of the yellow
hexagon
Move the green circle closer to the red star
Move the blue triangle to the top left of the red
circle
Move the blue cube to the left of the blue triangle
Move the green circle to the center
Push the green circle towards the yellow heart
Move the blue triangle to the right of the green
circle
....
Subgoal sequence: dynamically generated
Synthesized sequence of video plans
Goal: Make a line
Goal: Make a line – Execution on Real Robot
73
Long Horizon Decision Making through Hierarchical Composition
[1] Du et al. Video Language Planning. ICLR 2024.
74
Long Horizon Decision Making through Hierarchical Composition
[1] Du et al. Video Language Planning. ICLR 2024.
Hierarchical planning substantially outperforms monolithic generative models (RT-2 / PALM-
E)
75
Energy Based Models
: ℝ K
→ℝ
E
Introduce the idea of
Energy Based Models
and compositionality.
Planning
Generalizing beyond
demonstrations
through planning
Foundation
Models
Combining many
foundation models for
intelligent decision
making.
76
Building Customizable Generative Models through
Compositional Generation
Yilun Du
July 2024

More Related Content

PPTX
Learning Compositional Models of the World
PPTX
GDC2019 - SEED - Towards Deep Generative Models in Game Development
PDF
Energy-Based Models with Applications to Speech and Language Processing
PDF
Deep image generating models
PDF
Tutorial on Theory and Application of Generative Adversarial Networks
PPTX
Analytics forward 2019-03
PDF
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
PDF
Neural Computation Volume 16 2004 Mit Press
Learning Compositional Models of the World
GDC2019 - SEED - Towards Deep Generative Models in Game Development
Energy-Based Models with Applications to Speech and Language Processing
Deep image generating models
Tutorial on Theory and Application of Generative Adversarial Networks
Analytics forward 2019-03
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Neural Computation Volume 16 2004 Mit Press

Similar to Build Customizable Generative Models through Compositional Generation (20)

PPTX
cs231n_2019_lecture11_Tispptisneededforth.pptx
PDF
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
PPTX
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
PDF
TWCC22_PPT_v3_KL.pdf quantum computer, quantum computer , quantum computer ,...
PDF
DALL-E.pdf
PDF
Apr. 2, 2024 unsupervised learning fintecj.pdf
PPTX
GAN for Bayesian Inference objectives
PDF
Elif Lab srl - Gabriele Lami - Bayesian Probabilistic Algorithms and Human Sc...
PDF
Bayesian Probabilistic Algorithms and Human Sciences for Modeling and Predict...
PPTX
ANIn Coimbatore _ April 2025 | Why data is important and how synthetic data c...
PDF
4aMLChapter4Neurones&Networks UC Coimbr PT23ENSplit.pdf
PDF
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
PDF
Learning, Representations, Generative modelling
PPT
NIPS2007: deep belief nets
PDF
lecture_13_jiajun.pdf Generative models GAN
PPTX
cs236_lecture1_2023.pptx about machine learning
PDF
Tutorial on Deep Generative Models
PPTX
Sampling based motion planning method and shallow survey
PPTX
Learning Systems for Science
PDF
6666666666666666666666666666666666666.pdf
cs231n_2019_lecture11_Tispptisneededforth.pptx
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
TWCC22_PPT_v3_KL.pdf quantum computer, quantum computer , quantum computer ,...
DALL-E.pdf
Apr. 2, 2024 unsupervised learning fintecj.pdf
GAN for Bayesian Inference objectives
Elif Lab srl - Gabriele Lami - Bayesian Probabilistic Algorithms and Human Sc...
Bayesian Probabilistic Algorithms and Human Sciences for Modeling and Predict...
ANIn Coimbatore _ April 2025 | Why data is important and how synthetic data c...
4aMLChapter4Neurones&Networks UC Coimbr PT23ENSplit.pdf
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Learning, Representations, Generative modelling
NIPS2007: deep belief nets
lecture_13_jiajun.pdf Generative models GAN
cs236_lecture1_2023.pptx about machine learning
Tutorial on Deep Generative Models
Sampling based motion planning method and shallow survey
Learning Systems for Science
6666666666666666666666666666666666666.pdf
Ad

Recently uploaded (20)

PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Architecture types and enterprise applications.pdf
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
Modernising the Digital Integration Hub
PDF
Getting Started with Data Integration: FME Form 101
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
project resource management chapter-09.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Web App vs Mobile App What Should You Build First.pdf
Architecture types and enterprise applications.pdf
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
A comparative study of natural language inference in Swahili using monolingua...
WOOl fibre morphology and structure.pdf for textiles
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Modernising the Digital Integration Hub
Getting Started with Data Integration: FME Form 101
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
project resource management chapter-09.pdf
Hindi spoken digit analysis for native and non-native speakers
NewMind AI Weekly Chronicles - August'25-Week II
cloud_computing_Infrastucture_as_cloud_p
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
Group 1 Presentation -Planning and Decision Making .pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Ad

Build Customizable Generative Models through Compositional Generation

  • 1. 1 Building Customizable Generative Models through Compositional Generation Yilun Du July 2024
  • 2. 2 Is Data All You Need?
  • 3. Is Data All You Need? Question: Is the blue cup below the red bowl? GPT-4V: No, the blue cup is not below the red bowl. The blue cup is to the right side of the picture, and the red bowl is not visible in the image… 3
  • 4. 4 Why Does Generative AI on Other Settings Work Poorly? An astronaut riding a horse in a photorealistic style. A large blue metal cube in front of a large cyan metal cylinder. A large blue metal cube on the left of small metal sphere. Compared to pixels in images (and many other continuous distributions), natural language is much simpler (structured for effective communication) and naturally compositional (infinite use of finite means). Natural Language Training Distribution Real World Distribution Real World Distribution Training Distribution Embodied Data
  • 5. 5 Composition of Learned Factors Yields Strong Generalization An astronaut riding a horse in a photorealistic style. A large blue metal cube in front of a large cyan metal cylinder. A large blue metal cube on the left of small metal sphere. Natural Language Training Distribution Real World Distribution Real World Distribution Training Distribution Embodied Data Energy Based Models (EBMs) provide a probabilistic manner to represent the real distribution as a composition of tractable factors!
  • 6. 6 An astronaut riding a horse in a photorealistic style. A large blue metal cube in front of a large cyan metal cylinder. A large blue metal cube on the left of small metal sphere. Factor (Axis 1) Factor (Axis 2) Composition of Learned Factors Yields Strong Generalization Natural Language Training Distribution Real World Distribution Training Distribution Embodied Data Energy Based Models (EBMs) provide a probabilistic manner to represent the real distribution as a composition of tractable factors! 𝑝 (𝑥)∝∏ 𝑖=1 𝑁 𝑝𝑖(𝑥𝑖) Real World Distribution
  • 7. 7 Composition of Learned Factors Yields Strong Generalization An astronaut riding a horse in a photorealistic style. A large blue metal cube in front of a large cyan metal cylinder. A large blue metal cube on the left of small metal sphere. Factors can be combined at prediction time to represent the entire distribution (even parts with no data). Factor (Axis 2) Factor (Axis 1) Natural Language Training Distribution Real World Distribution Training Distribution Embodied Data Real World Distribution 𝑝 (𝑥)∝∏ 𝑖=1 𝑁 𝑝𝑖(𝑥𝑖)
  • 8. 8 Composition of Learned Factors Yields Strong Generalization An astronaut riding a horse in a photorealistic style. A large blue metal cube in front of a large cyan metal cylinder. A large blue metal cube on the left of small metal sphere. New compositions of factors enable fast customization of generative models to new settings. Factor (Axis 2) Factor (Axis 1) Natural Language Training Distribution Real World Distribution Training Distribution Embodied Data Real World Distribution 𝑝 (𝑥)∝∏ 𝑖=1 𝑁 𝑝𝑖(𝑥𝑖)
  • 9. 9 Energy Based Models : ℝ K →ℝ E Introduce the idea of Energy Based Models and compositionality. Planning Generalizing beyond demonstrations through planning Foundation Models Combining many foundation models for intelligent decision making.
  • 10. 10 Implicit Modeling with Energy Functions E Red Truck Energy Function High Energy
  • 11. 11 Low Energy Implicit Modeling with Energy Functions E Red Truck Energy Function
  • 12. 12 Implicit Modeling with Energy Functions E Red Truck Energy Function E(x) Low Energy
  • 13. 13 Implicit Modeling with Energy Functions E Red Truck Energy Function Encode probability distribution function: p(x) e-E(x) E(x) Low Energy
  • 14. 14 Sampling from Probability Distributions E Red Truck Energy Function Langevin Dynamics [1] Du and Mordatch. Implicit Generation and Modeling with Energy Based Models. NeurIPS 2019 𝑥𝑡=𝑥𝑡 − 1 − ∇𝑥 𝐸 (𝑥𝑡 )+𝑁 (0 , 𝜎 )
  • 15. 15 • Maximum likelihood training of EBMs corresponds to minimizing the loss: Learning Energy Functions [1] Du and Mordatch. Implicit Generation and Modeling with Energy Based Models. NeurIPS 2019 ∇ 𝜃 ℒ𝑀𝐿= 𝔼𝑥 𝑝 ( 𝑥) [ ∇ 𝜃 𝐸 ( 𝑥)] − 𝔼𝑥 𝑝 𝜃 ( 𝑥 ) [ ∇𝜃 𝐸 ( 𝑥)] Drawn from Data Drawn from Learned Distribution
  • 16. 16 Learning Energy Functions • Maximum likelihood training of EBMs corresponds to minimizing the loss: • Who showed how to scale EBM to modern generative tasks, by leveraging a combination of Langevin dynamics + replay buffers [1]. ∇ 𝜃 ℒ𝑀𝐿= 𝔼𝑥 𝑝 ( 𝑥) [ ∇ 𝜃 𝐸 ( 𝑥)] − 𝔼𝑥 𝑝 𝜃 ( 𝑥 ) [ ∇𝜃 𝐸 ( 𝑥)] Drawn from Data Drawn from Learned Distribution [1] Du and Mordatch. Implicit Generation and Modeling with Energy Based Models. NeurIPS 2019
  • 17. Diffusion Models are Energy Based Models • Diffusion models are Energy Based Models where the gradient of the energy function is the learned denoising function: • Can implement compositionality with diffusion models by treating the denoise function as the gradient of the energy function [2]. [1] Du and Mordatch. Implicit Generation and Modeling with Energy Based Models. NeurIPS 2019 [2] Du et al. Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC. ICML 2023 𝑥𝑡 =𝑥𝑡 − 1 − ∇𝑥 𝐸 (𝑥𝑡 )+𝑁 (0 , 𝜎 ) 𝑥𝑡 =𝑥𝑡 − 1 − 𝜀𝜃 (𝑥𝑡 )+𝑁 (0 , 𝜎 )
  • 18. 18 [6] 3D Shapes [4] Video Plans [3] Policies [1] Images [2] Robot Plans [5] Soft Robots [1] Liu*, Li*, Du* et al. Composable Visual Generation with Diffusion Models. ECCV 2022 [2] Janner*, Du* et al. Planning with Diffusion for Flexible Behavior Synthesis. ICML 2022 [3] Chi, Feng, Du et al. Diffusion Policy: Visualmotor Policy Learning via Action Diffusion. RSS 2023. [4] Du et al. Learning Universal Policies through Text-Conditioned Video Generation. NeurIPS 2023 [5] Wang, Zheng, Ma, Du* et al. Diffusebot: Breeding Soft Robots with Physics Augmented Diffusion Models. NeurIPS 2023 [6] Zhou, Du et al. 3D Shape Generation and Completion through Point-Voxel Diffusion. ICCV 2021 EBMs Can Represent Distributions Across High Dimensional Inputs
  • 19. 19 Composing Different Energy Functions at Prediction Time E Minimize Energy E1 Red Truck Energy Function
  • 20. 20 Composing Different Energy Functions at Prediction Time E Minimize Energy E1 En Red Truck Energy Function Desert Energy Function
  • 21. 21 Combining Probability Distributions with Energy Functions [1] Du et al. Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC. ICML 2023 Products: = × 𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) 𝑝2(𝑥)
  • 22. 22 Combining Probability Distributions with Energy Functions [1] Du et al. Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC. ICML 2023 Products: Mixtures: = × 𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) 𝑝2(𝑥) = +¿ 𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) + 𝑝2(𝑥)
  • 23. 23 Combining Probability Distributions with Energy Functions [1] Du et al. Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC. ICML 2023 Products: Mixtures: = × 𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) 𝑝2(𝑥) = +¿ 𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) + 𝑝2(𝑥) Inversion: ¿ = 𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) / 𝑝2(𝑥)𝑎
  • 24. 24 Unsupervised Discovery of Composable Components Training Distribution How can we discover the independent factors of a distributions?
  • 25. 25 Factor 2 Factor 1 Training Distribution 𝑝 (𝑥)∝∏ 𝑖=1 𝑁 𝑝𝑖(𝑥) Train generative model with factorized structure through maximum likelihood. Unsupervised Discovery of Composable Components
  • 26. 26 | E Minimize Energy E Inferred Energy Function | E Inferred Energy Function 𝑥=arg min ~ 𝑥 ∑𝑘 𝐸 (~ 𝑥∨Enc𝑘(𝑥)) Encourage independence through information bottleneck Unsupervised Discovery of Composable Components 𝑝𝜃 (𝑥∨ℑ)∝∏ 𝑖=1 𝑁 𝑝𝜃 𝑖 (𝑥∨Enc𝑖 (ℑ)¿)¿
  • 27. 27 [1] Du et al. Unsupervised Learning of Compositional Energy Concepts. NeurIPS 2021 Discovering Composable Components in Natural Images Input Image Inferred Energy Functions Compose Energy Function 𝑝 𝜃 ( 𝑥 ∨ 𝑧 1 ) 𝑝 𝜃 ( 𝑥 ∨ 𝑧 2 ) 𝑝 𝜃 ( 𝑥 ∨ 𝑧 3 ) 𝑝 𝜃 ( 𝑥 ∨ 𝑧 4 )
  • 28. 28 Compositional Generalization of Inferred Components Dataset 1 Dataset 2 Images [1] Su*, Du*, Liu* et al. Compositional Image Decomposition with Diffusion Models.
  • 29. 29 Compositional Generalization of Inferred Components Dataset 1 Dataset 2 Images [1] Su*, Du*, Liu* et al. Compositional Image Decomposition with Diffusion Models.
  • 30. 30 Compositional Generalization of Inferred Components Dataset 1 Dataset 2 Images Composition [1] Su*, Du*, Liu* et al. Compositional Image Decomposition with Diffusion Models. Sample from the composed distribution:
  • 31. 31 Energy Based Models : ℝ K →ℝ E Introduce the idea of Energy Based Models and compositionality. Planning Generalizing beyond demonstrations through planning Foundation Models Combining many foundation models for intelligent decision making.
  • 32. 32 Generalizing Beyond Demonstrations through Compoistion We want to construct robotic agents that can generalize beyond the demonstrations they have seen! Trajectory Synthesis Dem onstrations Robot Actions
  • 33. 33 Planning through Compositional Generation Decompose demonstrations into a model capturing the dynamics of the environment + a model to capture the goal of a task. Trajectory Synthesis Dem onstrations Goal Dynamics Robot Actions (, goal)
  • 34. 34 Planning through Compositional Generation Trajectory Synthesis Dem onstrations Goal Dynamics Robot Actions (, goal) Enables generalization to new combinations of states + goals through probabilistic planning!
  • 35. 35 Planning with Energy Minimization E Minimize Energy E1 E2 Trajectory Energy Function Cost Functions (Goal, Value Functions, Test-Time Constraints) [1] Du et al. Model Based Planning with Energy Based Models. CoRL 2019. [2] Janner*, Du* et al. Planning with Diffusion for Flexible Behavior Synthesis. ICML 2022 [3] Ajay*, Du*, Gupta* et al. Is Conditional Generative Modeling all You Need For Decision Making? ICLR 2023 𝒯 𝒯 𝒯 Planning/reinforcement learning through inference time search on composed distributions.
  • 36. 36 ) Reinforcement Learning through Value Composition 𝐬0 𝐚0 𝐬1 𝐚1 𝐬2 𝐚2 𝐬3 𝐚3 𝐬4 𝐚4 𝐬5 𝐚5 + ) ) + )
  • 37. 37 ) Reinforcement Learning through Value Composition 𝐬0 𝐚0 𝐬1 𝐚1 𝐬2 𝐚2 𝐬3 𝐚3 𝐬4 𝐚4 𝐬5 𝐚5 + ) ) + ) Solve many different tasks with one model!
  • 39. 39 Goal Planning through Optimization 𝐬0 𝐚0 𝐬1 𝐚1 𝐬2 𝐚2 𝐬3 𝐚3 𝐬4 𝐚4 𝐬5 𝐚5 ) ) ) + ) 𝐬0 𝐚0 𝐬1 𝐚1 𝐬2 𝐚2 𝐬3 𝐚3 𝐬4 𝐚4 𝐠 𝐚5 𝐬0 g Construct a zero-shot goal-seeking policy!
  • 40. 40 Goal Planning through Optimization
  • 41. 41 Goal Planning through Optimization Single Task Planning Multi-Task Planning
  • 42. 42 Goal Planning through Optimization Single Task Planning Multi-Task Planning Baselines require per task retraining while our trained model can be applied across tasks!
  • 43. 43 Test Time Cost Functions ) 𝐬0 𝐚0 𝐬1 𝐚1 𝐬2 𝐚2 𝐬3 𝐚3 𝐬4 𝐚4 𝐬5 𝐚5 ) + )
  • 44. 44 Optimizing Hand-Specified Cost Functions with Energy Minimization
  • 45. 45 Planning From Partial Visual Observations on Real Robots [1] Chi, Feng, Du et al. Diffusion Policy: Visualmotor Policy Learning via Action Diffusion. RSS 2023. Can learn complex trajectory planning given only visual observations given very few (50) demos.
  • 46. 46 Fast Task Learning by Finetuning the Task Energy Function E Minimize Energy E1 E2 Trajectory Energy Function Cost Functions (Goal, Value Functions, Test-Time Constraints) [1] Netanyahu, Du et al. Few-Shot Task Learning through Inverse Generative Modeling. 𝒯 𝒯 𝒯 Can learn new tasks from very few demonstrations!
  • 47. 47 Fast Task Learning by Finetuning the Task Energy Function [1] Netanyahu, Du et al. Few-Shot Task Learning through Inverse Generative Modeling. Train Tasks Highway Exit Merge Intersection
  • 48. 48 Fast Task Learning by Finetuning the Task Energy Function [1] Netanyahu, Du et al. Few-Shot Task Learning through Inverse Generative Modeling. Test Task (5 Demos) Few-shot Energy Composition Behavioral Cloning In Context Learning
  • 49. 49 Fast Task Learning by Finetuning the Task Energy Function [1] Netanyahu, Du et al. Few-Shot Task Learning through Inverse Generative Modeling. Train Tasks (217 Demos) Push on Surface Push around Bowl Pick and Place on Book Pick and Place on Table
  • 50. 50 Fast Task Learning by Finetuning the Task Energy Function [1] Netanyahu, Du et al. Few-Shot Task Learning through Inverse Generative Modeling. Test Task (10 Demos) Push on Book
  • 51. 51 Energy Based Models : ℝ K →ℝ E Introduce the idea of Energy Based Models and compositionality. Planning Generalizing beyond demonstrations through planning Foundation Models Combining many foundation models for intelligent decision making.
  • 52. 52 Solving Long Horizon Tasks by Composing Foundation Models Solving a long horizon decision-making task requires generalization across many sources of knowledge. Make A Cup of Tea Decision Making Dem onstrations Embodied Agents
  • 53. Composing Foundation Models for Embodied Agents Compose foundation models representing each axis of information! Make A Cup of Tea Decision Making Dem onstrations Embodied Agents Visual Knowledge Procedural Information
  • 54. Composing Foundation Models for Embodied Agents Compose foundation models representing each axis of information! Make A Cup of Tea Decision Making Dem onstrations Embodied Agents Visual Knowledge Procedural Information
  • 55. 55 Hierarchical Planning with Foundation Models Language Model Video Model Egocentric Action Model Task Information Motion Information Kinematics Information [1] Ajay*, Han*, Du* et al. Compositional Foundation Models for Hierarchical Planning. NeurIPS 2023 We want to construct a plan to make a cup of tea that is semantically, geometrically, and physically executable on a robot! 1) Look for a tea kettle 2) Heat water 3) Find teabag 4) …
  • 56. 56 Hierarchical Planning with Foundation Models Language Model Video Model Egocentric Action Model Task Information Motion Information Kinematics Information [1] Ajay*, Han*, Du* et al. Compositional Foundation Models for Hierarchical Planning. NeurIPS 2023 We want to construct a plan to make a cup of tea that is semantically, geometrically, and physically executable on a robot!
  • 57. 57 Hierarchical Planning with Foundation Models High Level Action Physical Plausibility Kinematic Plausibility Low Level Action [1] Ajay*, Han*, Du* et al. Compositional Foundation Models for Hierarchical Planning. NeurIPS 2023 Language Model Video Model Egocentric Action Model Task Information Motion Information Kinematics Information We want to construct a plan to make a cup of tea that is semantically, geometrically, and physically executable on a robot!
  • 58. 58 Hierarchical Decision Making with Multimodal Models Stack red block on a cyan block and place a brown block to the right of stack. Goal Start Image Place cyan block in brown box. Generated Video Plan Generated Language Plan
  • 59. 59 Place white block in a cyan bowl. Hierarchical Decision Making with Multimodal Models Stack red block on a cyan block and place a brown block to the right of stack. Goal Start Image Generated Video Plan Generated Language Plan
  • 60. A I Goal: Stack red block on top of brown block and place yellow block to the left of the stack Hierarchical Decision Making with Multimodal Models Execution
  • 61. 61 The Issue of Compositional Sampling Across Models The previous consensus procedure is slow because language models often proposes plans not grounded in vision and the video likelihood is hard to optimize. High Level Action Physical Plausibility Language Model Video Model Task Information Motion Information
  • 62. 62 Effective Planning with Vision Language Models Use a vision language model to propose visually grounded plans and as a heuristic to accelerate consensus. High Level Action Physical Plausibility Vision Language Model Video Model Task Information Motion Information [1] Du et al. Video Language Planning. ICLR 2024.
  • 63. 63 Vision Language Models as Text Policies Use the VLM as a policy (x, g), which given an image x and text goal g outputs possible text actions a to execute next. Vision Language Model Task Information [1] Du et al. Video Language Planning. ICLR 2024. Image x Text Goal g Put the fruits in the top drawer. Open top drawer. Place banana in top drawer.
  • 64. 64 Video Models as Text Conditioned Video Planners Video Model Task Information [1] Du et al. Video Language Planning. ICLR 2024. Image x Text Action a Place banana in top drawer Continue to use the video model fVM(x, a) as video space planner, which given an image x and a text action a, synthesizes and image plan .
  • 65. 65 Video Models as Text Conditioned Video Planners Video Model Task Information [1] Du et al. Video Language Planning. ICLR 2024. Image x Text Action a Open top drawer Continue to use the video model fVM(x, a) as video space planner, which given an image x and a text action a, synthesizes and image plan .
  • 66. 66 Vision Language Models as Heuristic Functions Use the VLM as a heuristic function HVLM(, g), which give a partial video plan and text goal g estimates completion to a goal. Vision Language Model Task Information [1] Du et al. Video Language Planning. ICLR 2024. Video Plan Heuristic Estimate Text Goal g Put the fruits in the top drawer.
  • 67. 67 Vision Language Models as Heuristic Functions Use the VLM as a heuristic function HVLM(, g), which give a partial video plan and text goal g estimates completion to a goal. Vision Language Model Task Information [1] Du et al. Video Language Planning. ICLR 2024. Video Plan Heuristic Estimate Text Goal g Put the fruits in the top drawer.
  • 68. 68 Zero-Shot Tree Search with Foundation Models [1] Du et al. Video Language Planning. ICLR 2024. Enables efficient consensus using tree search! 𝜏image= argmin 𝜏image 𝑓 VM , 𝜋VLM HVLM (𝜏image , 𝑔)
  • 69. 69 Zero-Shot Planning and Execution with Foundation Models Task Information Motion Information Kinematics Information [1] Du et al. Video Language Planning. ICLR 2024 We can plan and execute unseen long horizon tasks on the real robot without any explicit task training! Vision Language Model (PALM-E) Video Model (UniPi) Action Model (Goal-Conditioned RT2)
  • 70. 70 Zero-Shot Planning and Execution with Foundation Models Goal: Put the fruit into the top drawer.
  • 71. AI Move the red circle to the left of the yellow hexagon Move the green circle closer to the red star Move the blue triangle to the top left of the red circle Move the blue cube to the left of the blue triangle Move the green circle to the center Push the green circle towards the yellow heart Move the blue triangle to the right of the green circle .... Subgoal sequence: dynamically generated Synthesized sequence of video plans Goal: Make a line
  • 72. Goal: Make a line – Execution on Real Robot
  • 73. 73 Long Horizon Decision Making through Hierarchical Composition [1] Du et al. Video Language Planning. ICLR 2024.
  • 74. 74 Long Horizon Decision Making through Hierarchical Composition [1] Du et al. Video Language Planning. ICLR 2024. Hierarchical planning substantially outperforms monolithic generative models (RT-2 / PALM- E)
  • 75. 75 Energy Based Models : ℝ K →ℝ E Introduce the idea of Energy Based Models and compositionality. Planning Generalizing beyond demonstrations through planning Foundation Models Combining many foundation models for intelligent decision making.
  • 76. 76 Building Customizable Generative Models through Compositional Generation Yilun Du July 2024

Editor's Notes

  • #1: Section on foundation models need Abstract to concrete Slow down and explain the mathematical derivation Factorizing Optimize in intelligent generative and Switch the example Multiagent communication is factorization XXX Again to factorizing the energy function we do XXX To much stuff, repeat things again List all the works on a new slide Speak a little less slowly Pause more in the talk for interesting parts Defining the failures of the current method To the Test Set and Beyond Learning and Composing EBMs for Generalization Go deep, then go quick What is the training data you used in each example High level comments : Ending is too fluffy – more descriptions, remove citations Utility,Probability,Sets is confusing Highlighting the technical challenges, everything sounds very similarily
  • #2: Insight of each paper in each section Sometimes get lost of the main topic – jumping too much (cure is rhetoric, title of the slides and titles of talk are not important) limited amount of data, different distributions, compositionality Hit compositionality every time Put it in context of compositionality Now what if we know this but don’t know Speech countuors make everything clearer Too much energy function, too little compositionality Hey, this is the plan, hold hand and tell people what is connected Too many papers in too little time Title of the talk is not At the end of each section, going to the main idea of talk Emphasize the explicit and implicit learning more Make a set of bullets Implicit function drive the information, something about compositionality Write down all the assertions, allocate time as the score Few well placed diagrams and formulas The dashes are not consistent
  • #3: Visually indicate the incorrect things
  • #9: I’ll divide my talk into four separate sections. First I’ll introduce the idea of energy based model and show how it enables separate to be composed together. Next I’ll a set of different compositional operators to combine energy functions, and illustrate that enable generalization the original training distribution. I’ll further show how such operations can be generalized to generate trajectories of actions to execute in an environment. Finally, I’ll illustrate how such compositionality may be more broadly applied across a set of different heterogenous foundation models to construct complex multimodal systems.
  • #10: Now we can do this in detail, comes from statistics physics, bases of Boltzmann, Helmholtz machine, challenge is doing it for high dimensional input data
  • #11: Simultaneously, such an energy function assigns low energy to an image where the underlying person is smiling
  • #12: Simultaneously, such an energy function assigns low energy to an image where the underlying person is smiling
  • #13: Simultaneously, such an energy function assigns low energy to an image where the underlying person is smiling
  • #14: Cite people, write the form the sampling Du et al, Neurips XX, Smith and Jones, AAI Need to motivate generartion. If we have this energy function, we can use it find high scoring xs and here how. We can do gradient descent on the energy. Talk about different applications of the energy function (MAP estimation, draw samples from this energy function, sample from the probability distribution)
  • #15: For a long time people have been doing this idea for a long time – emphasize the things to make it effective Add a Given samples from a distribution P, I would like to learn an e f representing log P, We want to find \theta to maximize P_theta(D), minimize -log E_theta(D) Do stuff step by step, add theta List the data we have – have some smiling people One approach is to train models is to generate random noise and training the energy Take a random example – call it a negative example and try to make it better according to the energy function Minimize the energy of smiling people, maximize the energy of hallucinations (sampling energy function) Have an illustration showing a set of data, where we want low energy at the data points and high energy everywhere else Move the relation stuff here
  • #16: For a long time people have been doing this idea for a long time – emphasize the things to make it effective Add a Given samples from a distribution P, I would like to learn an e f representing log P, We want to find \theta to maximize P_theta(D), minimize -log E_theta(D) Do stuff step by step, add theta List the data we have – have some smiling people One approach is to train models is to generate random noise and training the energy Take a random example – call it a negative example and try to make it better according to the energy function Minimize the energy of smiling people, maximize the energy of hallucinations (sampling energy function) Have an illustration showing a set of data, where we want low energy at the data points and high energy everywhere else Move the relation stuff here
  • #17: For a long time people have been doing this idea for a long time – emphasize the things to make it effective Add a Given samples from a distribution P, I would like to learn an e f representing log P, We want to find \theta to maximize P_theta(D), minimize -log E_theta(D) Do stuff step by step, add theta List the data we have – have some smiling people One approach is to train models is to generate random noise and training the energy Take a random example – call it a negative example and try to make it better according to the energy function Minimize the energy of smiling people, maximize the energy of hallucinations (sampling energy function) Have an illustration showing a set of data, where we want low energy at the data points and high energy everywhere else Move the relation stuff here
  • #18: Make a matrix
  • #19: Make it super clear that these energy functions are differently trained Different examples What if I have these two models that I have learned on different settings Showing iteratively adding stuff Add no additional training somewhere
  • #20: Make it super clear that these energy functions are differently trained Different examples What if I have these two models that I have learned on different settings Showing iteratively adding stuff Add no additional training somewhere
  • #21: Can draw the PDFs for the combinations When I’m using generative models to be representing distributions I can do the following things : show 2D distributions
  • #22: Can draw the PDFs for the combinations When I’m using generative models to be representing distributions I can do the following things : show 2D distributions
  • #23: Can draw the PDFs for the combinations When I’m using generative models to be representing distributions I can do the following things : show 2D distributions
  • #24: Make sure to explained examples clearly – We trained some separate things, and the we used them toanswer some question Steps: 1) First we train a thing to generate 2) Now here is a query – imagine a scene that satisfy this thing 3) Then show the StyleGAN one 4) Then show our result
  • #25: Explain how you discover the right compositional structure
  • #26: Pose the problem – now we have an input, we want to discover the components, adopt a color scheme of things that you know and that you don’t know Technical challenges / show more methods How to decompose the components Decide on 3-5 points Show half colored / half not colored Now, here is another useful thing you can do – score things In previous picture the composition with at test time, This one is at training time
  • #27: For every 2 slides, have a declarative sentence
  • #31: I’ll divide my talk into four separate sections. First I’ll introduce the idea of energy based model and show how it enables separate to be composed together. Next I’ll a set of different compositional operators to combine energy functions, and illustrate that enable generalization the original training distribution. I’ll further show how such operations can be generalized to generate trajectories of actions to execute in an environment. Finally, I’ll illustrate how such compositionality may be more broadly applied across a set of different heterogenous foundation models to construct complex multimodal systems.
  • #32: If we wanted to train a robot policy directly solve each of the possible skills we could encounter in the world, it would require a huge range of different possible demonstrations. We would need demonstrations illustrating how to flip bacon, for instance across all the different possible configurations in a kitchen and the locations and positions of agents. In practice, this is pretty infeasible, demonstrations often only cover a subset of the space in which I would like to execute skills
  • #33: One way to effectively use these demonstrations, is to decompose the demonstrations into different subcomponents, and learn a world model or simulator of the world, and separate model to specifies What the characteristic of the successful execution of a skill corresponds to And a behavorial classifer
  • #34: Make it different – how do you represent trajectories and value functions We may then generate from a smiling energy function by optimizing for an input image with low smiling energy, which we illustrate in the video on the left.
  • #35: Make it different – how do you represent trajectories and value functions We may then generate from a smiling energy function by optimizing for an input image with low smiling energy, which we illustrate in the video on the left.
  • #36: Add note: can change this any time Say that non learnable things can be added also
  • #37: Add note: can change this any time Say that non learnable things can be added also
  • #38: We may then generate from a smiling energy function by optimizing for an input image with low smiling energy, which we illustrate in the video on the left.
  • #39: We may then generate from a smiling energy function by optimizing for an input image with low smiling energy, which we illustrate in the video on the left.
  • #40: We may then generate from a smiling energy function by optimizing for an input image with low smiling energy, which we illustrate in the video on the left.
  • #41: We may then generate from a smiling energy function by optimizing for an input image with low smiling energy, which we illustrate in the video on the left.
  • #42: We may then generate from a smiling energy function by optimizing for an input image with low smiling energy, which we illustrate in the video on the left.
  • #43: We may then generate from a smiling energy function by optimizing for an input image with low smiling energy, which we illustrate in the video on the left.
  • #44: We may then generate from a smiling energy function by optimizing for an input image with low smiling energy, which we illustrate in the video on the left.
  • #45: We may then generate from a smiling energy function by optimizing for an input image with low smiling energy, which we illustrate in the video on the left.
  • #46: Make it different – how do you represent trajectories and value functions We may then generate from a smiling energy function by optimizing for an input image with low smiling energy, which we illustrate in the video on the left.
  • #47: Make it different – how do you represent trajectories and value functions We may then generate from a smiling energy function by optimizing for an input image with low smiling energy, which we illustrate in the video on the left.
  • #48: Make it different – how do you represent trajectories and value functions We may then generate from a smiling energy function by optimizing for an input image with low smiling energy, which we illustrate in the video on the left.
  • #49: Make it different – how do you represent trajectories and value functions We may then generate from a smiling energy function by optimizing for an input image with low smiling energy, which we illustrate in the video on the left.
  • #50: Make it different – how do you represent trajectories and value functions We may then generate from a smiling energy function by optimizing for an input image with low smiling energy, which we illustrate in the video on the left.
  • #51: I’ll divide my talk into four separate sections. First I’ll introduce the idea of energy based model and show how it enables separate to be composed together. Next I’ll a set of different compositional operators to combine energy functions, and illustrate that enable generalization the original training distribution. I’ll further show how such operations can be generalized to generate trajectories of actions to execute in an environment. Finally, I’ll illustrate how such compositionality may be more broadly applied across a set of different heterogenous foundation models to construct complex multimodal systems.
  • #52: To effectively one construct one large model to solve the decision making task, we would need a huge amount of day, covering all different combinations of semantic information, visual, physical and control knowledge. We need to
  • #53: In contrast, in this paper, we propose to represent a single concept as an energy.
  • #54: In contrast, in this paper, we propose to represent a single concept as an energy.
  • #55: In contrast, in this paper, we propose to represent a single concept as an energy.
  • #56: In contrast, in this paper, we propose to represent a single concept as an energy.
  • #57: Say the modality that is communicated between the models In contrast, in this paper, we propose to represent a single concept as an energy.
  • #58: In contrast, in this paper, we propose to represent a single concept as an energy.
  • #59: In contrast, in this paper, we propose to represent a single concept as an energy.
  • #61: Say the modality that is communicated between the models In contrast, in this paper, we propose to represent a single concept as an energy.
  • #62: Say the modality that is communicated between the models In contrast, in this paper, we propose to represent a single concept as an energy.
  • #63: Say the modality that is communicated between the models In contrast, in this paper, we propose to represent a single concept as an energy.
  • #64: Say the modality that is communicated between the models In contrast, in this paper, we propose to represent a single concept as an energy.
  • #65: Say the modality that is communicated between the models In contrast, in this paper, we propose to represent a single concept as an energy.
  • #66: Say the modality that is communicated between the models In contrast, in this paper, we propose to represent a single concept as an energy.
  • #67: Say the modality that is communicated between the models In contrast, in this paper, we propose to represent a single concept as an energy.
  • #68: Explain this slide better
  • #69: In contrast, in this paper, we propose to represent a single concept as an energy.
  • #70: Explain this slide better
  • #73: In contrast, in this paper, we propose to represent a single concept as an energy.
  • #74: In contrast, in this paper, we propose to represent a single concept as an energy.
  • #75: I’ll divide my talk into four separate sections. First I’ll introduce the idea of energy based model and show how it enables separate to be composed together. Next I’ll a set of different compositional operators to combine energy functions, and illustrate that enable generalization the original training distribution. I’ll further show how such operations can be generalized to generate trajectories of actions to execute in an environment. Finally, I’ll illustrate how such compositionality may be more broadly applied across a set of different heterogenous foundation models to construct complex multimodal systems.