Build Customizable Generative Models through Compositional Generation

1
Building Customizable Generative Models through
Compositional Generation
Yilun Du
July 2024

Is Data All You Need?
Question: Is the blue cup below the red
bowl?
GPT-4V: No, the blue cup is not below the red
bowl. The blue cup is to the right side of the
picture, and the red bowl is not visible in the
image…
3

4
Why Does Generative AI on Other Settings Work Poorly?
An astronaut riding a
horse
in a photorealistic style.
A large blue metal cube in front of a large
cyan metal cylinder. A large blue metal
cube on the left of small metal sphere.
Compared to pixels in images (and many other continuous distributions), natural
language is much simpler (structured for effective communication) and naturally
compositional (infinite use of finite means).
Natural Language
Training
Distribution
Real World Distribution Real World Distribution
Training
Distribution
Embodied Data

5
Composition of Learned Factors Yields Strong Generalization
horse
Natural Language
Training
Distribution
Real World Distribution Real World Distribution
Training
Distribution
Embodied Data
Energy Based Models (EBMs) provide a probabilistic
manner to represent the real distribution as a
composition of tractable factors!

6
horse
Factor (Axis 1)
Factor
(Axis
2)
Natural Language
Training
Distribution
Real World Distribution
Training
Distribution
Embodied Data
Energy Based Models (EBMs) provide a probabilistic
manner to represent the real distribution as a
composition of tractable factors!
𝑝 (𝑥)∝∏
𝑖=1
𝑁
𝑝𝑖(𝑥𝑖)

7
horse
Factors can be combined at prediction
time to represent the entire
distribution (even parts with no data).
Factor
(Axis
2)
Factor (Axis 1)
Natural Language
Training
Distribution
Training
Distribution
Embodied Data
𝑝 (𝑥)∝∏
𝑖=1
𝑁
𝑝𝑖(𝑥𝑖)

8
horse
New compositions of factors enable
fast customization of generative
models to new settings.
Factor
(Axis
2)
Factor (Axis 1)
Natural Language
Training
Distribution
Training
Distribution
Embodied Data
𝑝 (𝑥)∝∏
𝑖=1
𝑁
𝑝𝑖(𝑥𝑖)

9
Energy Based Models
: ℝ K
→ℝ
E
Introduce the idea of
Energy Based Models
and compositionality.
Planning
Generalizing beyond
demonstrations
through planning
Foundation
Models
Combining many
foundation models for
intelligent decision
making.

10
Implicit Modeling with Energy Functions
E
Red Truck
Energy Function
High Energy

11
Low Energy
E
Red Truck
Energy Function

12
E
Red Truck
Energy Function
E(x)
Low Energy

13
E
Red Truck
Energy Function
Encode probability
distribution function:
p(x) e-E(x)
E(x)
Low Energy

14
Sampling from Probability Distributions
E
Red Truck
Energy Function
Langevin Dynamics
[1] Du and Mordatch. Implicit Generation and Modeling with Energy Based Models. NeurIPS 2019
𝑥𝑡=𝑥𝑡 − 1 − ∇𝑥 𝐸 (𝑥𝑡 )+𝑁 (0 , 𝜎 )

15
• Maximum likelihood training of EBMs corresponds to minimizing the loss:
Learning Energy Functions
∇ 𝜃 ℒ𝑀𝐿= 𝔼𝑥 𝑝 ( 𝑥) [ ∇ 𝜃 𝐸 ( 𝑥)] − 𝔼𝑥 𝑝 𝜃 ( 𝑥 ) [ ∇𝜃 𝐸 ( 𝑥)]
Drawn from Data Drawn from Learned
Distribution

16
Learning Energy Functions
• Maximum likelihood training of EBMs corresponds to minimizing the loss:
• Who showed how to scale EBM to modern generative tasks, by leveraging a
combination of Langevin dynamics + replay buffers [1].
∇ 𝜃 ℒ𝑀𝐿= 𝔼𝑥 𝑝 ( 𝑥) [ ∇ 𝜃 𝐸 ( 𝑥)] − 𝔼𝑥 𝑝 𝜃 ( 𝑥 ) [ ∇𝜃 𝐸 ( 𝑥)]
Drawn from Data Drawn from Learned
Distribution

Diffusion Models are Energy Based Models
• Diffusion models are Energy Based Models where the gradient of the
energy function is the learned denoising function:
• Can implement compositionality with diffusion models by treating the
denoise function as the gradient of the energy function [2].
[2] Du et al. Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and
MCMC. ICML 2023
𝑥𝑡 =𝑥𝑡 − 1 − ∇𝑥 𝐸 (𝑥𝑡 )+𝑁 (0 , 𝜎 ) 𝑥𝑡 =𝑥𝑡 − 1 − 𝜀𝜃 (𝑥𝑡 )+𝑁 (0 , 𝜎 )

18
[6] 3D Shapes
[4] Video Plans
[3] Policies
[1] Images [2] Robot Plans
[5] Soft Robots
[1] Liu*, Li*, Du* et al. Composable Visual Generation with Diffusion Models. ECCV 2022
[2] Janner*, Du* et al. Planning with Diffusion for Flexible Behavior Synthesis. ICML 2022
[3] Chi, Feng, Du et al. Diffusion Policy: Visualmotor Policy Learning via Action Diffusion. RSS 2023.
[4] Du et al. Learning Universal Policies through Text-Conditioned Video Generation. NeurIPS 2023
[5] Wang, Zheng, Ma, Du* et al. Diffusebot: Breeding Soft Robots with Physics Augmented Diffusion Models. NeurIPS 2023
[6] Zhou, Du et al. 3D Shape Generation and Completion through Point-Voxel Diffusion. ICCV 2021
EBMs Can Represent Distributions Across High Dimensional Inputs

19
Composing Different Energy Functions at Prediction Time
E
Minimize Energy
E1
Red Truck
Energy Function

20
Composing Different Energy Functions at Prediction Time
E
Minimize Energy
E1
En
Red Truck
Energy Function
Desert
Energy Function

21
Combining Probability Distributions with Energy Functions
[1] Du et al. Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC. ICML 2023
Products: =
×
𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) 𝑝2(𝑥)

22
Products:
Mixtures:
=
×
𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) 𝑝2(𝑥)
=
+¿
𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) + 𝑝2(𝑥)

23
Products:
Mixtures:
=
×
𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) 𝑝2(𝑥)
=
+¿
𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) + 𝑝2(𝑥)
Inversion: ¿ =
𝑝1(𝑥) 𝑝2(𝑥) 𝑝1(𝑥) / 𝑝2(𝑥)𝑎

24
Unsupervised Discovery of Composable Components
Training
Distribution
How can we discover
the independent
factors of a
distributions?

25
Factor
2
Factor 1
Training
Distribution
𝑝 (𝑥)∝∏
𝑖=1
𝑁
𝑝𝑖(𝑥)
Train generative
model with
factorized structure
through maximum
likelihood.

26
|
E
Minimize Energy
E
Inferred Energy
Function
|
E
Inferred Energy
Function
𝑥=arg min
~
𝑥
∑𝑘 𝐸 (~
𝑥∨Enc𝑘(𝑥))
Encourage
independence
through
information
bottleneck
𝑝𝜃 (𝑥∨ℑ)∝∏
𝑖=1
𝑁
𝑝𝜃
𝑖
(𝑥∨Enc𝑖 (ℑ)¿)¿

27
[1] Du et al. Unsupervised Learning of Compositional Energy Concepts. NeurIPS 2021
Discovering Composable Components in Natural Images
Input Image Inferred Energy
Functions
Compose Energy
Function
𝑝 𝜃 ( 𝑥 ∨ 𝑧 1 )
𝑝 𝜃 ( 𝑥 ∨ 𝑧 2 )
𝑝 𝜃 ( 𝑥 ∨ 𝑧 3 )
𝑝 𝜃 ( 𝑥 ∨ 𝑧 4 )

28
Compositional Generalization of Inferred Components
Dataset 1 Dataset 2
Images
[1] Su*, Du*, Liu* et al. Compositional Image Decomposition with Diffusion Models.

29
Dataset 1 Dataset 2
Images

30
Dataset 1 Dataset 2
Images Composition
Sample from the
composed
distribution:

31
Energy Based Models
: ℝ K
→ℝ
E
Energy Based Models
Planning
Generalizing beyond
demonstrations
through planning
Foundation
Models
Combining many
making.

32
Generalizing Beyond Demonstrations through Compoistion
We want to construct
robotic agents that can
generalize beyond the
demonstrations they
have seen!
Trajectory Synthesis
Dem
onstrations
Robot Actions

33
Planning through Compositional Generation
Decompose
demonstrations
into a model
capturing the
dynamics of the
environment + a
model to capture
the goal of a task.
Dem
onstrations
Goal
Dynamics
Robot Actions
(, goal)

34
Planning through Compositional Generation
Dem
onstrations
Goal
Dynamics
Robot Actions
(, goal)
Enables
generalization to
new combinations
of states + goals
through
probabilistic
planning!

35
Planning with Energy Minimization
E
Minimize Energy
E1
E2
Trajectory Energy
Function
Cost Functions
(Goal, Value Functions,
Test-Time Constraints)
[1] Du et al. Model Based Planning with Energy Based Models. CoRL 2019.
[2] Janner*, Du* et al. Planning with Diffusion for Flexible Behavior Synthesis. ICML 2022
[3] Ajay*, Du*, Gupta* et al. Is Conditional Generative Modeling all You Need For Decision Making? ICLR 2023
𝒯
𝒯
𝒯
Planning/reinforcement
learning through inference
time search on composed
distributions.

36
)
Reinforcement Learning through Value
Composition
𝐬0
𝐚0
𝐬1
𝐚1
𝐬2
𝐚2
𝐬3
𝐚3
𝐬4
𝐚4
𝐬5
𝐚5
+
)
) + )

37
)
Composition
𝐬0
𝐚0
𝐬1
𝐚1
𝐬2
𝐚2
𝐬3
𝐚3
𝐬4
𝐚4
𝐬5
𝐚5
+
)
) + )
Solve many different tasks with one model!

38
Composition

39
Goal Planning through Optimization
𝐬0
𝐚0
𝐬1
𝐚1
𝐬2
𝐚2
𝐬3
𝐚3
𝐬4
𝐚4
𝐬5
𝐚5
)
)
) + )
𝐬0
𝐚0
𝐬1
𝐚1
𝐬2
𝐚2
𝐬3
𝐚3
𝐬4
𝐚4
𝐠
𝐚5
𝐬0 g
Construct a zero-shot goal-seeking policy!

40

41
Single Task Planning Multi-Task Planning

42
Single Task Planning Multi-Task Planning
Baselines require per task retraining
while our trained model can be applied
across tasks!

43
Test Time Cost Functions
)
𝐬0
𝐚0
𝐬1
𝐚1
𝐬2
𝐚2
𝐬3
𝐚3
𝐬4
𝐚4
𝐬5
𝐚5
) +
)

44
Optimizing Hand-Specified Cost Functions with Energy Minimization

45
Planning From Partial Visual Observations on Real Robots
[1] Chi, Feng, Du et al. Diffusion Policy: Visualmotor Policy Learning via Action Diffusion. RSS 2023.
Can learn complex trajectory
planning given only visual
observations given very few
(50) demos.

46
Fast Task Learning by Finetuning the Task Energy Function
E
Minimize Energy
E1
E2
Trajectory Energy
Function
Cost Functions
(Goal, Value Functions,
Test-Time Constraints)
[1] Netanyahu, Du et al. Few-Shot Task Learning through Inverse Generative Modeling.
𝒯
𝒯
𝒯
Can learn new tasks from
very few demonstrations!

47
Train Tasks
Highway
Exit
Merge
Intersection

48
Test Task
(5 Demos)
Few-shot Energy
Composition
Behavioral
Cloning
In Context
Learning

49
Train Tasks
(217 Demos)
Push on Surface Push around Bowl
Pick and Place on Book Pick and Place on Table

50
Test Task
(10 Demos)
Push on Book

51
Energy Based Models
: ℝ K
→ℝ
E
Energy Based Models
Planning
Generalizing beyond
demonstrations
through planning
Foundation
Models
Combining many
making.

52
Solving Long Horizon Tasks by Composing Foundation Models
Solving a long horizon
decision-making task requires
generalization across many
sources of knowledge.
Make A Cup of Tea
Decision Making
Dem
onstrations
Embodied Agents

Composing Foundation Models for Embodied Agents
Compose foundation
models representing
each axis of information!
Make A Cup of Tea
Decision Making
Dem
onstrations
Embodied Agents
Visual
Knowledge
Procedural Information

55
Hierarchical Planning with Foundation Models
Language Model Video Model Egocentric
Action
Model
Task Information Motion
Information
Kinematics Information
[1] Ajay*, Han*, Du* et al. Compositional Foundation Models for Hierarchical Planning. NeurIPS 2023
We want to construct a plan to make a cup of tea that is
semantically, geometrically, and physically executable on a
robot!
1) Look for a tea kettle
2) Heat water
3) Find teabag
4) …

56
Action
Model
Information
robot!

57
High Level
Action
Physical
Plausibility
Kinematic
Plausibility
Low Level Action
Action
Model
Information
robot!

58
Hierarchical Decision Making with Multimodal Models
Stack red block on a
cyan block and place
a brown block to the
right of stack.
Goal
Start Image
Place cyan block in brown box.
Generated Video Plan
Generated Language Plan

59
Place white block in a cyan bowl.
Hierarchical Decision Making with Multimodal Models
Stack red block on a
cyan block and place
a brown block to the
right of stack.
Goal
Start Image
Generated Video Plan
Generated Language Plan

A
I
Goal: Stack red block on top of
brown block and place yellow
block to the left of the stack
Hierarchical Decision Making with Multimodal Models Execution

61
The Issue of Compositional Sampling Across Models
The previous consensus procedure is slow because language
models often proposes plans not grounded in vision and the video
likelihood is hard to optimize.
High Level Action
Physical
Plausibility
Language Model Video Model
Information

62
Effective Planning with Vision Language Models
Use a vision language model to propose visually grounded plans
and as a heuristic to accelerate consensus.
High Level Action
Physical
Plausibility
Vision Language
Model
Video Model
Information
[1] Du et al. Video Language Planning. ICLR 2024.

63
Vision Language Models as Text Policies
Use the VLM as a policy (x, g), which given an image x and text goal g
outputs possible text actions a to execute next.
Vision Language
Model
Task Information
Image x
Text Goal g
Put the fruits in
the top drawer.
Open top drawer.
Place banana in
top drawer.

64
Video Models as Text Conditioned Video Planners
Video Model
Task Information
Image x
Text Action a
Place banana in
top drawer
Continue to use the video model fVM(x, a) as video space planner, which given
an image x and a text action a, synthesizes and image plan .

65
Video Models as Text Conditioned Video Planners
Video Model
Task Information
Image x
Text Action a
Open top drawer
Continue to use the video model fVM(x, a) as video space planner, which given
an image x and a text action a, synthesizes and image plan .

66
Vision Language Models as Heuristic Functions
Use the VLM as a heuristic function HVLM(, g), which give a partial video
plan and text goal g estimates completion to a goal.
Vision Language
Model
Task Information
Video Plan
Heuristic Estimate
Text Goal g
Put the fruits in
the top drawer.

67
Vision Language Models as Heuristic Functions
Use the VLM as a heuristic function HVLM(, g), which give a partial video
plan and text goal g estimates completion to a goal.
Vision Language
Model
Task Information
Video Plan
Heuristic Estimate
Text Goal g
Put the fruits in
the top drawer.

68
Zero-Shot Tree Search with Foundation Models
Enables efficient consensus using tree search! 𝜏image= argmin
𝜏image
𝑓 VM
, 𝜋VLM
HVLM (𝜏image , 𝑔)

69
Zero-Shot Planning and Execution with Foundation Models
Information
[1] Du et al. Video Language Planning. ICLR 2024
We can plan and execute unseen long horizon tasks on the real robot
without any explicit task training!
Vision Language Model
(PALM-E)
Video Model
(UniPi)
Action Model
(Goal-Conditioned RT2)

70
Zero-Shot Planning and Execution with Foundation Models
Goal: Put the fruit into
the top drawer.

AI
Move the red circle to the left of the yellow
hexagon
Move the green circle closer to the red star
Move the blue triangle to the top left of the red
circle
Move the blue cube to the left of the blue triangle
Move the green circle to the center
Push the green circle towards the yellow heart
Move the blue triangle to the right of the green
circle
....
Subgoal sequence: dynamically generated
Synthesized sequence of video plans
Goal: Make a line

Goal: Make a line – Execution on Real Robot

73
Long Horizon Decision Making through Hierarchical Composition

74
Long Horizon Decision Making through Hierarchical Composition
Hierarchical planning substantially outperforms monolithic generative models (RT-2 / PALM-
E)

75
Energy Based Models
: ℝ K
→ℝ
E
Energy Based Models
Planning
Generalizing beyond
demonstrations
through planning
Foundation
Models
Combining many
making.

76
Building Customizable Generative Models through
Compositional Generation
Yilun Du
July 2024

Build Customizable Generative Models through Compositional Generation

More Related Content

Similar to Build Customizable Generative Models through Compositional Generation (20)

Recently uploaded (20)

Build Customizable Generative Models through Compositional Generation

Editor's Notes