SlideShare a Scribd company logo
Scene Representation Networks:
Continuous 3D-Structure-Aware Neural Scene Representations
Vincent Sitzmann Gordon Wetzstein
Michael Zollhöfer
single image
camera pose
intrinsics
Surface Normals
Novel Views
+ +
Observations
Image + Pose & Intrinsics
What can we learn about latent 3D scenes from observations?
Vision: Learn rich representations just by watching video!
{
Self-supervised Scene Representation Learning
}
,…
Latent 3D Scenes
}
{
,
,…
,
Observations Re-Rendered
Observations
Self-supervised Scene Representation Learning
Image Loss
Model
,…
, ,…
,
Observations Re-Rendered
Observations
Self-supervised Scene Representation Learning
Image Loss
Neural Scene
Representation
Persistent feature
representation of
scene.
,…
, ,…
,
Observations Re-Rendered
Observations
Self-supervised Scene Representation Learning
Image Loss
Neural Scene
Representation
Persistent feature
representation of
scene.
Neural Renderer
Render from different
camera perspectives.
,…
, ,…
,
Observations Re-Rendered
Observations
2D baseline: Autoencoder
Image Loss
Latent Code
Output Pose
+
Conv
Encoder
Conv
Decoder
,…
, ,…
,
Observations Re-Rendered
Observations
2D baseline: Autoencoder
Image Loss
,…
, Latent Code
Output Pose
Conv
Decoder
,…
,
Vincent Sitzmann, NeurIPS 2019
Doesn’t capture 3D properties of scenes.
Trained on ~2500 shapenet cars with 50 observations each.
Need 3D inductive bias!
Vincent Sitzmann, NeurIPS 2019
Related Work
Tatarchenko et al., 2015
Worrall et al., 2017
Eslami et al., 2018
…
Scene Representation Learning
3D Computer Vision
Goodfellow et al., 2014
Kingma et al., 2013
Kingma et al., 2018
…
2D Generative Models
3D inductive bias /
3D structure
Self-supervised
with posed images
Choy et al., 2016
Huang et al., 2018
Park et al., 2018
…
Voxel-based Representations
Sitzmann et al., 2019
Lombardi et al., 2019
Phuoc et al., 2019
…
• Memory inefficient: 𝑂 𝑛3 .
• Doesn’t parameterize scene surfaces smoothly.
• Generalization is hard.
Observations Re-Rendered
Observations
Scene Representation Networks
Image Loss
Neural Scene
Representation
Neural Renderer
,…
, ,…
,
Observations Re-Rendered
Observations
Scene Representation Networks
Image Loss
Neural Scene
Representation
Neural Renderer
,…
, ,…
,
Free Space
𝑥2
Objects
𝑥1
Model scene as function Φ that maps coordinates to features.
…
𝒙 ∈
[]
…
𝒙 ∈
𝒙 ∈
Free
Space
[]
Free Space
𝑥2
Objects
𝑥1
[]
…
…
Φ: ℝ 3 ℝ𝑛
Scene Representation Network parameterizes Φ as MLP.
…
𝒙 ∈
[]
…
𝒙 ∈
[]
…
𝒙 ∈
Free
Space
…
[]
Φ: ℝ 3 ℝ𝑛
Scene
Representation
Network
Free Space
𝑥2
Objects
𝑥1
Scene Representation Network parameterizes Φ as MLP.
Φ: ℝ 3 ℝ𝑛
Scene
Representation
Network
Can sample anywhere,
at arbitrary resolutions.
Parameterizes scene
surfaces smoothly.
Memory scales with scene
complexity.
Observations Re-Rendered
Observations
Scene Representation Networks
Image Loss
Neural Renderer
Φ: ℝ 3 ℝ𝑛
Neural Scene
Representation
,…
, ,…
,
Observations Re-Rendered
Observations
Scene Representation Networks
Image Loss
Neural Renderer
Φ: ℝ 3 ℝ𝑛
Neural Scene
Representation
,…
, ,…
,
𝑥2
𝑥1
Neural Renderer.
Free Space
Neural Renderer.
Neural Renderer.
Neural Renderer Step 1: Intersection Testing.
?
?
?
?
?
Idea: march along ray until arrived at surface.
Neural Renderer Step 1: Intersection Testing.
𝐱𝟎
𝐱𝑖
world
coordinates
𝒗𝑖
feature
vector
Φ: ℝ 3 ℝ𝑛
Scene Representation
Neural Renderer Step 1: Intersection Testing.
𝐱𝑖
world
coordinates
𝒗𝑖
feature
vector
Φ: ℝ 3 ℝ𝑛
Scene Representation
Ray Marching LSTM
𝛿𝑖+1
Step length
𝐱𝟎
𝐱𝑖+1
Feasible step length:
Distance to closest scene
surface
Neural Renderer Step 1: Intersection Testing.
Iteration 0
Neural Renderer Step 1: Intersection Testing.
Iteration 1
Neural Renderer Step 1: Intersection Testing.
Iteration 2
Neural Renderer Step 1: Intersection Testing.
Iteration 3
Neural Renderer Step 2: Color Generation
Iteration 4
Neural Renderer Step 1: Intersection Testing.
Iteration …
Neural Renderer Step 1: Intersection Testing.
Neural Renderer Step 2: Color Generation
Φ: ℝ 3 ℝ𝑛
Scene Representation
Color
MLP
Observations Re-Rendered
Observations
Can now train end-to-end with posed images only!
Image Loss
Neural Renderer
Φ: ℝ 3 ℝ𝑛
Neural Scene
Representation
,…
, ,…
,
Generalizing across a class of scenes
Each scene represented by its own SRN.
parameters 𝜙1 ∈ ℝ𝑙
parameters 𝜙0 ∈ ℝ𝑙
parameters 𝜙2 ∈ ℝ𝑙
parameters 𝜙𝑛 ∈ ℝ𝑙
Each scene represented by its own SRN.
𝜙𝑖 live on k-dimensional
subspace of ℝ𝑙
, 𝑘 < 𝑙.
parameters 𝜙1 ∈ ℝ𝑙
parameters 𝜙0 ∈ ℝ𝑙
parameters 𝜙2 ∈ ℝ𝑙
parameters 𝜙𝑛 ∈ ℝ𝑙
Each scene represented by its own SRN.
Represent each scene with
low-dimensional embedding
embedding 𝑧1 ∈ ℝ𝑘
embedding 𝑧0 ∈ ℝ𝑘
embedding 𝑧2 ∈ ℝ𝑘
embedding 𝑧𝑛 ∈ ℝ𝑘
parameters 𝜙1 ∈ ℝ𝑙
parameters 𝜙0 ∈ ℝ𝑙
parameters 𝜙2 ∈ ℝ𝑙
parameters 𝜙𝑛 ∈ ℝ𝑙
parameters 𝜙1 ∈ ℝ𝑙
parameters 𝜙0 ∈ ℝ𝑙
parameters 𝜙2 ∈ ℝ𝑙
parameters 𝜙𝑛 ∈ ℝ𝑙
Each scene represented by its own SRN.
embedding 𝑧1 ∈ ℝ𝑘
embedding 𝑧0 ∈ ℝ𝑘
embedding 𝑧2 ∈ ℝ𝑘
embedding 𝑧𝑛 ∈ ℝ𝑘
Ψ: ℝ 𝑘
ℝ𝑙
,
zi ↦ Ψ 𝑧𝑖 = 𝜙𝑖
Hypernetwork
Results
SRNs
Tatarchenko et al.
Deterministic
GQN
Worrall et al.
Novel View Synthesis – Baseline Comparison
Shapenet v2 – single-shot reconstruction of objects in held-out test set
SRNs (Ours)
Tatarchenko et al.
2015
Deterministic
GQN, adapted
Eslami et al.
2018
Worrall et al.
2017
Training
 Shapenet cars / chairs.
 50 observations per object.
Testing
• Cars / chairs from unseen
test set
• Single observation!
Input pose
Novel View Synthesis – SRN Output
Shapenet v2 – single-shot reconstruction of objects in held-out test set
Input
pose
Vincent Sitzmann, NeurIPS 2019
Sampling at arbitrary resolutions
32x32
64x64
128x128
512x512
256x256
Surface Normals RGB
Vincent Sitzmann, NeurIPS 2019
Generalization to unseen camera poses
Camera Roll
Camera close-up
SRNs
Vincent Sitzmann, NeurIPS 2019
Generalization to unseen camera poses
Camera Roll
Camera close-up
Doesn’t reconstruct
geometry
Doesn’t reconstruct
geometry
SRNs
Tatarchenko et al.
Vincent Sitzmann, NeurIPS 2019
Latent code interpolation
Surface Normals RGB
Vincent Sitzmann, NeurIPS 2019
Latent code interpolation
Surface Normals RGB
Vincent Sitzmann, NeurIPS 2019
Can represent room-scale scenes, but aren’t compositional.
Training set novel-view synthesis on
GQN rooms (Eslami et al. 2018) with
Shapenet cars, 50 observations.
Work-in-progress: Compositional SRNs
generalize to unseen numbers of objects!
Scene Representation Networks:
Continuous 3D-structure-aware Neural Scene Representations
Interpolation Single-shot reconstruction Camera pose extrapolation
Gordon Wetzstein
Michael Zollhöfer
Find me at Poster # 71!
Looking for research
positions
in scene representation
learning.
Vincent Sitzmann
@vincesitzmann
vsitzmann.github.io
Neural Renderer 𝚯 (𝐟𝐨𝐫 𝐚𝐥𝐥 w × ℎ 𝐩𝐢𝐱𝐞𝐥𝐬 (u, v))
Depth Update
𝑑𝑖+1 = 𝑑𝑖 + 𝛿𝑖+1
Differentiable Ray-Marching (for n iteration steps)
𝐯𝒊
Ray Marching LSTM
(𝛿𝑖+1, 𝐡𝑖+1, 𝐜𝑖+1 ) = LSTM(𝐯𝑖, 𝐡𝑖, 𝐜𝑖)
𝐡0, 𝐜0
Output
Rendering
world
coordinates
𝐱𝑖
Ray Layer
𝐱𝑖 = 𝐫𝑢,𝑣(𝑑𝑖)
𝑑0
[𝐑, 𝐭], 𝐊
features at final
world coordinates
𝐯𝒏
Scene representation
Φ: ℝ 3
ℝ𝑛
Pixel Generator
1 × 1 conv
SRNs have three main parts.
Intuition: Hypernetwork learns priors over scene properties.
𝒙
world
coordinates
𝒗
feature
vector
Φ: ℝ 3 ℝ𝑛
Scene Representation
embedding z
SRNs
Tatarchenko et al.
Deterministic
GQN
Worrall et al.
Training on:
• 2434 cars
• 50 observations each
Testing on:
• 2434 cars from training set
• 250 novel views rendered in
Archimedean spiral around each object
Novel View Synthesis – Baseline Comparison
Shapenet v2 cars – training set objects
Novel-View Synthesis – SRN Output
Shapenet v2 cars – training set objects
SRNs
Tatarchenko et al.
Deterministic
GQN
Worrall et al.
Training on:
• 4612 chairs
• 50 observations each
Testing on:
• 4612 chairs from training set
• 250 novel views rendered in
Archimedean spiral around each object
Novel View Synthesis – Baseline Comparison
Shapenet v2 chairs – objects in training set
Novel-View Synthesis – SRN Output
Shapenet v2 chairs – objects in training set
SRN
Normal
Map
SRN
Output
Ground
Truth
Voxelgrid discretizes Φ.
Φ: ℝ 3
ℝ𝑛
[]
…
⟺ Φ 𝒙 =
𝒙 ∈
[]
…
⟺ Φ 𝒙 =
𝒙 ∈
[]
…
⟺ Φ 𝒙 =
𝒙 ∈
Free
Space
…
Voxelgrid maps discrete coordinates to feature vectors.
Memory inefficient: 𝑂 𝑛3
.
Doesn’t parameterize scene surfaces
smoothly.
Generalization is hard.
Free Space
𝑥2
Objects
𝑥1
Surfaces implicitly defined as 2D subspace where feature changes.
Φ: ℝ 3 ℝ𝑛
Scene
Representation
Network
[]
…
…
[]
…
Vincent Sitzmann, NeurIPS 2019
But alas…
Feature
”Free Space”
Feature
”Object”
Φ: ℝ 3
ℝ𝑛
?
?
?
?
?
Vincent Sitzmann, NeurIPS 2019
Signed Distance Functions as a special case of SRNs
Φ: ℝ 3
ℝ1
Map every (x,y,z) coordinate to
Euclidean distance to closest surface point
Vincent Sitzmann, NeurIPS 2019
From Computer Graphics: Sphere Tracing
Φ: ℝ 3
ℝ1
Map every (x,y,z) coordinate to
Euclidean distance to closest surface point
Vincent Sitzmann, NeurIPS 2019
From Computer Graphics: Sphere Tracing
• Step along ray
• Step length is equal to signed distance
Φ: ℝ 3
ℝ1
Vincent Sitzmann, NeurIPS 2019
𝐱𝟎
Begin at point on ray close to camera.
Neural Ray Marcher
Vincent Sitzmann, NeurIPS 2019
𝐱𝑖
world
coordinates
𝒗𝑖
Scene representation
Φ: ℝ 3 ℝ𝑛
feature
vector
Sample SRN
Neural Ray Marcher
𝐱𝟎
Vincent Sitzmann, NeurIPS 2019
𝐱𝑖
world
coordinates
𝒗𝑖
Scene representation
Φ: ℝ 3 ℝ𝑛
feature
vector
Predict step length
Neural Ray Marcher
Ray Marching LSTM
(𝛿𝑖+1, 𝐡𝑖+1, 𝐜𝑖+1 ) = LSTM(𝐯𝑖, 𝐡𝑖, 𝐜𝑖)
𝛿𝑖+1
Step length
𝐱𝟎
Vincent Sitzmann, NeurIPS 2019
𝐱𝑖
world
coordinates
𝒗𝑖
Scene representation
Φ: ℝ 3 ℝ𝑛
feature
vector
Update intersection estimate.
Neural Ray Marcher
Ray Marching LSTM
(𝛿𝑖+1, 𝐡𝑖+1, 𝐜𝑖+1 ) = LSTM(𝐯𝑖, 𝐡𝑖, 𝐜𝑖)
𝐱𝑖+1 𝐱𝟎
𝛿𝑖+1
Step length
Vincent Sitzmann, NeurIPS 2019
Iterate fixed number of times – differentiable ray marching.
Vincent Sitzmann, NeurIPS 2019
Sample SRN a final time & translate features to colors with MLP.
Scene representation
Φ: ℝ 3
ℝ3
Pixel Generator
1 × 1 conv
Vincent Sitzmann, NeurIPS 2019
Can now train end-to-end with posed images only!
Neural Scene Representation Neural Renderer
2D Re-Rendering Loss
Φ: ℝ 3 ℝ𝑛
Neural Renderer Step 1: Intersection Testing.
Neural Renderer Step 1: Intersection Testing.
𝐱𝑖
world
coordinates
𝒗𝑖
feature
vector
Φ: ℝ 3 ℝ𝑛
Scene Representation
Ray Marching LSTM
(𝛿𝑖+1, 𝐡𝑖+1, 𝐜𝑖+1 ) = LSTM(𝐯𝑖, 𝐡𝑖, 𝐜𝑖)
𝛿𝑖+1 = 0
𝐱𝑖+𝟏 = 𝒙𝒊
Vincent Sitzmann, NeurIPS 2019
Novel View Synthesis – Baseline Comparison
Shepard-Metzler Objects
SRNs
Deterministic GQN (Eslami et al., 2018)
Reconstructed
Appearance
Reconstructed
Geometry
Doesn’t reconstruct geometry
Trained on:
1000 Shepard-Metzler objects
15 observations each
Testing on:
1000 Shepard-Metzler objects from training set
250 novel views rendered in Archimedean spiral around each object
Novel View Synthesis – SRN Output
Shepard-Metzler Objects
Vincent Sitzmann, NeurIPS 2019
Small Gaps
Failure Cases
Out-of-distribution samples at test time
Unseen object positions / object counts in rooms
Fine Detail

More Related Content

PDF
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
PPTX
Neural Scene Representation & Rendering: Introduction to Novel View Synthesis
PPTX
Light Field Networks: Neural Scene Representations with Single-Evaluation Ren...
PPTX
Scene Representation Networks(NIPS 2019)_OJung
PPTX
Tutorial on Generalization in Neural Fields, CVPR 2022 Tutorial on Neural Fie...
PDF
Neural Radiance Fields & Neural Rendering.pdf
PDF
TransNeRF
PDF
Introduction to 3D Computer Vision and Differentiable Rendering
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
Neural Scene Representation & Rendering: Introduction to Novel View Synthesis
Light Field Networks: Neural Scene Representations with Single-Evaluation Ren...
Scene Representation Networks(NIPS 2019)_OJung
Tutorial on Generalization in Neural Fields, CVPR 2022 Tutorial on Neural Fie...
Neural Radiance Fields & Neural Rendering.pdf
TransNeRF
Introduction to 3D Computer Vision and Differentiable Rendering

Similar to Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations, NeurIPS 2019 (20)

PDF
Learning to Perceive the 3D World
PDF
lecture_16_jiajun.pdf
PPTX
Unsupervised Learning of Object Landmarks through Conditional Image Generation
PPTX
Indoor scene understanding for autonomous agents
PDF
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
PDF
ICRA Nathan Piasco
PDF
“Few-shot Image Generation using Scene Graphs” by Azade Farshad
PPTX
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
PDF
Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Lea...
PDF
PPTX
[NS][Lab_Seminar_241118]Relation Matters: Foreground-aware Graph-based Relati...
PDF
Deep Neural Networks Presentation
PPTX
Image segmentation hj_cho
PDF
Computer vision for transportation
PDF
PDF
2019 cvpr paper_overview
PDF
2019 cvpr paper overview by Ho Seong Lee
PPTX
Scene recognition using Convolutional Neural Network
PDF
Pixel Recurrent Neural Networks
PDF
Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud
Learning to Perceive the 3D World
lecture_16_jiajun.pdf
Unsupervised Learning of Object Landmarks through Conditional Image Generation
Indoor scene understanding for autonomous agents
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
ICRA Nathan Piasco
“Few-shot Image Generation using Scene Graphs” by Azade Farshad
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Lea...
[NS][Lab_Seminar_241118]Relation Matters: Foreground-aware Graph-based Relati...
Deep Neural Networks Presentation
Image segmentation hj_cho
Computer vision for transportation
2019 cvpr paper_overview
2019 cvpr paper overview by Ho Seong Lee
Scene recognition using Convolutional Neural Network
Pixel Recurrent Neural Networks
Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud
Ad

Recently uploaded (20)

PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
. Radiology Case Scenariosssssssssssssss
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
2Systematics of Living Organisms t-.pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPT
protein biochemistry.ppt for university classes
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
neck nodes and dissection types and lymph nodes levels
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Comparative Structure of Integument in Vertebrates.pptx
AlphaEarth Foundations and the Satellite Embedding dataset
. Radiology Case Scenariosssssssssssssss
Phytochemical Investigation of Miliusa longipes.pdf
Derivatives of integument scales, beaks, horns,.pptx
7. General Toxicologyfor clinical phrmacy.pptx
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
2Systematics of Living Organisms t-.pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
protein biochemistry.ppt for university classes
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
TOTAL hIP ARTHROPLASTY Presentation.pptx
Ad

Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations, NeurIPS 2019

  • 1. Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations Vincent Sitzmann Gordon Wetzstein Michael Zollhöfer
  • 3. + + Observations Image + Pose & Intrinsics What can we learn about latent 3D scenes from observations? Vision: Learn rich representations just by watching video! { Self-supervised Scene Representation Learning } ,… Latent 3D Scenes } { , ,… ,
  • 4. Observations Re-Rendered Observations Self-supervised Scene Representation Learning Image Loss Model ,… , ,… ,
  • 5. Observations Re-Rendered Observations Self-supervised Scene Representation Learning Image Loss Neural Scene Representation Persistent feature representation of scene. ,… , ,… ,
  • 6. Observations Re-Rendered Observations Self-supervised Scene Representation Learning Image Loss Neural Scene Representation Persistent feature representation of scene. Neural Renderer Render from different camera perspectives. ,… , ,… ,
  • 7. Observations Re-Rendered Observations 2D baseline: Autoencoder Image Loss Latent Code Output Pose + Conv Encoder Conv Decoder ,… , ,… ,
  • 8. Observations Re-Rendered Observations 2D baseline: Autoencoder Image Loss ,… , Latent Code Output Pose Conv Decoder ,… ,
  • 9. Vincent Sitzmann, NeurIPS 2019 Doesn’t capture 3D properties of scenes. Trained on ~2500 shapenet cars with 50 observations each. Need 3D inductive bias!
  • 10. Vincent Sitzmann, NeurIPS 2019 Related Work Tatarchenko et al., 2015 Worrall et al., 2017 Eslami et al., 2018 … Scene Representation Learning 3D Computer Vision Goodfellow et al., 2014 Kingma et al., 2013 Kingma et al., 2018 … 2D Generative Models 3D inductive bias / 3D structure Self-supervised with posed images Choy et al., 2016 Huang et al., 2018 Park et al., 2018 … Voxel-based Representations Sitzmann et al., 2019 Lombardi et al., 2019 Phuoc et al., 2019 … • Memory inefficient: 𝑂 𝑛3 . • Doesn’t parameterize scene surfaces smoothly. • Generalization is hard.
  • 11. Observations Re-Rendered Observations Scene Representation Networks Image Loss Neural Scene Representation Neural Renderer ,… , ,… ,
  • 12. Observations Re-Rendered Observations Scene Representation Networks Image Loss Neural Scene Representation Neural Renderer ,… , ,… ,
  • 14. Model scene as function Φ that maps coordinates to features. … 𝒙 ∈ [] … 𝒙 ∈ 𝒙 ∈ Free Space [] Free Space 𝑥2 Objects 𝑥1 [] … … Φ: ℝ 3 ℝ𝑛
  • 15. Scene Representation Network parameterizes Φ as MLP. … 𝒙 ∈ [] … 𝒙 ∈ [] … 𝒙 ∈ Free Space … [] Φ: ℝ 3 ℝ𝑛 Scene Representation Network Free Space 𝑥2 Objects 𝑥1
  • 16. Scene Representation Network parameterizes Φ as MLP. Φ: ℝ 3 ℝ𝑛 Scene Representation Network Can sample anywhere, at arbitrary resolutions. Parameterizes scene surfaces smoothly. Memory scales with scene complexity.
  • 17. Observations Re-Rendered Observations Scene Representation Networks Image Loss Neural Renderer Φ: ℝ 3 ℝ𝑛 Neural Scene Representation ,… , ,… ,
  • 18. Observations Re-Rendered Observations Scene Representation Networks Image Loss Neural Renderer Φ: ℝ 3 ℝ𝑛 Neural Scene Representation ,… , ,… ,
  • 22. Neural Renderer Step 1: Intersection Testing. ? ? ? ? ? Idea: march along ray until arrived at surface.
  • 23. Neural Renderer Step 1: Intersection Testing. 𝐱𝟎 𝐱𝑖 world coordinates 𝒗𝑖 feature vector Φ: ℝ 3 ℝ𝑛 Scene Representation
  • 24. Neural Renderer Step 1: Intersection Testing. 𝐱𝑖 world coordinates 𝒗𝑖 feature vector Φ: ℝ 3 ℝ𝑛 Scene Representation Ray Marching LSTM 𝛿𝑖+1 Step length 𝐱𝟎 𝐱𝑖+1 Feasible step length: Distance to closest scene surface
  • 25. Neural Renderer Step 1: Intersection Testing. Iteration 0
  • 26. Neural Renderer Step 1: Intersection Testing. Iteration 1
  • 27. Neural Renderer Step 1: Intersection Testing. Iteration 2
  • 28. Neural Renderer Step 1: Intersection Testing. Iteration 3
  • 29. Neural Renderer Step 2: Color Generation Iteration 4
  • 30. Neural Renderer Step 1: Intersection Testing. Iteration …
  • 31. Neural Renderer Step 1: Intersection Testing.
  • 32. Neural Renderer Step 2: Color Generation Φ: ℝ 3 ℝ𝑛 Scene Representation Color MLP
  • 33. Observations Re-Rendered Observations Can now train end-to-end with posed images only! Image Loss Neural Renderer Φ: ℝ 3 ℝ𝑛 Neural Scene Representation ,… , ,… ,
  • 34. Generalizing across a class of scenes
  • 35. Each scene represented by its own SRN. parameters 𝜙1 ∈ ℝ𝑙 parameters 𝜙0 ∈ ℝ𝑙 parameters 𝜙2 ∈ ℝ𝑙 parameters 𝜙𝑛 ∈ ℝ𝑙
  • 36. Each scene represented by its own SRN. 𝜙𝑖 live on k-dimensional subspace of ℝ𝑙 , 𝑘 < 𝑙. parameters 𝜙1 ∈ ℝ𝑙 parameters 𝜙0 ∈ ℝ𝑙 parameters 𝜙2 ∈ ℝ𝑙 parameters 𝜙𝑛 ∈ ℝ𝑙
  • 37. Each scene represented by its own SRN. Represent each scene with low-dimensional embedding embedding 𝑧1 ∈ ℝ𝑘 embedding 𝑧0 ∈ ℝ𝑘 embedding 𝑧2 ∈ ℝ𝑘 embedding 𝑧𝑛 ∈ ℝ𝑘 parameters 𝜙1 ∈ ℝ𝑙 parameters 𝜙0 ∈ ℝ𝑙 parameters 𝜙2 ∈ ℝ𝑙 parameters 𝜙𝑛 ∈ ℝ𝑙
  • 38. parameters 𝜙1 ∈ ℝ𝑙 parameters 𝜙0 ∈ ℝ𝑙 parameters 𝜙2 ∈ ℝ𝑙 parameters 𝜙𝑛 ∈ ℝ𝑙 Each scene represented by its own SRN. embedding 𝑧1 ∈ ℝ𝑘 embedding 𝑧0 ∈ ℝ𝑘 embedding 𝑧2 ∈ ℝ𝑘 embedding 𝑧𝑛 ∈ ℝ𝑘 Ψ: ℝ 𝑘 ℝ𝑙 , zi ↦ Ψ 𝑧𝑖 = 𝜙𝑖 Hypernetwork
  • 40. SRNs Tatarchenko et al. Deterministic GQN Worrall et al. Novel View Synthesis – Baseline Comparison Shapenet v2 – single-shot reconstruction of objects in held-out test set SRNs (Ours) Tatarchenko et al. 2015 Deterministic GQN, adapted Eslami et al. 2018 Worrall et al. 2017 Training  Shapenet cars / chairs.  50 observations per object. Testing • Cars / chairs from unseen test set • Single observation! Input pose
  • 41. Novel View Synthesis – SRN Output Shapenet v2 – single-shot reconstruction of objects in held-out test set Input pose
  • 42. Vincent Sitzmann, NeurIPS 2019 Sampling at arbitrary resolutions 32x32 64x64 128x128 512x512 256x256 Surface Normals RGB
  • 43. Vincent Sitzmann, NeurIPS 2019 Generalization to unseen camera poses Camera Roll Camera close-up SRNs
  • 44. Vincent Sitzmann, NeurIPS 2019 Generalization to unseen camera poses Camera Roll Camera close-up Doesn’t reconstruct geometry Doesn’t reconstruct geometry SRNs Tatarchenko et al.
  • 45. Vincent Sitzmann, NeurIPS 2019 Latent code interpolation Surface Normals RGB
  • 46. Vincent Sitzmann, NeurIPS 2019 Latent code interpolation Surface Normals RGB
  • 47. Vincent Sitzmann, NeurIPS 2019 Can represent room-scale scenes, but aren’t compositional. Training set novel-view synthesis on GQN rooms (Eslami et al. 2018) with Shapenet cars, 50 observations. Work-in-progress: Compositional SRNs generalize to unseen numbers of objects!
  • 48. Scene Representation Networks: Continuous 3D-structure-aware Neural Scene Representations Interpolation Single-shot reconstruction Camera pose extrapolation Gordon Wetzstein Michael Zollhöfer Find me at Poster # 71! Looking for research positions in scene representation learning. Vincent Sitzmann @vincesitzmann vsitzmann.github.io
  • 49. Neural Renderer 𝚯 (𝐟𝐨𝐫 𝐚𝐥𝐥 w × ℎ 𝐩𝐢𝐱𝐞𝐥𝐬 (u, v)) Depth Update 𝑑𝑖+1 = 𝑑𝑖 + 𝛿𝑖+1 Differentiable Ray-Marching (for n iteration steps) 𝐯𝒊 Ray Marching LSTM (𝛿𝑖+1, 𝐡𝑖+1, 𝐜𝑖+1 ) = LSTM(𝐯𝑖, 𝐡𝑖, 𝐜𝑖) 𝐡0, 𝐜0 Output Rendering world coordinates 𝐱𝑖 Ray Layer 𝐱𝑖 = 𝐫𝑢,𝑣(𝑑𝑖) 𝑑0 [𝐑, 𝐭], 𝐊 features at final world coordinates 𝐯𝒏 Scene representation Φ: ℝ 3 ℝ𝑛 Pixel Generator 1 × 1 conv SRNs have three main parts.
  • 50. Intuition: Hypernetwork learns priors over scene properties. 𝒙 world coordinates 𝒗 feature vector Φ: ℝ 3 ℝ𝑛 Scene Representation embedding z
  • 51. SRNs Tatarchenko et al. Deterministic GQN Worrall et al. Training on: • 2434 cars • 50 observations each Testing on: • 2434 cars from training set • 250 novel views rendered in Archimedean spiral around each object Novel View Synthesis – Baseline Comparison Shapenet v2 cars – training set objects
  • 52. Novel-View Synthesis – SRN Output Shapenet v2 cars – training set objects
  • 53. SRNs Tatarchenko et al. Deterministic GQN Worrall et al. Training on: • 4612 chairs • 50 observations each Testing on: • 4612 chairs from training set • 250 novel views rendered in Archimedean spiral around each object Novel View Synthesis – Baseline Comparison Shapenet v2 chairs – objects in training set
  • 54. Novel-View Synthesis – SRN Output Shapenet v2 chairs – objects in training set SRN Normal Map SRN Output Ground Truth
  • 55. Voxelgrid discretizes Φ. Φ: ℝ 3 ℝ𝑛 [] … ⟺ Φ 𝒙 = 𝒙 ∈ [] … ⟺ Φ 𝒙 = 𝒙 ∈ [] … ⟺ Φ 𝒙 = 𝒙 ∈ Free Space …
  • 56. Voxelgrid maps discrete coordinates to feature vectors. Memory inefficient: 𝑂 𝑛3 . Doesn’t parameterize scene surfaces smoothly. Generalization is hard.
  • 57. Free Space 𝑥2 Objects 𝑥1 Surfaces implicitly defined as 2D subspace where feature changes. Φ: ℝ 3 ℝ𝑛 Scene Representation Network [] … … [] …
  • 58. Vincent Sitzmann, NeurIPS 2019 But alas… Feature ”Free Space” Feature ”Object” Φ: ℝ 3 ℝ𝑛 ? ? ? ? ?
  • 59. Vincent Sitzmann, NeurIPS 2019 Signed Distance Functions as a special case of SRNs Φ: ℝ 3 ℝ1 Map every (x,y,z) coordinate to Euclidean distance to closest surface point
  • 60. Vincent Sitzmann, NeurIPS 2019 From Computer Graphics: Sphere Tracing Φ: ℝ 3 ℝ1 Map every (x,y,z) coordinate to Euclidean distance to closest surface point
  • 61. Vincent Sitzmann, NeurIPS 2019 From Computer Graphics: Sphere Tracing • Step along ray • Step length is equal to signed distance Φ: ℝ 3 ℝ1
  • 62. Vincent Sitzmann, NeurIPS 2019 𝐱𝟎 Begin at point on ray close to camera. Neural Ray Marcher
  • 63. Vincent Sitzmann, NeurIPS 2019 𝐱𝑖 world coordinates 𝒗𝑖 Scene representation Φ: ℝ 3 ℝ𝑛 feature vector Sample SRN Neural Ray Marcher 𝐱𝟎
  • 64. Vincent Sitzmann, NeurIPS 2019 𝐱𝑖 world coordinates 𝒗𝑖 Scene representation Φ: ℝ 3 ℝ𝑛 feature vector Predict step length Neural Ray Marcher Ray Marching LSTM (𝛿𝑖+1, 𝐡𝑖+1, 𝐜𝑖+1 ) = LSTM(𝐯𝑖, 𝐡𝑖, 𝐜𝑖) 𝛿𝑖+1 Step length 𝐱𝟎
  • 65. Vincent Sitzmann, NeurIPS 2019 𝐱𝑖 world coordinates 𝒗𝑖 Scene representation Φ: ℝ 3 ℝ𝑛 feature vector Update intersection estimate. Neural Ray Marcher Ray Marching LSTM (𝛿𝑖+1, 𝐡𝑖+1, 𝐜𝑖+1 ) = LSTM(𝐯𝑖, 𝐡𝑖, 𝐜𝑖) 𝐱𝑖+1 𝐱𝟎 𝛿𝑖+1 Step length
  • 66. Vincent Sitzmann, NeurIPS 2019 Iterate fixed number of times – differentiable ray marching.
  • 67. Vincent Sitzmann, NeurIPS 2019 Sample SRN a final time & translate features to colors with MLP. Scene representation Φ: ℝ 3 ℝ3 Pixel Generator 1 × 1 conv
  • 68. Vincent Sitzmann, NeurIPS 2019 Can now train end-to-end with posed images only! Neural Scene Representation Neural Renderer 2D Re-Rendering Loss Φ: ℝ 3 ℝ𝑛
  • 69. Neural Renderer Step 1: Intersection Testing.
  • 70. Neural Renderer Step 1: Intersection Testing. 𝐱𝑖 world coordinates 𝒗𝑖 feature vector Φ: ℝ 3 ℝ𝑛 Scene Representation Ray Marching LSTM (𝛿𝑖+1, 𝐡𝑖+1, 𝐜𝑖+1 ) = LSTM(𝐯𝑖, 𝐡𝑖, 𝐜𝑖) 𝛿𝑖+1 = 0 𝐱𝑖+𝟏 = 𝒙𝒊
  • 71. Vincent Sitzmann, NeurIPS 2019 Novel View Synthesis – Baseline Comparison Shepard-Metzler Objects SRNs Deterministic GQN (Eslami et al., 2018) Reconstructed Appearance Reconstructed Geometry Doesn’t reconstruct geometry Trained on: 1000 Shepard-Metzler objects 15 observations each Testing on: 1000 Shepard-Metzler objects from training set 250 novel views rendered in Archimedean spiral around each object
  • 72. Novel View Synthesis – SRN Output Shepard-Metzler Objects
  • 73. Vincent Sitzmann, NeurIPS 2019 Small Gaps Failure Cases Out-of-distribution samples at test time Unseen object positions / object counts in rooms Fine Detail

Editor's Notes

  • #2: - Continuous, 3D-structure-aware Neural Scene Representation.
  • #3: Single observation of object 3D Neural Scene Representation Enabling multi-view consistent novel view synthesis & full shape reconstruction.
  • #4: Let me formalize. We are given a set of 3D scenes. For now, simple scenes, just a single object For each scene, we receive a set of observations. One observation = image, camera pose, camera intrinsics. Question: What can we learn about latent 3D scenes by looking at observations? Vision: Learn rich representation of our world, just by watching video. (I refer to this problem as self-supervised scene representation learning: self-supervised, because only observations. And we’re trying to learn a scene representation )
  • #5: Our model is going to take as input a set of observations – process them – and produce as output re-renderings of these same observations, allowing us to supervise the model with a 2D image loss.
  • #6: First, infer neural scene representation Feature representation that accumulates all information from observations.
  • #7: - Neural renderer than renders observations.
  • #8: Let me convince you this problem is hard. Let’s build a 2D baseline: Autoencoder, maps to latent code. Concatenate latent code with pose we want to render. Decode into observation.
  • #9: Latent code = scene representation Output pose + conv decoder = renderer.
  • #10: here’s what you get training on shapenet, 50 observations each. Fails to discover latent 3d structure. We argue: need 3d inductive bias!
  • #11: Let’s look at related work 2 Dimensions: Do models have 3D inductive bias, are they self-supervised only with images? Scene representation learning, such as Neural Scene Representations and Rendering. Self-supervised with images, but no strong 3D structure. Similarly, 2D generative models lack 3D structure. 3D computer vision deals with reconstruction with object geometry. Generally has 3D structure – such as signed distance function – but need supervision in 3D. (DeepSDF, occupancy networks huge inspiration!) Voxel-based neural scene representations, such as our own work – deepvoxels – or hologan, seemingly check both boxes But have their own problems: Voxelgrids scale poorly, don’t parameterize scene surfaces smoothly, and generalization is hard as a result.
  • #12: We’re now ready to formulate scene representation networks.
  • #13: We’ll start with the scene representation itself.
  • #14: Let’s consider this simple scene. The grey background represents free space. There’s three obejcts – a circle, a square, and a triangle, and we have defined a coordinate system with coordinate variable x.
  • #15: We propose to model scenes as functions that map 3D coordinates to feature representations of local scene properties. For instance, all coordinates within the blue triangle should be mapped to a feature representation that contains “triangle” and “blue”. Similarly, all points in the square should be mapped to “square, red”. Lastly, all the points in free space should be mapped to some feature that says “free space”.
  • #16: Of course, that function is generally intractable. We thus propose to parameterize this function as a fully connected neural network. This is a Scene Representation Network: A fully connected neural network that maps a 3D coordinate to a feature representation of what is at that 3D coordinate. The scene is thus implicitly encoded in the weights of the network.
  • #17: Has some neat properties. First, enforces 3D inductive bias, because lives in 3D. Second, can be sampled anywhere and at arbitrary resolutions: Continuous function defined on R3. Parameterizes scene smoothly, no discretization. Memory scales with scene complexity, not with spatial resolution as with voxel grids.
  • #18: We now have the scene representation.
  • #19: We still require a renderer in order to supervise it only with observations.
  • #20: Let’s go back to our toy scene.
  • #22: We now place a camera in our scene, and we want to render the image that is observed here.
  • #23: We first have to find the points where camera rays intersect scene geometry. Please note that this is not trivial b/c we do not have access to explicit surface points which we could project to the image plane – can instead only query our scene representation network. We will solve this via ray-marching: Stepping along the ray until we find a surface. Naively, we could pick a small constant step size – but can we do better?
  • #24: Let’s march along this ray. We start at a point x_0 close to the camera. At every step, we sample our scene representation at the current point. This yields a feature vector that describes local scene properties.
  • #25: We now use a neural network to translate that feature vector into a step length, and then advance along the ray according to this step length. What could the feature vector possibly encode about the correct step length? One feasible step size that is defined for every coordinate in this scene is the distance to the closest scene surface, here visualized with a circle that touches the red square. That specific choice of step length defines the so-called sphere tracing algorithm developed for rendering signed distance fields. Intuitively, we’re proposing a generalization of sphere tracing – please check the paper for details on that.
  • #26: We have thus found the intersections of each ray with scene geometry.
  • #27: We have thus found the intersections of each ray with scene geometry.
  • #28: We have thus found the intersections of each ray with scene geometry.
  • #29: We have thus found the intersections of each ray with scene geometry.
  • #30: We have thus found the intersections of each ray with scene geometry.
  • #31: We have thus found the intersections of each ray with scene geometry.
  • #32: The last iterate is our intersection estimate.
  • #33: To render an image, we sample the scene representation a final time at these intersections. We then translate the features at these locations into colors via another fully connected network.
  • #34: We now assembled the full pipeline and can train a scene representation network given only a set of observations.
  • #41: Let’s begin with a baseline comparison to other scene representation learning methods. We’re comparing to recent Scene Representation Learning models, including a deterministic implementation of the Generative Query Network by Eslami et al. We’re training on shapenet chairs and cars, with 50 observations per object. We’re reconstructing objects from a held-out test set *from a single observation*. You can see these observations in the lower right hand corner. As you can see, the proposed SRNs at the bottom are the only model that succeeds at discovering the underlying 3D geometry.
  • #42: Here’s some more!
  • #71: We iterate this algorithm a fixed number of times. When we arrive at the surface, the feature can encode a step size of zero to avoid us stepping any further.