Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations, NeurIPS 2019

Scene Representation Networks:
Continuous 3D-Structure-Aware Neural Scene Representations
Vincent Sitzmann Gordon Wetzstein
Michael Zollhöfer

single image
camera pose
intrinsics
Surface Normals
Novel Views

+ +
Observations
Image + Pose & Intrinsics
What can we learn about latent 3D scenes from observations?
Vision: Learn rich representations just by watching video!
{
Self-supervised Scene Representation Learning
}
,…
Latent 3D Scenes
}
{
,
,…
,

Observations Re-Rendered
Observations
Image Loss
Model
,…
, ,…
,

Observations
Image Loss
Neural Scene
Representation
Persistent feature
representation of
scene.
,…
, ,…
,

Observations
Image Loss
Neural Scene
Representation
Persistent feature
representation of
scene.
Neural Renderer
Render from different
camera perspectives.
,…
, ,…
,

Observations
2D baseline: Autoencoder
Image Loss
Latent Code
Output Pose
+
Conv
Encoder
Conv
Decoder
,…
, ,…
,

Observations
2D baseline: Autoencoder
Image Loss
,…
, Latent Code
Output Pose
Conv
Decoder
,…
,

Vincent Sitzmann, NeurIPS 2019
Doesn’t capture 3D properties of scenes.
Trained on ~2500 shapenet cars with 50 observations each.
Need 3D inductive bias!

Related Work
Tatarchenko et al., 2015
Worrall et al., 2017
Eslami et al., 2018
…
Scene Representation Learning
3D Computer Vision
Goodfellow et al., 2014
Kingma et al., 2013
Kingma et al., 2018
…
2D Generative Models
3D inductive bias /
3D structure
Self-supervised
with posed images
Choy et al., 2016
Huang et al., 2018
Park et al., 2018
…
Voxel-based Representations
Sitzmann et al., 2019
Lombardi et al., 2019
Phuoc et al., 2019
…
• Memory inefficient: 𝑂 𝑛3 .
• Doesn’t parameterize scene surfaces smoothly.
• Generalization is hard.

Observations
Scene Representation Networks
Image Loss
Neural Scene
Representation
Neural Renderer
,…
, ,…
,

Free Space
𝑥2
Objects
𝑥1

Model scene as function Φ that maps coordinates to features.
…
𝒙 ∈
[]
…
𝒙 ∈
𝒙 ∈
Free
Space
[]
Free Space
𝑥2
Objects
𝑥1
[]
…
…
Φ: ℝ 3 ℝ𝑛

Scene Representation Network parameterizes Φ as MLP.
…
𝒙 ∈
[]
…
𝒙 ∈
[]
…
𝒙 ∈
Free
Space
…
[]
Φ: ℝ 3 ℝ𝑛
Scene
Representation
Network
Free Space
𝑥2
Objects
𝑥1

Scene Representation Network parameterizes Φ as MLP.
Φ: ℝ 3 ℝ𝑛
Scene
Representation
Network
Can sample anywhere,
at arbitrary resolutions.
Parameterizes scene
surfaces smoothly.
Memory scales with scene
complexity.

Observations
Scene Representation Networks
Image Loss
Neural Renderer
Φ: ℝ 3 ℝ𝑛
Neural Scene
Representation
,…
, ,…
,

𝑥2
𝑥1
Neural Renderer.
Free Space

Neural Renderer Step 1: Intersection Testing.
?
?
?
?
?
Idea: march along ray until arrived at surface.

𝐱𝟎
𝐱𝑖
world
coordinates
𝒗𝑖
feature
vector
Φ: ℝ 3 ℝ𝑛
Scene Representation

𝐱𝑖
world
coordinates
𝒗𝑖
feature
vector
Φ: ℝ 3 ℝ𝑛
Ray Marching LSTM
𝛿𝑖+1
Step length
𝐱𝟎
𝐱𝑖+1
Feasible step length:
Distance to closest scene
surface

Iteration 0

Iteration 1

Iteration 2

Iteration 3

Neural Renderer Step 2: Color Generation
Iteration 4

Iteration …

Neural Renderer Step 2: Color Generation
Φ: ℝ 3 ℝ𝑛
Color
MLP

Observations
Can now train end-to-end with posed images only!
Image Loss
Neural Renderer
Φ: ℝ 3 ℝ𝑛
Neural Scene
Representation
,…
, ,…
,

Generalizing across a class of scenes

Each scene represented by its own SRN.
parameters 𝜙1 ∈ ℝ𝑙
parameters 𝜙𝑛 ∈ ℝ𝑙

𝜙𝑖 live on k-dimensional
subspace of ℝ𝑙
, 𝑘 < 𝑙.

Represent each scene with
low-dimensional embedding
embedding 𝑧1 ∈ ℝ𝑘
embedding 𝑧𝑛 ∈ ℝ𝑘

embedding 𝑧𝑛 ∈ ℝ𝑘
Ψ: ℝ 𝑘
ℝ𝑙
,
zi ↦ Ψ 𝑧𝑖 = 𝜙𝑖
Hypernetwork

SRNs
Tatarchenko et al.
Deterministic
GQN
Worrall et al.
Novel View Synthesis – Baseline Comparison
Shapenet v2 – single-shot reconstruction of objects in held-out test set
SRNs (Ours)
Tatarchenko et al.
2015
Deterministic
GQN, adapted
Eslami et al.
2018
Worrall et al.
2017
Training
 Shapenet cars / chairs.
 50 observations per object.
Testing
• Cars / chairs from unseen
test set
• Single observation!
Input pose

Novel View Synthesis – SRN Output
Shapenet v2 – single-shot reconstruction of objects in held-out test set
Input
pose

Sampling at arbitrary resolutions
32x32
64x64
128x128
512x512
256x256
Surface Normals RGB

Generalization to unseen camera poses
Camera Roll
Camera close-up
SRNs

Generalization to unseen camera poses
Camera Roll
Camera close-up
Doesn’t reconstruct
geometry
Doesn’t reconstruct
geometry
SRNs
Tatarchenko et al.

Latent code interpolation
Surface Normals RGB

Can represent room-scale scenes, but aren’t compositional.
Training set novel-view synthesis on
GQN rooms (Eslami et al. 2018) with
Shapenet cars, 50 observations.
Work-in-progress: Compositional SRNs
generalize to unseen numbers of objects!

Scene Representation Networks:
Continuous 3D-structure-aware Neural Scene Representations
Interpolation Single-shot reconstruction Camera pose extrapolation
Gordon Wetzstein
Michael Zollhöfer
Find me at Poster # 71!
Looking for research
positions
in scene representation
learning.
Vincent Sitzmann
@vincesitzmann
vsitzmann.github.io

Neural Renderer 𝚯 (𝐟𝐨𝐫 𝐚𝐥𝐥 w × ℎ 𝐩𝐢𝐱𝐞𝐥𝐬 (u, v))
Depth Update
𝑑𝑖+1 = 𝑑𝑖 + 𝛿𝑖+1
Differentiable Ray-Marching (for n iteration steps)
𝐯𝒊
Ray Marching LSTM
(𝛿𝑖+1, 𝐡𝑖+1, 𝐜𝑖+1 ) = LSTM(𝐯𝑖, 𝐡𝑖, 𝐜𝑖)
𝐡0, 𝐜0
Output
Rendering
world
coordinates
𝐱𝑖
Ray Layer
𝐱𝑖 = 𝐫𝑢,𝑣(𝑑𝑖)
𝑑0
[𝐑, 𝐭], 𝐊
features at final
world coordinates
𝐯𝒏
Scene representation
Φ: ℝ 3
ℝ𝑛
Pixel Generator
1 × 1 conv
SRNs have three main parts.

Intuition: Hypernetwork learns priors over scene properties.
𝒙
world
coordinates
𝒗
feature
vector
Φ: ℝ 3 ℝ𝑛
embedding z

SRNs
Tatarchenko et al.
Deterministic
GQN
Worrall et al.
Training on:
• 2434 cars
• 50 observations each
Testing on:
• 2434 cars from training set
• 250 novel views rendered in
Archimedean spiral around each object
Shapenet v2 cars – training set objects

Novel-View Synthesis – SRN Output
Shapenet v2 cars – training set objects

SRNs
Tatarchenko et al.
Deterministic
GQN
Worrall et al.
Training on:
• 4612 chairs
• 50 observations each
Testing on:
• 4612 chairs from training set
• 250 novel views rendered in
Archimedean spiral around each object
Shapenet v2 chairs – objects in training set

Novel-View Synthesis – SRN Output
Shapenet v2 chairs – objects in training set
SRN
Normal
Map
SRN
Output
Ground
Truth

Voxelgrid discretizes Φ.
Φ: ℝ 3
ℝ𝑛
[]
…
⟺ Φ 𝒙 =
𝒙 ∈
[]
…
⟺ Φ 𝒙 =
𝒙 ∈
[]
…
⟺ Φ 𝒙 =
𝒙 ∈
Free
Space
…

Voxelgrid maps discrete coordinates to feature vectors.
Memory inefficient: 𝑂 𝑛3
.
Doesn’t parameterize scene surfaces
smoothly.
Generalization is hard.

Free Space
𝑥2
Objects
𝑥1
Surfaces implicitly defined as 2D subspace where feature changes.
Φ: ℝ 3 ℝ𝑛
Scene
Representation
Network
[]
…
…
[]
…

But alas…
Feature
”Free Space”
Feature
”Object”
Φ: ℝ 3
ℝ𝑛
?
?
?
?
?

Signed Distance Functions as a special case of SRNs
Φ: ℝ 3
ℝ1
Map every (x,y,z) coordinate to
Euclidean distance to closest surface point

From Computer Graphics: Sphere Tracing
Φ: ℝ 3
ℝ1
Map every (x,y,z) coordinate to
Euclidean distance to closest surface point

From Computer Graphics: Sphere Tracing
• Step along ray
• Step length is equal to signed distance
Φ: ℝ 3
ℝ1

𝐱𝟎
Begin at point on ray close to camera.
Neural Ray Marcher

𝐱𝑖
world
coordinates
𝒗𝑖
Φ: ℝ 3 ℝ𝑛
feature
vector
Sample SRN
Neural Ray Marcher
𝐱𝟎

𝐱𝑖
world
coordinates
𝒗𝑖
Φ: ℝ 3 ℝ𝑛
feature
vector
Predict step length
Neural Ray Marcher
Ray Marching LSTM
𝛿𝑖+1
Step length
𝐱𝟎

𝐱𝑖
world
coordinates
𝒗𝑖
Φ: ℝ 3 ℝ𝑛
feature
vector
Update intersection estimate.
Neural Ray Marcher
Ray Marching LSTM
𝐱𝑖+1 𝐱𝟎
𝛿𝑖+1
Step length

Iterate fixed number of times – differentiable ray marching.

Sample SRN a final time & translate features to colors with MLP.
Φ: ℝ 3
ℝ3
Pixel Generator
1 × 1 conv

Can now train end-to-end with posed images only!
Neural Scene Representation Neural Renderer
2D Re-Rendering Loss
Φ: ℝ 3 ℝ𝑛

𝐱𝑖
world
coordinates
𝒗𝑖
feature
vector
Φ: ℝ 3 ℝ𝑛
Ray Marching LSTM
𝛿𝑖+1 = 0
𝐱𝑖+𝟏 = 𝒙𝒊

Shepard-Metzler Objects
SRNs
Deterministic GQN (Eslami et al., 2018)
Reconstructed
Appearance
Reconstructed
Geometry
Doesn’t reconstruct geometry
Trained on:
1000 Shepard-Metzler objects
15 observations each
Testing on:
1000 Shepard-Metzler objects from training set
250 novel views rendered in Archimedean spiral around each object

Novel View Synthesis – SRN Output
Shepard-Metzler Objects

Small Gaps
Failure Cases
Out-of-distribution samples at test time
Unseen object positions / object counts in rooms
Fine Detail

Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations, NeurIPS 2019

More Related Content

Similar to Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations, NeurIPS 2019 (20)

Recently uploaded (20)

Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations, NeurIPS 2019

Editor's Notes