Deep single view 3 d object reconstruction with visual hull

Deep Single-View 3D Object Reconstruction with
Visual Hull Embedding
Hanqing Wang, Jiaolong Yang, Wei Liang, Xin Tong
Beijing Institute of Technology Microsoft Research Asia
Beijing, China Beijing, China
AAAI 2019
1,2 2 1 2
1 2

• Input: a single RGB(D) Image
• Output: the corresponding 3D representation
Single-View 3D Reconstruction

• Deep Learning based Methods:
[Girdhar ECCV’16]
[Choy ECCV’16]
Other works:
Yan NIPS’16; Wu NIPS’16; Tulsiani CVPR’17; Zhu ICCV’17…

• Problems of Existing Deep Learning based Methods:
• 1. Arbitrary-view images vs. Canonical-view aligned 3D shapes
• 2. Unsatisfactory results
• Missing shape details
• Plausible shapes yet inconsistent with input images
11/15/2018 4
Generation or
Reconstruction???
Z
Y
X

• Goal: Reconstruct the object precisely with the given image
• Idea: Embed explicitly the 3D-2D projection geometry into a network
• Approach: Estimating a single-view visual hull inside of the network
Multi-
view
Visual Hull
Single-view
Visual Hull
Core Idea

Our Approach
• Perspective camera model
• Volumetric shape representation
• Method overview

Components
(R,T)
2D Encoder
Regressor2D Encoder 3D Decoder 2D Decoder
3D Decoder
+
3D Encoder
(a)
(d)
(b) (c)
(e)
• (a) V-Net: coarse shape prediction
• (b) P-Net: object pose and camera parameters estimation
• (c) S-Net: silhouette prediction
• (d) PSVH layer: visual hull generation
• (e) R-Net: coarse shape refinement

Projection Details
The relationship between a 3D point (𝑋, 𝑌, 𝑍) and its projected pixel location (𝑢, 𝑣) on the
image is
(1)
Where the camera intrinsic matrix , is the rotation matrix
generated by three Euler angles, noted as , is the
translation vector. For translation we estimate 𝑡 𝑍 and a 2D vector [𝑡 𝑢, 𝑡 𝑣] which centralizes the
object on image plane, and obtain 𝑡 via
𝑡 𝑢
𝑓
∗ 𝑡 𝑍,
𝑡 𝑣
𝑓
∗ 𝑡 𝑍, 𝑡 𝑍
𝑇
.
In summary, we parameterize the pose as a 6-D vector
𝑍 𝑢, 𝑣, 1 𝑇
= K(R 𝑋, 𝑌, 𝑍 𝑇
+ 𝑡)
K =
𝑓 0 𝑢0
0 𝑓 𝑣0
0 0 1
R ∈ SO(3)
𝑡 = 𝑡 𝑋, 𝑡 𝑌, 𝑡 𝑧
𝑇 ∈ ℝ3[𝜃1, 𝜃2, 𝜃3]
𝑝 = 𝜃1, 𝜃2, 𝜃3, 𝑡 𝑢, 𝑡 𝑣, 𝑡 𝑧
𝑇

Network Architecture
• Overview:

Training Loss
We use the binary cross-entropy loss to train V-Net, S-Net and R-Net, let 𝑝 𝑛 be the estimated
probability at location 𝑛, the loss is defined as
(2)
Where 𝑝 𝑛
∗
is the target probability
For P-Net, we use the 𝐿1 regression loss to train the network:
(3)
where we set 𝛼 = 1, 𝛾 = 1, 𝛽 = 0.01
𝑙 = −
1
𝑁
෍
𝑛
(𝑝 𝑛
∗ log 𝑝 𝑛 + 1 − 𝑝 𝑛
∗ log(1 − 𝑝 𝑛))
𝑙 = ෍
𝑖=1,2,3
𝛼 𝜃𝑖 − 𝜃𝑖
∗
+ ෍
𝑗=𝑢,𝑣
𝛽 𝑡𝑗 − 𝑡𝑗
∗
+ 𝛾 𝑡 𝑍 − 𝑡 𝑍
∗

• Object categories: car, airplane, chair, sofa
• Datasets:
• 3D-R2N2 dataset – rendered ShapeNet objects
• PASCAL 3D+ dataset – real images manfully associated with limited CAD models
Experiments

Experiments
• Implementation details:
• Network implemented in Tensorflow
• Input image size: 128x128x3
• Output voxel grid: 32x32x32
• Running time:
• ~18ms for one image (i.e. running at 55 fps)
• (Tested with a batch of 24 images on a NVIDIA Tesla M40 GPU)

Experiments
• Results on the 3D-R2N2 dataset (rendered ShapeNet objects)
• Ablation study:

Experiments
• Results on the 3D-R2N2 dataset (rendered ShapeNet objects)

• Results on the PASCAL 3D+ dataset (real images)
Experiments

Summary
• A novel 3D reconstruction neural network structure
• Embedding Domain knowledge (3D-2D perspective geometry) into a DNN
• Performing reconstruction jointly with segmentation and pose estimation
• A novel, GPU-friendly Probabilistic Single-view Visual Hull layer

Deep single view 3 d object reconstruction with visual hull

More Related Content

What's hot (20)

Similar to Deep single view 3 d object reconstruction with visual hull (20)

Recently uploaded (20)

Deep single view 3 d object reconstruction with visual hull