[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpainting for 3D Point Scene Completion from a Single Depth Image

第5回 3D勉強会@関東
Deep Reinforcement Learning of Volume-guided
Progressive View Inpainting for 3D Point Scene
Completion from a Single Depth Image
⻘⼭学院⼤学鷲⾒研究室
D2 伊東聖⽮

⾃⼰紹介
伊東聖⽮（いとうせいや）
• 所属
• ⻘⼭学院⼤学理⼯学研究科鷲⾒研究室 D2
（進路は企業とアカデミアで悩み中）
• ⻘⼭学院⼤学先端情報技術研究センター RA
• 研究テーマ
• SFM・MVS・深度推定・深度補完
• セマンティックセグメンテーション
2
Deep Modular Network [Ito+, ECCVW2018]
@seiyaito93

論⽂情報
Xiaoguang Han1,3, Zhaoxuan Zhang2,3, Dong Du3,4, Mingdai Yang1,3,
Jingming Yu5, Pan Pan4, Xin Yang2, Ligang Liu4, Zixiang Xiong6, Shuguang Cui1,3
1The Chinese University of Hong Kong(Shenzhen), 2Dalian University of Technology
3Shenzhen Research Institute of Big Data, 4University of Science and Technology of China
5Alibaba Group, 6Texas A&M University
Deep Reinforcement Learning of Volume-guided Progressive View Inpainting for
3D Point Scene Completion from a Single Depth Image
In CVPR 2019 (Oral)
ArXiv: https://arxiv.org/abs/1903.04019
Oral: https://www.youtube.com/watch?v=0lLnHe0xbZE&t=187s 3

概要
• 強化学習による単⼀深度画像を⼊⼒とした
点群ベースのシーン補完
• 論⽂の貢献
• 単⼀深度画像からの⽋落した点群を直接⽣成する，
シーン補完のための点群ベースのアルゴリズム
• 累積的な* シーン補完のための最適な視点シーケンスを
決定する新たな強化学習の⼿法
• Global context を最⼤限に利⽤する
volume-guided view inpainting network
• SUNCG で SOTA 達成
1
The Chinese University of Hong Kong(Shenzhen), 2
Dalian University of Technology
3
Shenzhen Research Institute of Big Data, 4
University of Science and Technology of China
5
Alibaba Group, 6
Texas A&M University
Abstract
We present a deep reinforcement learning method of pro-
gressive view inpainting for 3D point scene completion un-
der volume guidance, achieving high-quality scene recon-
struction from only a single depth image with severe oc-
clusion. Our approach is end-to-end, consisting of three
modules: 3D scene volume reconstruction, 2D depth map
inpainting, and multi-view selection for completion. Given
a single depth image, our method ﬁrst goes through the 3D
volume branch to obtain a volumetric scene reconstruction
as a guide to the next view inpainting step, which attempts
to make up the missing information; the third step involves
projecting the volume under the same view of the input, con-
catenating them to complete the current view depth, and in-
tegrating all depth into the point cloud. Since the occluded
areas are unavailable, we resort to a deep Q-Network to
glance around and pick the next best view for large hole
completion progressively until a scene is adequately recon-
structed while guaranteeing validity. All steps are learned
jointly to achieve robust and consistent results. We perform
qualitative and quantitative evaluations with extensive ex-
periments on the SUNCG data, obtaining better results than
the state of the art.
(a) depth (b) visible surface
(c) output: two views
Figure 1. Surface-based scene Completion. (a) A single-view
depth map as input; (b) Visible surface from the depth map, which
is represented as the point cloud. In our paper, the color of depth
and point cloud is for visualization only; (c) Our scene comple-
tion results: directly recovering the missing points of the occluded
regions. Here we choose two views for a better display.
search: pixels of the depth map are classiﬁed into several
arXiv:1903.04019v2[cs.CV]12Mar2019
(a)⼊⼒深度画像
(b)(a)を別視点から⾒たもの
(c)シーン補完後
*本⽂では “progressive” 4

Related WorksData-Driven Structural Priors for Shape Completion
Minhyuk Sung1
Vladimir G. Kim1,2
Roland Angst1,3
Leonidas Guibas1
1
Stanford University 2
Adobe Research 3
Max Planck Institute for Informatics
a1. Input Scan
b. Estimated Structure
⋯
a2. Training Data d. Fused Result
c2. Completed
With Database
c1. Completed
With Symmetry
re 1: Given an incomplete point scan with occlusions (a1), our method leverages training data (a2) to estimate the structure of the underlying shape, including parts and
metries (b). Our inference algorithm is capable of discovering the major symmetries, despite the occlusion of symmetric counterparts. The estimated structure is used to augment
oint cloud with additional points from symmetry (c1) and database (c2) priors, which are further fused to produce the final completed point cloud (d). Note how the final result
ages symmetry whenever possible for self-completion (e.g., stem, armrests), and falls back to database information transfer in case all symmetric parts are occluded (e.g., back).
bstract
quiring 3D geometry of an object is a tedious and time-consuming
k, typically requiring scanning the surface from multiple view-
nts. In this work we focus on reconstructing complete geometry
m a single scan acquired with a low-quality consumer-level scan-
g device. Our method uses a collection of example 3D shapes
uild structural part-based priors that are necessary to complete
shape. In our representation, we associate a local coordinate
em to each part and learn the distribution of positions and orien-
ons of all the other parts from the database, which implicitly also
nes positions of symmetry planes and symmetry axes. At the
erence stage, this knowledge enables us to analyze incomplete
nt clouds with substantial occlusions, because observing only a
regions is still sufficient to infer the global structure. Once the
s and the symmetries are estimated, both data sources, symmetry
database, are fused to complete the point cloud. We evaluate
technique on a synthetic dataset containing 481 shapes, and on
scans acquired with a Kinect scanner. Our method demonstrates
h accuracy for the estimated part structure and detected symme-
s, enabling higher quality shape completions in comparison to
rnative techniques.
ywords: shape analysis, shape completion, shape collections
an autonomous agent have to acquire the geometry from multiple
viewpoints, which is time-consuming and can be infeasible if the en-
vironment poses limitations on the choice of viewpoints. Incomplete
geometry leads to challenges for autonomous agents in planning an
interaction with an object, limits capabilities of geometry analysis
algorithms (e.g., in inferring semantic parts), and produces content
that has little use in virtual reality applications. To address this
problem, we propose a method for completing a point cloud from a
single-view scan by introducing structural priors including expected
symmetries and geometries of parts. In a preprocessing step, we
leverage a collection of segmented 3D shapes to learn a structural
prior which captures positions and orientations of parts and global
and partial symmetries between parts that are expected in a given
class of shapes. This data-driven approach enables us to estimate the
global part structure and to detect symmetries in partial and seem-
ingly asymmetric scans. Having access to this global part structure,
the input scan can then in turn be completed, by exploiting geometry
from both the observed partial scan itself (via symmetries) as well
as the shape collection.
Although symmetric objects can be completed by copying observed
regions to occluded counter-parts, existing symmetry detection algo-
rithms (e.g., [Pauly et al. 2008; Mitra et al. 2006]) infer symmetry
from the input shape, which means that they need to observe at least
some fraction of the occluded counter-part to detect the symmetry.
Moreover, even if symmetries are successfully detected, not all oc-
Geometric & template-based approach
High-Resolution Shape Completion [Han+ ICCV2017]
SSCNet [Song+ CVPR2017]
Data-driven Structural Priors [Sung+ SIGGRAPH2015]
Volume-based scene completion
Deep learning-based approach
ScanComplete [Dai+ CVPR2018]
5

Overview
•
⽬標︓⼊⼒深度画像 𝐷" から完全な点群を⽣成すること
⼿法︓不完全な点群を多視点の深度画像で表現し，
2D の inpainting task とみなす
→ 推論された点群を保持し，次の推論に使⽤（累積的に処理）
DQN
SSCNet
2DCNN
View ViewView
Depth
Voxel
Point Cloud
View
Output
Figure 2. The pipeline of our method. Given a single depth image D0, we convert it to a point cloud P, here shown in two different views.
DQN is used to seek the next-best-view, under which the point cloud is projected to a new depth image D1, causing holes. In parallel, the
P is also completed in volumetric space by SSCNet, resulting in V . Under the view of D1, V is projected and guide the inpainting of D1
with a 2DCNN network. Repeating this process several times, we can achieve the ﬁnal high-quality scene completion.
Depth Inpainting Similar to geometry completion, re-
searchers have employed various priors or optimized mod-
els to complete a depth image[15, 21, 27, 41, 3, 22, 50, 53].
The patch-based image synthesis idea is also applied[7, 10].
3. Algorithm
Overview
Taking a depth image D0 as input, we ﬁrst convert it to
a point cloud P , which suffers from severe data loss. Our
6

Overview
1. ⼊⼒深度画像 𝐷" を点群 𝑃" に変換
2. 𝐷" は 𝑃" を視点 𝑣" にレンダリングしたとして処理を開始
3. 𝑃" は新たな視点 𝑣%にレンダリングして深度画像 𝐷% を獲得
DQN
SSCNet
2DCNN
View ViewView
Depth
Voxel
Point Cloud
View
Output
3. Algorithm
Overview
深度画像には
多くの⽳がある
後述︓DQN で best view path を探索
7

Overview
4. 深度画像 𝐷% に対して 2D Inpainting を⾏い，深度画像 &𝐷%を獲得
5. &𝐷%は点群に変換され， 𝑃" と集約して密な点群 𝑃% を獲得
6. 以降，3-5を繰り返し
8

Overview
•
① 2D CNN (volume-guided view inpainting network)
② DQN
DQN
SSCNet
2DCNN
View ViewView
Depth
Voxel
Point Cloud
View
Output
3. Algorithm
Overview
9

Volume-guided View Inpainting
• 3つのサブネットワークで構成
• Volume Completion
𝑃" のボリューム占有グリッド 𝑉 を補完して 𝑉( に変換
• Depth Inpainting
深度画像 𝐷) と 𝑉( を視点 𝑣) に投影した深度画像 𝐷)
(
の2つを⼊⼒とし，
inpainting した深度画像 &𝐷) を獲得
• Projection Layer
深度画像 𝐷)
(
と 𝑉( を繋ぐ微分可能な層
10
Volume-guided Progressive View Inpainting
pletion from a Single Depth Image
,4
Dong Du, 1,3
Mingdai Yang, 5
Jingming Yu, 5
Pan Pan
u, 6
Zixiang Xiong, 1,3
Shuguang Cui
ong(Shenzhen), 2
ata, 4
up, 6
pro-
n un-
econ-
e oc-
three
map
Given
e 3D
ction
mpts
olves
con-
nd in-
uded
rk to
hole
𝐷"
SSCNet
ViewView
Depth
Voxel
Point  Cloud
SSCNet
Projection
Layer
𝑉(
SSCNet
ViewView
Depth
Point  Cloud
𝐷)
(
SSCNet
ViewView
Depth
Point  Cloud
𝐷)
View
Depth
Point  Cloud
&𝐷)
Depth
Inpainting
𝑖 = 1
𝑖 > 1

Volume Completion
• SSCNet [Song+ CVPR2017] を使⽤（ 𝑓: 𝑉 ↦ 𝑉(）
• オリジナルは意味ラベルも推定するが，本⼿法では意味ラベルの推定は⾏わない
→ ボクセルワイズの binary classification (empty/full) として学習
• ⼊⼒解像度 240x144x240 から 60x36x60 の確率ボリュームを出⼒（解像度1/4）
[Song+ CVPR2017]: S. Song et al. “Semantic Scene Completion from a Single Depth Image”, In CVPR, 2017.
11

Depth Inpainting
• 深度画像は 512x512 のグレースケール画像としてレンダリング
• 不規則な⽳に対応した [Liu+ ECCV2018] を採⽤
• 深度画像 𝐷𝑖, 𝐷𝑖
𝑐
を結合 (concat) し，U-Net like の構造のネットワークに⼊⼒
• 出⼒画像サイズは⼊⼒と同じく 512x512
Image Inpainting for Irregular Holes Using
Partial Convolutions
Guilin Liu Fitsum A. Reda Kevin J. Shih Ting-Chun Wang
Andrew Tao Bryan Catanzaro
NVIDIA Corporation
Fig. 1. Masked images and corresponding inpainted results using our partial-
convolution based network.
:1804.07723v2[cs.CV]15Dec2018
Our proposed model uses stacked partial convolution operations an
dating steps to perform image inpainting. We first define our conv
mask update mechanism, then discuss model architecture and loss f
3.1 Partial Convolutional Layer
We refer to our partial convolution operation and mask update fun
as the Partial Convolutional Layer. Let W be the convolution fi
for the convolution filter and b its the corresponding bias. X are
values (pixels values) for the current convolution (sliding) window a
corresponding binary mask. The partial convolution at every locatio
defined in [7], is expressed as:
x0
=
(
WT
(X M) sum(1)
sum(M) + b, if sum(M) > 0
0, otherwise
where denotes element-wise multiplication, and 1 has same shap
with all elements being 1. As can be seen, output values depend
unmasked inputs. The scaling factor sum(1)/sum(M) applies approp
to adjust for the varying amount of valid (unmasked) inputs.
【U-Net との違い】
Partial Convolutional Layer
12
[Liu+ ECCV2018]: G. Liu et al. “Image Inpainting for Irregular Holes Using Partial Convolutions”, In ECCV, 2018.

Projection Layer
• 2D loss を 3D CNNs のパラメータ最適化に使⽤
• Differentiable projection layer [Tulsiani+ CVPR2017]
ピクセル 𝑥 の深度 𝐷 𝑥
𝐷 𝑥 = 3
45%
67
𝑃4
8
𝑑4
𝑃4
8
= 1 − 𝑉4 ;
<5%
4=%
𝑉< , 𝑘 = 1, … , 𝑁8
𝜕𝐷(𝑥)
𝜕𝑉4
= 3
)54
67
(𝑑)E% − 𝑑)) ;
%FGH),GI4
𝑉G
13
𝑘GJ ボクセルに最初に
衝突する確率
𝑘 ピクセル 𝑥 を通る視線と衝突するボクセルのインデックス
（⼩さい⽅が⼿前， 𝑁8 は視線が通るボクセル数）
𝑉4 𝑘GJ ボクセルの占有確率（空である確率）
𝑑4 𝑘GJ ボクセルまでの距離
[Tulsiani+ CVPR2017]: S. Tulsiani et al. “Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency”, In CVPR, 2017.

Joint Training
• 収束を保証するために学習を3段階に
1. SSCNet は個別に学習
2. SSCNet を固定し，Depth Inpainting を学習
3. ネットワーク全体を fine tuning
• 学習データは SUNCG (synthetic scene) を使⽤
• 𝑁 枚の深度画像をランダムなシーン・視点でレンダリングして⽣成
• 各深度画像は点群に変換し，元の視点の近くでランダムに 𝑚 視点投影
→ ⼤きく⽳が空くことを避け，⼗分なコンテキストを利⽤可能に
14
rcement Learning of Volume-guided Progressive View Inpainting
3D Point Scene Completion from a Single Depth Image
Han, 2,3
Zhaoxuan Zhang, 3,4
Dong Du, 1,3
Mingdai Yang, 5
Jingming Yu, 5
Pan Pan
2
Xin Yang, 4
Ligang Liu, 6
Zixiang Xiong, 1,3
Shuguang Cui
nese University of Hong Kong(Shenzhen), 2
Research Institute of Big Data, 4
5
Alibaba Group, 6
Abstract
reinforcement learning method of pro-
ng for 3D point scene completion un-
achieving high-quality scene recon-
single depth image with severe oc-
ch is end-to-end, consisting of three
olume reconstruction, 2D depth map
view selection for completion. Given
our method ﬁrst goes through the 3D
ain a volumetric scene reconstruction
view inpainting step, which attempts
g information; the third step involves
under the same view of the input, con-
mplete the current view depth, and in-
o the point cloud. Since the occluded
e, we resort to a deep Q-Network to
ick the next best view for large hole
!"
DQN
SSCNet
2DCNN
ViewView
Depth
Voxel
Point  Cloud
SSCNet
Projection
Layer
#$
DQN
SSCNet
ViewView
Depth
Point  Cloud
!%
$
DQN
SSCNet
ViewView
Depth
Point  Cloud
!%
ViewView
Depth
Point  Cloud
!&%
Depth
Inpainting

Overview
•
① 2D CNN (volume-guided view inpainting network)
② DQN
DQN
SSCNet
2DCNN
View ViewView
Depth
Voxel
Point Cloud
View
Output
3. Algorithm
Overview
15

Progressive Scene Completion
• 最適な視点シーケンスを DQN で求める
• 不完全な点群 𝑃" から 𝑣%, … , 𝑣L を獲得
• Markov decision process (MDP)
• 状態
• 更新された点群
• ⾏動
• シーンを中⼼としたサポート球上の⼀様な 𝑘 視点
• 今回は⾚道と緯度 45° で各 10 視点（計 20 視点）
• 報酬
• Inpainting の精度 𝑅)
N((
• ⽋落⾯積 𝑅)
JOPQ
16
Environment
Agent
Action 𝑎GReward 𝑟GState 𝑠G
𝑟GE%
𝑠GE%

DQN
• 報酬
• Inpainting の精度
𝑅𝑖
𝑎𝑐𝑐
= −
1
Ω
𝐿Ω
1
(&𝐷𝑖, 𝐷𝑖
𝑔𝑡
)
Ω : ⽳の内側のピクセルの集合， 𝐿1: L1 loss， 𝐷𝑖
𝑔𝑡
: 事前にレンダリングした深度GT
• ⽋落⾯積
𝑅𝑖
ℎ𝑜𝑙𝑒
=
𝐴𝑟𝑒𝑎ℎ 𝑃𝑖−1 − 𝐴𝑟𝑒𝑎ℎ(𝑃𝑖)
𝐴𝑟𝑒𝑎ℎ(𝑃0)
− 1
𝐴𝑟𝑒𝑎ℎ 𝑃 : 𝑃 をすべてのカメラに投影したときの⽳の⾯積の総和
• 最終的な報酬
𝑅𝑖
𝑡𝑜𝑡𝑎𝑙
= 𝑤𝑅𝑖
𝑎𝑐𝑐
+ 1 − 𝑤 𝑅𝑖
ℎ𝑜𝑙𝑒
• 収束
• 𝐴𝑟𝑒𝑎ℎ(𝑃𝑖)/𝐴𝑟𝑒𝑎ℎ 𝑃0 < 5% … ほぼすべての⽳を補完
17

CNN
View
Pooling
CNN
CNN
CNN
Point Cloud Qvalue
512 256
1
20
20
Depth
Figure 3. The architecture of our DQN. For a point cloud state,
MVCNN is used to predict the best view for the next inpainting.
and support the use of DQN, we evenly sample a set of
scene-centric camera views to form a discrete action space.
Specifically, we first place P0 in its bounding sphere and
keep it upright. Then, two circle paths are created for both
the equatorial and 45-degree latitude line. In our experi-
ments, 20 camera views are uniformly selected on these
two paths, 10 per circle. All views are facing to the cen-
ter of the bounding sphere. We fixed these views for all
training samples. The set of 20 views is denoted as C =
{c , c , ..., c }.
DQN
• MVCNN [Su+ ICCV2015] ベース
• 𝑃)=% から多視点深度画像を⽣成
• 各 action の Q-value を出⼒
• 深度画像は 224x224
• Dueling DQN structure [Wang+ ICML2016]
• ロス関数
• 通常の DQN
𝐿𝑜𝑠𝑠 𝜃 = 𝔼 𝑟 + 𝛾 max
hijk
𝑄 𝑃), 𝑣)E%; 𝜃n − 𝑄 𝑃)=%, 𝑣); 𝜃
o
• 本⼿法︓ max
hijk
𝑄 𝑃), 𝑣)E%; 𝜃n の upward bias 項を除去 [Hasselt+ AAAI2016]
𝐿𝑜𝑠𝑠 𝜃 = 𝔼 𝑟 + 𝛾𝑄 𝑃), arg max
hijk
𝑄(𝑃), 𝑣)E%; 𝜃); 𝜃n − 𝑄 𝑃)=%, 𝑣); 𝜃
o
18
[Su+ ICCV2015]: H. Su et al. “Multi-view convolutional neural networks for 3d shape recognition”, In ICCV, 2015.
[Wang+ ICML2016]: Z. Wang et al. “Dueling Network Architectures for Deep Reinforcement Learning”, In ICML, 2016.
[Hasselt+ AAAI2016]: H-v. Hasselt et al. “Deep Reinforcement Learning with Double Q-Learning”, In AAAI, 2016.

DQN
• 𝜖-greedy policy
• 乱数を⽣成して 𝜖 の確率で最⼤ Q-valueを選択し，それ以外はランダムの⾏動
• （通常は逆︓𝜖 の確率でランダム，1 − 𝜖 で最⼤ Q-value の⾏動）
• Experience replay
• 200個のエピソードを記録
• 様々なエピソードから 𝑃𝑖−1, 𝑣𝑖, 𝑟, 𝑃𝑖 をランダムサンプリングして学習データとする
• 実装の詳細
• 報酬の重み 𝑤 は 0.7
• discount factor は 0.9
• 𝜖 は 10,000 ステップで 0.9 から 0.2 (fixed) に
• 学習に 3 ⽇間，推論は 60 秒（平均 5 視点）
19

Dataset
• SUNCG dataset
• 2D CNN (Depth Inpainting)
• 30,000 深度画像 (𝑁 = 3,000, 𝑚 = 10)
• ドアや壁でオクルージョンがある視点は除去
• 3,000 をテスト，残りを学習
• DQN (View Path Planning)
• 𝑁 = 2,500
• 2,300 を学習，200をテスト
20

Comparison Against SOTA
• ボリュームベースと点群ベースの⽐較⽅法の提案
• ボリュームGT の出⼒精度は既存のすべてのボリュームベースの
Scene Completion ⼿法の上限値として扱える
• ボリュームGT，SSCNet，提案法を様々な視点にレンダリングして深度を取得し，
点群に変換
• 定量評価
• Chamfer distance (CD)
• Completeness
𝐶𝑟 𝑃, 𝑃𝐺𝑇 =
{𝑑 𝑥, 𝑃 < 𝑟 𝑥 ∈ 𝑃𝐺𝑇 }
𝑦|𝑦 ∈ 𝑃𝐺𝑇
21
𝑑 𝑥, 𝑃 𝑥 と点群 𝑃 の距離
• 要素数
𝑟 distance threshold (0.02, 0.04, 0.06, 0.08, 0.10)

Comparison Against SOTAFigure 5. Comparisons against the state-of-the-arts. Given different inputs and the referenced groundtruth, we show the completion results
of three methods, with the corresponding point cloud error maps below, and zoom-in areas beside.
Table 1. Quantitative Comparisons against existing methods. The CD metric and the completeness metric (w.r.t different thresholds) are
used.
SSCNet V olume GT1 ScanComplete V olume GT2 U5 U10 DQNw/o hole Ours
CD 0.5162 0.5140 0.2193 0.2058 0.1642 0.1841 0.1495 0.1148
Cr=0.002(%) 14.61 13.28 34.46 31.18 79.18 80.17 79.22 79.26
Cr=0.004(%) 30.10 32.23 58.83 61.11 83.33 84.15 83.50 83.68
Cr=0.006(%) 52.82 50.14 74.60 74.88 85.81 86.56 86.02 86.28
Cr=0.008(%) 71.24 72.33 79.59 81.04 87.66 88.33 87.81 88.20
Cr=0.010(%) 78.23 78.96 81.01 81.61 89.06 89.70 89.24 89.68
4.2. Ablation Studies
To ensure the effectiveness of several key components of
our system, we do some control experiments by removing
each component.
On Depth Inpainting Firstly, to evaluate the efﬁcacy
of the volume guidance, we propose two variants of our
method: 1) we train a 2D inpainting network directly with-
out projecting volume as guidance, which is denoted as
DepInw/oV G; 2) we train the volume guided 2D inpaint-
shown in Figure 4. All of them show the superiority of our
design.
Table 2. Quantitative ablation studies on inpainting network.
DepInw/oV G DepInw/oP BP Ours
L1
⌦ 0.0717 0.0574 0.0470
PSNR 22.15 23.12 24.73
SSIM 0.910 0.926 0.930
On View Path Planning Without using DQN for path
planning, there exists a straightforward way to do comple-
Input & GT SSCNet Volume-GT Ours
22
上︓推定結果（左）と⾚枠のズーム（右）下︓エラーマップ（左）と⾚枠のズーム（右）

Ablation Studies
• Depth Inpainting
• 𝐷𝑒𝑝𝐼𝑛 𝑤/𝑜𝑉𝐺︓Volume guidance なし
• 𝐷𝑒𝑝𝐼𝑛 𝑤/𝑜𝑃𝐵𝑃 ︓Projection back-prop なし
• View Path Planning
• 𝑈5, 𝑈10︓視点数固定で⼀様サンプリング
• 𝐷𝑄𝑁 𝑤/𝑜−ℎ𝑜𝑙𝑒︓報酬を 𝑅𝑖
𝑎𝑐𝑐
のみ使⽤
23
0.2193 0.2058 0.1642 0.1841 0.1495 0.1148
34.46 31.18 79.18 80.17 79.22 79.26
58.83 61.11 83.33 84.15 83.50 83.68
74.60 74.88 85.81 86.56 86.02 86.28
79.59 81.04 87.66 88.33 87.81 88.20
81.01 81.61 89.06 89.70 89.24 89.68
ponents of
removing
e efficacy
nts of our
ctly with-
enoted as
D inpaint-
, which is
cs of L1
⌦,
uantitative
risons are
shown in Figure 4. All of them show the superiority of our
design.
Table 2. Quantitative ablation studies on inpainting network.
DepInw/oV G DepInw/oP BP Ours
L1
⌦ 0.0717 0.0574 0.0470
PSNR 22.15 23.12 24.73
SSIM 0.910 0.926 0.930
On View Path Planning Without using DQN for path
planning, there exists a straightforward way to do comple-
tion: we can uniformly sample a fixed number of views
from C and directly perform depth implanting on them.
In this uniform manner, two methods with two different
numbers of views (5 and 10 are selected) are evaluated.
CNN
View
Pooling
CNN
CNN
CNN
Point Cloud Qvalue
512 256
1
20
20
Depth
Figure 3. The architecture of our DQN. For a point cloud state,
MVCNN is used to predict the best view for the next inpainting.
and support the use of DQN, we evenly sample a set of
scene-centric camera views to form a discrete action space.
Specifically, we first place P0 in its bounding sphere and
keep it upright. Then, two circle paths are created for both
the equatorial and 45-degree latitude line. In our experi-
ments, 20 camera views are uniformly selected on these
two paths, 10 per circle. All views are facing to the cen-
ter of the bounding sphere. We fixed these views for all
Input & GT DepInw/oVG DepInw/oPBP Ours
Figure 4. Comparisons on variants of depth inpainting network.
the-arts. Given different inputs and the referenced groundtruth, we show the completion results
oint cloud error maps below, and zoom-in areas beside.
existing methods. The CD metric and the completeness metric (w.r.t different thresholds) are
GT1 ScanComplete V olume GT2 U5 U10 DQNw/o hole Ours
0.2193 0.2058 0.1642 0.1841 0.1495 0.1148
34.46 31.18 79.18 80.17 79.22 79.26
58.83 61.11 83.33 84.15 83.50 83.68
74.60 74.88 85.81 86.56 86.02 86.28
79.59 81.04 87.66 88.33 87.81 88.20
81.01 81.61 89.06 89.70 89.24 89.68
Figure 5. Comparisons against the state-of-the-arts. Given different inputs and the referenced groundtruth, we show the completion results
of three methods, with the corresponding point cloud error maps below, and zoom-in areas beside.
Table 1. Quantitative Comparisons against existing methods. The CD metric and the completeness metric (w.r.t different thresholds) are
used.
SSCNet V olume GT1 ScanComplete V olume GT2 U5 U10 DQNw/o hole Ours
CD 0.5162 0.5140 0.2193 0.2058 0.1642 0.1841 0.1495 0.1148
Cr=0.002(%) 14.61 13.28 34.46 31.18 79.18 80.17 79.22 79.26
Cr=0.004(%) 30.10 32.23 58.83 61.11 83.33 84.15 83.50 83.68
Cr=0.006(%) 52.82 50.14 74.60 74.88 85.81 86.56 86.02 86.28
Cr=0.008(%) 71.24 72.33 79.59 81.04 87.66 88.33 87.81 88.20
Cr=0.010(%) 78.23 78.96 81.01 81.61 89.06 89.70 89.24 89.68
Input & GT U5 U10 OursDQNw/o-hole

Conclusion
• 単⼀深度画像から点群ベースの3次元シーン補完⼿法を提案
• ⽋落している点群を多視点深度画像を補完することで推定
• 正確かつ⼀貫性を保証するために，volume-guided view inpainting を提案
• 最適な視点経路を探索するための強化学習フレームワークを考案
• Future work
1. 深度画像の inpainting にテクスチャ情報を使⽤ → ⼊⼒を RGBD に
2. テクスチャ付きのシーンを出⼒するためにテクスチャの補完を深度画像の
inpainting と⼀緒に⾏う
24

[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpainting for 3D Point Scene Completion from a Single Depth Image

More Related Content

What's hot (20)

Similar to [3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpainting for 3D Point Scene Completion from a Single Depth Image (20)

Recently uploaded (20)

[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpainting for 3D Point Scene Completion from a Single Depth Image