SlideShare a Scribd company logo
第5回 3D勉強会@関東
Deep Reinforcement Learning of Volume-guided
Progressive View Inpainting for 3D Point Scene
Completion from a Single Depth Image
⻘⼭学院⼤学 鷲⾒研究室
D2 伊東 聖⽮
⾃⼰紹介
伊東 聖⽮(いとう せいや)
• 所属
• ⻘⼭学院⼤学 理⼯学研究科 鷲⾒研究室 D2
(進路は企業とアカデミアで悩み中)
• ⻘⼭学院⼤学 先端情報技術研究センター RA
• 研究テーマ
• SFM・MVS・深度推定・深度補完
• セマンティックセグメンテーション
2
Deep Modular Network [Ito+, ECCVW2018]
@seiyaito93
論⽂情報
Xiaoguang Han1,3, Zhaoxuan Zhang2,3, Dong Du3,4, Mingdai Yang1,3,
Jingming Yu5, Pan Pan4, Xin Yang2, Ligang Liu4, Zixiang Xiong6, Shuguang Cui1,3
1The Chinese University of Hong Kong(Shenzhen), 2Dalian University of Technology
3Shenzhen Research Institute of Big Data, 4University of Science and Technology of China
5Alibaba Group, 6Texas A&M University
Deep Reinforcement Learning of Volume-guided Progressive View Inpainting for
3D Point Scene Completion from a Single Depth Image
In CVPR 2019 (Oral)
ArXiv: https://arxiv.org/abs/1903.04019
Oral: https://www.youtube.com/watch?v=0lLnHe0xbZE&t=187s 3
概要
• 強化学習による単⼀深度画像を⼊⼒とした
点群ベースのシーン補完
• 論⽂の貢献
• 単⼀深度画像からの⽋落した点群を直接⽣成する,
シーン補完のための点群ベースのアルゴリズム
• 累積的な* シーン補完のための最適な視点シーケンスを
決定する新たな強化学習の⼿法
• Global context を最⼤限に利⽤する
volume-guided view inpainting network
• SUNCG で SOTA 達成
1
The Chinese University of Hong Kong(Shenzhen), 2
Dalian University of Technology
3
Shenzhen Research Institute of Big Data, 4
University of Science and Technology of China
5
Alibaba Group, 6
Texas A&M University
Abstract
We present a deep reinforcement learning method of pro-
gressive view inpainting for 3D point scene completion un-
der volume guidance, achieving high-quality scene recon-
struction from only a single depth image with severe oc-
clusion. Our approach is end-to-end, consisting of three
modules: 3D scene volume reconstruction, 2D depth map
inpainting, and multi-view selection for completion. Given
a single depth image, our method first goes through the 3D
volume branch to obtain a volumetric scene reconstruction
as a guide to the next view inpainting step, which attempts
to make up the missing information; the third step involves
projecting the volume under the same view of the input, con-
catenating them to complete the current view depth, and in-
tegrating all depth into the point cloud. Since the occluded
areas are unavailable, we resort to a deep Q-Network to
glance around and pick the next best view for large hole
completion progressively until a scene is adequately recon-
structed while guaranteeing validity. All steps are learned
jointly to achieve robust and consistent results. We perform
qualitative and quantitative evaluations with extensive ex-
periments on the SUNCG data, obtaining better results than
the state of the art.
(a) depth (b) visible surface
(c) output: two views
Figure 1. Surface-based scene Completion. (a) A single-view
depth map as input; (b) Visible surface from the depth map, which
is represented as the point cloud. In our paper, the color of depth
and point cloud is for visualization only; (c) Our scene comple-
tion results: directly recovering the missing points of the occluded
regions. Here we choose two views for a better display.
search: pixels of the depth map are classified into several
arXiv:1903.04019v2[cs.CV]12Mar2019
(a)⼊⼒深度画像
(b)(a)を別視点から⾒たもの
(c)シーン補完後
*本⽂では “progressive” 4
Related WorksData-Driven Structural Priors for Shape Completion
Minhyuk Sung1
Vladimir G. Kim1,2
Roland Angst1,3
Leonidas Guibas1
1
Stanford University 2
Adobe Research 3
Max Planck Institute for Informatics
a1. Input Scan
b. Estimated Structure
⋯
a2. Training Data d. Fused Result
c2. Completed
With Database
c1. Completed
With Symmetry
re 1: Given an incomplete point scan with occlusions (a1), our method leverages training data (a2) to estimate the structure of the underlying shape, including parts and
metries (b). Our inference algorithm is capable of discovering the major symmetries, despite the occlusion of symmetric counterparts. The estimated structure is used to augment
oint cloud with additional points from symmetry (c1) and database (c2) priors, which are further fused to produce the final completed point cloud (d). Note how the final result
ages symmetry whenever possible for self-completion (e.g., stem, armrests), and falls back to database information transfer in case all symmetric parts are occluded (e.g., back).
bstract
quiring 3D geometry of an object is a tedious and time-consuming
k, typically requiring scanning the surface from multiple view-
nts. In this work we focus on reconstructing complete geometry
m a single scan acquired with a low-quality consumer-level scan-
g device. Our method uses a collection of example 3D shapes
uild structural part-based priors that are necessary to complete
shape. In our representation, we associate a local coordinate
em to each part and learn the distribution of positions and orien-
ons of all the other parts from the database, which implicitly also
nes positions of symmetry planes and symmetry axes. At the
erence stage, this knowledge enables us to analyze incomplete
nt clouds with substantial occlusions, because observing only a
regions is still sufficient to infer the global structure. Once the
s and the symmetries are estimated, both data sources, symmetry
database, are fused to complete the point cloud. We evaluate
technique on a synthetic dataset containing 481 shapes, and on
scans acquired with a Kinect scanner. Our method demonstrates
h accuracy for the estimated part structure and detected symme-
s, enabling higher quality shape completions in comparison to
rnative techniques.
ywords: shape analysis, shape completion, shape collections
an autonomous agent have to acquire the geometry from multiple
viewpoints, which is time-consuming and can be infeasible if the en-
vironment poses limitations on the choice of viewpoints. Incomplete
geometry leads to challenges for autonomous agents in planning an
interaction with an object, limits capabilities of geometry analysis
algorithms (e.g., in inferring semantic parts), and produces content
that has little use in virtual reality applications. To address this
problem, we propose a method for completing a point cloud from a
single-view scan by introducing structural priors including expected
symmetries and geometries of parts. In a preprocessing step, we
leverage a collection of segmented 3D shapes to learn a structural
prior which captures positions and orientations of parts and global
and partial symmetries between parts that are expected in a given
class of shapes. This data-driven approach enables us to estimate the
global part structure and to detect symmetries in partial and seem-
ingly asymmetric scans. Having access to this global part structure,
the input scan can then in turn be completed, by exploiting geometry
from both the observed partial scan itself (via symmetries) as well
as the shape collection.
Although symmetric objects can be completed by copying observed
regions to occluded counter-parts, existing symmetry detection algo-
rithms (e.g., [Pauly et al. 2008; Mitra et al. 2006]) infer symmetry
from the input shape, which means that they need to observe at least
some fraction of the occluded counter-part to detect the symmetry.
Moreover, even if symmetries are successfully detected, not all oc-
Geometric & template-based approach
High-Resolution Shape Completion [Han+ ICCV2017]
SSCNet [Song+ CVPR2017]
Data-driven Structural Priors [Sung+ SIGGRAPH2015]
Volume-based scene completion
Deep learning-based approach
ScanComplete [Dai+ CVPR2018]
5
Overview
•
⽬標︓⼊⼒深度画像 𝐷" から完全な点群を⽣成すること
⼿法︓不完全な点群を多視点の深度画像で表現し,
2D の inpainting task とみなす
→ 推論された点群を保持し,次の推論に使⽤(累積的に処理)
DQN
SSCNet
2DCNN
View ViewView
Depth
Voxel
Point  Cloud
View
Output
Figure 2. The pipeline of our method. Given a single depth image D0, we convert it to a point cloud P, here shown in two different views.
DQN is used to seek the next-best-view, under which the point cloud is projected to a new depth image D1, causing holes. In parallel, the
P is also completed in volumetric space by SSCNet, resulting in V . Under the view of D1, V is projected and guide the inpainting of D1
with a 2DCNN network. Repeating this process several times, we can achieve the final high-quality scene completion.
Depth Inpainting Similar to geometry completion, re-
searchers have employed various priors or optimized mod-
els to complete a depth image[15, 21, 27, 41, 3, 22, 50, 53].
The patch-based image synthesis idea is also applied[7, 10].
3. Algorithm
Overview
Taking a depth image D0 as input, we first convert it to
a point cloud P , which suffers from severe data loss. Our
6
Overview
1. ⼊⼒深度画像 𝐷" を点群 𝑃" に変換
2. 𝐷" は 𝑃" を視点 𝑣" にレンダリングしたとして処理を開始
3. 𝑃" は新たな視点 𝑣%にレンダリングして深度画像 𝐷% を獲得
DQN
SSCNet
2DCNN
View ViewView
Depth
Voxel
Point  Cloud
View
Output
Figure 2. The pipeline of our method. Given a single depth image D0, we convert it to a point cloud P, here shown in two different views.
DQN is used to seek the next-best-view, under which the point cloud is projected to a new depth image D1, causing holes. In parallel, the
P is also completed in volumetric space by SSCNet, resulting in V . Under the view of D1, V is projected and guide the inpainting of D1
with a 2DCNN network. Repeating this process several times, we can achieve the final high-quality scene completion.
Depth Inpainting Similar to geometry completion, re-
searchers have employed various priors or optimized mod-
els to complete a depth image[15, 21, 27, 41, 3, 22, 50, 53].
The patch-based image synthesis idea is also applied[7, 10].
3. Algorithm
Overview
Taking a depth image D0 as input, we first convert it to
a point cloud P , which suffers from severe data loss. Our
深度画像には
多くの⽳がある
後述︓DQN で best view path を探索
7
Overview
4. 深度画像 𝐷% に対して 2D Inpainting を⾏い,深度画像 &𝐷%を獲得
5. &𝐷%は点群に変換され, 𝑃" と集約して密な点群 𝑃% を獲得
6. 以降,3-5を繰り返し
8
Overview
•
① 2D CNN (volume-guided view inpainting network)
② DQN
DQN
SSCNet
2DCNN
View ViewView
Depth
Voxel
Point  Cloud
View
Output
Figure 2. The pipeline of our method. Given a single depth image D0, we convert it to a point cloud P, here shown in two different views.
DQN is used to seek the next-best-view, under which the point cloud is projected to a new depth image D1, causing holes. In parallel, the
P is also completed in volumetric space by SSCNet, resulting in V . Under the view of D1, V is projected and guide the inpainting of D1
with a 2DCNN network. Repeating this process several times, we can achieve the final high-quality scene completion.
Depth Inpainting Similar to geometry completion, re-
searchers have employed various priors or optimized mod-
els to complete a depth image[15, 21, 27, 41, 3, 22, 50, 53].
The patch-based image synthesis idea is also applied[7, 10].
3. Algorithm
Overview
Taking a depth image D0 as input, we first convert it to
a point cloud P , which suffers from severe data loss. Our
9
Volume-guided View Inpainting
• 3つのサブネットワークで構成
• Volume Completion
𝑃" のボリューム占有グリッド 𝑉 を補完して 𝑉( に変換
• Depth Inpainting
深度画像 𝐷) と 𝑉( を視点 𝑣) に投影した深度画像 𝐷)
(
の2つを⼊⼒とし,
inpainting した深度画像 &𝐷) を獲得
• Projection Layer
深度画像 𝐷)
(
と 𝑉( を繋ぐ微分可能な層
10
Volume-guided Progressive View Inpainting
pletion from a Single Depth Image
,4
Dong Du, 1,3
Mingdai Yang, 5
Jingming Yu, 5
Pan Pan
u, 6
Zixiang Xiong, 1,3
Shuguang Cui
ong(Shenzhen), 2
Dalian University of Technology
ata, 4
University of Science and Technology of China
up, 6
Texas A&M University
pro-
n un-
econ-
e oc-
three
map
Given
e 3D
ction
mpts
olves
con-
nd in-
uded
rk to
hole
(a) depth (b) visible surface
(c) output: two views
Figure 1. Surface-based scene Completion. (a) A single-view
𝐷"
SSCNet
ViewView
Depth
Voxel
Point  Cloud
SSCNet
Projection
Layer
𝑉(
SSCNet
ViewView
Depth
Point  Cloud
𝐷)
(
SSCNet
ViewView
Depth
Point  Cloud
𝐷)
View
Depth
Point  Cloud
&𝐷)
Depth
Inpainting
𝑖 = 1
𝑖 > 1
Volume-guided View Inpainting
Volume Completion
• SSCNet [Song+ CVPR2017] を使⽤( 𝑓: 𝑉 ↦ 𝑉()
• オリジナルは意味ラベルも推定するが,本⼿法では意味ラベルの推定は⾏わない
→ ボクセルワイズの binary classification (empty/full) として学習
• ⼊⼒解像度 240x144x240 から 60x36x60 の確率ボリュームを出⼒(解像度1/4)
[Song+ CVPR2017]: S. Song et al. “Semantic Scene Completion from a Single Depth Image”, In CVPR, 2017.
11
Volume-guided View Inpainting
Depth Inpainting
• 深度画像は 512x512 のグレースケール画像としてレンダリング
• 不規則な⽳に対応した [Liu+ ECCV2018] を採⽤
• 深度画像 𝐷𝑖, 𝐷𝑖
𝑐
を結合 (concat) し,U-Net like の 構造のネットワークに⼊⼒
• 出⼒画像サイズは⼊⼒と同じく 512x512
Image Inpainting for Irregular Holes Using
Partial Convolutions
Guilin Liu Fitsum A. Reda Kevin J. Shih Ting-Chun Wang
Andrew Tao Bryan Catanzaro
NVIDIA Corporation
Fig. 1. Masked images and corresponding inpainted results using our partial-
convolution based network.
:1804.07723v2[cs.CV]15Dec2018
Our proposed model uses stacked partial convolution operations an
dating steps to perform image inpainting. We first define our conv
mask update mechanism, then discuss model architecture and loss f
3.1 Partial Convolutional Layer
We refer to our partial convolution operation and mask update fun
as the Partial Convolutional Layer. Let W be the convolution fi
for the convolution filter and b its the corresponding bias. X are
values (pixels values) for the current convolution (sliding) window a
corresponding binary mask. The partial convolution at every locatio
defined in [7], is expressed as:
x0
=
(
WT
(X M) sum(1)
sum(M) + b, if sum(M) > 0
0, otherwise
where denotes element-wise multiplication, and 1 has same shap
with all elements being 1. As can be seen, output values depend
unmasked inputs. The scaling factor sum(1)/sum(M) applies approp
to adjust for the varying amount of valid (unmasked) inputs.
【U-Net との違い】
Partial Convolutional Layer
12
[Liu+ ECCV2018]: G. Liu et al. “Image Inpainting for Irregular Holes Using Partial Convolutions”, In ECCV, 2018.
Volume-guided View Inpainting
Projection Layer
• 2D loss を 3D CNNs のパラメータ最適化に使⽤
• Differentiable projection layer [Tulsiani+ CVPR2017]
ピクセル 𝑥 の深度 𝐷 𝑥
𝐷 𝑥 = 3
45%
67
𝑃4
8
𝑑4
𝑃4
8
= 1 − 𝑉4 ;
<5%
4=%
𝑉< , 𝑘 = 1, … , 𝑁8
𝜕𝐷(𝑥)
𝜕𝑉4
= 3
)54
67
(𝑑)E% − 𝑑)) ;
%FGH),GI4
𝑉G
13
𝑘GJ ボクセルに最初に
衝突する確率
𝑘 ピクセル 𝑥 を通る視線と衝突するボクセルのインデックス
(⼩さい⽅が⼿前, 𝑁8 は視線が通るボクセル数)
𝑉4 𝑘GJ ボクセルの占有確率(空である確率)
𝑑4 𝑘GJ ボクセルまでの距離
[Tulsiani+ CVPR2017]: S. Tulsiani et al. “Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency”, In CVPR, 2017.
Volume-guided View Inpainting
Joint Training
• 収束を保証するために学習を3段階に
1. SSCNet は個別に学習
2. SSCNet を固定し,Depth Inpainting を学習
3. ネットワーク全体を fine tuning
• 学習データは SUNCG (synthetic scene) を使⽤
• 𝑁 枚の深度画像をランダムなシーン・視点でレンダリングして⽣成
• 各深度画像は点群に変換し,元の視点の近くでランダムに 𝑚 視点投影
→ ⼤きく⽳が空くことを避け,⼗分なコンテキストを利⽤可能に
14
rcement Learning of Volume-guided Progressive View Inpainting
3D Point Scene Completion from a Single Depth Image
Han, 2,3
Zhaoxuan Zhang, 3,4
Dong Du, 1,3
Mingdai Yang, 5
Jingming Yu, 5
Pan Pan
2
Xin Yang, 4
Ligang Liu, 6
Zixiang Xiong, 1,3
Shuguang Cui
nese University of Hong Kong(Shenzhen), 2
Dalian University of Technology
Research Institute of Big Data, 4
University of Science and Technology of China
5
Alibaba Group, 6
Texas A&M University
Abstract
reinforcement learning method of pro-
ng for 3D point scene completion un-
achieving high-quality scene recon-
single depth image with severe oc-
ch is end-to-end, consisting of three
olume reconstruction, 2D depth map
view selection for completion. Given
our method first goes through the 3D
ain a volumetric scene reconstruction
view inpainting step, which attempts
g information; the third step involves
under the same view of the input, con-
mplete the current view depth, and in-
o the point cloud. Since the occluded
e, we resort to a deep Q-Network to
ick the next best view for large hole
(a) depth (b) visible surface
(c) output: two views
Figure 1. Surface-based scene Completion. (a) A single-view
!"
DQN
SSCNet
2DCNN
ViewView
Depth
Voxel
Point  Cloud
SSCNet
Projection
Layer
#$
DQN
SSCNet
ViewView
Depth
Point  Cloud
!%
$
DQN
SSCNet
ViewView
Depth
Point  Cloud
!%
ViewView
Depth
Point  Cloud
!&%
Depth
Inpainting
Overview
•
① 2D CNN (volume-guided view inpainting network)
② DQN
DQN
SSCNet
2DCNN
View ViewView
Depth
Voxel
Point  Cloud
View
Output
Figure 2. The pipeline of our method. Given a single depth image D0, we convert it to a point cloud P, here shown in two different views.
DQN is used to seek the next-best-view, under which the point cloud is projected to a new depth image D1, causing holes. In parallel, the
P is also completed in volumetric space by SSCNet, resulting in V . Under the view of D1, V is projected and guide the inpainting of D1
with a 2DCNN network. Repeating this process several times, we can achieve the final high-quality scene completion.
Depth Inpainting Similar to geometry completion, re-
searchers have employed various priors or optimized mod-
els to complete a depth image[15, 21, 27, 41, 3, 22, 50, 53].
The patch-based image synthesis idea is also applied[7, 10].
3. Algorithm
Overview
Taking a depth image D0 as input, we first convert it to
a point cloud P , which suffers from severe data loss. Our
15
Progressive Scene Completion
• 最適な視点シーケンスを DQN で求める
• 不完全な点群 𝑃" から 𝑣%, … , 𝑣L を獲得
• Markov decision process (MDP)
• 状態
• 更新された点群
• ⾏動
• シーンを中⼼としたサポート球上の⼀様な 𝑘 視点
• 今回は⾚道と緯度 45° で各 10 視点(計 20 視点)
• 報酬
• Inpainting の精度 𝑅)
N((
• ⽋落⾯積 𝑅)
JOPQ
16
Environment
Agent
Action 𝑎GReward 𝑟GState 𝑠G
𝑟GE%
𝑠GE%
DQN
• 報酬
• Inpainting の精度
𝑅𝑖
𝑎𝑐𝑐
= −
1
Ω
𝐿Ω
1
(&𝐷𝑖, 𝐷𝑖
𝑔𝑡
)
Ω : ⽳の内側のピクセルの集合, 𝐿1: L1 loss, 𝐷𝑖
𝑔𝑡
: 事前にレンダリングした深度GT
• ⽋落⾯積
𝑅𝑖
ℎ𝑜𝑙𝑒
=
𝐴𝑟𝑒𝑎ℎ 𝑃𝑖−1 − 𝐴𝑟𝑒𝑎ℎ(𝑃𝑖)
𝐴𝑟𝑒𝑎ℎ(𝑃0)
− 1
𝐴𝑟𝑒𝑎ℎ 𝑃 : 𝑃 をすべてのカメラに投影したときの⽳の⾯積の総和
• 最終的な報酬
𝑅𝑖
𝑡𝑜𝑡𝑎𝑙
= 𝑤𝑅𝑖
𝑎𝑐𝑐
+ 1 − 𝑤 𝑅𝑖
ℎ𝑜𝑙𝑒
• 収束
• 𝐴𝑟𝑒𝑎ℎ(𝑃𝑖)/𝐴𝑟𝑒𝑎ℎ 𝑃0 < 5% … ほぼすべての⽳を補完
17
CNN
View
Pooling
CNN
CNN
CNN
Point Cloud Q­value
512 256
1
20
20
Depth
Figure 3. The architecture of our DQN. For a point cloud state,
MVCNN is used to predict the best view for the next inpainting.
and support the use of DQN, we evenly sample a set of
scene-centric camera views to form a discrete action space.
Specifically, we first place P0 in its bounding sphere and
keep it upright. Then, two circle paths are created for both
the equatorial and 45-degree latitude line. In our experi-
ments, 20 camera views are uniformly selected on these
two paths, 10 per circle. All views are facing to the cen-
ter of the bounding sphere. We fixed these views for all
training samples. The set of 20 views is denoted as C =
{c , c , ..., c }.
DQN
• MVCNN [Su+ ICCV2015] ベース
• 𝑃)=% から多視点深度画像を⽣成
• 各 action の Q-value を出⼒
• 深度画像は 224x224
• Dueling DQN structure [Wang+ ICML2016]
• ロス関数
• 通常の DQN
𝐿𝑜𝑠𝑠 𝜃 = 𝔼 𝑟 + 𝛾 max
hijk
𝑄 𝑃), 𝑣)E%; 𝜃n − 𝑄 𝑃)=%, 𝑣); 𝜃
o
• 本⼿法︓ max
hijk
𝑄 𝑃), 𝑣)E%; 𝜃n の upward bias 項を除去 [Hasselt+ AAAI2016]
𝐿𝑜𝑠𝑠 𝜃 = 𝔼 𝑟 + 𝛾𝑄 𝑃), arg max
hijk
𝑄(𝑃), 𝑣)E%; 𝜃); 𝜃n − 𝑄 𝑃)=%, 𝑣); 𝜃
o
18
[Su+ ICCV2015]: H. Su et al. “Multi-view convolutional neural networks for 3d shape recognition”, In ICCV, 2015.
[Wang+ ICML2016]: Z. Wang et al. “Dueling Network Architectures for Deep Reinforcement Learning”, In ICML, 2016.
[Hasselt+ AAAI2016]: H-v. Hasselt et al. “Deep Reinforcement Learning with Double Q-Learning”, In AAAI, 2016.
DQN
• 𝜖-greedy policy
• 乱数を⽣成して 𝜖 の確率で最⼤ Q-valueを選択し,それ以外はランダムの⾏動
• (通常は逆︓𝜖 の確率でランダム,1 − 𝜖 で最⼤ Q-value の⾏動)
• Experience replay
• 200個のエピソード を記録
• 様々なエピソードから 𝑃𝑖−1, 𝑣𝑖, 𝑟, 𝑃𝑖 をランダムサンプリングして学習データとする
• 実装の詳細
• 報酬の重み 𝑤 は 0.7
• discount factor は 0.9
• 𝜖 は 10,000 ステップで 0.9 から 0.2 (fixed) に
• 学習に 3 ⽇間,推論は 60 秒(平均 5 視点)
19
Dataset
• SUNCG dataset
• 2D CNN (Depth Inpainting)
• 30,000 深度画像 (𝑁 = 3,000, 𝑚 = 10)
• ドアや壁でオクルージョンがある視点は除去
• 3,000 をテスト,残りを学習
• DQN (View Path Planning)
• 𝑁 = 2,500
• 2,300 を学習,200をテスト
20
Comparison Against SOTA
• ボリュームベースと点群ベースの⽐較⽅法の提案
• ボリュームGT の出⼒精度は既存のすべてのボリュームベースの
Scene Completion ⼿法の上限値として扱える
• ボリュームGT,SSCNet,提案法を様々な視点にレンダリングして深度を取得し,
点群に変換
• 定量評価
• Chamfer distance (CD)
• Completeness
𝐶𝑟 𝑃, 𝑃𝐺𝑇 =
{𝑑 𝑥, 𝑃 < 𝑟 𝑥 ∈ 𝑃𝐺𝑇 }
𝑦|𝑦 ∈ 𝑃𝐺𝑇
21
𝑑 𝑥, 𝑃 𝑥 と点群 𝑃 の距離
• 要素数
𝑟 distance threshold (0.02, 0.04, 0.06, 0.08, 0.10)
Comparison Against SOTAFigure 5. Comparisons against the state-of-the-arts. Given different inputs and the referenced groundtruth, we show the completion results
of three methods, with the corresponding point cloud error maps below, and zoom-in areas beside.
Table 1. Quantitative Comparisons against existing methods. The CD metric and the completeness metric (w.r.t different thresholds) are
used.
SSCNet V olume GT1 ScanComplete V olume GT2 U5 U10 DQNw/o hole Ours
CD 0.5162 0.5140 0.2193 0.2058 0.1642 0.1841 0.1495 0.1148
Cr=0.002(%) 14.61 13.28 34.46 31.18 79.18 80.17 79.22 79.26
Cr=0.004(%) 30.10 32.23 58.83 61.11 83.33 84.15 83.50 83.68
Cr=0.006(%) 52.82 50.14 74.60 74.88 85.81 86.56 86.02 86.28
Cr=0.008(%) 71.24 72.33 79.59 81.04 87.66 88.33 87.81 88.20
Cr=0.010(%) 78.23 78.96 81.01 81.61 89.06 89.70 89.24 89.68
4.2. Ablation Studies
To ensure the effectiveness of several key components of
our system, we do some control experiments by removing
each component.
On Depth Inpainting Firstly, to evaluate the efficacy
of the volume guidance, we propose two variants of our
method: 1) we train a 2D inpainting network directly with-
out projecting volume as guidance, which is denoted as
DepInw/oV G; 2) we train the volume guided 2D inpaint-
shown in Figure 4. All of them show the superiority of our
design.
Table 2. Quantitative ablation studies on inpainting network.
DepInw/oV G DepInw/oP BP Ours
L1
⌦ 0.0717 0.0574 0.0470
PSNR 22.15 23.12 24.73
SSIM 0.910 0.926 0.930
On View Path Planning Without using DQN for path
planning, there exists a straightforward way to do comple-
Input & GT SSCNet Volume-GT Ours
22
上︓推定結果(左)と⾚枠のズーム(右) 下︓エラーマップ(左)と⾚枠のズーム(右)
Ablation Studies
• Depth Inpainting
• 𝐷𝑒𝑝𝐼𝑛 𝑤/𝑜𝑉𝐺︓Volume guidance なし
• 𝐷𝑒𝑝𝐼𝑛 𝑤/𝑜𝑃𝐵𝑃 ︓Projection back-prop なし
• View Path Planning
• 𝑈5, 𝑈10︓視点数固定で⼀様サンプリング
• 𝐷𝑄𝑁 𝑤/𝑜−ℎ𝑜𝑙𝑒︓報酬を 𝑅𝑖
𝑎𝑐𝑐
のみ使⽤
23
0.2193 0.2058 0.1642 0.1841 0.1495 0.1148
34.46 31.18 79.18 80.17 79.22 79.26
58.83 61.11 83.33 84.15 83.50 83.68
74.60 74.88 85.81 86.56 86.02 86.28
79.59 81.04 87.66 88.33 87.81 88.20
81.01 81.61 89.06 89.70 89.24 89.68
ponents of
removing
e efficacy
nts of our
ctly with-
enoted as
D inpaint-
, which is
cs of L1
⌦,
uantitative
risons are
shown in Figure 4. All of them show the superiority of our
design.
Table 2. Quantitative ablation studies on inpainting network.
DepInw/oV G DepInw/oP BP Ours
L1
⌦ 0.0717 0.0574 0.0470
PSNR 22.15 23.12 24.73
SSIM 0.910 0.926 0.930
On View Path Planning Without using DQN for path
planning, there exists a straightforward way to do comple-
tion: we can uniformly sample a fixed number of views
from C and directly perform depth implanting on them.
In this uniform manner, two methods with two different
numbers of views (5 and 10 are selected) are evaluated.
CNN
View
Pooling
CNN
CNN
CNN
Point Cloud Q­value
512 256
1
20
20
Depth
Figure 3. The architecture of our DQN. For a point cloud state,
MVCNN is used to predict the best view for the next inpainting.
and support the use of DQN, we evenly sample a set of
scene-centric camera views to form a discrete action space.
Specifically, we first place P0 in its bounding sphere and
keep it upright. Then, two circle paths are created for both
the equatorial and 45-degree latitude line. In our experi-
ments, 20 camera views are uniformly selected on these
two paths, 10 per circle. All views are facing to the cen-
ter of the bounding sphere. We fixed these views for all
Input & GT DepInw/oVG DepInw/oPBP Ours
Figure 4. Comparisons on variants of depth inpainting network.
the-arts. Given different inputs and the referenced groundtruth, we show the completion results
oint cloud error maps below, and zoom-in areas beside.
existing methods. The CD metric and the completeness metric (w.r.t different thresholds) are
GT1 ScanComplete V olume GT2 U5 U10 DQNw/o hole Ours
0.2193 0.2058 0.1642 0.1841 0.1495 0.1148
34.46 31.18 79.18 80.17 79.22 79.26
58.83 61.11 83.33 84.15 83.50 83.68
74.60 74.88 85.81 86.56 86.02 86.28
79.59 81.04 87.66 88.33 87.81 88.20
81.01 81.61 89.06 89.70 89.24 89.68
Figure 5. Comparisons against the state-of-the-arts. Given different inputs and the referenced groundtruth, we show the completion results
of three methods, with the corresponding point cloud error maps below, and zoom-in areas beside.
Table 1. Quantitative Comparisons against existing methods. The CD metric and the completeness metric (w.r.t different thresholds) are
used.
SSCNet V olume GT1 ScanComplete V olume GT2 U5 U10 DQNw/o hole Ours
CD 0.5162 0.5140 0.2193 0.2058 0.1642 0.1841 0.1495 0.1148
Cr=0.002(%) 14.61 13.28 34.46 31.18 79.18 80.17 79.22 79.26
Cr=0.004(%) 30.10 32.23 58.83 61.11 83.33 84.15 83.50 83.68
Cr=0.006(%) 52.82 50.14 74.60 74.88 85.81 86.56 86.02 86.28
Cr=0.008(%) 71.24 72.33 79.59 81.04 87.66 88.33 87.81 88.20
Cr=0.010(%) 78.23 78.96 81.01 81.61 89.06 89.70 89.24 89.68
Input & GT U5 U10 OursDQNw/o-hole
Conclusion
• 単⼀深度画像から点群ベースの3次元シーン補完⼿法を提案
• ⽋落している点群を多視点深度画像を補完することで推定
• 正確かつ⼀貫性を保証するために,volume-guided view inpainting を提案
• 最適な視点経路を探索するための強化学習フレームワークを考案
• Future work
1. 深度画像の inpainting にテクスチャ情報を使⽤ → ⼊⼒を RGBD に
2. テクスチャ付きのシーンを出⼒するためにテクスチャの補完を深度画像の
inpainting と⼀緒に⾏う
24

More Related Content

PDF
Skip Connection まとめ(Neural Network)
PDF
0から理解するニューラルネットアーキテクチャサーチ(NAS)
PPTX
Triplet Loss 徹底解説
PPTX
backbone としての timm 入門
PDF
SSII2018TS: 3D物体検出とロボットビジョンへの応用
PDF
【DL輪読会】Unpaired Image Super-Resolution Using Pseudo-Supervision
PPTX
[DL輪読会]Objects as Points
PDF
Ml system in_python
Skip Connection まとめ(Neural Network)
0から理解するニューラルネットアーキテクチャサーチ(NAS)
Triplet Loss 徹底解説
backbone としての timm 入門
SSII2018TS: 3D物体検出とロボットビジョンへの応用
【DL輪読会】Unpaired Image Super-Resolution Using Pseudo-Supervision
[DL輪読会]Objects as Points
Ml system in_python

What's hot (20)

PPTX
近年のHierarchical Vision Transformer
PDF
3次元レジストレーションの基礎とOpen3Dを用いた3次元点群処理
PDF
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
PPTX
【DL輪読会】マルチモーダル 基盤モデル
PDF
点群SegmentationのためのTransformerサーベイ
PDF
[DL輪読会] off-policyなメタ強化学習
PPTX
ボイパの音をリアルタイムで解析してみる 〜リザバーコンピューティングを添えて〜
PPTX
畳み込みニューラルネットワークの高精度化と高速化
PDF
[DL輪読会]Disentangling by Factorising
PDF
メルペイの与信モデリングにおける特徴量の品質向上の施策
PPTX
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
PPTX
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
PPTX
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...
PPTX
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
PDF
実装レベルで学ぶVQVAE
PDF
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
PDF
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
PPTX
【論文読み会】Deep Reinforcement Learning at the Edge of the Statistical Precipice
PDF
【ECCV 2022】NeDDF: Reciprocally Constrained Field for Distance and Density
PDF
深層強化学習の self-playで、複雑な行動を機械に学習させたい!
近年のHierarchical Vision Transformer
3次元レジストレーションの基礎とOpen3Dを用いた3次元点群処理
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチモーダル 基盤モデル
点群SegmentationのためのTransformerサーベイ
[DL輪読会] off-policyなメタ強化学習
ボイパの音をリアルタイムで解析してみる 〜リザバーコンピューティングを添えて〜
畳み込みニューラルネットワークの高精度化と高速化
[DL輪読会]Disentangling by Factorising
メルペイの与信モデリングにおける特徴量の品質向上の施策
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
実装レベルで学ぶVQVAE
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【論文読み会】Deep Reinforcement Learning at the Edge of the Statistical Precipice
【ECCV 2022】NeDDF: Reciprocally Constrained Field for Distance and Density
深層強化学習の self-playで、複雑な行動を機械に学習させたい!
Ad

Similar to [3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpainting for 3D Point Scene Completion from a Single Depth Image (20)

PDF
A NOVEL APPROACH TO SMOOTHING ON 3D STRUCTURED ADAPTIVE MESH OF THE KINECT-BA...
PDF
A NOVEL APPROACH TO SMOOTHING ON 3D STRUCTURED ADAPTIVE MESH OF THE KINECT-BA...
PDF
PDF
PDF
3D Reconstruction from Multiple uncalibrated 2D Images of an Object
PDF
Visual Environment by Semantic Segmentation Using Deep Learning: A Prototype ...
PDF
PDF
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
PDF
PDF
Oc2423022305
PDF
Best Techniques of Point cloud to 3D.pdf
PPTX
FastV2C-HandNet - ICICC 2020
PPTX
Semantic Segmentation on Satellite Imagery
PDF
Laplacian-regularized Graph Bandits
PDF
Spme 2013 segmentation
PDF
Learning Graph Representation for Data-Efficiency RL
PDF
3-d interpretation from single 2-d image III
PDF
Point cloud mesh-investigation_report-lihang
PDF
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
PDF
A PROJECT REPORT ON REMOVAL OF UNNECESSARY OBJECTS FROM PHOTOS USING MASKING
A NOVEL APPROACH TO SMOOTHING ON 3D STRUCTURED ADAPTIVE MESH OF THE KINECT-BA...
A NOVEL APPROACH TO SMOOTHING ON 3D STRUCTURED ADAPTIVE MESH OF THE KINECT-BA...
3D Reconstruction from Multiple uncalibrated 2D Images of an Object
Visual Environment by Semantic Segmentation Using Deep Learning: A Prototype ...
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Oc2423022305
Best Techniques of Point cloud to 3D.pdf
FastV2C-HandNet - ICICC 2020
Semantic Segmentation on Satellite Imagery
Laplacian-regularized Graph Bandits
Spme 2013 segmentation
Learning Graph Representation for Data-Efficiency RL
3-d interpretation from single 2-d image III
Point cloud mesh-investigation_report-lihang
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
A PROJECT REPORT ON REMOVAL OF UNNECESSARY OBJECTS FROM PHOTOS USING MASKING
Ad

Recently uploaded (20)

PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
737-MAX_SRG.pdf student reference guides
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Artificial Intelligence
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PPTX
UNIT 4 Total Quality Management .pptx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Sustainable Sites - Green Building Construction
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
web development for engineering and engineering
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
Geodesy 1.pptx...............................................
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
OOP with Java - Java Introduction (Basics)
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Safety Seminar civil to be ensured for safe working.
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Model Code of Practice - Construction Work - 21102022 .pdf
737-MAX_SRG.pdf student reference guides
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Artificial Intelligence
CH1 Production IntroductoryConcepts.pptx
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
UNIT 4 Total Quality Management .pptx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
CYBER-CRIMES AND SECURITY A guide to understanding
Sustainable Sites - Green Building Construction
Operating System & Kernel Study Guide-1 - converted.pdf
web development for engineering and engineering
III.4.1.2_The_Space_Environment.p pdffdf
Geodesy 1.pptx...............................................
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Internet of Things (IOT) - A guide to understanding
OOP with Java - Java Introduction (Basics)
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Safety Seminar civil to be ensured for safe working.

[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpainting for 3D Point Scene Completion from a Single Depth Image

  • 1. 第5回 3D勉強会@関東 Deep Reinforcement Learning of Volume-guided Progressive View Inpainting for 3D Point Scene Completion from a Single Depth Image ⻘⼭学院⼤学 鷲⾒研究室 D2 伊東 聖⽮
  • 2. ⾃⼰紹介 伊東 聖⽮(いとう せいや) • 所属 • ⻘⼭学院⼤学 理⼯学研究科 鷲⾒研究室 D2 (進路は企業とアカデミアで悩み中) • ⻘⼭学院⼤学 先端情報技術研究センター RA • 研究テーマ • SFM・MVS・深度推定・深度補完 • セマンティックセグメンテーション 2 Deep Modular Network [Ito+, ECCVW2018] @seiyaito93
  • 3. 論⽂情報 Xiaoguang Han1,3, Zhaoxuan Zhang2,3, Dong Du3,4, Mingdai Yang1,3, Jingming Yu5, Pan Pan4, Xin Yang2, Ligang Liu4, Zixiang Xiong6, Shuguang Cui1,3 1The Chinese University of Hong Kong(Shenzhen), 2Dalian University of Technology 3Shenzhen Research Institute of Big Data, 4University of Science and Technology of China 5Alibaba Group, 6Texas A&M University Deep Reinforcement Learning of Volume-guided Progressive View Inpainting for 3D Point Scene Completion from a Single Depth Image In CVPR 2019 (Oral) ArXiv: https://arxiv.org/abs/1903.04019 Oral: https://www.youtube.com/watch?v=0lLnHe0xbZE&t=187s 3
  • 4. 概要 • 強化学習による単⼀深度画像を⼊⼒とした 点群ベースのシーン補完 • 論⽂の貢献 • 単⼀深度画像からの⽋落した点群を直接⽣成する, シーン補完のための点群ベースのアルゴリズム • 累積的な* シーン補完のための最適な視点シーケンスを 決定する新たな強化学習の⼿法 • Global context を最⼤限に利⽤する volume-guided view inpainting network • SUNCG で SOTA 達成 1 The Chinese University of Hong Kong(Shenzhen), 2 Dalian University of Technology 3 Shenzhen Research Institute of Big Data, 4 University of Science and Technology of China 5 Alibaba Group, 6 Texas A&M University Abstract We present a deep reinforcement learning method of pro- gressive view inpainting for 3D point scene completion un- der volume guidance, achieving high-quality scene recon- struction from only a single depth image with severe oc- clusion. Our approach is end-to-end, consisting of three modules: 3D scene volume reconstruction, 2D depth map inpainting, and multi-view selection for completion. Given a single depth image, our method first goes through the 3D volume branch to obtain a volumetric scene reconstruction as a guide to the next view inpainting step, which attempts to make up the missing information; the third step involves projecting the volume under the same view of the input, con- catenating them to complete the current view depth, and in- tegrating all depth into the point cloud. Since the occluded areas are unavailable, we resort to a deep Q-Network to glance around and pick the next best view for large hole completion progressively until a scene is adequately recon- structed while guaranteeing validity. All steps are learned jointly to achieve robust and consistent results. We perform qualitative and quantitative evaluations with extensive ex- periments on the SUNCG data, obtaining better results than the state of the art. (a) depth (b) visible surface (c) output: two views Figure 1. Surface-based scene Completion. (a) A single-view depth map as input; (b) Visible surface from the depth map, which is represented as the point cloud. In our paper, the color of depth and point cloud is for visualization only; (c) Our scene comple- tion results: directly recovering the missing points of the occluded regions. Here we choose two views for a better display. search: pixels of the depth map are classified into several arXiv:1903.04019v2[cs.CV]12Mar2019 (a)⼊⼒深度画像 (b)(a)を別視点から⾒たもの (c)シーン補完後 *本⽂では “progressive” 4
  • 5. Related WorksData-Driven Structural Priors for Shape Completion Minhyuk Sung1 Vladimir G. Kim1,2 Roland Angst1,3 Leonidas Guibas1 1 Stanford University 2 Adobe Research 3 Max Planck Institute for Informatics a1. Input Scan b. Estimated Structure ⋯ a2. Training Data d. Fused Result c2. Completed With Database c1. Completed With Symmetry re 1: Given an incomplete point scan with occlusions (a1), our method leverages training data (a2) to estimate the structure of the underlying shape, including parts and metries (b). Our inference algorithm is capable of discovering the major symmetries, despite the occlusion of symmetric counterparts. The estimated structure is used to augment oint cloud with additional points from symmetry (c1) and database (c2) priors, which are further fused to produce the final completed point cloud (d). Note how the final result ages symmetry whenever possible for self-completion (e.g., stem, armrests), and falls back to database information transfer in case all symmetric parts are occluded (e.g., back). bstract quiring 3D geometry of an object is a tedious and time-consuming k, typically requiring scanning the surface from multiple view- nts. In this work we focus on reconstructing complete geometry m a single scan acquired with a low-quality consumer-level scan- g device. Our method uses a collection of example 3D shapes uild structural part-based priors that are necessary to complete shape. In our representation, we associate a local coordinate em to each part and learn the distribution of positions and orien- ons of all the other parts from the database, which implicitly also nes positions of symmetry planes and symmetry axes. At the erence stage, this knowledge enables us to analyze incomplete nt clouds with substantial occlusions, because observing only a regions is still sufficient to infer the global structure. Once the s and the symmetries are estimated, both data sources, symmetry database, are fused to complete the point cloud. We evaluate technique on a synthetic dataset containing 481 shapes, and on scans acquired with a Kinect scanner. Our method demonstrates h accuracy for the estimated part structure and detected symme- s, enabling higher quality shape completions in comparison to rnative techniques. ywords: shape analysis, shape completion, shape collections an autonomous agent have to acquire the geometry from multiple viewpoints, which is time-consuming and can be infeasible if the en- vironment poses limitations on the choice of viewpoints. Incomplete geometry leads to challenges for autonomous agents in planning an interaction with an object, limits capabilities of geometry analysis algorithms (e.g., in inferring semantic parts), and produces content that has little use in virtual reality applications. To address this problem, we propose a method for completing a point cloud from a single-view scan by introducing structural priors including expected symmetries and geometries of parts. In a preprocessing step, we leverage a collection of segmented 3D shapes to learn a structural prior which captures positions and orientations of parts and global and partial symmetries between parts that are expected in a given class of shapes. This data-driven approach enables us to estimate the global part structure and to detect symmetries in partial and seem- ingly asymmetric scans. Having access to this global part structure, the input scan can then in turn be completed, by exploiting geometry from both the observed partial scan itself (via symmetries) as well as the shape collection. Although symmetric objects can be completed by copying observed regions to occluded counter-parts, existing symmetry detection algo- rithms (e.g., [Pauly et al. 2008; Mitra et al. 2006]) infer symmetry from the input shape, which means that they need to observe at least some fraction of the occluded counter-part to detect the symmetry. Moreover, even if symmetries are successfully detected, not all oc- Geometric & template-based approach High-Resolution Shape Completion [Han+ ICCV2017] SSCNet [Song+ CVPR2017] Data-driven Structural Priors [Sung+ SIGGRAPH2015] Volume-based scene completion Deep learning-based approach ScanComplete [Dai+ CVPR2018] 5
  • 6. Overview • ⽬標︓⼊⼒深度画像 𝐷" から完全な点群を⽣成すること ⼿法︓不完全な点群を多視点の深度画像で表現し, 2D の inpainting task とみなす → 推論された点群を保持し,次の推論に使⽤(累積的に処理) DQN SSCNet 2DCNN View ViewView Depth Voxel Point  Cloud View Output Figure 2. The pipeline of our method. Given a single depth image D0, we convert it to a point cloud P, here shown in two different views. DQN is used to seek the next-best-view, under which the point cloud is projected to a new depth image D1, causing holes. In parallel, the P is also completed in volumetric space by SSCNet, resulting in V . Under the view of D1, V is projected and guide the inpainting of D1 with a 2DCNN network. Repeating this process several times, we can achieve the final high-quality scene completion. Depth Inpainting Similar to geometry completion, re- searchers have employed various priors or optimized mod- els to complete a depth image[15, 21, 27, 41, 3, 22, 50, 53]. The patch-based image synthesis idea is also applied[7, 10]. 3. Algorithm Overview Taking a depth image D0 as input, we first convert it to a point cloud P , which suffers from severe data loss. Our 6
  • 7. Overview 1. ⼊⼒深度画像 𝐷" を点群 𝑃" に変換 2. 𝐷" は 𝑃" を視点 𝑣" にレンダリングしたとして処理を開始 3. 𝑃" は新たな視点 𝑣%にレンダリングして深度画像 𝐷% を獲得 DQN SSCNet 2DCNN View ViewView Depth Voxel Point  Cloud View Output Figure 2. The pipeline of our method. Given a single depth image D0, we convert it to a point cloud P, here shown in two different views. DQN is used to seek the next-best-view, under which the point cloud is projected to a new depth image D1, causing holes. In parallel, the P is also completed in volumetric space by SSCNet, resulting in V . Under the view of D1, V is projected and guide the inpainting of D1 with a 2DCNN network. Repeating this process several times, we can achieve the final high-quality scene completion. Depth Inpainting Similar to geometry completion, re- searchers have employed various priors or optimized mod- els to complete a depth image[15, 21, 27, 41, 3, 22, 50, 53]. The patch-based image synthesis idea is also applied[7, 10]. 3. Algorithm Overview Taking a depth image D0 as input, we first convert it to a point cloud P , which suffers from severe data loss. Our 深度画像には 多くの⽳がある 後述︓DQN で best view path を探索 7
  • 8. Overview 4. 深度画像 𝐷% に対して 2D Inpainting を⾏い,深度画像 &𝐷%を獲得 5. &𝐷%は点群に変換され, 𝑃" と集約して密な点群 𝑃% を獲得 6. 以降,3-5を繰り返し 8
  • 9. Overview • ① 2D CNN (volume-guided view inpainting network) ② DQN DQN SSCNet 2DCNN View ViewView Depth Voxel Point  Cloud View Output Figure 2. The pipeline of our method. Given a single depth image D0, we convert it to a point cloud P, here shown in two different views. DQN is used to seek the next-best-view, under which the point cloud is projected to a new depth image D1, causing holes. In parallel, the P is also completed in volumetric space by SSCNet, resulting in V . Under the view of D1, V is projected and guide the inpainting of D1 with a 2DCNN network. Repeating this process several times, we can achieve the final high-quality scene completion. Depth Inpainting Similar to geometry completion, re- searchers have employed various priors or optimized mod- els to complete a depth image[15, 21, 27, 41, 3, 22, 50, 53]. The patch-based image synthesis idea is also applied[7, 10]. 3. Algorithm Overview Taking a depth image D0 as input, we first convert it to a point cloud P , which suffers from severe data loss. Our 9
  • 10. Volume-guided View Inpainting • 3つのサブネットワークで構成 • Volume Completion 𝑃" のボリューム占有グリッド 𝑉 を補完して 𝑉( に変換 • Depth Inpainting 深度画像 𝐷) と 𝑉( を視点 𝑣) に投影した深度画像 𝐷) ( の2つを⼊⼒とし, inpainting した深度画像 &𝐷) を獲得 • Projection Layer 深度画像 𝐷) ( と 𝑉( を繋ぐ微分可能な層 10 Volume-guided Progressive View Inpainting pletion from a Single Depth Image ,4 Dong Du, 1,3 Mingdai Yang, 5 Jingming Yu, 5 Pan Pan u, 6 Zixiang Xiong, 1,3 Shuguang Cui ong(Shenzhen), 2 Dalian University of Technology ata, 4 University of Science and Technology of China up, 6 Texas A&M University pro- n un- econ- e oc- three map Given e 3D ction mpts olves con- nd in- uded rk to hole (a) depth (b) visible surface (c) output: two views Figure 1. Surface-based scene Completion. (a) A single-view 𝐷" SSCNet ViewView Depth Voxel Point  Cloud SSCNet Projection Layer 𝑉( SSCNet ViewView Depth Point  Cloud 𝐷) ( SSCNet ViewView Depth Point  Cloud 𝐷) View Depth Point  Cloud &𝐷) Depth Inpainting 𝑖 = 1 𝑖 > 1
  • 11. Volume-guided View Inpainting Volume Completion • SSCNet [Song+ CVPR2017] を使⽤( 𝑓: 𝑉 ↦ 𝑉() • オリジナルは意味ラベルも推定するが,本⼿法では意味ラベルの推定は⾏わない → ボクセルワイズの binary classification (empty/full) として学習 • ⼊⼒解像度 240x144x240 から 60x36x60 の確率ボリュームを出⼒(解像度1/4) [Song+ CVPR2017]: S. Song et al. “Semantic Scene Completion from a Single Depth Image”, In CVPR, 2017. 11
  • 12. Volume-guided View Inpainting Depth Inpainting • 深度画像は 512x512 のグレースケール画像としてレンダリング • 不規則な⽳に対応した [Liu+ ECCV2018] を採⽤ • 深度画像 𝐷𝑖, 𝐷𝑖 𝑐 を結合 (concat) し,U-Net like の 構造のネットワークに⼊⼒ • 出⼒画像サイズは⼊⼒と同じく 512x512 Image Inpainting for Irregular Holes Using Partial Convolutions Guilin Liu Fitsum A. Reda Kevin J. Shih Ting-Chun Wang Andrew Tao Bryan Catanzaro NVIDIA Corporation Fig. 1. Masked images and corresponding inpainted results using our partial- convolution based network. :1804.07723v2[cs.CV]15Dec2018 Our proposed model uses stacked partial convolution operations an dating steps to perform image inpainting. We first define our conv mask update mechanism, then discuss model architecture and loss f 3.1 Partial Convolutional Layer We refer to our partial convolution operation and mask update fun as the Partial Convolutional Layer. Let W be the convolution fi for the convolution filter and b its the corresponding bias. X are values (pixels values) for the current convolution (sliding) window a corresponding binary mask. The partial convolution at every locatio defined in [7], is expressed as: x0 = ( WT (X M) sum(1) sum(M) + b, if sum(M) > 0 0, otherwise where denotes element-wise multiplication, and 1 has same shap with all elements being 1. As can be seen, output values depend unmasked inputs. The scaling factor sum(1)/sum(M) applies approp to adjust for the varying amount of valid (unmasked) inputs. 【U-Net との違い】 Partial Convolutional Layer 12 [Liu+ ECCV2018]: G. Liu et al. “Image Inpainting for Irregular Holes Using Partial Convolutions”, In ECCV, 2018.
  • 13. Volume-guided View Inpainting Projection Layer • 2D loss を 3D CNNs のパラメータ最適化に使⽤ • Differentiable projection layer [Tulsiani+ CVPR2017] ピクセル 𝑥 の深度 𝐷 𝑥 𝐷 𝑥 = 3 45% 67 𝑃4 8 𝑑4 𝑃4 8 = 1 − 𝑉4 ; <5% 4=% 𝑉< , 𝑘 = 1, … , 𝑁8 𝜕𝐷(𝑥) 𝜕𝑉4 = 3 )54 67 (𝑑)E% − 𝑑)) ; %FGH),GI4 𝑉G 13 𝑘GJ ボクセルに最初に 衝突する確率 𝑘 ピクセル 𝑥 を通る視線と衝突するボクセルのインデックス (⼩さい⽅が⼿前, 𝑁8 は視線が通るボクセル数) 𝑉4 𝑘GJ ボクセルの占有確率(空である確率) 𝑑4 𝑘GJ ボクセルまでの距離 [Tulsiani+ CVPR2017]: S. Tulsiani et al. “Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency”, In CVPR, 2017.
  • 14. Volume-guided View Inpainting Joint Training • 収束を保証するために学習を3段階に 1. SSCNet は個別に学習 2. SSCNet を固定し,Depth Inpainting を学習 3. ネットワーク全体を fine tuning • 学習データは SUNCG (synthetic scene) を使⽤ • 𝑁 枚の深度画像をランダムなシーン・視点でレンダリングして⽣成 • 各深度画像は点群に変換し,元の視点の近くでランダムに 𝑚 視点投影 → ⼤きく⽳が空くことを避け,⼗分なコンテキストを利⽤可能に 14 rcement Learning of Volume-guided Progressive View Inpainting 3D Point Scene Completion from a Single Depth Image Han, 2,3 Zhaoxuan Zhang, 3,4 Dong Du, 1,3 Mingdai Yang, 5 Jingming Yu, 5 Pan Pan 2 Xin Yang, 4 Ligang Liu, 6 Zixiang Xiong, 1,3 Shuguang Cui nese University of Hong Kong(Shenzhen), 2 Dalian University of Technology Research Institute of Big Data, 4 University of Science and Technology of China 5 Alibaba Group, 6 Texas A&M University Abstract reinforcement learning method of pro- ng for 3D point scene completion un- achieving high-quality scene recon- single depth image with severe oc- ch is end-to-end, consisting of three olume reconstruction, 2D depth map view selection for completion. Given our method first goes through the 3D ain a volumetric scene reconstruction view inpainting step, which attempts g information; the third step involves under the same view of the input, con- mplete the current view depth, and in- o the point cloud. Since the occluded e, we resort to a deep Q-Network to ick the next best view for large hole (a) depth (b) visible surface (c) output: two views Figure 1. Surface-based scene Completion. (a) A single-view !" DQN SSCNet 2DCNN ViewView Depth Voxel Point  Cloud SSCNet Projection Layer #$ DQN SSCNet ViewView Depth Point  Cloud !% $ DQN SSCNet ViewView Depth Point  Cloud !% ViewView Depth Point  Cloud !&% Depth Inpainting
  • 15. Overview • ① 2D CNN (volume-guided view inpainting network) ② DQN DQN SSCNet 2DCNN View ViewView Depth Voxel Point  Cloud View Output Figure 2. The pipeline of our method. Given a single depth image D0, we convert it to a point cloud P, here shown in two different views. DQN is used to seek the next-best-view, under which the point cloud is projected to a new depth image D1, causing holes. In parallel, the P is also completed in volumetric space by SSCNet, resulting in V . Under the view of D1, V is projected and guide the inpainting of D1 with a 2DCNN network. Repeating this process several times, we can achieve the final high-quality scene completion. Depth Inpainting Similar to geometry completion, re- searchers have employed various priors or optimized mod- els to complete a depth image[15, 21, 27, 41, 3, 22, 50, 53]. The patch-based image synthesis idea is also applied[7, 10]. 3. Algorithm Overview Taking a depth image D0 as input, we first convert it to a point cloud P , which suffers from severe data loss. Our 15
  • 16. Progressive Scene Completion • 最適な視点シーケンスを DQN で求める • 不完全な点群 𝑃" から 𝑣%, … , 𝑣L を獲得 • Markov decision process (MDP) • 状態 • 更新された点群 • ⾏動 • シーンを中⼼としたサポート球上の⼀様な 𝑘 視点 • 今回は⾚道と緯度 45° で各 10 視点(計 20 視点) • 報酬 • Inpainting の精度 𝑅) N(( • ⽋落⾯積 𝑅) JOPQ 16 Environment Agent Action 𝑎GReward 𝑟GState 𝑠G 𝑟GE% 𝑠GE%
  • 17. DQN • 報酬 • Inpainting の精度 𝑅𝑖 𝑎𝑐𝑐 = − 1 Ω 𝐿Ω 1 (&𝐷𝑖, 𝐷𝑖 𝑔𝑡 ) Ω : ⽳の内側のピクセルの集合, 𝐿1: L1 loss, 𝐷𝑖 𝑔𝑡 : 事前にレンダリングした深度GT • ⽋落⾯積 𝑅𝑖 ℎ𝑜𝑙𝑒 = 𝐴𝑟𝑒𝑎ℎ 𝑃𝑖−1 − 𝐴𝑟𝑒𝑎ℎ(𝑃𝑖) 𝐴𝑟𝑒𝑎ℎ(𝑃0) − 1 𝐴𝑟𝑒𝑎ℎ 𝑃 : 𝑃 をすべてのカメラに投影したときの⽳の⾯積の総和 • 最終的な報酬 𝑅𝑖 𝑡𝑜𝑡𝑎𝑙 = 𝑤𝑅𝑖 𝑎𝑐𝑐 + 1 − 𝑤 𝑅𝑖 ℎ𝑜𝑙𝑒 • 収束 • 𝐴𝑟𝑒𝑎ℎ(𝑃𝑖)/𝐴𝑟𝑒𝑎ℎ 𝑃0 < 5% … ほぼすべての⽳を補完 17
  • 18. CNN View Pooling CNN CNN CNN Point Cloud Q­value 512 256 1 20 20 Depth Figure 3. The architecture of our DQN. For a point cloud state, MVCNN is used to predict the best view for the next inpainting. and support the use of DQN, we evenly sample a set of scene-centric camera views to form a discrete action space. Specifically, we first place P0 in its bounding sphere and keep it upright. Then, two circle paths are created for both the equatorial and 45-degree latitude line. In our experi- ments, 20 camera views are uniformly selected on these two paths, 10 per circle. All views are facing to the cen- ter of the bounding sphere. We fixed these views for all training samples. The set of 20 views is denoted as C = {c , c , ..., c }. DQN • MVCNN [Su+ ICCV2015] ベース • 𝑃)=% から多視点深度画像を⽣成 • 各 action の Q-value を出⼒ • 深度画像は 224x224 • Dueling DQN structure [Wang+ ICML2016] • ロス関数 • 通常の DQN 𝐿𝑜𝑠𝑠 𝜃 = 𝔼 𝑟 + 𝛾 max hijk 𝑄 𝑃), 𝑣)E%; 𝜃n − 𝑄 𝑃)=%, 𝑣); 𝜃 o • 本⼿法︓ max hijk 𝑄 𝑃), 𝑣)E%; 𝜃n の upward bias 項を除去 [Hasselt+ AAAI2016] 𝐿𝑜𝑠𝑠 𝜃 = 𝔼 𝑟 + 𝛾𝑄 𝑃), arg max hijk 𝑄(𝑃), 𝑣)E%; 𝜃); 𝜃n − 𝑄 𝑃)=%, 𝑣); 𝜃 o 18 [Su+ ICCV2015]: H. Su et al. “Multi-view convolutional neural networks for 3d shape recognition”, In ICCV, 2015. [Wang+ ICML2016]: Z. Wang et al. “Dueling Network Architectures for Deep Reinforcement Learning”, In ICML, 2016. [Hasselt+ AAAI2016]: H-v. Hasselt et al. “Deep Reinforcement Learning with Double Q-Learning”, In AAAI, 2016.
  • 19. DQN • 𝜖-greedy policy • 乱数を⽣成して 𝜖 の確率で最⼤ Q-valueを選択し,それ以外はランダムの⾏動 • (通常は逆︓𝜖 の確率でランダム,1 − 𝜖 で最⼤ Q-value の⾏動) • Experience replay • 200個のエピソード を記録 • 様々なエピソードから 𝑃𝑖−1, 𝑣𝑖, 𝑟, 𝑃𝑖 をランダムサンプリングして学習データとする • 実装の詳細 • 報酬の重み 𝑤 は 0.7 • discount factor は 0.9 • 𝜖 は 10,000 ステップで 0.9 から 0.2 (fixed) に • 学習に 3 ⽇間,推論は 60 秒(平均 5 視点) 19
  • 20. Dataset • SUNCG dataset • 2D CNN (Depth Inpainting) • 30,000 深度画像 (𝑁 = 3,000, 𝑚 = 10) • ドアや壁でオクルージョンがある視点は除去 • 3,000 をテスト,残りを学習 • DQN (View Path Planning) • 𝑁 = 2,500 • 2,300 を学習,200をテスト 20
  • 21. Comparison Against SOTA • ボリュームベースと点群ベースの⽐較⽅法の提案 • ボリュームGT の出⼒精度は既存のすべてのボリュームベースの Scene Completion ⼿法の上限値として扱える • ボリュームGT,SSCNet,提案法を様々な視点にレンダリングして深度を取得し, 点群に変換 • 定量評価 • Chamfer distance (CD) • Completeness 𝐶𝑟 𝑃, 𝑃𝐺𝑇 = {𝑑 𝑥, 𝑃 < 𝑟 𝑥 ∈ 𝑃𝐺𝑇 } 𝑦|𝑦 ∈ 𝑃𝐺𝑇 21 𝑑 𝑥, 𝑃 𝑥 と点群 𝑃 の距離 • 要素数 𝑟 distance threshold (0.02, 0.04, 0.06, 0.08, 0.10)
  • 22. Comparison Against SOTAFigure 5. Comparisons against the state-of-the-arts. Given different inputs and the referenced groundtruth, we show the completion results of three methods, with the corresponding point cloud error maps below, and zoom-in areas beside. Table 1. Quantitative Comparisons against existing methods. The CD metric and the completeness metric (w.r.t different thresholds) are used. SSCNet V olume GT1 ScanComplete V olume GT2 U5 U10 DQNw/o hole Ours CD 0.5162 0.5140 0.2193 0.2058 0.1642 0.1841 0.1495 0.1148 Cr=0.002(%) 14.61 13.28 34.46 31.18 79.18 80.17 79.22 79.26 Cr=0.004(%) 30.10 32.23 58.83 61.11 83.33 84.15 83.50 83.68 Cr=0.006(%) 52.82 50.14 74.60 74.88 85.81 86.56 86.02 86.28 Cr=0.008(%) 71.24 72.33 79.59 81.04 87.66 88.33 87.81 88.20 Cr=0.010(%) 78.23 78.96 81.01 81.61 89.06 89.70 89.24 89.68 4.2. Ablation Studies To ensure the effectiveness of several key components of our system, we do some control experiments by removing each component. On Depth Inpainting Firstly, to evaluate the efficacy of the volume guidance, we propose two variants of our method: 1) we train a 2D inpainting network directly with- out projecting volume as guidance, which is denoted as DepInw/oV G; 2) we train the volume guided 2D inpaint- shown in Figure 4. All of them show the superiority of our design. Table 2. Quantitative ablation studies on inpainting network. DepInw/oV G DepInw/oP BP Ours L1 ⌦ 0.0717 0.0574 0.0470 PSNR 22.15 23.12 24.73 SSIM 0.910 0.926 0.930 On View Path Planning Without using DQN for path planning, there exists a straightforward way to do comple- Input & GT SSCNet Volume-GT Ours 22 上︓推定結果(左)と⾚枠のズーム(右) 下︓エラーマップ(左)と⾚枠のズーム(右)
  • 23. Ablation Studies • Depth Inpainting • 𝐷𝑒𝑝𝐼𝑛 𝑤/𝑜𝑉𝐺︓Volume guidance なし • 𝐷𝑒𝑝𝐼𝑛 𝑤/𝑜𝑃𝐵𝑃 ︓Projection back-prop なし • View Path Planning • 𝑈5, 𝑈10︓視点数固定で⼀様サンプリング • 𝐷𝑄𝑁 𝑤/𝑜−ℎ𝑜𝑙𝑒︓報酬を 𝑅𝑖 𝑎𝑐𝑐 のみ使⽤ 23 0.2193 0.2058 0.1642 0.1841 0.1495 0.1148 34.46 31.18 79.18 80.17 79.22 79.26 58.83 61.11 83.33 84.15 83.50 83.68 74.60 74.88 85.81 86.56 86.02 86.28 79.59 81.04 87.66 88.33 87.81 88.20 81.01 81.61 89.06 89.70 89.24 89.68 ponents of removing e efficacy nts of our ctly with- enoted as D inpaint- , which is cs of L1 ⌦, uantitative risons are shown in Figure 4. All of them show the superiority of our design. Table 2. Quantitative ablation studies on inpainting network. DepInw/oV G DepInw/oP BP Ours L1 ⌦ 0.0717 0.0574 0.0470 PSNR 22.15 23.12 24.73 SSIM 0.910 0.926 0.930 On View Path Planning Without using DQN for path planning, there exists a straightforward way to do comple- tion: we can uniformly sample a fixed number of views from C and directly perform depth implanting on them. In this uniform manner, two methods with two different numbers of views (5 and 10 are selected) are evaluated. CNN View Pooling CNN CNN CNN Point Cloud Q­value 512 256 1 20 20 Depth Figure 3. The architecture of our DQN. For a point cloud state, MVCNN is used to predict the best view for the next inpainting. and support the use of DQN, we evenly sample a set of scene-centric camera views to form a discrete action space. Specifically, we first place P0 in its bounding sphere and keep it upright. Then, two circle paths are created for both the equatorial and 45-degree latitude line. In our experi- ments, 20 camera views are uniformly selected on these two paths, 10 per circle. All views are facing to the cen- ter of the bounding sphere. We fixed these views for all Input & GT DepInw/oVG DepInw/oPBP Ours Figure 4. Comparisons on variants of depth inpainting network. the-arts. Given different inputs and the referenced groundtruth, we show the completion results oint cloud error maps below, and zoom-in areas beside. existing methods. The CD metric and the completeness metric (w.r.t different thresholds) are GT1 ScanComplete V olume GT2 U5 U10 DQNw/o hole Ours 0.2193 0.2058 0.1642 0.1841 0.1495 0.1148 34.46 31.18 79.18 80.17 79.22 79.26 58.83 61.11 83.33 84.15 83.50 83.68 74.60 74.88 85.81 86.56 86.02 86.28 79.59 81.04 87.66 88.33 87.81 88.20 81.01 81.61 89.06 89.70 89.24 89.68 Figure 5. Comparisons against the state-of-the-arts. Given different inputs and the referenced groundtruth, we show the completion results of three methods, with the corresponding point cloud error maps below, and zoom-in areas beside. Table 1. Quantitative Comparisons against existing methods. The CD metric and the completeness metric (w.r.t different thresholds) are used. SSCNet V olume GT1 ScanComplete V olume GT2 U5 U10 DQNw/o hole Ours CD 0.5162 0.5140 0.2193 0.2058 0.1642 0.1841 0.1495 0.1148 Cr=0.002(%) 14.61 13.28 34.46 31.18 79.18 80.17 79.22 79.26 Cr=0.004(%) 30.10 32.23 58.83 61.11 83.33 84.15 83.50 83.68 Cr=0.006(%) 52.82 50.14 74.60 74.88 85.81 86.56 86.02 86.28 Cr=0.008(%) 71.24 72.33 79.59 81.04 87.66 88.33 87.81 88.20 Cr=0.010(%) 78.23 78.96 81.01 81.61 89.06 89.70 89.24 89.68 Input & GT U5 U10 OursDQNw/o-hole
  • 24. Conclusion • 単⼀深度画像から点群ベースの3次元シーン補完⼿法を提案 • ⽋落している点群を多視点深度画像を補完することで推定 • 正確かつ⼀貫性を保証するために,volume-guided view inpainting を提案 • 最適な視点経路を探索するための強化学習フレームワークを考案 • Future work 1. 深度画像の inpainting にテクスチャ情報を使⽤ → ⼊⼒を RGBD に 2. テクスチャ付きのシーンを出⼒するためにテクスチャの補完を深度画像の inpainting と⼀緒に⾏う 24