論文紹介"DynamicFusion: Reconstruction and Tracking of Non-‐rigid Scenes in Real-‐Time"

DynamicFusion: Reconstruction and Tracking
of Non-rigid Scenes in Real-Time
Richard A. Newcombe, Dieter Fox, Steven M. Seitz
CVPR2015, Best Paper Award
論文紹介，櫻田健（東京工業大学）, ２０１５年６月２３日
1

論文紹介者
櫻田健 ( http://www.vision.is.tohoku.ac.jp/us/member/sakurada )
• 東京工業大学博士研究員（2015年4月〜）
• 東北大学岡谷研卒（2015年3月）
• Twitter ID: @sakuDken
• 研究内容
– 車載カメラを利用した都市の時空間モデリング
– SfM, MVS, CNN …
• 主要論文
– CVPR(Poster) １本
– BMVC(Poster) １本
– ACCV(Oral) １本
• “Best Application Paper Honorable Mention Award”
– IROS …
内容に関して何かお気づきになりましたらご連絡頂けると幸いです
Email: sakurada@ok.ctrl.titech.ac.jp
sakurada@vision.is.tohoku.ac.jp
2

DynamicFusionの著者紹介
University of Washington
• Richard Newcombe（PhD）
– KinectFusionやDTAMの著者
• Dieter Fox（Professor）
– “Probabilistic Robotics”（SLAMの
バイブル本）の著者
• Steven Seitz (Professor)
– SfMなどの大御所
3

DynamicFusion
Dense SLAM システム
• デプス画像を統合して動的シーンをリアルタイムで
3次元復元
– KinectFusionを動的シーンに拡張
Video: https://www.youtube.com/watch?v=i1eZekcc_lM
4
(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance
(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals
Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the moving
scene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the live frame,
enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-
neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated through
a k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressively denoised and

KinectFusion
静的シーンを対象としたDense SLAM システム
• 複数のデプス画像から密なサーフェスモデルを構築
• 得られたモデルに対して最新のデプス画像を位置合わせし
てカメラ姿勢を推定
“KinectFusion: Real-Time Dense Surface Mapping and Tracking” (ISMAR 2011)
Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison,
Pushmeet Kohi, Jamie Shotton, Steve Hodges, Andrew Fitzgibbon
KinectFusion: Real-Time Dense Surface Mapping and Tracking⇤
Richard A. Newcombe
Imperial College London
Shahram Izadi
Microsoft Research
Otmar Hilliges
Microsoft Research
David Molyneaux
Microsoft Research
Lancaster University
David Kim
Microsoft Research
Newcastle University
Andrew J. Davison
Pushmeet Kohli
Microsoft Research
Jamie Shotton
Microsoft Research
Steve Hodges
Microsoft Research
Andrew Fitzgibbon
Microsoft Research
Figure 1: Example output from our system, generated in real-time with a handheld Kinect depth camera and no other sensing infrastructure.
Normal maps (colour) and Phong-shaded renderings (greyscale) from our dense reconstruction system are shown. On the left for comparison
is an example of the live, incomplete, and noisy data from the Kinect sensor (used as input to our system).
5
Video: https://www.youtube.com/watch?v=quGhaggn3cQ

KinectFusion: 処理の流れ
6

KinectFusion: カメラの移動量
• カメラ姿勢を6自由度の剛体変換で表現
– ローカルデプスマップをグローバルサーフェスに変換
7

KinectFusion: デプスマップから頂点と法線への変換
8
• バイラテラルフィルタをデプスマップへ適用（ノイズ低減）
• 頂点（3次元点）
• 3次元点をカメラ座標からワールド座標へ変換
• 単位法線ベクトル
– 隣接ピクセルの3次元点から推定
𝐯" = 𝐷" 𝐮 K'(
𝑢, 𝑣, 1 -
𝐷" 𝐮 : ピクセル𝐮 = 𝑢, 𝑣 -
のデプス
𝐾: カメラの内部パラメータ
𝐯 𝒘 = 𝑻 𝒘 𝐯 𝒌
𝐍 𝑥, 𝑦 =
𝐚×𝐛
𝐚×𝐛

KinectFusion: 処理の流れ
9
カメラ移動量とサーフェスの同時推定問題
rom Depth to a Dense Oriented Point Cloud
Raw Depth ICP Outliers
Depth Map
Conversion
Model-Frame
Camera Track
Volumetric
Integration
Model
Rendering
Predicted Vertex
and Normal Maps
(Measured Vertex
and Normal Maps)
(ICP) (TSDF Fusion) (TSDF Raycast)
6DoF Pose and Raw Depth

カメラ移動量から3次元復元可能
10

カメラ移動量が分かると．．．
11

12

13

計測を統合(サーフェス生成)できる...
14

．．．また，3次元形状が分かると．．．
15

．．．新しいサーフェスの位置合わせができる．．．
16

．．．サーフェスの計測誤差を最小化すると．．．
17

．．．カメラ姿勢が求まりデプスマップを統合可能．．．
18

KinectFusion: サーフェス表現
19
Raw Depth ICP Outliers
Depth Map
Conversion
Model-Frame
Camera Track
Volumetric
Integration
Model
Rendering
Predicted Vertex
and Normal Maps
(Measured Vertex
and Normal Maps)
(ICP) (TSDF Fusion) (TSDF Raycast)
6DoF Pose and Raw Depth

問題点
• デプスマップの計測誤差が大きい
• 〃に穴（計測値なし）がある
解決方法
• 陰なサーフェス表現を利用
20Reference: http://slideplayer.com/slide/3892185/#

Truncated Signed Distance Function (TSDF)
ボクセルグリッド

23
[投票値] = ピクセルのデプス − [センサーからボクセルまでの距離]
Reference: http://slideplayer.com/slide/3892185/#

TSDF 𝐹", 𝑊" の更新（𝐹":投票値，𝑊":重み）
• 𝑊=>
𝐩 = 1で良い結果が得られる
27
million new point measurements are made per second). Storing
a weight Wk(p) with each value allows an important aspect of the
global minimum of the convex L2 de-noising metric to be exploited
for real-time fusion; that the solution can be obtained incrementally
as more data terms are added using a simple weighted running av-
erage [7], defined point-wise {p|FRk
(p) 6= null}:
Fk(p) =
Wk 1(p)Fk 1(p)+WRk
(p)FRk
(p)
Wk 1(p)+WRk
(p)
(11)
Wk(p) = Wk 1(p)+WRk
(p) (12)
No update on the global TSDF is performed for values resulting
from unmeasurable regions specified in Equation 9. While Wk(p)
provides weighting of the TSDF proportional to the uncertainty of
surface measurement, we have also found that in practice simply
letting WRk
(p) = 1, resulting in a simple average, provides good re-
sults. Moreover, by truncating the updated weight over some value
Wh ,
Wk(p) min(Wk 1(p)+WRk
(p),Wh ) , (13)
million new point measurements are made per second). Storing
a weight Wk(p) with each value allows an important aspect of the
global minimum of the convex L2 de-noising metric to be exploited
for real-time fusion; that the solution can be obtained incrementally
as more data terms are added using a simple weighted running av-
erage [7], defined point-wise {p|FRk
(p) 6= null}:
Fk(p) =
Wk 1(p)Fk 1(p)+WRk
(p)FRk
(p)
Wk 1(p)+WRk
(p)
(11)
Wk(p) = Wk 1(p)+WRk
(p) (12)
No update on the global TSDF is performed for values resulting
from unmeasurable regions specified in Equation 9. While Wk(p)
provides weighting of the TSDF proportional to the uncertainty of
surface measurement, we have also found that in practice simply
letting WRk
sults. Moreover, by truncating the updated weight over some value
Fk(p) =
Wk 1(p)Fk 1(p)+WRk
(p)FRk
(p)
Wk 1(p)+WRk
(p)
(11)
Wk(p) = Wk 1(p)+WRk
(p) (12)
update on the global TSDF is performed for values resulting
m unmeasurable regions specified in Equation 9. While Wk(p)
ovides weighting of the TSDF proportional to the uncertainty of
face measurement, we have also found that in practice simply
ing WRk
ts. Moreover, by truncating the updated weight over some value
h ,
Wk(p) min(Wk 1(p)+WRk
(p),Wh ) , (13)
moving average surface reconstruction can be obtained enabling
onstruction in scenes with dynamic object motion.
Although a large number of voxels can be visited that will not
oject into the current image, the simplicity of the kernel means
eration time is memory, not computation, bound and with current
system workflow.
新しい観測現在まで

KinectFusion: センサーの姿勢推定
点群と面の距離（point-plane energy）を最小化
28
(Vk 1,Nk 1) which is used in our experimental section for a com-
parison between frame-to-frame and frame-model tracking.
Utilising the surface prediction, the global point-plane energy,
under the L2 norm for the desired camera pose estimate Tg,k is:
E(Tg,k) = Â
u2U
Wk(u)6=null
⇣
Tg,k
˙Vk(u) ˆV
g
k 1 (û)
⌘>
ˆN
g
k 1 (û)
2
, (16)
where each global frame surface prediction is obtained using the
previous fixed pose estimate Tg,k 1. The projective data as-
sociation algorithm produces the set of vertex correspondences
{Vk(u), ˆVk 1(û)|W(u) 6= null} by computing the perspectively pro-
jected point, û = p(KeTk 1,k
˙Vk(u)) using an estimate for the frame-
frame transform eTz
k 1,k = T 1
g,k 1
eTz
g,k and testing the predicted and
measured vertex and normal for compatibility. A threshold on the
distance of vertices and difference in normal values suffices to re-
ject grossly incorrect correspondences, also illustrated in Figure 7:
8
< Mk(u) = 1, and
ns for view-planning
y [Besl92], the ICP
ome the most widely
al shapes (a similar
d Medioni [Chen92]).
1] provide a recent
on the original ICP
McKay [Besl92], each
est point in the other
a point-to-point error
red distance between
mized. The process is
a threshold or it stops
ioni [Chen92] used a
ect of minimization is
point and the tangent
e the point-to-point
n, the point-to-plane
nlinear least squares
dt method [Press92].
ane ICP algorithm is
ion, researchers have
rates in the former
explanation of the
escribed by Pottmann
source points such that the total error between the corresponding
points, under a certain chosen error metric, is minimal.
When the point-to-plane error metric is used, the object of
minimization is the sum of the squared distance between each
source point and the tangent plane at its corresponding destination
point (see Figure 1). More specifically, if si = (six, siy, siz, 1)T
is a
source point, di = (dix, diy, diz, 1)T
is the corresponding destination
point, and ni = (nix, niy, niz, 0)T
is the unit normal vector at di, then
the goal of each ICP iteration is to find Mopt such that
( )( )∑ •−⋅=
i
iii
2
opt minarg ndsMM M (1)
where M and Mopt are 4×4 3D rigid-body transformation matrices.
Figure 1: Point-to-plane error between two surfaces.
tangent
plane
s1
source
point
destination
point
d1
n1
unit
normal
s2
d2
n2
s3
d3
n3
destination
surface
source
surface
l1
l2
l3
Figure Reference: Low, Kok-Lim. "Linear least-squares optimization for point-to-plane icp surface registration.”
Chapel Hill, University of North Carolina (2004).
２つのサーフェス間の誤差

29
点群と面の距離を最小化

30

31

点と面の対応付けと姿勢の最適化
• 頂点と法線のピラミッドマップを利用
– 𝐿 = 3 つの解像度でデプスと法線のマップを作成
𝐕 𝒍∈ 𝟏…𝑳
, 𝑵𝒍∈ 𝟏…𝑳
• Coarse-to-fine
32Figure Reference: http://razorvision.tumblr.com/post/15039827747/how-kinect-and-kinect-fusion-kinfu-work

KinectFusion: 実験結果
33
前後フレーム間でトラッキング
• 累積誤差が発生してドリフト

34
フレーム・モデル間でトラッキング
• グローバルなモデルに対してスキャンマッチング
• ドリフトフリー
• 前後フレーム間のトラッキングより高精度

ボクセルの解像度と処理時間
上から順に
• デプスマップの統合
• サーフェス生成のためのレイキャスティング
• ピラミッドマップを利用したカメラ姿勢の最適化
• ピラミッドマップの各スケール間の対応付け
• デプスマップの前処理
35
t
e
)
d Figure 12: A reconstruction result using 1
64 the memory (643 vox-
els) of the previous ﬁgures, and using only every 6th sensor frame,
demonstrating graceful degradation with drastic reductions in mem-
ory and processing resources.
Time(ms)
Voxel Resolution
64 128 192 320 448384256 512
33 3 3 3 3 3 3

KinectFusion: 課題
• 長距離軌跡のドリフト
– 明示的なループクロージングが必要
• 十分な幾何拘束が必要
– 例，一枚の平面では3自由度しか拘束されない
• 広域な問題への拡張
– 均一なボクセルではメモリ・計算量が膨大
– 疎な領域が多いため八分木のSDFを利用
36

KinectFusionからDynamicFusionへ
KinectFusionの前提
• 観測シーンは大部分が変化しない
DynamicFusion
• リアルタイム処理を保ちKinectFusionを動的かつ非
剛体なシーンへと拡張
37
KinectFusion: Real-Time Dense Surface Mapping and Tracking⇤
Richard A. Newcombe
Shahram Izadi
Microsoft Research
Otmar Hilliges
Microsoft Research
David Molyneaux
Microsoft Research
Lancaster University
David Kim
Microsoft Researc
Newcastle Univers
Andrew J. Davison
Pushmeet Kohli
Microsoft Research
Jamie Shotton
Microsoft Research
Steve Hodges
Microsoft Research
Andrew Fitzgibbon
Microsoft Research
Figure 1: Example output from our system, generated in real-time with a handheld Kinect depth camera and no other sensing infrastru
Normal maps (colour) and Phong-shaded renderings (greyscale) from our dense reconstruction system are shown. On the left for compa
is an example of the live, incomplete, and noisy data from the Kinect sensor (used as input to our system).
ABSTRACT
We present a system for accurate real-time mapping of complex and
arbitrary indoor scenes in variable lighting conditions, using only a
moving low-cost depth camera and commodity graphics hardware.
We fuse all of the depth data streamed from a Kinect sensor into
a single global implicit surface model of the observed scene in
real-time. The current sensor pose is simultaneously obtained by
1 INTRODUCTION
Real-time infrastructure-free tracking of a handheld camera w
simultaneously mapping the physical scene in high-detail pro
new possibilities for augmented and mixed reality application
In computer vision, research on structure from motion (
and multi-view stereo (MVS) has produced many compellin
sults, in particular accurate camera tracking and sparse recon
DynamicFusion: Reconstruction and Tracking of Non-rigid Scenes in Real-Time
Richard A. Newcombe
newcombe@cs.washington.edu
Dieter Fox
fox@cs.washington.edu
University of Washington, Seattle
Steven M. Seitz
seitz@cs.washington.edu
Figure 1: Real-time reconstructions of a moving scene with DynamicFusion; both the person and the camera are moving. The initially
noisy and incomplete model is progressively denoised and completed over time (left to right).

DynamicFusion: 概要
• ノイズの大きい連続デプス画像を入力
• 動的シーンの密な3次元形状をリアルタイムで出力
38
(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) N
(d) Canonical Model (e) Canonical model warped into its live frame (f) Mo
Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstructi
scene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space in
enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstructio
neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly inte
a k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressiv
completed scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails
(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node D
(d) Canonical Model (e) Canonical model warped into its live frame (f) Model N

• ワープフィールドを疎なノードのワープ（ 6自由
度）の重み付き平均で表現
• 基準空間の各ボクセルを最新フレームへワープ
39
0s (c) Node Distance
(f) Model Normals
ime dense reconstruction of the moving
nonical model space into the live frame,
gid TSDF reconstruction (d,f). Simulta-
(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s
(d) Canonical Model (e) Canonical model warped into its live frame
Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time de
scene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical
enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSD
neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that a
a k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enable
completed scene geometry to be transformed into the live frame in real-time (e). In (e) we also visuali
of model vertices over the last 1 second of scene motion together with a coordinate frame showing the ri
motion. In (c) we render the nearest node to model surface distance where increased distance is mapped
of objects with both translation and rotation results in signif-
icantly better tracking and reconstruction. For each canoni-
cal point vc 2 S, Tlc = W(vc) transforms that point from
canonical space into the live, non-rigidly deformed frame of
reference.
with each unit dual-quaternio
the k-nearest transformation
R3
7! R defines a weight tha
of each node and SE3(.) con
an SE(3) transformation mat
(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Di
(d) Canonical Model (e) Canonical model warped into its live frame (f) Model No
Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of t
scene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the
補間疎なノード
ワープフィールド𝒲Jを推定

TSDFは最新フレームの空間で統合
• 最新フレームのレイが基準空間では歪曲
40
Non-rigid scene deformation Introducing an occlusion
(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7! Live (d) Live frame t = 0 (e) Live Frame t = 1 (f) Canonical 7! Live
Figure 3: An illustration of how each point in the canonical frame maps, through a correct warp ﬁeld, onto a ray in the live camera frame
when observing a deforming scene. In (a) the ﬁrst view of a dynamic scene is observed. In the corresponding canonical frame, the warp is
initialized to the identity transform and the three rays shown in the live frame also map as straight lines in the canonical frame. As the scene
deforms in the live frame (b), the warp function transforms each point from the canonical and into the corresponding live frame location,
causing the corresponding rays to bend (c). Note that this warp can be achieved with two 6D deformation nodes (shown as circles), where
the left node applies a clockwise twist. In (d) we show a new scene that includes a cube that is about to occlude the bar. In the live frame

DynamicFusion: 幾何表現
Truncated Signed Distance Function (TSDF)
256N
のボクセルの場合,
• Kinect Fusion
– カメラ姿勢（6パラメータ）のみ推定
• DynamicFusion
– 1フレームごとに6×256N
パラメータを推定
– KinectFusionの約1000万倍
全ボクセルのワープフィールド推定は困難
41

DynamicFusion: 密な非剛体ワープフィールド
ワープ関数
ベース: 疎なノードの変換
𝑛 個の変換ノード 𝒩𝐰𝐚𝐫𝐩
J
= 𝐝𝐠V, 𝐝𝐠 𝓌, 𝐝𝐠XYN
𝐝𝐠V
Z
∈ ℝN
: 基準空間における位置
𝐝𝐠XYN
Z
= TZ] : ノード 𝑖の（座標）変換パラメータ
𝐝𝐠 𝓌
Z
: 放射基底重み（ノードの影響半径）
座標におけるノードの影響度合い
𝒘𝒊 𝑥 𝒄 = 𝐞𝐱𝐩 − 𝐝𝐠V
Z
− 𝑥 𝒄
𝟐
/ 2 𝐝𝐠 𝓌
Z e
𝐒 : 基準空間 𝑥 𝒄 ∈ 𝐒 : 各ボクセルの中心 42

ワープ関数
補間: 密なボクセルのワープ関数
𝒲 𝑥] ≡ 𝑆𝐸3 𝐃𝐐𝐁 𝑥]
𝐃𝐐𝐁 𝒙 𝒄 =
∑ 𝐰" 𝑥] 𝐪p"]"∈q rs
∑ 𝐰" 𝑥] 𝐪p"]"∈q rs
単位デュアルクォータニオン: 𝐪p"] ∈ ℝt
𝑆𝐸3 . : デュアルクォータニオンを３次元ユークリッド
空間の座標変換行列に変換
43

予備知識: デュアルクォータニオン
クォータニオン (William Rowan Hamilton,1843)
• 回転のみ
𝐪 = cos
𝜃
2
+ 𝑢r 𝐢 + 𝑢| 𝐣 + 𝑢~ 𝐤 sin
𝜃
2
𝒊 𝟐
= 𝒋 𝟐
= 𝒋 𝟐
= 𝒊𝒋𝒌 = −𝟏
単位ベクトル: 𝐮 = 𝑢r 𝐢 + 𝑢| 𝐣 + 𝑢~ 𝐤
• ３次元点 𝐩 = 𝑝r 𝐢 + 𝑝| 𝐣 + 𝑝~ 𝐤 を回転
𝐩„
= 𝐪𝐩𝐪'𝟏
44

予備知識: デュアルクォータニオン
デュアルクォータニオン（William Kingdon Clifford, 1873）
• 回転 + 並進
𝐪̇ = 𝐪† + 𝜖𝐪ˆ 𝜖e
= 0
𝐩„
= 𝐪̇ 𝐩𝐪̇ '𝟏

回転
𝐪† = 𝑟‹ + 𝑟ri +𝑟|j +𝑟~ 𝒌 = cos
Œ
e
+ sin
Œ
e
• 𝐮
並進
𝐪ˆ = 0 +
Žr
e
𝒊 +
Ž|
e
𝒋 +
Ž~
e
𝒌
45

予備知識: Dual Quaternion Blending (DQB)
Linear Blending Skinning (LBS)
𝐩𝐢• = • 𝑤Z’ 𝑇’ 𝐩𝐢
”
’•(
• 体積縮小問題
– スケール変化が原因
46
Reference: Kavan, Ladislav, et al. "Skinning with dual quaternions." Proceedings of the 2007
symposium on Interactive 3D graphics and games. ACM, 2007.

予備知識: Dual Quaternion Blending (DQB)
Dual Quaternion Skinning (DQS)
𝐪̇ =
∑ 𝑤Z 𝐪̇ 𝐢
”
’•(
∑ 𝑤Z 𝐪̇ 𝐢
”
’•(
47
Reference: Kavan, Ladislav, et al. "Skinning with dual quaternions." Proceedings of the 2007
symposium on Interactive 3D graphics and games. ACM, 2007.
LBS DQS

ワープフィールド
𝒲J 𝑥] = T–‹ 𝑆𝐸3 𝐃𝐐𝐁 𝑥]
48
剛体変換
（カメラの移動量）
ボクセル（ノード）ごとの
ワープフィールド

DynamicFusion: 密な非剛体サーフェスの統合
Sampled TSDF
𝒱 𝐱 ↦ v 𝐱 ∈ ℝ,w 𝐱 ∈ ℝ
v 𝐱 : 全projective TSDF 値の重み付き平均
w 𝐱 : 𝐱 に関する重みの総和
49

最新のデプス画像取得後
• 基準空間の各ボクセル中心を最新フレームの座標
系に変換
• Projective Signed Distance Function (psdf)
: ボクセル中心の投影画素
50
デプスの計測点ボクセル中心
𝐩𝐬𝐝𝐟(𝑥]) = K'(
𝐷J 𝑢] 𝑢]
-
, 1 -
~ − 𝑥J ~
𝑥J
-
, 1
-
= 𝒲J 𝑥] 𝑥]
-
, 1
-
𝑢] = 𝜋 K𝑥J

TSDF値の更新
𝒱 𝐱 J =
v′ 𝐱 ,w′ 𝐱 -
, if 𝐩𝐬𝐝𝐟 𝐝𝐜 𝐱 > −𝜏
𝒱 𝐱 J'(, otherwise
𝐝𝐜 . : （ボクセルID） →（ボクセル中心の座標（TSDF領域））
𝜏 > 0: 打ち切り(Truncate)する距離の閾値
v„
𝐱 =
« 𝐱 𝒕-𝟏® 𝐱 𝒕-𝟏¯°±² ³,´ ‹ 𝐱
® 𝐱 𝒕-𝟏¯‹ 𝐱

𝜌 = 𝐩𝐬𝐝𝐟 𝐝𝐜 𝐱
𝑤„ 𝐱
= min w 𝐱 𝒕'𝟏 + 𝑤 𝐱 , 𝑤°·¸
𝑤 𝐱 ∝
(
"
∑ 𝐝𝐠‹
Z
− 𝑥] e
Z∈q rs
51

DynamicFusion: ワープフィールド𝒲Jの推定
エネルギー関数
𝐸 𝒲J, 𝒱, 𝐷J, ℰ = 𝐃𝐚𝐭𝐚 𝒲J, 𝒱, 𝐷J + 𝜆𝐑𝐞𝐠 𝒲J, ℰ
52
モデルから最新フレーム
への密なICPコスト
非スムースなモーションフィー
ルドに対するペナルティ
𝒱: 現在の３次元形状
𝐷J: 最新のデプスマップ
ℰ : エッジの集合

現在のサーフェスモデル
• TSDF 𝒱からマーチングキューブアルゴリズムで抽出
– 0の等値面から表面形状を生成
• 基準空間におけるポリゴンメッシュとして保持
– 頂点と法線のペア： 𝒱¾] ≡ 𝑉], 𝑁]
53

データ項
𝐃𝐚𝐭𝐚 𝒲, 𝒱, 𝐷J ≡ • 𝜓 𝐝𝐚𝐭𝐚 𝐧pÃ
-
𝐯ÄÃ − 𝐯𝐥ÃÆ
Ã∈Ç
𝜓 𝐝𝐚𝐭𝐚: Robust Tukey penalty function
54
頂点と法線の推定値（基準空間
から最新フレームの空間に変換）
計測点（デプス）
最新フレームにおける
頂点の法線方向の誤差
• This gives the “solution” as a simple least-squares problem:
â =
i
wixix⊤
i
−1
i
wiyixi. (8)
Note that this solution is depends on the wi values which in turn depend
on â.
• The idea is to alternate calculating â and recalculating wi = w((yi −
â⊤
xi)/σi).
• Here are the weight functions associated with the two estimates. For the
Cauchy ρ function,
wC(u) =
u
1 + (u/c)2
(9)
0.2
0.4
0.6
0.8
1
–6 –4 –2 0 2 4 6
and, for the Beaton-Tukey ρ function,
Beaton-Tukey 𝜌 function
0.2
0.4
0.6
0.8
1
–6 –4 –2 0 2 4 6
d, for the Beaton-Tukey ρ function,
wT (u) =
1 − u
a
2 2
|u| ≤ a
0 |u| > a
. (10)
Beaton-Tukey 𝜌 functionの例

正則化項
• 最新フレームで計測されない箇所のモーションを制約
𝐑𝐞𝐠 𝒲,ℰ ≡ • • 𝛼Z’ 𝜓𝐫𝐞𝐠 𝐓𝒊𝒄 𝐝𝐠V
’
− 𝐓𝒋𝒄 𝐝𝐠V
’
’∈ℰ Z
”
Z•Ê
𝜓𝐫𝐞𝐠: Huber penalty
𝛼Z’ = max 𝐝𝐠 𝓌
Z
, 𝐝𝐠 𝓌
’
55
ノード 𝑖 と 𝑗のエッジができる
だけ剛体を保つ制約
𝐿Î 𝑎 =
1
2
𝑎e 𝑎 ≤ 𝛿
𝛿 𝑎 −
1
2
𝛿 , otherwise
Huber loss function
Huber loss
Squared error loss
Huber loss functionの例

56
エネルギー関数𝐸を最小化
• ガウス・ニュートン法
ヘッシアン: 𝐉-
𝐉 = 𝐉 𝒅
-
𝐉 𝒅 + λ𝐉 𝒓
-
𝐉 𝒓
– ブロックarrow-head行列
– ブロックコレスキー分解で効率的に計算
Arrow-head行列

DynamicFusion: 新しいノードの挿入
• 支持されてないサーフェスの頂点
min"∈q rs
𝐝𝐠V
"
− 𝑣]
𝐝𝐠 𝓌
"
≥ 1
• 新ノード 𝐝𝐠V
∗
∈ 𝐝𝐠ØV
– DQBを利用して周囲のノードから新ノードの変換
パラメータを初期化
𝐝𝐠XYN
∗
← 𝒲J 𝐝𝐠V
∗
• 更新
𝒩𝐰𝐚𝐫𝐩
J
= 𝒩𝐰𝐚𝐫𝐩
J'(
∪ 𝐝𝐠ØV, 𝐝𝐠ØXYN, 𝐝𝐠Ø 𝓌
57
サーフェスの頂点
ノードの座標と影響半径

DynamicFusion: 実験結果
58
Canonical Model for “drinking from a cup”
(a) Canonical model warped into the live frame for “drinking from a cup”
Canonical Model for “Crossing ﬁngers”

DynamicFusion: 実験結果
59
Canonical Model for “drinking from a cup”
(a) Canonical model warped into the live frame for “drinking from a cup”
Canonical Model for “Crossing fingers”
(b) Canonical model warped into the live frame for “crossing fingers”
Figure 5: Real-time non-rigid reconstructions for two deforming scenes. Upper rows of (a) and (b) show the canonical models as they
evolve over time, lower rows show the corresponding warped geometries tracking the scene. In (a) complete models of the arm and the
cup are obtained. Note the system’s ability to deal with large motion and add surfaces not visible in the initial scene, such as the bottom of
the cup and the back side of the arm. In (b) we show full body motions including clasping of the hands where we note that the model stays
consistent throughout the interaction.
tracking scenes with more fluid deformations than shown
in the results, but the long term stability can degrade and
tracking will fail when the observed data term is not able to
5. Conclusions

DynamicFusion: 課題
• グラフ構造の急激な変化に弱い
– トポロジー的に閉じている状態から開いてる状態
へ急に遷移すると失敗し易い
– 結合しているノードは急に分離できない
• リアルタイム差分トラッキングに共通の失敗
– ループクロージングの失敗
– フレーム間の大きすぎる移動
60

まとめ
• ダイナミックシーンの密なリアルタイム３次元
復元手法を提案
– シーンの静的仮定なし
• ３次元TSDFの統合を非剛体に一般化
• ３次元（６自由度）のワープフィールドをリアル
タイムで推定
61

論文紹介"DynamicFusion: Reconstruction and Tracking of Non-‐rigid Scenes in Real-‐Time"

More Related Content

What's hot (20)

Similar to 論文紹介"DynamicFusion: Reconstruction and Tracking of Non-‐rigid Scenes in Real-‐Time" (20)

Recently uploaded (20)