Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Learning

Bridging the gap between 2D and 3D
with Deep Learning
Evgeny Burnaev (PhD) <e.burnaev@skoltech.ru>
assoc. prof. Skoltech
Alexandr Notchenko <a.notchenko@skoltech.ru>
PhD student

ImageNet top-5 error over the years
- Deep learning based methods
- Feature based methods
- human performance

Supervised Deep Learning data
Type
2D Image classification,
detection segmentation
Pose Estimation
Supervision
class label , object detection box,
segmentation contours
Structure of “skeleton” on image

3D deep learning is gaining popularity
Workshops:
● Deep Learning for Robotic Vision Workshop
CVPR 2017
● Geometry Meets Deep Learning ECCV 2016
● 3D Deep Learning Workshop @ NIPS 2016
● Large Scale 3D Data: Acquisition, Modelling
and Analysis CVPR 2016
● 3D from a Single Image CVPR 2015
Google Scholar when searched for "3D" "Deep
Learning" returns:
year # articles
2012 410
2013 627
2014 1210
2015 2570
2016 5440

Representation of 3D data for Deep Learning
Method Pros (+) Cons (-)
Many 2D projections sustain surface texture,
There is a lot of 2D DL methods
Redundant representation,
vulnerable to optic illusions
Voxels simple, can be sparse, has
volumetric properties
losing surface properties
Point Cloud Can be sparse losing surface properties and
volumetric properties
2.5D images Cheap measurement devices,
senses depth
self occlusion of bodies in a
scene, a lot of Noise in
measurements

Learning Rich Features from RGB-D Images for
Object Detection and Segmentation
[10]

Latest development in
SLAM family of methods

LSD-SLAM (Large-Scale Direct Monocular Simultaneous Localization and Mapping)
[5]
LSD-SLAM - direct (feature-less) monocular SLAM

ElasticFusion
ElasticFusion - DenseSLAM without a pose-graph
[7]

Dynamic Fusion
The technique won the prestigious CVPR 2015 best paper award.
[9]

Problems of SLAM algorithms
● Don’t represent objects (only know surfaces)
● Mostly dense representation (requires a lot of data)
● Whole scene is one big surface, e.g. cannot separate different objects that
are close to each other.

3D Design Phase
•
There exists massive storages with 3D CAD models, e.g. GrabCAD
Chairs Mechanical parts

3D Design Phase
•Designers spend about 60% of their time
searching for the right information
• Massive and complex CAD models are
usually disorderly archived in enterprises,
which makes design reuse a difficult task
3D Model retrieval can significantly shorten the product lifecycles

3D Shape-based Model Retrieval
•3D models are complex = No clear search rules
•The text-based search has its limitations: e.g. often 3D
models are poorly annotated
• There is some commercial software for 3D CAD modeling, e.g.
➢ Exalead OnePart by Dassault Systems,
➢ Geolus Search by Siemens PLM, and others
• However, used methods
➢ are time-consuming,
➢ are often based on hand-crafted descriptors,
➢ could be limited to a specific class of shapes,
➢ are not robust to scaling, rotations, etc.

Sparse 3D Convolutional Neural Networks for
Large-Scale Shape Retrieval
Alexandr Notchenko, Ermek Kapushev, Evgeny Burnaev
Presented at 3D Deep Learning Workshop at NIPS 2016

Sparsity of voxel representation
30^3 Voxels is already enough
to understand simple shape
But with texture information it
would be even easier
Sparsity for all classes of
ModelNet40 train dataset at
voxel resolution 40 is only
5.5%

Shape Retrieval
Precomputed
feature vector of
dataset.
(Vcar
, Vperson
,...)
Vplane
- feature vector
of plane
Sparse3DCNN
Query
Retrieved items
Cosine distance

Triplet loss
The representation can be efficiently learned by minimizing triplet loss.
Triplet is a set (a, p, n), where
● a - anchor object
● p - positive object that is similar to anchor object
● n - negative object that is not similar to anchor object
,
where is a margin parameter, and are distances between p and a and
n and a.

Our approach
● Use very large resolutions, and sparse representations.
● Used triplet learning for 3D shapes.
● Used Large Scale Shape Datasets ModelNet and ShapeNet.

Represent voxel shape as vector

Conclusions
● For small datasets of shape or 3D sparse tensors voxels
can work.
● Voxels don’t scale for hundreds of “classes” and loose
texture information.
● Cannot encode complicated object domains.

Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Learning

Robotics in human
environments

Robotic Control in Human Environments

Commodity sensors to create 2.5D images
Intel RealSense Series
Asus Xtion Pro
Microsoft Kinect v2
Structure Sensor

What they have in
common?
They require understanding the whole scene

Problem of “Holistic” Scene understanding

Lin D., Fidler S., Urtasun R. Holistic scene understanding for 3d object detection
with rgbd cameras //Proceedings of the IEEE International Conference on Computer
Vision. – 2013. – С. 1417-1424.
● Human environments often designed by humans
● A most of the objects are created by humans
● Context provides information by joint probability functions
● Textures caused by materials and therefore can explain a functions and
structure of an object
Problem of “Holistic” Scene understanding

Connecting 3 families of CV algorithms is inevitable
Learnable Computer
Vision Systems
(Deep Learning)
Geometric Computer Vision
(SLAMs)
Probabilistic Computer
Vision
(Bayesian methods)

Connecting 3 families of CV algorithms is inevitable
Learnable Computer
Vision Systems
(Deep Learning)
Geometric Computer Vision
(SLAMs)
Probabilistic Computer
Vision
(Bayesian methods)
Probabilistic
Inverse
Graphics

Probabilistic Inverse Graphics enables
● Takes into account setting information (shop: shelves and products | street: buildings,
cars, pedestrians)
● Make maximum likelihood estimates from data and model (or give directions on how
to reduce uncertainty the best way)
● Learns structure of objects (Materials and textures / 3D shape / intrinsic dynamics)

Thank you.
Alexandr Notchenko Ermek Kapushev Evgeny Burnaev

Citations and Links
1. Deep Learning NIPS’2015 Tutorial by Geoff Hinton, Yoshua Bengio & Yann LeCun
2. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J. (2015). 3D ShapeNets: A Deep Representation for Volumetric Shapes.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1912-1920).
3. C. Nash, C. Williams Generative Models of Part-Structured 3D Objects
4. Qin, Fei-wei, et al. "A deep learning approach to the classification of 3D CAD models." Journal of Zhejiang University SCIENCE C 15.2
(2014): 91-106.
5. Engel, Jakob, Thomas Schöps, and Daniel Cremers. "LSD-SLAM: Large-scale direct monocular SLAM." European Conference on Computer
Vision. Springer International Publishing, 2014.
6. Su, Hang, et al. "Multi-view convolutional neural networks for 3D shape recognition." Proceedings of the IEEE International Conference on
Computer Vision. 2015.
7. Whelan, Thomas, et al. "ElasticFusion: Dense SLAM Without A Pose Graph." Robotics: science and systems. Vol. 11. 2015.
8. Notchenko, Alexandr, Ermek Kapushev, and Evgeny Burnaev. "Sparse 3D Convolutional Neural Networks for Large-Scale Shape Retrieval."
arXiv preprint arXiv:1611.09159 (2016).
9. Newcombe, Richard A., Dieter Fox, and Steven M. Seitz. "Dynamicfusion: Reconstruction and tracking of non-rigid scenes in
real-time." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
10. Gupta, Saurabh, et al. "Learning rich features from RGB-D images for object detection and segmentation." European Conference on
Computer Vision. Springer International Publishing, 2014.

Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Learning

More Related Content

Viewers also liked (20)

Similar to Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Learning (20)

More from Skolkovo Robotics Center (18)

Recently uploaded (20)

Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Learning