Final_From 2D Image To 3D Object.pptx

From 2D Image To 3D
Object
Extract Human-Object interaction from a 2D Images

Outline
Project Overview(How the inference work)
Previous Work (PROX model)
Methodology
Problem Statement
PHOSA Model Challenges
Future work(drawbacks)
Deployment Challenges
Summary
Introduction

Introduction
Why to extract 3D human-object interaction?
Domanis uses the 3D objects:
● Entertainment(VR, Gaming)
● Animation
● Simulation
● 3D Printing
● Manufacturing
● Advertising and Marketing
3D modeling enables companies to display their products in an
ideal state.
● Sciences and Geology

Introduction
Why to extract 3D human-object interaction?
The time to create a 3D character depends on many factors, like the level of
complexity, experience of the modeler, etc. Keeping in mind these factors,
it can take approximately 100 to 200 hours.

Problem Statement
Attempting to understand human object interaction, without
understanding the interaction we will get results with the same 2D
projection for different interpretations

Previous Work(PROX model)
Considers humans and their environments using a known 3D scene
PROX recovers 3d meshes of human in relation to that scene. This
approach depends on existing 3D captures of the scene, which are
not available in the wild, PHOSA attempts to reconstruct 3D human
and objects without 3D scans.

System Architecture:
Image PoseOptimizer
Frankmocap
Bodymocap
SMPL
3D Objects
3D
Human
PHOSA
PointRend
Detectron
FLASK API
Model Architecture
3D Global
reasoning

Project
Overview(How
the inference
work)
We choose “Perceiving 3D Human-Object Spatial Arrangements from
a Single Image in the Wild”, paper from Facebook AI Research, has
discussed European Conference on Computer Vision (ECCV) 2020,
which concentrates on extracting objects interactions without any
scene- or object-level 3D supervision. We aim for providing a 2D image
as input, and outputting a reasonable human-object interaction 3D
scenes through a Flask API.

Methodology
1. Data acquisition(COCO Dataset)
2. Data preparation(preparing the objects/json files)
3. Modeling(Explain PHOSA mode)
4. Evaluation(Explain loss weights)
5. Deployment

- COCO is a large-scale object detection, segmentation, and
captioning dataset with more than 331k image.
- Contains challenging images of humans interacting with everyday
objects obtained in uncontrolled settings.
Methodology →
1) Data acquisition(COCO Dataset)

The Approach genericity demonstrated by evaluating on objects
from 8 categories of varying size and interaction types.
● Bicycle
● Baseball_bat
● Tennis_Racket
● Motorcycle
● Laptop
● Bench
● Surfboard
● Skateboard
Methodology →
1) Data acquisition(COCO Dataset)
“We need to add a mesh
(3D model) for each class
of these”

● Resizing the Input image.
● So now we need to obtain and prepare these meshes for each category amd
preparing them using meshlab.
● All the meshes are pre-processed to be watertight and are simplified to make the the
optimization more efficient.
○ For the pre-processing, we first fill in the holes of the raw mesh models (e.g. the
holes in the wheels or tennis racket) to make the projection of the 3D models
consistent with the silhouettes obtained by the instance segmentation
algorithm.
○ Finally, we reduce the number of mesh vertices using MeshLab.
Methodology → 2) Data preparation

filling in the holes of the raw mesh models and reduced
the number of mesh vertices

The basic idea behind the model performance is that
the model exploits the interaction areas between
humans and each object of the previously listed.
So we had to label the interaction areas manually on the
meshes for each new class using Meshlab.
Part labeling

Generating corresponding file
● This is a file sample of the specific vertices of
bench object that had Contact with human
beings , generated as a json file after using
meshlab.

Methodology →
3) Modeling(Explain PHOSA model)
An optimization model cares about resulting reasonable interaction
PHOSA Model
(Theoretical perspective)

PHOSA Model
(Technical perspective)
PHOSA Submodels:
● PoseOptimizer
● PHOSA
PHOSA External Models:
● Detectron(PointRend)
● Frankmocap (BodyMocap)
● Multiperson(neural_rendrer)

PHOSA Model
1. Detectron: predict human instance segmentation masks and object detection.
2. Frankmocap : for Human Pose Estimator to predict 3D human meshes.
3. PoseOptimizer: Find the optimal poses of an object based on object instances
segmented from Detectron.
4. PHOSA: Optimizes the human-object interaction based on some losses
5. Neural_rendrer: For 3D meshes representation and visualization.

PHOSA Model
PoseOptimizer Frankmocap
Bodymocap
SMPL
3D Objects
3D Human
PHOSA
PointRend
Detectron

PHOSA Model (PointRend)
Uses detectron to predict human instance
segmentation
Detectron
PointRend

PHOSA Model
Frankmocap
Bodymocap
SMPL
PointRend
Detectron

PHOSA Model
PoseOptimizer
PointRend
Detectron
Frankmocap
Bodymocap
SMPL

Object instance mask
(Form pointRend)
PHOSA Model
Differentiable renderer OR
PoseOptimizer
Object
Differentiable renderer → To solve
the object rotation and translation
that minimizes the Silhouette
reprojection error, helps in Object
Pose Estimation

Object instance mask
(Form pointRend)
PHOSA Model
Differentiable renderer
OR PoseOptimizer
Object
- we visualize the final distribution
of object sizes learned for the
COCO-2017 test set after
optimizing for human interaction.

PHOSA Model
PoseOptimizer
3D Objects
3D Human
PointRend
Detectron
Frankmocap
Bodymocap
SMPL

Given independently estimated 3D humans and 3D
objects
The weighted sum optimization of loss terms starts
w.r.t 6-Dof pose + Intrinsic scaling factor for each
object and human.
PHOSA Model

PHOSA Model
PoseOptimizer
3D Objects
3D Human
PHOSA
PointRend
Detectron
Frankmocap
Bodymocap
SMPL

Methodology → Evaluation(Explain loss
weights)
Loss Weights
Weighted sum of losses

Loss Terms
Occlusion-aware silhouette loss:
For optimizing object pose. Given an image, a 3D mesh
model, and instance masks, our occlusion-aware silhouette
loss finds the 6-DoF pose that most closely matches the
target mask (bottom right).

Loss Terms
Human-object interaction loss:
- A part labeling approach an
inspiration taken from PROX
- Taking the contact regions on
the human body and on each
object mesh to encode
interaction parts.

Loss Terms
In order to identify human-object interaction, we first determine
if two parts interact by using 3D bounding boxes.
For example, people usually grab tennis rackets
by the handle using their hands As shown here,
the handle of the tennis racket and the hand of
the person are not in contact, but their 3D bounding boxes overlap.

Loss Terms
We impose our loss which pulls the interacting parts closer
together. Here, we visualize the improved arrangement.
To identify human-object interaction, the initial size of
objects is very important.

Loss Terms
Taking a prior on human and object intrinsic
scale:
- If the surfboard is initially a reasonable
size, say two and a half meters long, then
the surfboard is close enough to the
person to correctly detect interaction.
- If we were to initialize the surfboard
to be unrealistically large,
the 3D bounding boxes would no longer
overlap.

Loss Terms
Taking a prior on human and object intrinsic scale:
To start our optimization process, we use an
internet search to find the usual size of objects.
The red caret denotes the size resulting from the
hand-picked scale used for initialization. The
blue line denotes the size produced by the
empirical mean scale of all category instances at
the end of optimization.

Loss Terms
Ordinal depth loss(Correct depth ordering):
The depth ordering inferred from the 3D placement should match that of the
image.

Loss Terms
Ordinal depth loss(Correct depth ordering):
While the correct depth ordering of people and objects would also minimize the occlusion-
aware silhouette loss.

Loss Terms
Collision loss:
- Its used to avoid
interpenetration of human
and object that occupying
the same 3D space.
- Promoting proximity
between people and objects
can exacerbate the problem
of instances occupying the
same 3D space

Loss Terms
Importance of each loss

AFTER TESTING THE
MODEL ON COLAB
NOTEBOOK
WE SWITCHED TO THE
NEXT STEP (AWS
CLOUD)

• We worked on EC2-
AWS linux-based
instance which is a
standard computing
unit on amazon web
service
Instance type

● We uploaded our project files to AWS cloud as it will facilitate Our work along
with the productivity speed , not to mention GPU resources.
● After some network issues we managed to use FLASK Successfully to deploy our
model not only on a local server But also publicly for anyone to use at
35.86.170.176:5000
Deployment

● After labelling the interaction regions on both humans and object.
● Now it is time to add it all together to deploy our model
One final step before
deployment

First we assign each class name
to its 3d object mesh

Second we mark the contact
regions as follows:

● Now our model is ready to deploy!
● We chose to display our output divided into four images:
● 1- instance segmentation for the human and objects in the input
picture
● 2- human 3d mesh
● 3- interaction between human and object(front view)
● 4- interaction between human and object(top view)
Deployment

● We are actually running four models all together to obtain this
result:
● 1- detecrton segmenter to obtain the first picture
● 2- frankmocap to generate human mesh
● 3- neural-renderer to create 3d object mask
● And last but not least our phosa model to optimize interaction
Between 3d human mesh and 3d object mesh
Some of the drawbacks of the
App is Time complexity

1- instead of initializing our model randomly (far away from the
best weights),during training time we consistently used the
optimized weights (output) as an input for next iterations.
2- we managed to reduce number of iterations to only 25 instead of
100 keeping most of the great quality of the output.(Accepted
tradeoff)
3- we can also reduce time complexity by switching to a higher GPU
Like RTX or workstations GPU.
Some work around solutions:

● After developing our application we tried our best to make it
public for anyone with ease of access despite the the huge size
of our model and the conflict libraries during installation
So,
What is next?

● We provide our project as a colab notebook for fast and easier interface
● Here :
https://colab.research.google.com/drive/18zkZd46CZ2GGIa5yYi8hBlAyLgggRN_j?usp=sh
aring
Colab notebook

● Some of the difficulties we confronted while pushing the project to GitHub is
the huge size of the project Since file limit size is only 100 MB.
● We used git LFS to escape this problem
● You can find our project repo at :
https://github.com/MohsenAziz/2d-images-to-3d-meshes_app-
deployment
Github repository

● We tried to deploy the project on Heruku platform but unfortunately it didn’t work due to
huge size of the project.
● We also tried to dockerize our project as an image but failed due to CUDA installation errors.
Further steps

PHOSA Model
Challenges
● Randomization
● Inference time
● Detectron different versions
● Torch version conflicts
● Prepare objects & JSON files
● Data annotation
● Sequential processes of the project

PHOSA Drawbacks
Human pose failure
Our human pose estimator
sometimes incorrectly estimates
the pose of the person
Object pose failure
The predicted masks are sometimes
unreliable for estimating the pose of
the object.

PHOSA Drawbacks
Incorrect reasoning about interaction due to scale

Future Work
● Add more classes
● Enhance complex images
● Make human object more reliable
● Change body motion capture with Whole
body Motion Capture (body + hands)
● Real-time human-object interaction
● Define mesh for specific human features

Final_From 2D Image To 3D Object.pptx

More Related Content

Similar to Final_From 2D Image To 3D Object.pptx (20)

Recently uploaded (20)

Final_From 2D Image To 3D Object.pptx