3-d interpretation from stereo images for autonomous driving

3D Interpretation from Stereo Images
for Autonomous Driving
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• Object-Centric Stereo Matching for 3D Object Detection
• Triangulation Learning Network: from Monocular to Stereo 3D Object Detection
• Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object
Detection for Autonomous Driving
• Stereo R-CNN based 3D Object Detection for Autonomous Driving
• 3d object proposals for accurate object class detection

Object-Centric Stereo Matching for 3D
Object Detection
• The current SoA for stereo 3D object detection takes the existing PSMNet stereo matching
network, with no modifications, and converts the estimated disparities into a 3D point cloud,
and feeds this point cloud into a LiDAR-based 3D object detector.
• The issue with existing stereo matching networks is that they are designed for disparity
estimation, not 3D object detection; the shape and accuracy of object point clouds are not
the focus.
• Stereo matching networks commonly suffer from inaccurate depth estimates at object
boundaries, which this method defines as streaking, because BG and FG points are jointly
estimated.
• Existing networks also penalize disparity instead of the estimated position of object point
clouds in their loss functions.
• Here it proposes a 2D box association and object-centric stereo matching method that only
estimates the disparities of the objects of interest to address these two issues.

Object Detection
First, a 2D detector generates 2D boxes in Il and Ir. Next, a box association algorithm matches object detections
across both images. Each matched detection pair is passed into the object-centric stereo network, which jointly
produces a disparity map and instance segmentation mask for each object. Together, these form a disparity
map containing only the objects of interest. Lastly, the disparity map is transformed into a point cloud that can
be used by any LiDAR-based 3D object detection network to predict the 3D bounding boxes.

Object Detection
Qualitative results on KITTI. Ground truth and predictions are in red and green,
respectively. Colored points are predicted by our stereo matching network
while LiDAR points are shown in black for visualization purposes only.

Triangulation Learning Network: from
Monocular to Stereo 3D Object Detection
• For 3D object detection from stereo images, the key challenge is how to effectively utilize
stereo information.
• Different from previous methods using pixel-level depth maps, this method employs 3D
anchors to explicitly construct object-level correspondences between the ROI in stereo
images, from which DNN learns to detect and triangulate the targeted object in 3D space.
• It introduces a cost-efficient channel reweighting strategy that enhances representational
features and weakens noisy signals to facilitate the learning process.
• All of these are flexibly integrated into a solid baseline detector that uses monocular images.
• It is demonstrated that both the monocular baseline and the stereo triangulation learning
network outperform the prior SoA in 3D object detection and localization on KITTI dataset.

Overview of the 3D detection pipeline. The baseline monocular network is indicated with
blue background, and can be easily extended to stereo inputs by duplicating the baseline
and further integrating with the TLNet (Triangulation Learning Network).

• The baseline network taking a mono image as the input is composed of a backbone and 3
subsequent modules, i.e. front view anchor generation, 3D box proposal and refinement.
• The three-stage pipeline progressively reduces the searching space by selecting confident
anchors, which highly reduces computational complexity.
• The stereo 3D detection is performed by integrating a triangulation learning network (TL-
Net) into the baseline model.
• Triangulation is known as localizing 3D points from multi- view images in the classical
geometry fields, while this objective is to localize a 3D object and estimates its size and
orientation from stereo images.
• To achieve this, introduce an anchor triangulation scheme, in which the NN uses 3D
anchors as reference to triangulate the targets.

Front view anchor generation. Potential anchors are of high objectness in the front view. Only
the potential anchors are fed into RPN to reduce searching space and save computational cost.

Anchor triangulation. By projecting the 3D
anchor box to stereo images, obtain a pair of
RoIs. The left RoI establishes a geometric
correspondence with the right one via the
anchor box. The nearby target is present in
both RoIs with slightly positional differences.
The TLNet takes the RoI pair as input and
utilizes the 3D anchor as reference to localize
the targeted object.

The TLNet takes as input a pair of left-
right RoI features Fl and Fr with Croi
channels and size Hroi ×Wroi, which are
obtained using RoIAlign by projecting
the same 3D anchor to the left and right
frames. To utilize the left-right coherence
scores to reweight each channel. The
reweighted features are fused using
element-wise addition and passed to
task-specific fully-connected layers to
predict the objectness confidence and
3D bounding box offsets, i.e., the 3D
geometric variance between the anchor
and target.

Orange bounding boxes are detection results, while the green boxes are ground truths. For the main method, also
visualize the projected 3D bounding boxes in image, i.e., the first and forth rows. The lidar point clouds are
visualized for reference but not used in both training and evaluation. It is shown that the triangulation learning
method can reduce missed detections and improve the performance of depth prediction at distant regions.

Pseudo-LiDAR from Visual Depth Estimation: Bridging the
Gap in 3D Object Detection for Autonomous Driving
• Approaches based on cheaper monocular or stereo imagery data have, until now, resulted in
drastically lower accuracies — a gap that is commonly attributed to poor image-based
depth estimation.
• However, it is not the quality of the data but its representation that accounts for the majority
of the difference.
• Taking the inner workings of CNNs into consideration, convert image-based depth maps to
pseudo- LiDAR representations — essentially mimicking the LiDAR signal.
• With this representation, apply different existing LiDAR-based detection algorithms.
• On the popular KITTI benchmark, this approach achieves impressive improvements over the
existing state-of-the-art in image-based performance — raising the detection accuracy of
objects within the 30m range from the previous state-of-the-art of 22% to an
unprecedented 74%.

The pipeline for image-based 3D object detection. Given stereo or monocular images, first predict the depth map,
followed by back-projecting it into a 3D point cloud in the LiDAR coordinate system. Refer this representation as pseudo-
LiDAR, and process it exactly like LiDAR — any LiDAR-based detection algorithms can be applied.

Apply a single 2D convolution with a uniform kernel to the frontal view depth map (top-left). The resulting depth
map (top-right), after back-projected into pseudo- LiDAR and displayed from the bird’s-eye view (bottom- right),
reveals a large depth distortion in comparison to the original pseudo-LiDAR representation (bottom-left),
especially for far-away objects. Mark points of each car instance by a color. The boxes are super-imposed and
contain all points of the green and cyan cars respectively.

Qualitative comparison. Compare AVOD with LiDAR, pseudo-LiDAR, and frontal-view (stereo).
Ground- truth boxes are in red, predicted boxes in green; the observer in the pseudo-LiDAR
plots (bottom row) is on the very left side looking to the right. The frontal-view approach
(right) even miscalculates the depths of nearby objects and misses far-away objects entirely.

Stereo R-CNN based 3D Object Detection
• A 3D object detection method for autonomous driving by fully exploiting the sparse and
dense, semantic and geometry information in stereo imagery.
• This method, called Stereo R-CNN, extends Faster R-CNN for stereo inputs to
simultaneously detect and associate object in left and right images.
• Add extra branches after stereo Region Proposal Network (RPN) to predict sparse keypoints,
viewpoints, and object dimensions, which are combined with 2D left-right boxes to calculate
a coarse1 3D object bounding box.
• Then to recover the accurate 3D bounding box by a region-based photometric alignment
using left and right RoIs.
• This method does not require depth input and 3D position supervision, however,
outperforms all existing fully supervised image-based methods.
• Code released at https://github.com/HKUST-Aerial-Robotics/Stereo-RCNN.

The stereo R-CNN outputs stereo boxes, keypoints, dimensions, and the viewpoint angle,
followed by the 3D box estimation and the dense 3D box alignment module.

Relations between object orientation θ,
azimuth β and viewpoint θ + β. Only same
viewpoints lead to same projections.
Different targets assignment for RPN classification and
regression.

3D semantic keypoints, the 2D perspective keypoint, and boundary keypoints.

Sparse constraints for the 3D box estimation

From top to bottom: detections on left image, right image, and bird’s eye view image.

3-d interpretation from stereo images for autonomous driving

3-d interpretation from stereo images for autonomous driving

More Related Content

What's hot (20)

Similar to 3-d interpretation from stereo images for autonomous driving (20)

More from Yu Huang (20)

Recently uploaded (20)

3-d interpretation from stereo images for autonomous driving