3D Interpretation from Stereo Images
for Autonomous Driving
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
Outline
• Object-Centric Stereo Matching for 3D Object Detection
• Triangulation Learning Network: from Monocular to Stereo 3D Object Detection
• Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object
Detection for Autonomous Driving
• Stereo R-CNN based 3D Object Detection for Autonomous Driving
• 3d object proposals for accurate object class detection
Object-Centric Stereo Matching for 3D
Object Detection
• The current SoA for stereo 3D object detection takes the existing PSMNet stereo matching
network, with no modifications, and converts the estimated disparities into a 3D point cloud,
and feeds this point cloud into a LiDAR-based 3D object detector.
• The issue with existing stereo matching networks is that they are designed for disparity
estimation, not 3D object detection; the shape and accuracy of object point clouds are not
the focus.
• Stereo matching networks commonly suffer from inaccurate depth estimates at object
boundaries, which this method defines as streaking, because BG and FG points are jointly
estimated.
• Existing networks also penalize disparity instead of the estimated position of object point
clouds in their loss functions.
• Here it proposes a 2D box association and object-centric stereo matching method that only
estimates the disparities of the objects of interest to address these two issues.
Object-Centric Stereo Matching for 3D
Object Detection
First, a 2D detector generates 2D boxes in Il and Ir. Next, a box association algorithm matches object detections
across both images. Each matched detection pair is passed into the object-centric stereo network, which jointly
produces a disparity map and instance segmentation mask for each object. Together, these form a disparity
map containing only the objects of interest. Lastly, the disparity map is transformed into a point cloud that can
be used by any LiDAR-based 3D object detection network to predict the 3D bounding boxes.
Object-Centric Stereo Matching for 3D
Object Detection
Qualitative results on KITTI. Ground truth and predictions are in red and green,
respectively. Colored points are predicted by our stereo matching network
while LiDAR points are shown in black for visualization purposes only.
Triangulation Learning Network: from
Monocular to Stereo 3D Object Detection
• For 3D object detection from stereo images, the key challenge is how to effectively utilize
stereo information.
• Different from previous methods using pixel-level depth maps, this method employs 3D
anchors to explicitly construct object-level correspondences between the ROI in stereo
images, from which DNN learns to detect and triangulate the targeted object in 3D space.
• It introduces a cost-efficient channel reweighting strategy that enhances representational
features and weakens noisy signals to facilitate the learning process.
• All of these are flexibly integrated into a solid baseline detector that uses monocular images.
• It is demonstrated that both the monocular baseline and the stereo triangulation learning
network outperform the prior SoA in 3D object detection and localization on KITTI dataset.
Triangulation Learning Network: from
Monocular to Stereo 3D Object Detection
Overview of the 3D detection pipeline. The baseline monocular network is indicated with
blue background, and can be easily extended to stereo inputs by duplicating the baseline
and further integrating with the TLNet (Triangulation Learning Network).
Triangulation Learning Network: from
Monocular to Stereo 3D Object Detection
• The baseline network taking a mono image as the input is composed of a backbone and 3
subsequent modules, i.e. front view anchor generation, 3D box proposal and refinement.
• The three-stage pipeline progressively reduces the searching space by selecting confident
anchors, which highly reduces computational complexity.
• The stereo 3D detection is performed by integrating a triangulation learning network (TL-
Net) into the baseline model.
• Triangulation is known as localizing 3D points from multi- view images in the classical
geometry fields, while this objective is to localize a 3D object and estimates its size and
orientation from stereo images.
• To achieve this, introduce an anchor triangulation scheme, in which the NN uses 3D
anchors as reference to triangulate the targets.
Triangulation Learning Network: from
Monocular to Stereo 3D Object Detection
Front view anchor generation. Potential anchors are of high objectness in the front view. Only
the potential anchors are fed into RPN to reduce searching space and save computational cost.
Triangulation Learning Network: from
Monocular to Stereo 3D Object Detection
Anchor triangulation. By projecting the 3D
anchor box to stereo images, obtain a pair of
RoIs. The left RoI establishes a geometric
correspondence with the right one via the
anchor box. The nearby target is present in
both RoIs with slightly positional differences.
The TLNet takes the RoI pair as input and
utilizes the 3D anchor as reference to localize
the targeted object.
Triangulation Learning Network: from
Monocular to Stereo 3D Object Detection
The TLNet takes as input a pair of left-
right RoI features Fl and Fr with Croi
channels and size Hroi ×Wroi, which are
obtained using RoIAlign by projecting
the same 3D anchor to the left and right
frames. To utilize the left-right coherence
scores to reweight each channel. The
reweighted features are fused using
element-wise addition and passed to
task-specific fully-connected layers to
predict the objectness confidence and
3D bounding box offsets, i.e., the 3D
geometric variance between the anchor
and target.
Triangulation Learning Network: from
Monocular to Stereo 3D Object Detection
Orange bounding boxes are detection results, while the green boxes are ground truths. For the main method, also
visualize the projected 3D bounding boxes in image, i.e., the first and forth rows. The lidar point clouds are
visualized for reference but not used in both training and evaluation. It is shown that the triangulation learning
method can reduce missed detections and improve the performance of depth prediction at distant regions.
Pseudo-LiDAR from Visual Depth Estimation: Bridging the
Gap in 3D Object Detection for Autonomous Driving
• Approaches based on cheaper monocular or stereo imagery data have, until now, resulted in
drastically lower accuracies — a gap that is commonly attributed to poor image-based
depth estimation.
• However, it is not the quality of the data but its representation that accounts for the majority
of the difference.
• Taking the inner workings of CNNs into consideration, convert image-based depth maps to
pseudo- LiDAR representations — essentially mimicking the LiDAR signal.
• With this representation, apply different existing LiDAR-based detection algorithms.
• On the popular KITTI benchmark, this approach achieves impressive improvements over the
existing state-of-the-art in image-based performance — raising the detection accuracy of
objects within the 30m range from the previous state-of-the-art of 22% to an
unprecedented 74%.
Pseudo-LiDAR from Visual Depth Estimation: Bridging the
Gap in 3D Object Detection for Autonomous Driving
The pipeline for image-based 3D object detection. Given stereo or monocular images, first predict the depth map,
followed by back-projecting it into a 3D point cloud in the LiDAR coordinate system. Refer this representation as pseudo-
LiDAR, and process it exactly like LiDAR — any LiDAR-based detection algorithms can be applied.
Pseudo-LiDAR from Visual Depth Estimation: Bridging the
Gap in 3D Object Detection for Autonomous Driving
Apply a single 2D convolution with a uniform kernel to the frontal view depth map (top-left). The resulting depth
map (top-right), after back-projected into pseudo- LiDAR and displayed from the bird’s-eye view (bottom- right),
reveals a large depth distortion in comparison to the original pseudo-LiDAR representation (bottom-left),
especially for far-away objects. Mark points of each car instance by a color. The boxes are super-imposed and
contain all points of the green and cyan cars respectively.
Pseudo-LiDAR from Visual Depth Estimation: Bridging the
Gap in 3D Object Detection for Autonomous Driving
Qualitative comparison. Compare AVOD with LiDAR, pseudo-LiDAR, and frontal-view (stereo).
Ground- truth boxes are in red, predicted boxes in green; the observer in the pseudo-LiDAR
plots (bottom row) is on the very left side looking to the right. The frontal-view approach
(right) even miscalculates the depths of nearby objects and misses far-away objects entirely.
Stereo R-CNN based 3D Object Detection
for Autonomous Driving
• A 3D object detection method for autonomous driving by fully exploiting the sparse and
dense, semantic and geometry information in stereo imagery.
• This method, called Stereo R-CNN, extends Faster R-CNN for stereo inputs to
simultaneously detect and associate object in left and right images.
• Add extra branches after stereo Region Proposal Network (RPN) to predict sparse keypoints,
viewpoints, and object dimensions, which are combined with 2D left-right boxes to calculate
a coarse1 3D object bounding box.
• Then to recover the accurate 3D bounding box by a region-based photometric alignment
using left and right RoIs.
• This method does not require depth input and 3D position supervision, however,
outperforms all existing fully supervised image-based methods.
• Code released at https://github.com/HKUST-Aerial-Robotics/Stereo-RCNN.
Stereo R-CNN based 3D Object Detection
for Autonomous Driving
The stereo R-CNN outputs stereo boxes, keypoints, dimensions, and the viewpoint angle,
followed by the 3D box estimation and the dense 3D box alignment module.
Stereo R-CNN based 3D Object Detection
for Autonomous Driving
Relations between object orientation θ,
azimuth β and viewpoint θ + β. Only same
viewpoints lead to same projections.
Different targets assignment for RPN classification and
regression.
Stereo R-CNN based 3D Object Detection
for Autonomous Driving
3D semantic keypoints, the 2D perspective keypoint, and boundary keypoints.
Stereo R-CNN based 3D Object Detection
for Autonomous Driving
Sparse constraints for the 3D box estimation
Stereo R-CNN based 3D Object Detection
for Autonomous Driving
From top to bottom: detections on left image, right image, and bird’s eye view image.
3-d interpretation from stereo images for autonomous driving

More Related Content

PDF
Driving Behavior for ADAS and Autonomous Driving VII
PDF
Pedestrian Behavior/Intention Modeling for Autonomous Driving VI
PDF
Pedestrian behavior/intention modeling for autonomous driving III
PDF
Driving behaviors for adas and autonomous driving xiv
PDF
Pedestrian behavior/intention modeling for autonomous driving II
PDF
Driving Behavior for ADAS and Autonomous Driving X
PDF
Depth Fusion from RGB and Depth Sensors IV
PDF
3-d interpretation from single 2-d image for autonomous driving II
Driving Behavior for ADAS and Autonomous Driving VII
Pedestrian Behavior/Intention Modeling for Autonomous Driving VI
Pedestrian behavior/intention modeling for autonomous driving III
Driving behaviors for adas and autonomous driving xiv
Pedestrian behavior/intention modeling for autonomous driving II
Driving Behavior for ADAS and Autonomous Driving X
Depth Fusion from RGB and Depth Sensors IV
3-d interpretation from single 2-d image for autonomous driving II

What's hot (20)

PDF
Pedestrian behavior/intention modeling for autonomous driving V
PDF
Driving behaviors for adas and autonomous driving XI
PDF
Depth Fusion from RGB and Depth Sensors III
PDF
Deep Learning’s Application in Radar Signal Data II
PDF
Lidar for Autonomous Driving II (via Deep Learning)
PDF
Driving Behavior for ADAS and Autonomous Driving III
PDF
Camera-Based Road Lane Detection by Deep Learning II
PDF
Pedestrian behavior/intention modeling for autonomous driving IV
PDF
Driving behaviors for adas and autonomous driving XII
PDF
camera-based Lane detection by deep learning
PDF
Fisheye Omnidirectional View in Autonomous Driving II
PDF
Driving behaviors for adas and autonomous driving XIII
PDF
Fisheye-Omnidirectional View in Autonomous Driving III
PDF
Deep VO and SLAM
PDF
Prediction and planning for self driving at waymo
PDF
Fisheye Omnidirectional View in Autonomous Driving
PDF
Deep Learning’s Application in Radar Signal Data
PDF
BEV Semantic Segmentation
PDF
Deep VO and SLAM IV
PDF
3-d interpretation from single 2-d image IV
Pedestrian behavior/intention modeling for autonomous driving V
Driving behaviors for adas and autonomous driving XI
Depth Fusion from RGB and Depth Sensors III
Deep Learning’s Application in Radar Signal Data II
Lidar for Autonomous Driving II (via Deep Learning)
Driving Behavior for ADAS and Autonomous Driving III
Camera-Based Road Lane Detection by Deep Learning II
Pedestrian behavior/intention modeling for autonomous driving IV
Driving behaviors for adas and autonomous driving XII
camera-based Lane detection by deep learning
Fisheye Omnidirectional View in Autonomous Driving II
Driving behaviors for adas and autonomous driving XIII
Fisheye-Omnidirectional View in Autonomous Driving III
Deep VO and SLAM
Prediction and planning for self driving at waymo
Fisheye Omnidirectional View in Autonomous Driving
Deep Learning’s Application in Radar Signal Data
BEV Semantic Segmentation
Deep VO and SLAM IV
3-d interpretation from single 2-d image IV
Ad

Similar to 3-d interpretation from stereo images for autonomous driving (20)

PDF
fusion of Camera and lidar for autonomous driving II
DOC
Mmpaper draft10
DOC
Mmpaper draft10
PDF
3-d interpretation from single 2-d image III
PPTX
[NS][Lab_Seminar_240611]Graph R-CNN.pptx
PDF
fusion of Camera and lidar for autonomous driving I
PPTX
Presentation2.pptx of sota seminar iit kanpur
PDF
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
PDF
3-d interpretation from single 2-d image for autonomous driving
PDF
Stereo Matching by Deep Learning
PDF
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
PDF
LiDAR-based Autonomous Driving III (by Deep Learning)
PDF
6 - Conception of an Autonomous UAV using Stereo Vision (presented in an Indo...
DOCX
3D transformers Phd Research Proposal doc
PDF
BEV Object Detection and Prediction
PDF
3-d interpretation from single 2-d image V
PDF
Depth Fusion from RGB and Depth Sensors II
PDF
Deep vo and slam ii
PDF
Depth Fusion from RGB and Depth Sensors by Deep Learning
PDF
3d object detection and recognition : a review
fusion of Camera and lidar for autonomous driving II
Mmpaper draft10
Mmpaper draft10
3-d interpretation from single 2-d image III
[NS][Lab_Seminar_240611]Graph R-CNN.pptx
fusion of Camera and lidar for autonomous driving I
Presentation2.pptx of sota seminar iit kanpur
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
3-d interpretation from single 2-d image for autonomous driving
Stereo Matching by Deep Learning
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
LiDAR-based Autonomous Driving III (by Deep Learning)
6 - Conception of an Autonomous UAV using Stereo Vision (presented in an Indo...
3D transformers Phd Research Proposal doc
BEV Object Detection and Prediction
3-d interpretation from single 2-d image V
Depth Fusion from RGB and Depth Sensors II
Deep vo and slam ii
Depth Fusion from RGB and Depth Sensors by Deep Learning
3d object detection and recognition : a review
Ad

More from Yu Huang (20)

PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
GOSIM_China_2024_Embodied AI Data VLA World Model
PDF
Levels of AI Agents: from Rules to Large Language Models
PDF
Application of Foundation Model for Autonomous Driving
PDF
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
PDF
Data Closed Loop in Simulation Test of Autonomous Driving
PDF
Techniques and Challenges in Autonomous Driving
PDF
BEV Joint Detection and Segmentation
PDF
Fisheye based Perception for Autonomous Driving VI
PDF
Fisheye/Omnidirectional View in Autonomous Driving V
PDF
Fisheye/Omnidirectional View in Autonomous Driving IV
PDF
Prediction,Planninng & Control at Baidu
PDF
Cruise AI under the Hood
PDF
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
PDF
Scenario-Based Development & Testing for Autonomous Driving
PDF
How to Build a Data Closed-loop Platform for Autonomous Driving?
PDF
Annotation tools for ADAS & Autonomous Driving
PDF
Simulation for autonomous driving at uber atg
PDF
Multi sensor calibration by deep learning
PDF
Jointly mapping, localization, perception, prediction and planning
Embodied AI: Ushering in the Next Era of Intelligent Systems
GOSIM_China_2024_Embodied AI Data VLA World Model
Levels of AI Agents: from Rules to Large Language Models
Application of Foundation Model for Autonomous Driving
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
Data Closed Loop in Simulation Test of Autonomous Driving
Techniques and Challenges in Autonomous Driving
BEV Joint Detection and Segmentation
Fisheye based Perception for Autonomous Driving VI
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving IV
Prediction,Planninng & Control at Baidu
Cruise AI under the Hood
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
Scenario-Based Development & Testing for Autonomous Driving
How to Build a Data Closed-loop Platform for Autonomous Driving?
Annotation tools for ADAS & Autonomous Driving
Simulation for autonomous driving at uber atg
Multi sensor calibration by deep learning
Jointly mapping, localization, perception, prediction and planning

Recently uploaded (20)

PDF
MLpara ingenieira CIVIL, meca Y AMBIENTAL
PDF
Computer organization and architecuture Digital Notes....pdf
PDF
First part_B-Image Processing - 1 of 2).pdf
PDF
Implantable Drug Delivery System_NDDS_BPHARMACY__SEM VII_PCI .pdf
PPTX
Software Engineering and software moduleing
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PDF
Prof. Dr. KAYIHURA A. SILAS MUNYANEZA, PhD..pdf
PDF
Soil Improvement Techniques Note - Rabbi
PPTX
Management Information system : MIS-e-Business Systems.pptx
PPTX
Principal presentation for NAAC (1).pptx
PPTX
Module 8- Technological and Communication Skills.pptx
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PDF
Java Basics-Introduction and program control
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PDF
Applications of Equal_Area_Criterion.pdf
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PPTX
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PDF
Abrasive, erosive and cavitation wear.pdf
MLpara ingenieira CIVIL, meca Y AMBIENTAL
Computer organization and architecuture Digital Notes....pdf
First part_B-Image Processing - 1 of 2).pdf
Implantable Drug Delivery System_NDDS_BPHARMACY__SEM VII_PCI .pdf
Software Engineering and software moduleing
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
Prof. Dr. KAYIHURA A. SILAS MUNYANEZA, PhD..pdf
Soil Improvement Techniques Note - Rabbi
Management Information system : MIS-e-Business Systems.pptx
Principal presentation for NAAC (1).pptx
Module 8- Technological and Communication Skills.pptx
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
Java Basics-Introduction and program control
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
Applications of Equal_Area_Criterion.pdf
August 2025 - Top 10 Read Articles in Network Security & Its Applications
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
"Array and Linked List in Data Structures with Types, Operations, Implementat...
Abrasive, erosive and cavitation wear.pdf

3-d interpretation from stereo images for autonomous driving

  • 1. 3D Interpretation from Stereo Images for Autonomous Driving Yu Huang Yu.huang07@gmail.com Sunnyvale, California
  • 2. Outline • Object-Centric Stereo Matching for 3D Object Detection • Triangulation Learning Network: from Monocular to Stereo 3D Object Detection • Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving • Stereo R-CNN based 3D Object Detection for Autonomous Driving • 3d object proposals for accurate object class detection
  • 3. Object-Centric Stereo Matching for 3D Object Detection • The current SoA for stereo 3D object detection takes the existing PSMNet stereo matching network, with no modifications, and converts the estimated disparities into a 3D point cloud, and feeds this point cloud into a LiDAR-based 3D object detector. • The issue with existing stereo matching networks is that they are designed for disparity estimation, not 3D object detection; the shape and accuracy of object point clouds are not the focus. • Stereo matching networks commonly suffer from inaccurate depth estimates at object boundaries, which this method defines as streaking, because BG and FG points are jointly estimated. • Existing networks also penalize disparity instead of the estimated position of object point clouds in their loss functions. • Here it proposes a 2D box association and object-centric stereo matching method that only estimates the disparities of the objects of interest to address these two issues.
  • 4. Object-Centric Stereo Matching for 3D Object Detection First, a 2D detector generates 2D boxes in Il and Ir. Next, a box association algorithm matches object detections across both images. Each matched detection pair is passed into the object-centric stereo network, which jointly produces a disparity map and instance segmentation mask for each object. Together, these form a disparity map containing only the objects of interest. Lastly, the disparity map is transformed into a point cloud that can be used by any LiDAR-based 3D object detection network to predict the 3D bounding boxes.
  • 5. Object-Centric Stereo Matching for 3D Object Detection Qualitative results on KITTI. Ground truth and predictions are in red and green, respectively. Colored points are predicted by our stereo matching network while LiDAR points are shown in black for visualization purposes only.
  • 6. Triangulation Learning Network: from Monocular to Stereo 3D Object Detection • For 3D object detection from stereo images, the key challenge is how to effectively utilize stereo information. • Different from previous methods using pixel-level depth maps, this method employs 3D anchors to explicitly construct object-level correspondences between the ROI in stereo images, from which DNN learns to detect and triangulate the targeted object in 3D space. • It introduces a cost-efficient channel reweighting strategy that enhances representational features and weakens noisy signals to facilitate the learning process. • All of these are flexibly integrated into a solid baseline detector that uses monocular images. • It is demonstrated that both the monocular baseline and the stereo triangulation learning network outperform the prior SoA in 3D object detection and localization on KITTI dataset.
  • 7. Triangulation Learning Network: from Monocular to Stereo 3D Object Detection Overview of the 3D detection pipeline. The baseline monocular network is indicated with blue background, and can be easily extended to stereo inputs by duplicating the baseline and further integrating with the TLNet (Triangulation Learning Network).
  • 8. Triangulation Learning Network: from Monocular to Stereo 3D Object Detection • The baseline network taking a mono image as the input is composed of a backbone and 3 subsequent modules, i.e. front view anchor generation, 3D box proposal and refinement. • The three-stage pipeline progressively reduces the searching space by selecting confident anchors, which highly reduces computational complexity. • The stereo 3D detection is performed by integrating a triangulation learning network (TL- Net) into the baseline model. • Triangulation is known as localizing 3D points from multi- view images in the classical geometry fields, while this objective is to localize a 3D object and estimates its size and orientation from stereo images. • To achieve this, introduce an anchor triangulation scheme, in which the NN uses 3D anchors as reference to triangulate the targets.
  • 9. Triangulation Learning Network: from Monocular to Stereo 3D Object Detection Front view anchor generation. Potential anchors are of high objectness in the front view. Only the potential anchors are fed into RPN to reduce searching space and save computational cost.
  • 10. Triangulation Learning Network: from Monocular to Stereo 3D Object Detection Anchor triangulation. By projecting the 3D anchor box to stereo images, obtain a pair of RoIs. The left RoI establishes a geometric correspondence with the right one via the anchor box. The nearby target is present in both RoIs with slightly positional differences. The TLNet takes the RoI pair as input and utilizes the 3D anchor as reference to localize the targeted object.
  • 11. Triangulation Learning Network: from Monocular to Stereo 3D Object Detection The TLNet takes as input a pair of left- right RoI features Fl and Fr with Croi channels and size Hroi ×Wroi, which are obtained using RoIAlign by projecting the same 3D anchor to the left and right frames. To utilize the left-right coherence scores to reweight each channel. The reweighted features are fused using element-wise addition and passed to task-specific fully-connected layers to predict the objectness confidence and 3D bounding box offsets, i.e., the 3D geometric variance between the anchor and target.
  • 12. Triangulation Learning Network: from Monocular to Stereo 3D Object Detection Orange bounding boxes are detection results, while the green boxes are ground truths. For the main method, also visualize the projected 3D bounding boxes in image, i.e., the first and forth rows. The lidar point clouds are visualized for reference but not used in both training and evaluation. It is shown that the triangulation learning method can reduce missed detections and improve the performance of depth prediction at distant regions.
  • 13. Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving • Approaches based on cheaper monocular or stereo imagery data have, until now, resulted in drastically lower accuracies — a gap that is commonly attributed to poor image-based depth estimation. • However, it is not the quality of the data but its representation that accounts for the majority of the difference. • Taking the inner workings of CNNs into consideration, convert image-based depth maps to pseudo- LiDAR representations — essentially mimicking the LiDAR signal. • With this representation, apply different existing LiDAR-based detection algorithms. • On the popular KITTI benchmark, this approach achieves impressive improvements over the existing state-of-the-art in image-based performance — raising the detection accuracy of objects within the 30m range from the previous state-of-the-art of 22% to an unprecedented 74%.
  • 14. Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving The pipeline for image-based 3D object detection. Given stereo or monocular images, first predict the depth map, followed by back-projecting it into a 3D point cloud in the LiDAR coordinate system. Refer this representation as pseudo- LiDAR, and process it exactly like LiDAR — any LiDAR-based detection algorithms can be applied.
  • 15. Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving Apply a single 2D convolution with a uniform kernel to the frontal view depth map (top-left). The resulting depth map (top-right), after back-projected into pseudo- LiDAR and displayed from the bird’s-eye view (bottom- right), reveals a large depth distortion in comparison to the original pseudo-LiDAR representation (bottom-left), especially for far-away objects. Mark points of each car instance by a color. The boxes are super-imposed and contain all points of the green and cyan cars respectively.
  • 16. Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving Qualitative comparison. Compare AVOD with LiDAR, pseudo-LiDAR, and frontal-view (stereo). Ground- truth boxes are in red, predicted boxes in green; the observer in the pseudo-LiDAR plots (bottom row) is on the very left side looking to the right. The frontal-view approach (right) even miscalculates the depths of nearby objects and misses far-away objects entirely.
  • 17. Stereo R-CNN based 3D Object Detection for Autonomous Driving • A 3D object detection method for autonomous driving by fully exploiting the sparse and dense, semantic and geometry information in stereo imagery. • This method, called Stereo R-CNN, extends Faster R-CNN for stereo inputs to simultaneously detect and associate object in left and right images. • Add extra branches after stereo Region Proposal Network (RPN) to predict sparse keypoints, viewpoints, and object dimensions, which are combined with 2D left-right boxes to calculate a coarse1 3D object bounding box. • Then to recover the accurate 3D bounding box by a region-based photometric alignment using left and right RoIs. • This method does not require depth input and 3D position supervision, however, outperforms all existing fully supervised image-based methods. • Code released at https://github.com/HKUST-Aerial-Robotics/Stereo-RCNN.
  • 18. Stereo R-CNN based 3D Object Detection for Autonomous Driving The stereo R-CNN outputs stereo boxes, keypoints, dimensions, and the viewpoint angle, followed by the 3D box estimation and the dense 3D box alignment module.
  • 19. Stereo R-CNN based 3D Object Detection for Autonomous Driving Relations between object orientation θ, azimuth β and viewpoint θ + β. Only same viewpoints lead to same projections. Different targets assignment for RPN classification and regression.
  • 20. Stereo R-CNN based 3D Object Detection for Autonomous Driving 3D semantic keypoints, the 2D perspective keypoint, and boundary keypoints.
  • 21. Stereo R-CNN based 3D Object Detection for Autonomous Driving Sparse constraints for the 3D box estimation
  • 22. Stereo R-CNN based 3D Object Detection for Autonomous Driving From top to bottom: detections on left image, right image, and bird’s eye view image.