VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection

Tokyo Institute of Technology
T2 Inc.
CVPR 2024

*Indicates Equal Contribution

Infernece video of weakly supervised monocular 3D object detection on KITTI-360 subset using the pseudo labels generated by our proposed auto-labeling. We use 3D NMS with an IoU threshold of 0.3 for a better view using MonoFlex as the monocular 3D regressor.


1.Abstract

   Monocular 3D object detection poses a significant challenge in 3D scene understanding due to its inherently ill-posed nature in monocular depth estimation. Existing methods heavily rely on supervised learning using abundant 3D labels, typically obtained through expensive and labor-intensive annotation on LiDAR point clouds. To tackle this problem, we propose a novel weakly supervised 3D object detection framework named VSRD (Volumetric Silhouette Rendering for Detection) to train 3D object detectors without any 3D supervision but only weak 2D supervision. VSRD consists of multi-view 3D auto-labeling and subsequent training of monocular 3D object detectors using the pseudo labels generated in the auto-labeling stage. In the auto-labeling stage, we represent the surface of each instance as a signed distance field (SDF) and render its silhouette as an instance mask through our proposed instance-aware volumetric silhouette rendering. To directly optimize the 3D bounding boxes through rendering, we decompose the SDF of each instance into the SDF of a cuboid and the residual distance field (RDF) that represents the residual from the cuboid. This mechanism enables us to optimize the 3D bounding boxes in an end-to-end manner by comparing the rendered instance masks with the ground truth instance masks. The optimized 3D bounding boxes serve as effective training data for 3D object detection. We conduct extensive experiments on the KITTI-360 dataset, demonstrating that our method outperforms the existing weakly supervised 3D object detection methods.

Description of the image

Figure.1 Illustration of our proposed weakly supervised 3D object detection framework, which consists of multi-view 3D autolabeling and subsequent training of monocular 3D object detectors using the pseudo labels generated in the auto-labeling stage.

2.Method

2.1 Multi-View 3D Auto-Labeling

  • Visualization of instance SDF optimization process in the multi-view 3D auto-labeling stage


The instance SDF is optimized from a random initialized box SDF to the instance SDF using our proposed pipeline. The right views demonstrate the optimziation process in the BEV View.


  • Pipeline Overview

Description of the image

Figure.2. Illustration of the pipeline of our proposed multi-view 3D auto-labeling. We represent the surface of each instance as an SDF and decompose it into the SDF of a 3D bounding box and the (RDF), which is learned via a hypernetwork. The composed instance SDF is used to render the silhouette of the instance through our proposed instance-aware volumetric silhouette rendering. All the 3D bounding boxes are optimized based on the loss between the rendered and ground truth instance masks.


Problem Definition: Given a monocular video consisting of posed frames, each frame annotated with instance masks, our goal is to optimize the 3D bounding box frame by frame without 3D supervision. More specifically, for each frame in the video, we sample multiple frames and optimize the 3D bounding boxes in the target frame using the instance masks of the source frames as weak supervision, where denotes the number of instances in the target frame. We parameterize the -th 3D bounding box in the target frame with a dimension , location , and orientation , which is the rotation angle in the bird’s-eye-view. In addition to these parameters for each bounding box, we prepare a learnable instance embedding for each instance and a shared parameterized by for the .

Dimensions , Locations , Orientations , and instance embeddings . Our target is to optimize using a loss function by stochastic gradient descent as above.


  • Instance-Aware Volumetric Silhouette Rendering

Description of the image

Figure.3 Illustration of our proposed instance-aware volumetric silhouette rendering. The instance labels are averaged for each sampled point along a ray based on the signed distance to each instance. The averaged instance labels are integrated along the ray based on the SDF-based volume rendering formulation.


The core idea is to render instance masks and compare them with ground truth instance masks. To achieve this, we propose a novel SDF-based - volumetric silhouette rendering, where instance labels instead of colors are integrated along a ray based on the same SDF-based volume rendering formulation:

where, denotes the rendered soft instance label, denotes the weighted average instance label at the position indicating how relatively close the position is to each instance, and denotes the one-hot instance label of the n-th instance. denotes the SDF of the n-th instance

2.2 Training of 3D object detectors using pseudo labels

  • Confidence Assignment
Description of the image

Figure.4 Comparison of the confidence scores in static and dynamic scenes. It can be seen that the confidence scores are lower for dynamic, occluded, or truncated objects, indicating less influence on the subsequent training of 3D object detectors. It is calculated by multi-view projection loss.


  • Confidence-based Weighted Loss

where denotes the predicted 3D bounding boxes, denotes the 3D bounding boxes optimized by the proposed auto-labeling, and denotes the corresponding confidence scores. denotes the number of positive anchors, and denotes the label assigner that maps the index of an anchor to that of the matched ground truth.

3.Evalution Results

  • Evaluation results of monocular 3D object detection on the KITTI-360 test set. Reproduced with the official code. CAD models are used as extra data. indicates that detection model is employed for model-agnostic method .
Description of the image

  • Evaluation results of semi-supervised monocular 3D object detection on the KITTI validation set. We choose MonoFlex and MonoDETR as the 3D object detectors to verify the quality of the generated pseudo labels.
First Image
First Image

4.Visualization Results

BibTeX

@misc{liu2024vsrd,
        title={VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection}, 
        author={Zihua Liu and Hiroki Sakuma and Masatoshi Okutomi},
        year={2024},
        eprint={2404.00149},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }