VSRD:CVPR2024

1.Abstract

Monocular 3D object detection poses a significant challenge in 3D scene understanding due to its inherently ill-posed nature in monocular depth estimation. Existing methods heavily rely on supervised learning using abundant 3D labels, typically obtained through expensive and labor-intensive annotation on LiDAR point clouds. To tackle this problem, we propose a novel weakly supervised 3D object detection framework named VSRD (Volumetric Silhouette Rendering for Detection) to train 3D object detectors without any 3D supervision but only weak 2D supervision. VSRD consists of multi-view 3D auto-labeling and subsequent training of monocular 3D object detectors using the pseudo labels generated in the auto-labeling stage. In the auto-labeling stage, we represent the surface of each instance as a signed distance field (SDF) and render its silhouette as an instance mask through our proposed instance-aware volumetric silhouette rendering. To directly optimize the 3D bounding boxes through rendering, we decompose the SDF of each instance into the SDF of a cuboid and the residual distance field (RDF) that represents the residual from the cuboid. This mechanism enables us to optimize the 3D bounding boxes in an end-to-end manner by comparing the rendered instance masks with the ground truth instance masks. The optimized 3D bounding boxes serve as effective training data for 3D object detection. We conduct extensive experiments on the KITTI-360 dataset, demonstrating that our method outperforms the existing weakly supervised 3D object detection methods.

Figure.1 Illustration of our proposed weakly supervised 3D object detection framework, which consists of multi-view 3D autolabeling and subsequent training of monocular 3D object detectors using the pseudo labels generated in the auto-labeling stage.

2.1 Multi-View 3D Auto-Labeling

Visualization of instance SDF optimization process in the multi-view 3D auto-labeling stage

The instance SDF is optimized from a random initialized box SDF to the instance SDF using our proposed pipeline. The right views demonstrate the optimziation process in the BEV View.

Pipeline Overview

Figure.2. Illustration of the pipeline of our proposed multi-view 3D auto-labeling. We represent the surface of each instance as an SDF and decompose it into the SDF of a 3D bounding box and the $residual$ $distance$ $field$ (RDF), which is learned via a hypernetwork. The composed instance SDF is used to render the silhouette of the instance through our proposed instance-aware volumetric silhouette rendering. All the 3D bounding boxes are optimized based on the loss between the rendered and ground truth instance masks.

Problem Definition: Given a monocular video consisting of posed frames, each frame annotated with instance masks, our goal is to optimize the 3D bounding box frame by frame without 3D supervision. More specifically, for each $target$ frame $t$ in the video, we sample multiple $source$ frames $S$ and optimize the $N$ 3D bounding boxes in the target frame using the instance masks of the source frames as weak supervision, where $N$ denotes the number of instances in the target frame. We parameterize the $n$ -th 3D bounding box $\hat{\mathbf{B}}_{n} \in \mathbb{R}^{8 \times 3}$ in the target frame with a dimension $\hat{\mathbf{d}}_{n} \in \mathbb{R}^{3}_{+}$ , location $\hat{\mathbf{\ell}}_{n} \in \mathbb{R}^{3}$ , and orientation $\hat{\theta}_{n} \in \mathbb{R}$ , which is the rotation angle in the bird’s-eye-view. In addition to these parameters for each bounding box, we prepare a learnable instance embedding $\mathbf{z}_{n} \in \mathbb{R}^{D}$ for each instance and a shared $hypernetwork$ parameterized by $\mathbf{\psi}$ for the $residual$ $distance$ $field$ .

$\begin{align} ^{*}\!\mathbf{D}, ^{*}\!\!\mathbf{L}, ^{*}\!\!\mathbf{\Theta}, ^{*}\!\!\mathbf{Z}, ^{*}\!\!\mathbf{\psi} = \underset{\hat{\mathbf{D}}, \hat{\mathbf{L}}, \hat{\mathbf{\Theta}}, \mathbf{Z}, \mathbf{\psi}} {\operatorname{argmin}} \ \mathcal{L}(\hat{\mathbf{D}}, \hat{\mathbf{L}}, \hat{\mathbf{\Theta}}, \mathbf{Z}, \mathbf{\psi}) \ . \end{align}$

Dimensions $\hat{\mathbf{D}} \in \mathbb{R}^{N \times 3}_{+}$ , Locations $\hat{\mathbf{L}} \in \mathbb{R}^{N \times 3}$ , Orientations $\hat{\mathbf{\Theta}} \in \mathbb{R}^{N \times 1}$ , and instance embeddings $\mathbf{Z} \in \mathbb{R}^{N \times D}$ . Our target is to optimize $\hat{\mathbf{D}}, \hat{\mathbf{L}}, \hat{\mathbf{\Theta}}, \mathbf{Z}, \mathbf{\psi}$ using a loss function by stochastic gradient descent as above.

Instance-Aware Volumetric Silhouette Rendering

Figure.3 Illustration of our proposed instance-aware volumetric silhouette rendering. The instance labels are averaged for each sampled point along a ray based on the signed distance to each instance. The averaged instance labels are integrated along the ray based on the SDF-based volume rendering formulation.

The core idea is to render instance masks and compare them with ground truth instance masks. To achieve this, we propose a novel SDF-based $instance$ - $aware$ volumetric silhouette rendering, where instance labels instead of colors are integrated along a ray based on the same SDF-based volume rendering formulation:

$\begin{align} \label{eq:method/multi_view_3d_auto_labeling/instance_aware_volumetric_silhouette_rendering/integration} \hat{\mathbf{S}}(\mathbf{o}, \mathbf{d}) & = \int_{0}^{\infty} w(t) \mathbf{s}(\mathbf{r}(t)) dt \ , \\ \label{eq:method/multi_view_3d_auto_labeling/instance_aware_volumetric_silhouette_rendering/average_instance_label} \mathbf{s}(\mathbf{p}) & = \sum_{n=1}^{N} \text{softmin}([\hat{\mathcal{F}}_{n}(\mathbf{p})]_{n=1}^{N})_{n} \cdot \mathbf{y}_{n} \ , \end{align}$

where, $\hat{\mathbf{S}}(\mathbf{o}, \mathbf{d}) \in [0, 1]^{N}$ denotes the rendered soft instance label, $\mathbf{s}(\mathbf{p}) \in [0, 1]^{N}$ denotes the weighted average instance label at the position $\mathbf{p}$ indicating how relatively close the position $\mathbf{p}$ is to each instance, and $\mathbf{y}_{n} \in \{0, 1\}^{N}$ denotes the one-hot instance label of the n-th instance. $\hat{\mathcal{F}}_{n}(\cdot)$ denotes the SDF of the n-th instance

2.2 Training of 3D object detectors using pseudo labels

Confidence Assignment

Figure.4 Comparison of the confidence scores in static and dynamic scenes. It can be seen that the confidence scores are lower for dynamic, occluded, or truncated objects, indicating less influence on the subsequent training of 3D object detectors. It is calculated by multi-view projection loss.

Confidence-based Weighted Loss

$\begin{align} \label{sec:method/training_of_3d_object_detectors/confidence_based_weighted_loss/confidence_based_weighted_loss} \tilde{\mathcal{L}}_{\text{box}}(\hat{\mathbf{B}}, ^{*}\!\!\mathbf{B}, ^{*}\!\!\mathbf{C}) = \sum_{m=1}^{M} \ ^{*}\!\mathbf{C}_{\pi(m)} \mathcal{L}_{\text{box}}(\hat{\mathbf{B}}_{m}, ^{*}\!\!\mathbf{B}_{\pi(m)}) , \end{align}$

where $\hat{\mathbf{B}}$ denotes the predicted 3D bounding boxes, $^{*}\!\mathbf{B}$ denotes the 3D bounding boxes optimized by the proposed auto-labeling, and $^{*}\!\mathbf{C}$ denotes the corresponding confidence scores. $M$ denotes the number of positive anchors, and $\pi(\cdot)$ denotes the label assigner that maps the index of an anchor to that of the matched ground truth.

3.Evalution Results

Evaluation results of monocular 3D object detection on the KITTI-360 test set. $^{*}$ Reproduced with the official code. $^{\dagger}$ CAD models are used as extra data. $^{\ddagger}M(D)$ indicates that detection model $D$ is employed for model-agnostic method $M$ .

Evaluation results of semi-supervised monocular 3D object detection on the KITTI validation set. We choose MonoFlex and MonoDETR as the 3D object detectors to verify the quality of the generated pseudo labels.

BibTeX

@misc{liu2024vsrd, title={VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection}, author={Zihua Liu and Hiroki Sakuma and Masatoshi Okutomi}, year={2024}, eprint={2404.00149}, archivePrefix={arXiv}, primaryClass={cs.CV} }

VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection

Infernece video of weakly supervised monocular 3D object detection on KITTI-360 subset using the pseudo labels generated by our proposed auto-labeling. We use 3D NMS with an IoU threshold of 0.3 for a better view using MonoFlex as the monocular 3D regressor.

1.Abstract

2.Method

2.1 Multi-View 3D Auto-Labeling

2.2 Training of 3D object detectors using pseudo labels

3.Evalution Results

4.Visualization Results

BibTeX