MVA 2025
Doehyung Lee, Zhuofeng Wu, Yusuke Monno, Masatoshi Okutomi
In this paper, we propose a foundation-model-aided dual-pixel disparity estimation network, named FMDP, which leverages the physical cues from dual-pixel defocus disparities and the powerful scene priors encoded by a depth-estimation foundation model. Previous dual-pixel disparity estimation methods often suffer from limited generalization ability due to the lack of a large-scale training dataset. In contrast, recent depth-estimation foundation models can successfully encode the features of diverse real scenes using a huge amount of data. Given this, our FMDP effectively integrates the features from a foundation model into a dual-pixel disparity estimation pipeline. Experimental results show that our FMDP consistently outperforms prior methods on both synthetic and real scenes, especially demonstrating improved robustness to noise and strong generalization to unseen real scenes.
The figure above shows the overall architecture of our proposed FMDP. Our method enhances a dual-pixel (DP) disparity estimation pipeline by incorporating features from a pre-trained depth estimation foundation model (Depth Anything V2). FMDP takes left, right, and center images from a DP sensor as input. The center image is the average of the left and right images. These inputs are processed through feature encoders, correlation pyramids, and iterative refinement units (GRUs). By combining physical cues from the DP sensor with the rich scene understanding from the foundation model, FMDP achieves more accurate and robust disparity estimation.
We integrate features from the Depth Anything V2 Large encoder into our disparity estimation pipeline. As shown in the figure, hierarchical features are extracted from the encoder at different layers. These features are then spatially and dimensionally aligned using Feature Alignment (FA) blocks before being injected into the corresponding levels of the GRU-based iterative refinement module. This allows the network to leverage both the physical cues from the DP sensor and the powerful, general-purpose scene priors learned by the foundation model.
The network is trained on a synthetic dual-pixel image dataset. To preserve the powerful priors learned by the foundation model and prevent overfitting, we keep the weights of the pre-trained Depth Anything V2 encoder frozen during training. This strategy ensures that our model effectively learns to utilize the foundation model's features for the specific task of dual-pixel disparity estimation, leading to improved performance and generalization.
Quantitative comparisons
Qualitative comparisons
The above table shows the quantitative comparison on the synthetic dataset under both noise-free and noisy settings. Across all metrics and settings, our proposed FMDP outperforms other methods. The performance gap becomes more pronounced in the noisy setting, indicating the enhanced robustness and generalization ability of our FMDP. The qualitative comparison on the noisy synthetic dataset shows that while other methods suffer from noise or fail to capture correct depth relationships, our FMDP yields consistently lower errors and demonstrates strong robustness.
Quantitative comparisons
Qualitative comparisons
On the real-world dataset, FMDP achieves significant improvements over the other methods, as shown in the table below. This provides strong evidence that incorporating features from a foundation model significantly enhances generalization. The visual results show that FMDP has a pronounced advantage. While other methods misjudge depth, are sensitive to blur, or produce errors in textured regions, FMDP does not suffer from these issues and achieves more stable and accurate results across diverse real-world scenes.
FMDP: Leveraging a Foundation Model for Dual-Pixel Disparity Estimation
Doehyung Lee, Zhuofeng Wu, Yusuke Monno, Masatoshi Okutomi
19th International Conference on Machine Vision Applications (MVA 2025)
Doehyung Lee: dlee[at]ok.sc.e.titech.ac.jp
Zhuofeng Wu: zwu[at]ok.sc.e.titech.ac.jp
Yusuke Monno: ymonno[at]ok.sc.e.titech.ac.jp
Masatoshi Okutomi: mxo[at]ctrl.titech.ac.jp