FMDP: Leveraging a Foundation Model for
Dual-Pixel Disparity Estimation

MVA 2025

Doehyung Lee, Zhuofeng Wu, Yusuke Monno, Masatoshi Okutomi


Institute of Science Tokyo
Tokyo, Japan

Abstract

In this paper, we propose a foundation-model-aided dual-pixel disparity estimation network, named FMDP, which leverages the physical cues from dual-pixel defocus disparities and the powerful scene priors encoded by a depth-estimation foundation model. Previous dual-pixel disparity estimation methods often suffer from limited generalization ability due to the lack of a large-scale training dataset. In contrast, recent depth-estimation foundation models can successfully encode the features of diverse real scenes using a huge amount of data. Given this, our FMDP effectively integrates the features from a foundation model into a dual-pixel disparity estimation pipeline. Experimental results show that our FMDP consistently outperforms prior methods on both synthetic and real scenes, especially demonstrating improved robustness to noise and strong generalization to unseen real scenes.

Method Overview

The overview of our proposed FMDP architecture.

The figure above shows the overall architecture of our proposed FMDP. Our method enhances a dual-pixel (DP) disparity estimation pipeline by incorporating features from a pre-trained depth estimation foundation model (Depth Anything V2). FMDP takes left, right, and center images from a DP sensor as input. The center image is the average of the left and right images. These inputs are processed through feature encoders, correlation pyramids, and iterative refinement units (GRUs). By combining physical cues from the DP sensor with the rich scene understanding from the foundation model, FMDP achieves more accurate and robust disparity estimation.

Integration of Depth Anything V2 Features

Integration of Depth Anything V2 features into GRU
FA Block

We integrate features from the Depth Anything V2 Large encoder into our disparity estimation pipeline. As shown in the figure, hierarchical features are extracted from the encoder at different layers. These features are then spatially and dimensionally aligned using Feature Alignment (FA) blocks before being injected into the corresponding levels of the GRU-based iterative refinement module. This allows the network to leverage both the physical cues from the DP sensor and the powerful, general-purpose scene priors learned by the foundation model.

Training Strategy

The network is trained on a synthetic dual-pixel image dataset. To preserve the powerful priors learned by the foundation model and prevent overfitting, we keep the weights of the pre-trained Depth Anything V2 encoder frozen during training. This strategy ensures that our model effectively learns to utilize the foundation model's features for the specific task of dual-pixel disparity estimation, leading to improved performance and generalization.

Evaluation Results on Synthetic Dataset

Quantitative comparisons

Quantitative comparisons on the synthetic dataset

Qualitative comparisons

Qualitative comparison of disparity estimation results on the synthetic dataset

The above table shows the quantitative comparison on the synthetic dataset under both noise-free and noisy settings. Across all metrics and settings, our proposed FMDP outperforms other methods. The performance gap becomes more pronounced in the noisy setting, indicating the enhanced robustness and generalization ability of our FMDP. The qualitative comparison on the noisy synthetic dataset shows that while other methods suffer from noise or fail to capture correct depth relationships, our FMDP yields consistently lower errors and demonstrates strong robustness.

Generalization Performance on Real-world Dataset

Quantitative comparisons

Quantitative comparisons on the real-world dataset

Qualitative comparisons

Qualitative comparison of disparity estimation results on the real-world dataset

On the real-world dataset, FMDP achieves significant improvements over the other methods, as shown in the table below. This provides strong evidence that incorporating features from a foundation model significantly enhances generalization. The visual results show that FMDP has a pronounced advantage. While other methods misjudge depth, are sensitive to blur, or produce errors in textured regions, FMDP does not suffer from these issues and achieves more stable and accurate results across diverse real-world scenes.

Publications

FMDP: Leveraging a Foundation Model for Dual-Pixel Disparity Estimation

Doehyung Lee, Zhuofeng Wu, Yusuke Monno, Masatoshi Okutomi
19th International Conference on Machine Vision Applications (MVA 2025)

[Project Page]

Contact

Doehyung Lee: dlee[at]ok.sc.e.titech.ac.jp
Zhuofeng Wu: zwu[at]ok.sc.e.titech.ac.jp
Yusuke Monno: ymonno[at]ok.sc.e.titech.ac.jp
Masatoshi Okutomi: mxo[at]ctrl.titech.ac.jp