Digging Into Normal Incorporated Stereo Matching

Zihua Liu1, Songyan Zhang2, Zhicheng Wang2, Masatoshi Okutomi1

1.Tokyo Institude of Technology 2.Tongji University

Proceedings of the 30th ACM international Conference on Multimedia(ACMMM2022)

[paper], [supp], [code]

Method Overview


An overview of our proposed NINet. Our model is mainly composed of two modules named ARL and NDP. Note that some skip-connection operations are omitted here for simplifying the visualization.


Despite the remarkable progress facilitated by learning-based stereo matching algorithms, disparity estimation in low-texture, occluded, and bordered regions still remain bottlenecks that limit the performance. To tackle these challenges, geometric guidance like plane information is necessary as it provides intuitive guidance about disparity consistency and affinity similarity. In this paper, we propose a normal incorporated joint learning that framework consisting of two specific modules named non-local disparity propagation(NDP) and affinity-aware residual learning(ARL). The estimated normal map is first utilized for calculating a non-local affinity matrix as well as a non-local offset to perform spatial propagation at the disparity level. To enhance geometric consistency, especially in low-texture regions, the estimated normal map is then leveraged to calculate a local affinity matrix which provides the residual learning with information about where the correction should refer and thus improve the residual learning efficiency. Extensive experiments on several public datasets including Scene Flow, KITTI 2015, and Middlebury 2014 validate the effectiveness of our proposed method. By the time we finished this work, our approach ranked 1st for stereo matching across foreground pixels on the KITTI 2015 dataset and 3rd on the Scene Flow dataset among all the published works.

Non-Local Disparity Propagation(NDP)

Architecture of Non-Local Disparity Propgation Module


Illustration of local spatial propagation (a), non-local spatial propagation (b). (c) shows sampled points in low-texture regions. (d) demonstrates sampled points at edges. (e) shows sampled points in occluded regions. Red points indicated the selected ones for propagating disparity to the targeted white/green point. It's obvious that our method successfully learns to dynamically sample points for propagation according to different patterns.


Disparity refinement at different scales. It clearly shows that our proposed non-local propagation witness a clear edges and structures at different disparity scales, which alleviates the blurring and breakage issues at the edges of the image.

Affinity-Aware Residual Learning(ARL)


A Simple U-Net Architecture to obtain surface normal from the left and right image views.This further provides geometry guidance for disparity refinement.


Affinity-Aware Residual Architecture and the Visualization of Affinity Maps. Note that disparities lie in the same physical plane trend to aggregate together

Quantitative Results

Massive experiments were conducted on SceneFlow[1],KIITI2015[2] and MiddleBurry2014[3] dataset, compared with existing state-of-the-art method.


Qualitative Results


Visualization Comparsions at Sceneflow dataset. Note our approch achieve better results at border,occlusion and texture-less regions.


Generalization at the Middleburry 2014 dataset.


Predicted surface normal at different datasets


Digging Into Normal Incorporated Stereo Matching

Zihua Liu, Songyan Zhang, Zhicheng Wang, Masatoshi Okutomi
Proceedings of the 30th ACM international Conference on Multimedia (ACMMM 2022, accepted)

[paper], [supp], [code]


[1]N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. 2016. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4040–4048.

[2] Moritz Menze and Andreas Geiger. 2015. Object scene flow for autonomous vehicles. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3061–3070. https://doi.org/10.1109/CVPR.2015.7298925.

[3]D. Scharstein, H. Hirschmuller, Y. Kitajima, G. Krathwohl, N. Nesic, and P. Westling X Wang. 2014. High-resolution stereo datasets with subpixel-accurate ground truth. In German Conference on Pattern Recognition (GCPR)


Zihua Liu: zliu@ok.sc.e.titech.ac.jp
Songyan Zhang: spyder@tongji.edu.cn
Zhicheng Wang: zhichengwang@tongji.edu.cn
Masatoshi Okutomi: mxo@ctrl.titech.ac.jp