Structure-from-Motion Using DenseCNN Features with Keypoint Relocalization

Author
Abstract

Structure from Motion (SfM) using imagery that involves extreme appearance changes is yet a challenging task due to a loss of feature repeatability. Using feature correspondences obtained by matching densely extracted convolutional neural network (CNN) features significantly improves the SfM reconstruction capability. However, the reconstruction accuracy is limited by the spatial resolution of the extracted CNN features which is not even pixel-level accuracy in the existing approach. Providing dense feature matches with precise keypoint positions is not trivial because of memory limitation and computational burden of dense features. To achieve accurate SfM reconstruction with highly repeatable dense features, we propose an SfM pipeline that uses dense CNN features with relocalization of keypoint position that can efficiently and accurately provide pixel-level feature correspondences. Then, we demonstrate on the Aachen Day-Night dataset that the proposed SfM using dense CNN features with the keypoint relocalization outperforms a state-of-the-art SfM (COLMAP using RootSIFT) by a large margin.

Summary

Based on the figure above, our algorithm works as follow:

  1. Densely extract the feature from each of input image using convolutional neural network (CNN). We use pre-trained VGG-16 in our implementation.

  2. Do tentative matching. Compute the nearest neighbour to find matches between keypoint.

  3. Relocalize keypoints since the accuracy of densely extracted keypoints are still not in pixel-level accuracy.

  4. Do Homography RANSAC to remove outlier keypoint matches.

  5. Do final 3D reconstruction using any 3D reconstruction pipeline.

Paper and Code

Paper [arXiv | CVA]
Code on GitHub + instruction

@Article{Widya2018,
  author="Widya, Aji Resindra
  and Torii, Akihiko
  and Okutomi, Masatoshi",
  title="Structure from Motion Using Dense CNN Features With Keypoint Relocalization",
  journal="IPSJ Transactions on Computer Vision and Applications",
  year="2018",
  month="May",
  day="31",
  volume="10",
  number="1",
  pages="6",
  issn="1882-6695",
  doi="10.1186/s41074-018-0042-y",
  url="https://doi.org/10.1186/s41074-018-0042-y"
  }           
Example Result

RootSIFT

DenseCNN with keypoint relocalization (Ours)

Example of 3D reconstruction in the Aachen dataset. Figures above show qualitative comparison between RootSIFT COLMAP and our DenseCNN with keypoint relocalization. Our method can reconstructs all 21 images in the subset whereas RootSIFT COLMAP fails to reconstruct the night image.

Example 3D reconstruction on Castle-P30 from Strecha Dataset. This figure demonstrates that our DenseCNN with keypoint relocalization does not overfit to specific problem but also works reasonably well for standard (easy) case. Moreover, our proposed method can achieve better camera poses estimation compared to DenseCNN without keypoint relocalization.

Acknowledgement

This work was partly supported by JSPS KAKENHI Grant Number 17H00744, 15H05313, 16KK0002, and Indonesia Endowment Fund for Education.