Learning-Based Depth and Pose Estimation
for Monocular Endoscope with Loss Generalization

(To be presented at) 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2021

Sho Suzuki2, Takuji Gotoda2, Kenji Miki3
1Department of Systems and Control Engineering, School of Engineering, Tokyo Institute of Technology
2Division of Gastroenterology and Hepatology, Department of Medicine, Nihon University School of Medicine
3Department of Internal Medicine, Tsujinaka Hospital Kashiwanoha
1Department of Systems and Control Engineering, School of Engineering, Tokyo Institute of Technology
2Division of Gastroenterology and Hepatology, Department of Medicine, Nihon University School of Medicine
3Department of Internal Medicine, Tsujinaka Hospital Kashiwanoha
Abstract

Gastroendoscopy has been a clinical standard for diagnosing and treating conditions that affect a part of a patient's digestive system, such as the stomach. Despite the fact that gastroendoscopy has a lot of advantages for patients, there exist some challenges for practitioners, such as the lack of 3D perception, including the depth and the endoscope pose information. Such challenges make navigating the endoscope and localizing any found lesion in a digestive tract difficult. To tackle these problems, deep learning-based approaches have been proposed to provide monocular gastroendoscopy with additional yet important depth and pose information. In this paper, we propose a novel supervised approach to train depth and pose estimation networks using consecutive endoscopy images to assist the endoscope navigation in the stomach. We firstly generate real depth and pose training data using our previously proposed whole stomach 3D reconstruction pipeline to avoid poor generalization ability between computer-generated (CG) models and real data for the stomach. In addition, we propose a novel generalized photometric loss function to avoid the complicated process of finding proper weights for balancing the depth and the pose loss terms, which is required for existing direct depth and pose supervision approaches. We then experimentally show that our proposed generalized loss performs better than existing direct supervision losses.

Summary

Gastroendoscopy is one of the golden standards for finding and treating abnormalities inside a patient's digestive tract, including the stomach. Even though gastroendoscopy gives enormous advantages for the patient, such as no need for invasive surgeries, it is known that there exist some challenges for medical practitioners, such as the loss of depth perception and the difficulty in assessing the endoscope pose. These challenges lead to difficulties in navigating and understanding the scene captured by the endoscope system, making the localization of a found lesion hard for the practitioners.

Previous studies have proposed to reconstruct the 3D model of a whole stomach with its texture to provide a global view of the stomach and the estimated endoscope trajectory. It enables medical practitioners to perform a second inspection with more degree of freedom after an initial gastroendoscopy procedure. While the whole stomach 3D reconstruction provides the depth and the endoscope trajectory, the methods cannot be done alongside the gastroendoscopy procedure in real-time.

Even though real-time monocular depth estimation has been proposed before, to effectively tackle the endoscope navigation and the lesion localization challenges, only providing depth information is not enough. Both continuous depth and pose information are needed to address these challenges appropriately. To provide both depth and pose information, deep learning approach is heavily adopted. In this work, we decide to supervise a depth and pose estimation network.

A commonly used supervision approach is to take the direct Euclidean distance losses for the predicted depth and pose in comparison with the respective ground truths or references. In this approach, computer-generated (CG) and/or phantom models are commonly used for the training of depth and pose estimation networks, affecting the network generalization between CG and real data. In addition, the direct supervision approach needs balancing weights for depth and pose loss terms, which are difficult to search.

In this paper, we propose a supervised approach to simultaneously train depth and pose networks using consecutive images for monocular endoscopy of the stomach. To avoid the generalization problem between CG and real data, we apply the whole stomach 3D reconstruction pipeline to generate reference depth and pose from real endoscope data for network training. Additionally, we propose a novel loss generalization by unifying the depth and the pose losses into a photometric error loss for our supervised training to avoid the necessity of delicate weight balancing between the depth and the pose losses. Our method achieves up to 60fps at test time for depth and pose predictions.

Paper

To appear

[arXiv page] Learning-Based Depth and Pose Estimation for Monocular Endoscope with Loss Generalization
@misc{widya2021learningbased,
title={Learning-Based Depth and Pose Estimation for Monocular Endoscope with Loss Generalization},
author={Aji Resindra Widya and Yusuke Monno and Masatoshi Okutomi and Sho Suzuki and Takuji Gotoda and Kenji Miki},
year={2021},
eprint={2107.13263},
archivePrefix={arXiv},
primaryClass={cs.CV}
}

Loss function comparison

Firstly, here we introduce the network structure and three training methods that could possibly be used to train the said networks.

(a) Network structure, consisting of depth and pose estimation network.
(b) Self-supervised photometric loss.
(c) Commonly used direct depth and pose supervision loss.
(d) Proposed generalized photometric loss.

The network structure which consists of depth and pose estimation networks is shown in (a). Figures (b)-(d) show the comparison between the existing self-supervised photometric loss, the existing direct depth and pose supervision loss, and our proposed generalized photometric loss. In both (c) and (d), the loss in the purple-colored box is used for training the depth estimation network and the loss in the pink-colored box is used for training the pose estimation network. The existing depth and pose supervision approach trains the depth and the pose estimation networks by directly taking the Euclidean distance between the predicted depth and its reference and also between the predicted pose and its reference, respectively. This direct supervision approach needs balancing the weights for each loss term, which are difficult to search, because their physical meanings are different. In our proposed generalize loss, we adjusted our loss terms so that each of them has the same physical meaning, i.e., the photometric error. This generalization eliminates the need for the balancing weight search.

Depth estimation evaluation
Input image Reference depth Self-supervised Direct supervision Proposed generalized loss

Some examples of depth estimation results. Here we show the RGB images for better visualization, though we actually used red channel images as the input of the network according to the finding in our previous research. We compare the depth prediction results of the self-supervision, the direct supervision, and our proposed generalized loss supervision. As we can see, our proposed method not only estimates closer depth to the reference, but also better estimates the structures and the boundaries, including the endoscope rod. In some cases, the direct supervision results are too smooth. For better visualization, we also provide videos and the point cloud reconstruction from a single frame below.

Input image Reference depth Self-supervised Direct supervision Proposed generalized loss

For objective evaluation, we evaluate on depth accuracy and depth relative error on both test and training set as follows.

We can see that our proposed method has better performance compared to the self-supervised method by a fair margin. In addition, our proposed method generally shows better performance compared to the direct supervision method. Even though our proposed method comes seconds in the median relative error, the values are very close. In addition to the testing on the test data (Subject 1 and 2), we also tested each of the trained networks on the training data. As we can see, the direct supervision has the best results for this evaluation. However, it can be noticed that the performance of the direct supervision on the test data falls sharply compared to its performance on the training data. It shows that the depth estimation network trained with the direct supervision has poor generalization to the data that have never been seen during the training.

Pose estimation evaluation

For pose estimation evaluation, we first split the full sequence of Subject 1 and 2 into groups of 150 images of consecutive frames. The prediction results are then aligned with the reference pose using Umeyama transform. We then used absolute pose error (APE) to evaluate the translation and rotation components of the predicted poses against the reference poses. First of all, the subjective evaluations are shown in the figures below.

(a)
(b)

Figure (a) and (b) show the trajectory component the predicted pose from two sample sequences. As we can see, our prediction result is the closest to the reference pose. In addition, the ORB-SLAM in (a) and (b) can only predicts 16 and 17 poses respectively from 150 input images. The table below shows the objective evaluation.

We can see that, based on the evaluation on the test data, our proposed generalized loss has better performance compared to the self-supervised and direct supervision method. Even though it is evident that the direct supervision has the best result when tested on the training data, its performance drops sharply when tested on the test data. This characteristic is the same with the results previously shown in the depth estimation evaluation. We believe that it is because a direct supervision loss implicitly induces poor generalization performance. In addition, even without intricate search of the weight balancing term, our method could achieve the best result.