Self-Supervised Monocular Depth Estimation in Gastroendoscopy
Using GAN-Augmented Images

(To be presented at) SPIE Medical Imaging 2021

Sho Suzuki2, Takuji Gotoda2, Kenji Miki3
1Department of Systems and Control Engineering, School of Engineering, Tokyo Institute of Technology
2Division of Gastroenterology and Hepatology, Department of Medicine, Nihon University School of Medicine
3Department of Internal Medicine, Tsujinaka Hospital Kashiwanoha
1Department of Systems and Control Engineering, School of Engineering, Tokyo Institute of Technology
2Division of Gastroenterology and Hepatology, Department of Medicine, Nihon University School of Medicine
3Department of Internal Medicine, Tsujinaka Hospital Kashiwanoha
Abstract

Gastroendoscopy is the golden standard procedure that enables medical doctors to investigate the inside of a patient's stomach. Monocular depth estimation from an endoscopic image enables the simultaneous acquisition of RGB and depth data, which can boost the capability of the endoscopy for various potential diagnostic applications, such as the RGB-D data acquisition toward whole stomach 3D reconstruction for lesion localization and local view expansion for lesion inspection. Therefore, deep-learning-based approaches are gaining traction to provide depth information in monocular endoscopy. Since it is very difficult to obtain ground-truth RGB and depth image pairs in clinical settings, computer-generated (CG) data is usually used for training the depth estimation network. However, CG data has a limitation to generate realistic RGB and depth data. In this paper, we propose a novel data generation strategy for self-supervised training to predict the depth in gastroendoscopy. To obtain dense reference depth data for training, we first reconstruct a whole stomach 3D model by exploiting chromoendoscopic images sprayed with indigo carmine (IC) blue dye. We then generate virtual no-IC images from chromoendoscopic images using CycleGAN to make our depth estimation network applicable to general endoscopic images without IC dye. We experimentally demonstrate that our proposed approach achieves plausible depth prediction on both chromoendoscopic and general endoscopic images.

Summary

Gastroendoscopy is a clinical standard for medical doctors to diagnose various lesions inside a patient's stomach. Even though current gastroendoscopy is based on RGB image data, simultaneously providing the depth data in addition to the RGB images can boost the capability of the endoscopy for further potential application such as lesion localization or whole stomach 3D reconstruction. Lately, deep-learning-based monocular depth estimation has proven its usefulness in computer vision applications on general scenes. However, it is very difficult to obtain training RGB and depth image pairs for endoscope data in real clinical settings, which is a main challenge for deep-learning-based monocular depth estimation for endoscopy. Computer generated (CG) data has been highly favoured to tackle the lack of endoscope training data for training a depth estimation network in a supervised manner. However, the use of CG data can lead to non-optimal generalization to real endoscope data.

In this paper, we propose a novel data generation strategy for self-supervised training of a monocular depth estimation network in gastroendoscopy. To obtain dense reference depth images for training, we focus on the finding that dense 3D reconstruction with SfM can be achieved by using chromoendoscopic images sprayed with indigo carmine (IC) blue dye. Based on this study, we first apply our stomach 3D reconstruction pipeline using IC-dye-sprayed images to obtain camera poses and a dense 3D model. Although the training pair of IC-dye-sprayed RGB and depth images can be generated using the estimated camera poses and the reconstructed 3D model, that pair cannot be directly used for training the depth estimation network for standard no-IC RGB images. Therefore, to make our depth estimation network applicable to general no-IC images, we then propose to apply an image-to-image translation generative adversarial network (GAN) to generate virtual no-IC images from real IC-dye-sprayed images. The depth estimation network is finally trained using the generated depth images, real IC images, and GAN-augmented no-IC images mixed together.

Paper

[To appear]

Flowchart

This section illustrates the overall flow of our proposed pipeline.

The overall flow of the proposed pipeline.

Our proposed method is divided into three main parts: (i). Reference depth data generation, (ii). Virtual no-IC image generation, and (iii). Monocular depth estimation training.

Reference depth data generation. We first apply a stomach 3D reconstruction SfM pipeline2 to estimate camera poses and reconstruct a 3D mesh model. Here, we use real IC-dye-sprayed texture-enhanced images (IC images) as SfM inputs since dense 3D reconstruction cannot be achieved using standard no-IC images due to its textureless surface properties. Using the obtained 3D mesh, we then generate dense reference depth for each reconstructed camera.
Virtual no-IC image generation. The dense stomach 3D reconstruction pipeline2 cannot work with no-IC images because of its textureless surface. Thus, reference depth data for no-IC images cannot be directly obtained, which limits the depth estimation to only real IC images. To address this issue, we propose to apply CycleGAN that works with unpaired data to generate virtual no-IC images. To train the CycleGAN, we used unpaired real IC and real no-IC images extracted from our experimental endoscope data. We then use the trained CycleGAN to generate virtual no-IC images. This approach enables us to create the pairs of reference depth images and no-IC images for self-supervised depth training.
Monocular depth estimation training. Even tough our main goal is to predict the depth from conventional endoscopic images without IC dye, which can be achieved by training using virtual no-IC images only, we are also aware that chromoendoscopy with IC dye is widely applied in gastroendoscopy. Because of that, we use both real IC and virtual no-IC images and mix them into the training set to make our network applicable to both data types in the application phase.

Reference depth and virtual no-IC image generation

Generated virtual no-IC image


Real IC-sprayed image

Generated Virtual no-IC image

Generated reference Depth

Predicted depth

On a full sequence of test set

Here we show an example result of the predicted depth on a test set for both real IC-sprayed and real no-IC images. The network used for this depth prediction is trained on both real IC-sprayed and virtual no-IC images. In addition, we also show the prediction result using a network trained on publicly available colonoscopy data.


Predicted depth on real IC-sprayed images
From left to right: Input image - Predicted depth - Reference depth - Error map

Predicted depth on real no-IC images. Note that we do not have the reference depth for real no-IC images (as of the writing of this manuscript)
From left to right: Input image - Predicted depth

Predicted depth on real no-IC images using a network trained on publicly available colonoscopy data.
From left to right: Input image - Predicted depth

Examples on a specific frame


Predicted depth from network trained using our proposed method.

Predicted depth from network trained using publicly available colonoscopy data.

Predicted depth from network trained using our proposed method.

Predicted depth from network trained using publicly available colonoscopy data.

Here we can see that the network trained on publicly available colonoscopy data fails to provide good depth prediction results compared to the network trained with our proposed pipeline.

Objective evaluation


The objective evaluation of the estimated depth tested on real IC and virtual no-IC sequences for four subjects.

From both subective and objective evaluation, we can see that the network trained using our proposed approach, on average, is even better compared to the network trained on same modality with the test set. Not only that, the network trained on our approach also performs better in both IC and no-IC sequences. It means that adding the virtual no-IC as data augmentation does not deteriorate the performance of the network.