Stomach 3D Reconstruction Based on Virtual Chromoendoscopic Image Generation

(To be presented at) Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 2020

Sho Suzuki2, Takuji Gotoda2, Kenji Miki3
1Department of Systems and Control Engineering, School of Engineering, Tokyo Institute of Technology
2Division of Gastroenterology and Hepatology, Department of Medicine, Nihon University School of Medicine
3Department of Internal Medicine, Tsujinaka Hospital Kashiwanoha
1Department of Systems and Control Engineering, School of Engineering, Tokyo Institute of Technology
2Division of Gastroenterology and Hepatology, Department of Medicine, Nihon University School of Medicine
3Department of Internal Medicine, Tsujinaka Hospital Kashiwanoha
Abstract

Gastric endoscopy is a standard clinical process that enables medical practitioners to diagnose various lesions inside a patient's stomach. If any lesion is found, it is very important to perceive the location of the lesion relative to the global view of the stomach. Our previous research showed that this could be addressed by reconstructing the whole stomach shape from chromoendoscopic images using a structure-from-motion (SfM) pipeline, in which indigo carmine (IC) blue dye-sprayed images were used to increase feature matches for SfM by enhancing stomach surface's textures. However, spraying the IC dye to the whole stomach requires additional time, labor, and cost, which is not desirable for patients and practitioners. In this paper, we propose an alternative way to achieve whole stomach 3D reconstruction without the need of the IC dye by generating virtual IC-sprayed (VIC) images based on image-to-image style translation trained on unpaired real no-IC and IC-sprayed images. We have specifically investigated the effect of input and output color channel selection for generating the VIC images and found that translating no-IC green-channel images to IC-sprayed red-channel images gives the best SfM reconstruction result.

Summary

Firstly, This study was conducted in accordance with the Declaration of Helsinki. The Institutional Review Board at Nihon University Hospital approved the study protocol on March 8, 2018, before patient recruitment. Informed consent was obtained from all patients before they were enrolled. This study was registered with the University Hospital Medical Information Network (UMIN) Clinical Trials Registry (identification No.: UMIN000031776) on March 17, 2018. This study was also approved by the research ethics committee of Tokyo Institute of Technology, where 3D reconstruction experiments were conducted.

In our previous study, we tackled the problem of lesion localization by reconstructing a whole stomach 3D shape from endoscopic images based on structrure-from-motion (SfM) pipeline. Unfortunately, our previous pipeline only works with gastroendoscopy sequence with indigo carmin blue dye. However, though the IC dye is commonly used in gastroendoscopy, spraying it on the whole stomach surface requires additional procedure time, labor, and cost, which is not desirable for both patients and practitioners. Furthermore, the IC dye may also hinder the visibility for some lesions and reconstructed stomach 3D models because of its dark color tone. In this work, we expand our previous work by proposing a framework to reconstruct the whole stomach shape from endoscope video using SfM pipeline without the need of indigo carmin blue dye.

Based on the investigation, we have found that the generated VIC images significantly increase the number of extracted SIFT feature points. We have also found that translating from no-IC green-channel images to IC-sprayed red-channel images gives significant improvements to the SfM reconstruction quality. We have experimentally demonstrated that our new approach can reconstruct the whole stomach shapes of all seven subjects and showed that the estimated camera poses can be used for the lesion localization purpose.

Paper

[To appear]

@misc{widya2020stomach,
title={Stomach 3D Reconstruction Based on Virtual Chromoendoscopic Image Generation},
author={Aji Resindra Widya and Yusuke Monno and Masatoshi Okutomi and Sho Suzuki and Takuji Gotoda and Kenji Miki},
year={2020},
eprint={2004.12288},
archivePrefix={arXiv},
primaryClass={cs.CV}
}

Flowchart

Our best result was achieved by this following setting to train the image-to-image translation network (cycleGAN):

The flow of overall pipeline.

We train the cycleGAN to translate between green channel of no-IC images to red channel of IC images. We chose this setting because it was proven previously in our study that the green channel has the most number of extracted features in the case of no-IC dye and red channel has the most number of extracted features in the case of with IC dye. We also experimented with translating red channel of no-IC images to red channel of IC images with less preferable results.

Image-to-image translation results

Real Red channel no-IC
no-ICr in paper

Real red channel no-IC to virtual red channel with IC
VICr2r and cGANr2r in paper

Real Green channel no-IC
no-ICg in paper

Real green channel no-IC to virtual red channel with IC
VICg2r and cGANg2r in paper

Real red channel no-IC
Real red channel no-IC to virtual red channel with IC
Real green channel no-IC
Real green channel no-IC to virtual red channel with IC

As we can see from the results, both cGANr2r and cGANg2r can transfer the style of the IC-sprayed image to the input no-IC image, not only by transferring the IC pattern, but also either bringing up or down the overall surface brightness. However, if we see the red channel no-ICimage, we can observe that the surface looks fairly texture-less. It is hard even for convolutional neural networks~(CNN) to extract features from this kind of texture-less images. In other hand, the green channel no-IC image shows more textures, enabling slightly better style transfer.

Feature matching performance

First, we show the features extracted from three types of sequences bellow. The red dots in the video represent the extracted features. We can see that the virtual IC sequence generated from cGANg2r has the most number of extracted features.


Real Green channel no-IC
no-ICg in paper

Real red channel no-IC to virtual red channel with IC
VICr2r and cGANr2r in paper

Real green channel no-IC to virtual red channel with IC
VICg2r and cGANg2r in paper

We then show the comparison of the average number of feature matches between the anchor frame and its 10 consecutive frames in the graphic bellow. The x-axis represents the relative time stamp and the y-axis represents the average number of matches calculated using 43 samples. It is clearly shown that the VIC images from cGANr2r has a higher number of matches across frames.


Feature matches on green channel no-IC images
Feature matches on VIC images from cGANr2r
Feature matches on VIC images from cGANg2r

Reconstruction results

Please note that the embedded 3D model viewers show the inner texture of the stomach and with back-face culling mode. Also please note that the order of the subjects are different with the other page.

Subject A VIC input images. Red dots in each frame represent extracted feature points.

Subject B VIC input images. Red dots in each frame represent extracted feature points.

Subject C VIC input images. Red dots in each frame represent extracted feature points.

Subject D VIC input images. Red dots in each frame represent extracted feature points.

Subject E VIC input images. Red dots in each frame represent extracted feature points.

Subject F VIC input images. Red dots in each frame represent extracted feature points.

Subject G VIC input images. Red dots in each frame represent extracted feature points.