SimpleMapping: Real-Time Visual-Inertial Dense

Mapping with Deep Multi-View Stereo

IEEE International Symposium on Mixed and Augmented Reality (ISMAR) 2023

Yingye Xin1 *     Xingxing Zuo1,2 * #     Dongyue Lu1     Stefan Leutenegger1,2        

* Equal Contribution

# Corresponding Author

1Smart Robotics Lab, Technical University of Munich   2Munich Center for Machine Learning (MCML)  



Showcase of EuRoC/V201 reconstructed in real-time with SimpleMapping. The current camera frame in green is navigating across the room while the surface mesh with texture is incrementally reconstructed online.

Video


Abstract

We present SimpleMapping, a real-time visual-inertial dense mapping method capable of performing incremental 3D mesh reconstruction with high quality using only sequential monocular images and inertial measurement unit (IMU) readings.

6-DoF camera poses are estimated by a robust feature-based visual-inertial odometry (VIO), which also generates noisy sparse 3D map points as a by-product. We propose a sparse point aided multi-view stereo neural network (SPA-MVSNet) that can effectively leverage the informative but noisy sparse points from the VIO system. The sparse depth from VIO is firstly completed by a single-view depth completion network. This dense depth map, although naturally limited in accuracy, is then used as a prior to guide our MVS network in the cost volume generation and regularization for accurate dense depth prediction. Predicted depth maps of keyframe images by the MVS network are incrementally fused into a global map using TSDF-Fusion. We extensively evaluate both the proposed SPA-MVSNet and the entire dense mapping system on several public datasets as well as our own dataset, demonstrating the system’s impressive generalization capabilities and its ability to deliver high-quality 3D reconstruction online. Our proposed dense mapping system achieves a 39.7% improvement in F-score over existing systems when evaluated on the challenging scenarios of the EuRoC dataset.


Method



System overview of SimpleMapping. VIO takes input of monocular images and IMU data to estimate 6-DoF camera poses and generate a local map containing noisy 3D sparse points. Dense mapping process first performs the single-view depth completion with VIO sparse depth map $\hat{\mathbf{D}_{s_0}}$ and the reference image frame $\mathbf{I}_0$, then adopts multi-view stereo (MVS) network to infer a high-quality dense depth map for $\mathbf{I}_0$. The depth prior $\hat{\mathbf{D}}_0$ and hierarchical deep feature maps $\mathbf{\mathcal{F}}_0$ from the single-view depth completion contribute to the cost volume formulation and 2D CNN cost volume regularization in the MVS. The high-quality dense depth prediction from the MVS, $\breve{\mathbf{D}}_0$, is then fused into a global TSDF grid map for a coherent dense reconstruction.


Experimental Results

(Baseline denotes the same VIO frontend integrated with SimpleRecon.)

 

EuRoC Dataset

Here we present 3D reconstruction on EuRoC V102 and V203 scenarios. SimpleMapping yields consistently better detailed reconstruction even in challenging scenarios.

 

 

 

ETH3D Dataset and Self-collected Dataset

As can be observed, both the baseline system and TANDEM suffer from inconsistent geometry and noticeable noise. SimpleMapping surpasses the other methods significantly in terms of dense mapping accuracy and robustness.

 

 

 

ScanNet Dataset

We showcase the comparable reconstruction performance of SimpleMapping utilizing only a monocular camera setup without IMU, against a state-of-the-art RGB-D dense SLAM method with neural implicit representation, Vox-Fusion. Here is the results on ScanNet test set. Vox-Fusion tends to produce over-smoothed geometries and experience drift during long-time tracking, resulting in inconsistent reconstruction, as observed in Scene0787.

 

 

 

Runtime Efficiency

We present the averaged per-keyframe runtime for each module and per-frame runtime for the whole process evaluated on EuRoC Dataset. SimpleMapping is able to ensure real-time performance, only requiring 55 ms to process one frame.

 

 

 

BibTeX

@article{Xin2023ISMAR,
      author    = {Xin, Yingye and Zuo, Xingxing and Lu, Dongyue and Leutenegger, Stefan},
      title     = {{SimpleMapping: Real-Time Visual-Inertial Dense Mapping with Deep Multi-View Stereo}},
      booktitle = {IEEE International Symposium on Mixed and Augmented Reality (ISMAR)},
      month     = {Oct},
      year      = {2023}
  }

Acknowledgements

This work is partially supported by Munich Center for Machine Learning (MCML), Germany.