We present SimpleMapping, a real-time visual-inertial dense mapping method capable of performing incremental 3D mesh reconstruction with high quality using only sequential monocular images and inertial measurement unit (IMU) readings.
6-DoF camera poses are estimated by a robust feature-based visual-inertial odometry (VIO), which also generates noisy sparse 3D map points as a by-product. We propose a sparse point aided multi-view stereo neural network (SPA-MVSNet) that can effectively leverage the informative but noisy sparse points from the VIO system. The sparse depth from VIO is firstly completed by a single-view depth completion network. This dense depth map, although naturally limited in accuracy, is then used as a prior to guide our MVS network in the cost volume generation and regularization for accurate dense depth prediction. Predicted depth maps of keyframe images by the MVS network are incrementally fused into a global map using TSDF-Fusion. We extensively evaluate both the proposed SPA-MVSNet and the entire dense mapping system on several public datasets as well as our own dataset, demonstrating the system’s impressive generalization capabilities and its ability to deliver high-quality 3D reconstruction online. Our proposed dense mapping system achieves a 39.7% improvement in F-score over existing systems when evaluated on the challenging scenarios of the EuRoC dataset.
System overview of SimpleMapping. VIO takes input of monocular images and IMU data to estimate 6-DoF camera poses and generate a local map containing noisy 3D sparse points. Dense mapping process first performs the single-view depth completion with VIO sparse depth map $\hat{\mathbf{D}_{s_0}}$ and the reference image frame $\mathbf{I}_0$, then adopts multi-view stereo (MVS) network to infer a high-quality dense depth map for $\mathbf{I}_0$. The depth prior $\hat{\mathbf{D}}_0$ and hierarchical deep feature maps $\mathbf{\mathcal{F}}_0$ from the single-view depth completion contribute to the cost volume formulation and 2D CNN cost volume regularization in the MVS. The high-quality dense depth prediction from the MVS, $\breve{\mathbf{D}}_0$, is then fused into a global TSDF grid map for a coherent dense reconstruction.
 
Here we present 3D reconstruction on EuRoC V102 and V203 scenarios. SimpleMapping yields consistently better detailed reconstruction even in challenging scenarios.
 
 
 
As can be observed, both the baseline system and TANDEM suffer from inconsistent geometry and noticeable noise. SimpleMapping surpasses the other methods significantly in terms of dense mapping accuracy and robustness.
 
 
 
We showcase the comparable reconstruction performance of SimpleMapping utilizing only a monocular camera setup without IMU, against a state-of-the-art RGB-D dense SLAM method with neural implicit representation, Vox-Fusion. Here is the results on ScanNet test set. Vox-Fusion tends to produce over-smoothed geometries and experience drift during long-time tracking, resulting in inconsistent reconstruction, as observed in Scene0787.
 
 
 
We present the averaged per-keyframe runtime for each module and per-frame runtime for the whole process evaluated on EuRoC Dataset. SimpleMapping is able to ensure real-time performance, only requiring 55 ms to process one frame.
 
 
 
@article{Xin2023ISMAR,
author = {Xin, Yingye and Zuo, Xingxing and Lu, Dongyue and Leutenegger, Stefan},
title = {{SimpleMapping: Real-Time Visual-Inertial Dense Mapping with Deep Multi-View Stereo}},
booktitle = {IEEE International Symposium on Mixed and Augmented Reality (ISMAR)},
month = {Oct},
year = {2023}
}