AirSLAM: An Efficient and Illumination-Robust Point-Line Visual SLAM System

Kuan Xu1, Yuefan Hao2, Shenghai Yuan1
Chen Wang2,  Lihua Xie1
This work is supported by the National Research Foundation of Singapore under its Medium-Sized Center for Advanced Robotics Technology Innovation.1Kuan Xu, Shenghai Yuan, and Lihua Xie are with the Centre for Advanced Robotics Technology Innovation (CARTIN), School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, {kuan.xu,shyuan,elhxie}@ntu.edu.sg.2Yuefan Hao and Chen Wang are with Spatial AI & Robotics Lab, Computer Science and Engineering, University at Buffalo, Buffalo, NY 14260, USA [email protected], [email protected].
Abstract

In this paper, we present an efficient visual SLAM system designed to tackle both short-term and long-term illumination challenges. Our system adopts a hybrid approach that combines deep learning techniques for feature detection and matching with traditional backend optimization methods. Specifically, we propose a unified convolutional neural network (CNN) that simultaneously extracts keypoints and structural lines. These features are then associated, matched, triangulated, and optimized in a coupled manner. Additionally, we introduce a lightweight relocalization pipeline that reuses the built map, where keypoints, lines, and a structure graph are used to match the query frame with the map. To enhance the applicability of the proposed system to real-world robots, we deploy and accelerate the feature detection and matching networks using C++ and NVIDIA TensorRT. Extensive experiments conducted on various datasets demonstrate that our system outperforms other state-of-the-art visual SLAM systems in illumination-challenging environments. Efficiency evaluations show that our system can run at a rate of 73Hz73Hz73\mathrm{Hz}73 roman_H roman_z on a PC and 40Hz40Hz40\mathrm{Hz}40 roman_H roman_z on an embedded platform. Our implementation is open-sourced: https://github.com/sair-lab/AirSLAM.

Index Terms:
Visual SLAM, Mapping, Relocalization.

I Introduction

Visual simultaneous localization and mapping (vSLAM) is essential for robot navigation due to its favorable balance between cost and accuracy [1]. Compared to LiDAR SLAM, vSLAM utilizes more cost-effective and compact sensors to achieve accurate localization, thus broadening its range of potential applications [2]. Moreover, cameras can capture richer and more detailed information, which enhances their potential for providing robust localization.

Despite the recent advancements, the present vSLAM systems still struggle with severe lighting conditions [3, 4, 5, 6], which can be summarized into two categories. First, feature detection and tracking often fail due to drastic changes or low light, severely affecting the quality of the estimated trajectory [7, 8]. Second, when the visual map is reused for relocalization, lighting variations could significantly reduce the success rate [9, 10]. In this paper, we refer to the first issue as the short-term illumination challenge, which impacts pose estimation between two temporally adjacent frames, and the second as the long-term illumination challenge, which affects matching between the query frame and an existing map.

Present methods usually focus on only one of the above challenges. For example, various image enhancement [11, 12, 13] and image normalization algorithms [14, 15] have been developed to ensure robust tracking. These methods primarily focus on maintaining either global or local brightness consistency, yet they often fall short of handling all types of challenging lighting conditions [16]. Some systems have addressed this issue by training a VO or SLAM network on large datasets containing diverse lighting conditions [17, 18, 19]. However, they have difficulty producing a map suitable for long-term localization. Some methods can provide illumination-robust relocalization, but they usually require map building under good lighting conditions [20, 21]. In real-world robot applications, these two challenges often arise simultaneously, necessitating a unified system capable of addressing both.

Furthermore, many of the aforementioned systems incorporate intricate neural networks, relying on powerful GPUs to run in real-time. They lack the efficiency necessary for deployment on resource-constrained platforms, such as warehouse robots. These limitations impede the transition of vSLAM from laboratory research to industrial applications.

In response to these gaps, this paper introduces AirSLAM. Observing that line features can improve the accuracy and robustness of vSLAM systems [5, 22, 23], we integrate both point and line features for tracking, mapping, optimization, and relocalization. To achieve a balance between efficiency and performance, we design our system as a hybrid system, employing learning-based methods for feature detection and matching, and traditional geometric approaches for pose and map optimization. Additionally, to enhance the efficiency of feature detection, we developed a unified model capable of simultaneously detecting point and line features. We also address long-term localization challenges by proposing a multi-stage relocalization strategy, which effectively reuses our point-line map. In summary, our contributions include

  • We propose a novel point-line-based vSLAM system that combines the efficiency of traditional optimization techniques with the robustness of learning-based methods. Our system is resilient to both short-term and long-term illumination challenges while remaining efficient enough for deployment on embedded platforms.

  • We develop a unified model for both keypoint and line detection, which we call PLNet. To our knowledge, PLNet is the first model capable of simultaneously detecting both point and line features. Furthermore, we associate these two types of features and jointly utilize them for tracking, mapping, and relocalization tasks.

  • We propose a multi-stage relocalization method based on both point and line features, utilizing both appearance and geometry information. This method can provide fast and illumination-robust localization in an existing visual map using only a single image.

  • We conduct extensive experiments to demonstrate the efficiency and effectiveness of the proposed methods. The results show that our system achieves accurate and robust mapping and relocalization performance under various illumination-challenging conditions. Additionally, our system is also very efficient. It runs at a rate of 73Hz73Hz73\mathrm{Hz}73 roman_H roman_z on a PC and 40Hz40Hz40\mathrm{Hz}40 roman_H roman_z on an embedded platform.

In addition, our engineering contributions include deploying and accelerating feature detection and matching networks using C++ and NVIDIA TensorRT, facilitating their deployment on real robots. We release all the source code at https://github.com/sair-lab/AirSLAM to benefit the community.

This paper extends our conference paper, AirVO [22]. AirVO utilizes SuperPoint [24] and LSD [25] for feature detection, and SuperGlue [26] for feature matching. It achieves remarkable performance in environments with changing illumination. However, as a visual-only odometry, it primarily addresses short-term illumination challenges and cannot reuse a map for drift-free relocalization. Additionally, despite carefully designed post-processing operations, the modified LSD is still not stable enough for long-term localization. It relies on image gradient information rather than environmental structural information, rendering it susceptible to varying lighting conditions. In this version, we introduce substantial improvements, including:

  • We design a unified CNN to detect both point and line features, enhancing the stability of feature detection in illumination-challenging environments. Additionally, the more efficient LightGlue [27] is used for feature matching.

  • We extend our system to support both stereo-only and stereo-inertial data, increasing its reliability when an inertial measurement unit (IMU) is available.

  • We incorporate loop closure detection and map optimization, forming a complete vSLAM system.

  • We design a multi-stage relocalization module based on both point and line features, enabling our system to effectively handle long-term illumination challenges.

The remainder of this article is organized as follows. In Section II, we discuss the relevant literature. Section III-B presents an overview of the complete system pipeline. The proposed PLNet is presented in Section IV. In Section V, we introduce the visual-inertial odometry based on PLNet. Section VI shows how to optimize the map offline and reuse it online. The detailed experimental results are presented in Section VII to verify the efficiency, accuracy, and robustness of AirSLAM. This article is concluded in Section VIII.

II Related work

II-A Keypoint and Line Detection for vSLAM

II-A1 Keypoint Detection

Various handcrafted keypoint features e.g., ORB [28], FAST [29], and BRISK [30], have been proposed and applied to VO and vSLAM systems. They are usually efficient but not robust enough in challenging environments [26, 9]. With the development of deep learning techniques, more and more learning-based features are proposed and used to replace the handcrafted features in vSLAM systems. Rong et al. [31] introduce TFeat network [32] to extract descriptors for FAST corners and apply it to a traditional vSLAM pipeline. Tang et al. [33] use a neural network to extract robust keypoints and binary feature descriptors with the same shape as the ORB. Han et al. [34] combine SuperPoint [24] feature extractor with a traditional back-end. Bruno et al. proposed LIFT-SLAM [35], where they use LIFT [36] to extract features. Li et al. [37] replace the ORB feature with SuperPoint in ORB-SLAM2 and optimize the feature extraction with the Intel OpenVINO toolkit. Zhan et al. [38] proposed a new self-supervised training scheme for learning-based features, using bundle adjustment and bi-level optimization as a supervision signal. Some other learning-based features, e.g., R2D2 [39] and DISK [40], and D2Net [41], are also being attempted to be applied to vSLAM systems, although they are not yet efficient enough [42, 43].

II-A2 Line Detection

Currently, most point-line-based vSLAM systems use the LSD [25] or EDLines [44] to detect line features because of their good efficiency [5, 45, 46, 47, 48]. Although many learning-based line detection methods, e.g., SOLD2 [49], AirLine [50], and HAWP [51], have been proposed and shown better robustness in challenging environments, they are difficult to apply to real-time vSLAM systems due to lacking efficiency. For example, Kannapiran et al. propose StereoVO [23], where they choose SuperPoint [24] and SOLD2 [49] to detect keypoints and line segments, respectively. Despite achieving good performance in dynamic lighting conditions, StereoVO can only run at a rate of about 7Hz7Hz7\mathrm{Hz}7 roman_H roman_z on a good GPU.

II-B Short-Term Illumination Challenge

Several handcrafted methods have been proposed to improve the robustness of VO and vSLAM to challenging illumination. DSO [52] models brightness changes and jointly optimizes camera poses and photometric parameters. DRMS [11] and AFE-ORB-SLAM [12] utilize various image enhancements. Some systems try different methods, such as ZNCC, the locally-scaled sum of squared differences (LSSD), and dense descriptor computation, to achieve robust tracking [14, 53, 15]. These methods mainly focus on either global or local illumination change for all kinds of images, however, lighting conditions often affect the scene differently in different areas [16]. Other related methods include that of Huang and Liu [54], which presents a multi-feature extraction algorithm to extract two kinds of image features when a single-feature algorithm fails to extract enough feature points. Kim et al. [55] employ a patch-based affine illumination model during direct motion estimation. Chen et al. [56] minimize the normalized information distance with nonlinear least square optimization for image registration. Alismail et al. [57] propose a binary feature descriptor using a descriptor assumption to avoid brightness constancy.

Compared with handcrafted methods, learning-based methods have shown better performance. Savinykh et al. [8] propose DarkSLAM, where Generative Adversarial Network (GAN) [58] is used to enhance input images. Pratap Singh et al. [59] compare different learning-based image enhancement methods for vSLAM in low-light environments. TartanVO [17], DROID-SLAM [18], and iSLAM [19] train their VO or SLAM networks on the TartanAir dataset [60], which is a large simulation dataset that contains various lighting conditions, therefore, they are very robust in challenging environments. However, they usually require good GPUs and long training times. Besides, DROID-SLAM runs very slowly and is difficult to apply to real-time applications on resource-constrained platforms. TartanVO and iSLAM are more efficient, but they cannot achieve performance as accurately as traditional vSLAM systems.

II-C Long-Term Illumination Challenge

Currently, most SLAM systems still use the bag of words (BoW) [61] for loop closure detection and relocalization due to its good balance between efficiency and effectiveness [62, 4, 63]. To make the relocalization more robust to large illumination variations, Labbé et al. [20] propose the multi-session relocalization method, where they combine multiple maps generated at different times and in various illumination conditions. DXSLAM [37] uses NetVLAD [64] for the coarse image retrieval and SuperPoint with a binary descriptor for keypoint matching between the query frame and candidates.

Another similar task in the robotics and computer vision communities is the visual place recognition (VPR) problem, where many researchers handle the localization problem with image retrieval methods [64, 65]. These VPR solutions try to find images most similar to the query image from a database. They usually cannot directly provide accurate pose estimation which is needed in robot applications. Sarlin et al. address this and propose Hloc [9]. They use a global retrieval to obtain several candidates and match local features within those candidates. The Hloc toolbox has integrated many image retrieval methods, local feature extractors, and matching methods, and it is currently the SOTA system. Yan et al. [66] propose a long-term visual localization method for mobile platforms, however, they rely on other sensors, e.g., GPS, compass, and gravity sensor, for the coarse location retrieval.

Refer to caption
Figure 1: The proposed system consists of three main parts: online stereo VO/VIO, offline map optimization, and online relocalization. The VO/VIO module uses the mapping image sequences to build an initial map. Then the initial map is processed offline and an optimized map is outputted. The optimized map can be used for the one-shot relocalization.

III System Overview

III-A Notations

In this paper, \mathbb{R}blackboard_R represents the set of real numbers, and msuperscript𝑚\mathbb{R}^{m}blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT denotes the m𝑚mitalic_m-dimensional real vector space. The transpose of a vector or matrix is written as ()superscripttop(\cdot)^{\top}( ⋅ ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. For a vector 𝐱m𝐱superscript𝑚\mathbf{x}\in\mathbb{R}^{m}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, 𝐱delimited-∥∥𝐱\lVert\mathbf{x}\rVert∥ bold_x ∥ represents its Euclidean norm, while 𝐱Σ2subscriptsuperscriptdelimited-∥∥𝐱2Σ\lVert\mathbf{x}\rVert^{2}_{\Sigma}∥ bold_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT is shorthand for 𝐱Σ𝐱superscript𝐱topΣ𝐱\mathbf{x}^{\top}\Sigma\mathbf{x}bold_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ bold_x. The transformation, rotation, and translation from the b𝑏bitalic_b-coordinate system to the a𝑎aitalic_a-coordinate system are denoted by 𝐓abSE(3)subscript𝐓𝑎𝑏SE(3)\mathbf{T}_{ab}\in\text{SE(3)}bold_T start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT ∈ SE(3), 𝐑abSO(3)subscript𝐑𝑎𝑏SO(3)\mathbf{R}_{ab}\in\text{SO(3)}bold_R start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT ∈ SO(3), and 𝐭ab3subscript𝐭𝑎𝑏superscript3\mathbf{t}_{ab}\in\mathbb{R}^{3}bold_t start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, respectively. We use ()csubscript𝑐(\cdot)_{c}( ⋅ ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to indicate a vector in the camera frame and ()wsubscript𝑤(\cdot)_{w}( ⋅ ) start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT to indicate a vector in the global world frame. Throughout the paper, super/subscripts may be omitted for brevity, as long as the meaning remains unambiguous within the given context.

Refer to caption
Figure 2: We visualize the feature map (top right) and detected keypoints (bottom left) of a keypoint detection model, and the detected structural lines (bottom right) of a line detection model. The overlap of keypoints and junctions, and the edge information in the feature map inspire the design of our PLNet.

III-B System Architecture

We believe that a practical vSLAM system should possess the following features:

  • High efficiency. The system should have real-time performance on resource-constrained platforms.

  • Scalability. The system should be easily extensible for various purposes and real-world applications.

  • Easy to deploy. The system should be easy to deploy on real robots and capable of achieving robust localization.

Therefore, we design a system as shown in Fig. 1. The proposed system is a hybrid system as we need the robustness of data-driven approaches and the accuracy of geometric methods. It consists of three main components: stereo VO/VIO, offline map optimization, and lightweight relocalization. (1) Stereo VO/VIO: We propose a point-line-based visual odometry that can handle both stereo and stereo-inertial inputs. (2) Offline map optimization: We implement several commonly used plugins, such as loop detection, pose graph optimization, and global bundle adjustment. The system is easily extensible for other map-processing purposes by adding customized plugins. For example, we have implemented a plugin to train a scene-dependent junction vocabulary using the endpoints of line features, which is utilized in our lightweight multi-stage relocalization. (3) Lightweight relocalization: We propose a multi-stage relocalization method that improves efficiency while maintaining effectiveness. In the first stage, keypoints and line features are detected using the proposed PLNet, and several candidates are retrieved using a keypoint vocabulary trained on a large dataset. In the second stage, most false candidates are quickly filtered out using a scene-dependent junction vocabulary and a structure graph. In the third stage, feature matching is performed between the query frame and the remaining candidates to find the best match and estimate the pose of the query frame. Since feature matching in the third stage is typically time-consuming, the filtering process in the second stage enhances the efficiency of our system compared to other two-stage relocalization systems.

We transfer some time-consuming processes, e.g., loop closure detection, pose graph optimization, and global bundle adjustment, to the offline stage. This improves the efficiency of our online mapping module. In many practical applications, such as warehouse robotics, a map is typically built by one robot and then reused by others. Our system is designed with these applications in mind. The lightweight mapping and map reuse modules can be easily deployed on resource-constrained robots, while the offline optimization module can run on a more powerful computer for various map manipulations, such as map editing and visualization. The mapping robot uploads the initial map to the computer, which then distributes the optimized map to other robots, ensuring drift-free relocalization. In the following sections, we introduce our feature detection and visual odometry (VO) pipeline in Section IV and Section V, respectively. The offline optimization and relocalization modules are presented in Section VI.

Refer to caption
Figure 3: The framework of the proposed PLNet. It consists of the shared backbone, the keypoint module, and the line module.

IV Feature Detection

IV-A Motivation

With advancements in deep learning technology, learning-based feature detection methods have demonstrated more stable performance in illumination-challenging environments compared to traditional methods. However, existing point-line-based VO/VIO and SLAM systems typically detect keypoints and line features separately. While it is acceptable for handcrafted methods due to their efficiency, the simultaneous application of keypoint detection and line detection networks in VO/VIO or SLAM systems, especially in stereo configurations, often hinders real-time performance on resource-constrained platforms. Consequently, we aim to design an efficient unified model that can detect keypoints and line features concurrently.

However, achieving a unified model for keypoint and line detection is challenging, as these tasks typically require different real-image datasets and training procedures. Keypoint detection models are generally trained on large datasets comprising diverse images and depend on either a boosting step or the correspondences of image pairs for training [24, 40, 39]. For line detection, we find wireframe parsing methods [67, 51] can provide stronger geometric cues than the self-supervised models [49, 68] as they are able to detect longer and more complete lines. The wireframe includes all prominent straight lines and their junctions within the scene, providing an efficient and accurate representation of large-scale geometry and object shapes [69]. However, these methods are trained on the Wireframe dataset [69], which is limited in size with only 5,462 discontinuous images. In the following sections, we will address this challenge and demonstrate how to train a unified model capable of performing both tasks. It is important to note that in this paper, the term “line detection” refers specifically to the wireframe parsing task.

Refer to caption
Figure 4: We use four parameters, d,θ,θ1,θ2𝑑𝜃subscript𝜃1subscript𝜃2d,\theta,\theta_{1},\theta_{2}italic_d , italic_θ , italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, to encode a line into a point 𝐩𝐩\mathbf{p}bold_p within the attraction region field.

IV-B Architecture Design

As shown in Fig. 2, we have two findings when visualizing the results of the keypoint and line detection networks: (1) Most junctions (endpoints of lines) detected by the line detection model are also selected as keypoints by the keypoint detection model. (2) The feature maps outputted by the keypoint detection model contain the edge information. Therefore, we argue that a line detection model can be built on the backbone of a pre-trained keypoint detection model. Based on this assumption, we design the PLNet to detect keypoints and lines in a unified framework. As shown in Fig. 3, it consists of the shared backbone, the keypoint module, and the line module.

Backbone: We follow SuperPoint [24] to design the backbone for its good efficiency and effectiveness. It uses 8 convolutional layers and 3 max-pooling layers. The input is the grayscale image sized H×W𝐻𝑊H\times Witalic_H × italic_W. The outputs are H×W×64𝐻𝑊64H\times W\times 64italic_H × italic_W × 64, H2×W2×64𝐻2𝑊264\frac{H}{2}\times\frac{W}{2}\times 64divide start_ARG italic_H end_ARG start_ARG 2 end_ARG × divide start_ARG italic_W end_ARG start_ARG 2 end_ARG × 64, H4×W4×128𝐻4𝑊4128\frac{H}{4}\times\frac{W}{4}\times 128divide start_ARG italic_H end_ARG start_ARG 4 end_ARG × divide start_ARG italic_W end_ARG start_ARG 4 end_ARG × 128, H8×W8×128𝐻8𝑊8128\frac{H}{8}\times\frac{W}{8}\times 128divide start_ARG italic_H end_ARG start_ARG 8 end_ARG × divide start_ARG italic_W end_ARG start_ARG 8 end_ARG × 128 feature maps.

Keypoint Module: We also follow SuperPoint [24] to design the keypoint detection header. It has two branches: the score branch and the descriptor branch. The inputs are H8×W8×128𝐻8𝑊8128\frac{H}{8}\times\frac{W}{8}\times 128divide start_ARG italic_H end_ARG start_ARG 8 end_ARG × divide start_ARG italic_W end_ARG start_ARG 8 end_ARG × 128 feature maps outputted by the backbone. The score branch outputs a tensor sized H8×W8×65𝐻8𝑊865\frac{H}{8}\times\frac{W}{8}\times 65divide start_ARG italic_H end_ARG start_ARG 8 end_ARG × divide start_ARG italic_W end_ARG start_ARG 8 end_ARG × 65. The 65 channels correspond to an 8×8888\times 88 × 8 grid region and a dustbin indicating no keypoint. The tensor is processed by a softmax and then resized to H×W𝐻𝑊H\times Witalic_H × italic_W. The descriptor branch outputs a tensor sized H8×W8×256𝐻8𝑊8256\frac{H}{8}\times\frac{W}{8}\times 256divide start_ARG italic_H end_ARG start_ARG 8 end_ARG × divide start_ARG italic_W end_ARG start_ARG 8 end_ARG × 256, which is used for interpolation to compute descriptors of keypoints.

Line Module: This module takes feature maps from the backbone as inputs. It consists of a U-Net-like CNN and the line detection header. We modify the U-Net [70] to make it contain fewer convolutional layers and thus be more efficient. The U-Net-like CNN is to increase the receptive field as detecting lines requires a larger receptive field than detecting keypoints. The EPD LOIAlign [51] is used to process the outputs of the line module and finally outputs junctions and lines.

Refer to caption
Figure 5: We use seven types of photometric data augmentation to train our PLNet to make it more robust to challenging illumination.

IV-C Network Training

Due to the training problem described in Section IV-A and the assumption in Section IV-B, we train our PLNet in two rounds. In the first round, only the backbone and the keypoint detection module are trained, which means we need to train a keypoint detection network. In the second round, the backbone and the keypoint detection module are fixed, and we only train the line detection module on the Wireframe dataset. We skip the details of the first round as they are very similar to [24]. Instead, we present the training of the line detection module.

Refer to caption
Figure 6: The framework of our visual(-inertial) odometry. The system is split into two main threads, which are represented by two different colored regions. Note that the IMU input is not strictly required. The system is optional to use stereo data or stereo-inertial data.

Line Encoding: We adopt the attraction region field [51] to encode line segments. As shown in Fig. 4, for a line segment 𝐥=(𝐱𝟏,𝐱𝟐)𝐥subscript𝐱1subscript𝐱2\mathbf{l}=(\mathbf{x_{1}},\mathbf{x_{2}})bold_l = ( bold_x start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ), where 𝐱𝟏subscript𝐱1\mathbf{x_{1}}bold_x start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT and 𝐱𝟐subscript𝐱2\mathbf{x_{2}}bold_x start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT are two endpoints of 𝐥𝐥\mathbf{l}bold_l, and a point 𝐩𝐩\mathbf{p}bold_p in the attraction region of 𝐥𝐥\mathbf{l}bold_l, four parameters and 𝐩𝐩\mathbf{p}bold_p are used to encode 𝐥𝐥\mathbf{l}bold_l:

𝐩(𝐥)=(d,θ,θ1,θ2),𝐩𝐥𝑑𝜃subscript𝜃1subscript𝜃2\mathbf{p}\left(\mathbf{l}\right)=\left(d,\theta,\theta_{1},\theta_{2}\right),bold_p ( bold_l ) = ( italic_d , italic_θ , italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , (1)

where d𝑑ditalic_d is the distance from 𝐩𝐩\mathbf{p}bold_p to 𝐥𝐥\mathbf{l}bold_l, θ𝜃\thetaitalic_θ is the angle between 𝐥𝐥\mathbf{l}bold_l and the 𝐯𝐯\mathbf{v}bold_v-axis of the image, θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the angle between 𝐩𝐱𝟏subscript𝐩𝐱1\mathbf{px_{1}}bold_px start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT and the perpendicular line from 𝐩𝐩\mathbf{p}bold_p to 𝐥𝐥\mathbf{l}bold_l, and θ2subscript𝜃2\theta_{2}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the angle between 𝐩𝐱𝟐subscript𝐩𝐱2\mathbf{px_{2}}bold_px start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT and the perpendicular line. The network can predict these four parameters for point 𝐩𝐩\mathbf{p}bold_p and then 𝐥𝐥\mathbf{l}bold_l can be decoded through:

𝐥=d[cosθsinθsinθcosθ][11tanθ1tanθ2]+[𝐩𝐩].𝐥𝑑delimited-[]𝜃𝜃𝜃𝜃delimited-[]11subscript𝜃1subscript𝜃2delimited-[]𝐩𝐩\mathbf{l}=d\cdot\left[\begin{array}[]{cc}\cos{\theta}&-\sin{\theta}\\ \sin{\theta}&\cos{\theta}\end{array}\right]\left[\begin{array}[]{cc}1&1\\ \tan{\theta_{1}}&\tan{\theta_{2}}\end{array}\right]+\left[\begin{array}[]{cc}% \mathbf{p}&\mathbf{p}\end{array}\right].bold_l = italic_d ⋅ [ start_ARRAY start_ROW start_CELL roman_cos italic_θ end_CELL start_CELL - roman_sin italic_θ end_CELL end_ROW start_ROW start_CELL roman_sin italic_θ end_CELL start_CELL roman_cos italic_θ end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL roman_tan italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL roman_tan italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] + [ start_ARRAY start_ROW start_CELL bold_p end_CELL start_CELL bold_p end_CELL end_ROW end_ARRAY ] . (2)

Line Prediction: The line detection module outputs a tensor sized H4×W4×4𝐻4𝑊44\frac{H}{4}\times\frac{W}{4}\times 4divide start_ARG italic_H end_ARG start_ARG 4 end_ARG × divide start_ARG italic_W end_ARG start_ARG 4 end_ARG × 4 to predict parameters in (1) and a heatmap to predict junctions. For each decoded line segment by (2), two junctions closest to its endpoints will be selected to form a line proposal with it. Proposals with the same junctions will be deduplicated and only one is retained. Then the EPD LOIAlign [51] and a head classifier are applied to decide whether the line proposal is a true line feature.

Line Module Training: We use the L1𝐿1L1italic_L 1 loss to supervise the prediction of parameters in (1) and the binary cross-entropy loss to supervise the junction heatmap and the head classifier. The total loss is the sum of them. As shown in Fig. 5, to improve the robustness of line detection in illumination-challenging environments, seven types of photometric data augmentation are applied to process training images. The training uses the ADAM optimizer [71] with the learning rate lr=4e-4𝑙𝑟4𝑒-4lr=4e\text{-}4italic_l italic_r = 4 italic_e - 4 in the first 35 epochs and lr=4e-5𝑙𝑟4𝑒-5lr=4e\text{-}5italic_l italic_r = 4 italic_e - 5 in the last 5 epochs.

V Stereo Visual Odometry

V-A Overview

The proposed point-line-based stereo visual odometry is shown in Fig. 6. It is a hybrid VO system utilizing both the learning-based front-end and the traditional optimization backend. For each stereo image pair, we first employ the proposed PLNet to extract keypoints and line features. Then a GNN (LightGlue [27]) is used to match keypoints. In parallel, we associate line features with keypoints and match them using the keypoint matching results. After that, we perform an initial pose estimation and reject outliers. Based on the results, we triangulate the 2D features of keyframes and insert them into the map. Finally, the local bundle adjustment will be performed to optimize points, lines, and keyframe poses. In the meantime, if an IMU is accessible, its measurements will be processed using the IMU preintegration method [72], and added to the initial pose estimation and local bundle adjustment.

Applying both learning-based feature detection and matching methods to the stereo VO is time-consuming. Therefore, to improve efficiency, the following three techniques are utilized in our system. (1) For keyframes, we extract features on both left and right images and perform stereo matching to estimate the real scale. But for non-keyframes, we only process the left image. Besides, we use some lenient criteria to make the selected keyframes in our system very sparse, so the runtime and resource consumption of feature detection and matching in our system are close to that of a monocular system. (2) We convert the inference code of the CNN and GNN from Python to C++, and deploy them using ONNX and NVIDIA TensorRT, where the 16-bit floating-point arithmetic replaces the 32-bit floating-point arithmetic. (3) We design a multi-thread pipeline. A producer-consumer model is used to split the system into two main threads, i.e., the front-end thread and the backend thread. The front-end thread extracts and matches features while the backend thread performs the initial pose estimation, keyframe insertion, and local bundle adjustment.

V-B Feature Matching

We use LightGlue [27] to match keypoints. For line features, most of the current VO and SLAM systems use the LBD algorithm [73] or tracking sample points to match them. However, the LBD algorithm extracts the descriptor from a local band region of the line, so it suffers from unstable line detection due to challenging illumination or viewpoint changes. Tracking sample points can match the line detected with different lengths in two frames, but current SLAM systems usually use optical flow to track the sample points, which have a bad performance when the light conditions change rapidly or violently. Some learning-based line feature descriptors [49] are also proposed, however, they are rarely used in current SLAM systems due to the increased time complexity.

Therefore, to address both the effectiveness problem and efficiency problem, we design a fast and robust line-matching method for illumination-challenging conditions. First, we associate keypoints with line segments through their distances. Assume that M𝑀Mitalic_M keypoints and N𝑁Nitalic_N line segments are detected on the image, where the i𝑖iitalic_i-th keypoint is denoted as 𝐩i=(xi,yi)subscript𝐩𝑖subscript𝑥𝑖subscript𝑦𝑖\mathbf{p}_{i}=(x_{i},y_{i})bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and the j𝑗jitalic_j-th line segment is denoted as 𝐥j=(Aj,Bj,Cj,xj,1,yj,1,xj,2,yj,2)subscript𝐥𝑗subscript𝐴𝑗subscript𝐵𝑗subscript𝐶𝑗subscript𝑥𝑗1subscript𝑦𝑗1subscript𝑥𝑗2subscript𝑦𝑗2\mathbf{l}_{j}=(A_{j},B_{j},C_{j},x_{j,1},y_{j,1},x_{j,2},y_{j,2})bold_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j , 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j , 2 end_POSTSUBSCRIPT ), where (Aj,Bj,Cj)subscript𝐴𝑗subscript𝐵𝑗subscript𝐶𝑗(A_{j},B_{j},C_{j})( italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) are line parameters of 𝐥jsubscript𝐥𝑗\mathbf{l}_{j}bold_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and (xj,1,yj,1,xj,2,yj,2)subscript𝑥𝑗1subscript𝑦𝑗1subscript𝑥𝑗2subscript𝑦𝑗2(x_{j,1},y_{j,1},x_{j,2},y_{j,2})( italic_x start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j , 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j , 2 end_POSTSUBSCRIPT ) are the endpoints. We first compute the distance between 𝐩isubscript𝐩𝑖\mathbf{p}_{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐥jsubscript𝐥𝑗\mathbf{l}_{j}bold_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT through:

dij=d(𝐩i,𝐥j)=|Ajxi+Bjyi+Cj|Aj2+Bj2.subscript𝑑𝑖𝑗𝑑subscript𝐩𝑖subscript𝐥𝑗subscript𝐴𝑗subscript𝑥𝑖subscript𝐵𝑗subscript𝑦𝑖subscript𝐶𝑗superscriptsubscript𝐴𝑗2superscriptsubscript𝐵𝑗2d_{ij}=d\left(\mathbf{p}_{i},\mathbf{l}_{j}\right)=\frac{\lvert A_{j}\cdot x_{% i}+B_{j}\cdot y_{i}+C_{j}\rvert}{\sqrt{A_{j}^{2}+B_{j}^{2}}}.italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_d ( bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG | italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG start_ARG square-root start_ARG italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG . (3)

If dij<3subscript𝑑𝑖𝑗3d_{ij}<3italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT < 3 and the projection of 𝐩isubscript𝐩𝑖\mathbf{p}_{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on the coordinate axis lies within the projections of line segment endpoints, i.e., min(xj,1,xj,2)ximax(xj,1,xj,2)subscript𝑥𝑗1subscript𝑥𝑗2subscript𝑥𝑖subscript𝑥𝑗1subscript𝑥𝑗2\min({x}_{j,1},{x}_{j,2})\leq x_{i}\leq\max({x}_{j,1},{x}_{j,2})roman_min ( italic_x start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j , 2 end_POSTSUBSCRIPT ) ≤ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ roman_max ( italic_x start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j , 2 end_POSTSUBSCRIPT ) or min(yj,1,yj,2)yimax(yj,1,yj,2)subscript𝑦𝑗1subscript𝑦𝑗2subscript𝑦𝑖subscript𝑦𝑗1subscript𝑦𝑗2\min({y}_{j,1},{y}_{j,2})\leq{y}_{i}\leq\max({y}_{j,1},{y}_{j,2})roman_min ( italic_y start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j , 2 end_POSTSUBSCRIPT ) ≤ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ roman_max ( italic_y start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j , 2 end_POSTSUBSCRIPT ), we will say 𝐩isubscript𝐩𝑖\mathbf{p}_{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT belongs to 𝐥jsubscript𝐥𝑗\mathbf{l}_{j}bold_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Then the line segments on two images can be matched based on the point-matching result of these two images. For 𝐥k,msubscript𝐥𝑘𝑚{\mathbf{l}_{k,m}}bold_l start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT on image k𝑘kitalic_k and 𝐥k+1,nsubscript𝐥𝑘1𝑛{\mathbf{l}_{k+1,n}}bold_l start_POSTSUBSCRIPT italic_k + 1 , italic_n end_POSTSUBSCRIPT on image k+1𝑘1k+1italic_k + 1, we compute a score Smnsubscript𝑆𝑚𝑛{S}_{mn}italic_S start_POSTSUBSCRIPT italic_m italic_n end_POSTSUBSCRIPT to represent the confidence of that they are the same line:

Smn=Npmmin(Nk,m,Nk+1,n),subscript𝑆𝑚𝑛subscript𝑁𝑝𝑚subscript𝑁𝑘𝑚subscript𝑁𝑘1𝑛{S}_{mn}=\frac{{N}_{pm}}{\min({{N}_{k,m}},{{N}_{k+1,n}})},italic_S start_POSTSUBSCRIPT italic_m italic_n end_POSTSUBSCRIPT = divide start_ARG italic_N start_POSTSUBSCRIPT italic_p italic_m end_POSTSUBSCRIPT end_ARG start_ARG roman_min ( italic_N start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_k + 1 , italic_n end_POSTSUBSCRIPT ) end_ARG , (4)

where Npmsubscript𝑁𝑝𝑚{N}_{pm}italic_N start_POSTSUBSCRIPT italic_p italic_m end_POSTSUBSCRIPT is the matching number between point features belonging to 𝐥k,msubscript𝐥𝑘𝑚{\mathbf{l}_{k,m}}bold_l start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT and point features belonging to 𝐥k+1,nsubscript𝐥𝑘1𝑛{\mathbf{l}_{k+1,n}}bold_l start_POSTSUBSCRIPT italic_k + 1 , italic_n end_POSTSUBSCRIPT. Nk,msubscript𝑁𝑘𝑚{{N}_{k,m}}italic_N start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT and Nk+1,nsubscript𝑁𝑘1𝑛{{N}_{k+1,n}}italic_N start_POSTSUBSCRIPT italic_k + 1 , italic_n end_POSTSUBSCRIPT are the numbers of point features belonging to 𝐥k,msubscript𝐥𝑘𝑚{\mathbf{l}_{k,m}}bold_l start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT and 𝐥k+1,nsubscript𝐥𝑘1𝑛{\mathbf{l}_{k+1,n}}bold_l start_POSTSUBSCRIPT italic_k + 1 , italic_n end_POSTSUBSCRIPT, respectively. Then if Smn>δSsubscript𝑆𝑚𝑛subscript𝛿𝑆{S}_{mn}>\delta_{S}italic_S start_POSTSUBSCRIPT italic_m italic_n end_POSTSUBSCRIPT > italic_δ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and Npm>δNsubscript𝑁𝑝𝑚subscript𝛿𝑁{N}_{pm}>\delta_{N}italic_N start_POSTSUBSCRIPT italic_p italic_m end_POSTSUBSCRIPT > italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, where δSsubscript𝛿𝑆\delta_{S}italic_δ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and δNsubscript𝛿𝑁\delta_{N}italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT are two preset thresholds, we will regard 𝐥k,msubscript𝐥𝑘𝑚{\mathbf{l}_{k,m}}bold_l start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT and 𝐥k+1,nsubscript𝐥𝑘1𝑛{\mathbf{l}_{k+1,n}}bold_l start_POSTSUBSCRIPT italic_k + 1 , italic_n end_POSTSUBSCRIPT as the same line. This coupled feature matching method allows our line matching to share the robust performance of keypoint matching while being highly efficient due to that it does not need another line-matching network.

V-C 3D Feature Processing

In this part, we will introduce our 3D feature processing methods, including 3D feature representation, triangulation, i.e., constructing 3D features from 2D features, and re-projection, i.e., projecting 3D features to the image plane. For 3D point processing, a 3D point is denoted as 𝐗3𝐗superscript3\mathbf{X}\in\mathbb{R}^{3}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. LightGlue [27] is utilized to match keypoints between the left and right images. Successfully matched keypoints are triangulated using stereo disparity information, while unmatched keypoints are triangulated leveraging multi-frame observations. The projection of 3D points onto the image plane is modeled using either the pinhole camera model or the fisheye camera model, depending on the specific camera in use. We skip the details of 3D point processing in our system as they are similar to other point-based VO and SLAM systems [74, 4]. On the contrary, compared with 3D points, 3D lines have more degrees of freedom, and they are easier to degenerate when being triangulated. Therefore, the 3D line processing will be illustrated in detail.

V-C1 3D Line Representation

We use Plücker coordinates [75] to represent a 3D spatial line:

𝐋=[𝐧𝐯]6,𝐋delimited-[]𝐧𝐯superscript6\mathbf{L}=\left[\begin{array}[]{c}\mathbf{n}\\ \mathbf{v}\end{array}\right]\in\mathbb{R}^{6},bold_L = [ start_ARRAY start_ROW start_CELL bold_n end_CELL end_ROW start_ROW start_CELL bold_v end_CELL end_ROW end_ARRAY ] ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT , (5)

where 𝐯𝐯\mathbf{v}bold_v is the direction vector of the line and 𝐧𝐧\mathbf{n}bold_n is the normal vector of the plane determined by the line and the origin. Plücker coordinates are used for 3D line triangulation, transformation, and projection. It is over-parameterized because it is a 6-dimensional vector, but a 3D line has only four degrees of freedom. In the graph optimization stage, the extra degrees of freedom will increase the computational cost and cause the numerical instability of the system [76]. Therefore, we also use orthonormal representation [75] to represent a 3D line:

(𝐔,𝐖)SO(3)×SO(2)𝐔𝐖SO(3)SO(2)\left(\mathbf{U},\mathbf{W}\right)\in\text{SO(3)}\times\text{SO(2)}( bold_U , bold_W ) ∈ SO(3) × SO(2) (6)

The relationship between Plücker coordinates and orthonormal representation is similar to SO(3) and so(3). Orthonormal representation can be obtained from Plücker coordinates by:

𝐋=[𝐧𝐯]=[𝐧𝐧𝐯𝐯𝐧×𝐯𝐧×𝐯]𝐔SO(3)[𝐧𝟎𝟎𝐯𝟎𝟎]Σ3×2,\mathbf{L}=\left[\mathbf{n}\mid\mathbf{v}\right]=\underbrace{\left[\begin{% array}[]{ccc}\frac{\mathbf{n}}{\lVert\mathbf{n}\lVert}&\frac{\mathbf{v}}{% \lVert\mathbf{v}\lVert}&\frac{\mathbf{n}\times\mathbf{v}}{\lVert\mathbf{n}% \times\mathbf{v}\lVert}\end{array}\right]}_{\mathbf{U}\in\text{SO(3)}}% \underbrace{\left[\begin{array}[]{cc}\lVert\mathbf{n}\lVert&\mathbf{0}\\ \mathbf{0}&\lVert\mathbf{v}\lVert\\ \mathbf{0}&\mathbf{0}\end{array}\right]}_{\Sigma_{3\times 2}},bold_L = [ bold_n ∣ bold_v ] = under⏟ start_ARG [ start_ARRAY start_ROW start_CELL divide start_ARG bold_n end_ARG start_ARG ∥ bold_n ∥ end_ARG end_CELL start_CELL divide start_ARG bold_v end_ARG start_ARG ∥ bold_v ∥ end_ARG end_CELL start_CELL divide start_ARG bold_n × bold_v end_ARG start_ARG ∥ bold_n × bold_v ∥ end_ARG end_CELL end_ROW end_ARRAY ] end_ARG start_POSTSUBSCRIPT bold_U ∈ SO(3) end_POSTSUBSCRIPT under⏟ start_ARG [ start_ARRAY start_ROW start_CELL ∥ bold_n ∥ end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL ∥ bold_v ∥ end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL end_ROW end_ARRAY ] end_ARG start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT 3 × 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (7)

where Σ3×2subscriptΣ32\Sigma_{3\times 2}roman_Σ start_POSTSUBSCRIPT 3 × 2 end_POSTSUBSCRIPT is a diagonal matrix and its two non-zero entries defined up to scale can be represented by an SO(2) matrix:

𝐖=1𝐧2+𝐯2[𝐧𝐯𝐯𝐧]SO(2).\mathbf{W}=\frac{1}{\sqrt{{\lVert\mathbf{n}\lVert}^{2}+{\lVert\mathbf{v}\lVert% }^{2}}}\left[\begin{array}[]{cc}{\lVert\mathbf{n}\lVert}&-{\lVert\mathbf{v}% \lVert}\\ {\lVert\mathbf{v}\lVert}&{\lVert\mathbf{n}\lVert}\end{array}\right]\in\text{SO% (2)}.bold_W = divide start_ARG 1 end_ARG start_ARG square-root start_ARG ∥ bold_n ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG [ start_ARRAY start_ROW start_CELL ∥ bold_n ∥ end_CELL start_CELL - ∥ bold_v ∥ end_CELL end_ROW start_ROW start_CELL ∥ bold_v ∥ end_CELL start_CELL ∥ bold_n ∥ end_CELL end_ROW end_ARRAY ] ∈ SO(2) . (8)

In practice, this conversion can be done simply and quickly with the QR decomposition.

V-C2 Triangulation

Triangulation is to initialize a 3D line from two or more 2D line features. In our system, we use two methods to triangulate a 3D line. The first is similar to the line triangulation algorithm B𝐵Bitalic_B in [77], where the pose of a 3D line can be computed from two planes. To achieve this, we select two line segments, 𝐥1subscript𝐥1\mathbf{l}_{1}bold_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐥2subscript𝐥2\mathbf{l}_{2}bold_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, on two images, which are two observations of a 3D line. Note that the two images can come from the stereo pair of the same keyframe or two different keyframes. 𝐥1subscript𝐥1\mathbf{l}_{1}bold_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐥2subscript𝐥2\mathbf{l}_{2}bold_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be back-projected and construct two 3D planes, π1subscript𝜋1\mathbf{\pi}_{1}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and π2subscript𝜋2\mathbf{\pi}_{2}italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then the 3D line can be regarded as the intersection of π1subscript𝜋1\mathbf{\pi}_{1}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and π2subscript𝜋2\mathbf{\pi}_{2}italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

However, triangulating a 3D line is more difficult than triangulating a 3D point, because it suffers more from degenerate motions[77]. Therefore, we also employ a second line triangulation method if the above method fails, where points are utilized to compute the 3D line. In Section V-B, we have associated point features with line features. So to initialize a 3D line, two triangulated points 𝐗1subscript𝐗1\mathbf{X}_{1}bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐗2subscript𝐗2\mathbf{X}_{2}bold_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which belong to this line and have the shortest distance from this line on the image plane are selected. Then the Plücker coordinates of this line can be obtained through:

𝐋=[𝐧𝐯]=[𝐗1×𝐗2𝐗1𝐗2𝐗1𝐗2].\mathbf{L}=\left[\begin{array}[]{c}\mathbf{n}\\ \mathbf{v}\end{array}\right]=\left[\begin{array}[]{c}\mathbf{X}_{1}\times% \mathbf{X}_{2}\\ \frac{\mathbf{X}_{1}-\mathbf{X}_{2}}{\lVert\mathbf{X}_{1}-\mathbf{X}_{2}\lVert% }\end{array}\right].bold_L = [ start_ARRAY start_ROW start_CELL bold_n end_CELL end_ROW start_ROW start_CELL bold_v end_CELL end_ROW end_ARRAY ] = [ start_ARRAY start_ROW start_CELL bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × bold_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL divide start_ARG bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ end_ARG end_CELL end_ROW end_ARRAY ] . (9)

This method requires little extra computation because the selected 3D points have been triangulated in the point triangulating stage. It is very efficient and robust.

V-C3 Re-projection

Re-projection is used to compute the re-projection errors. We use Plücker coordinates to transform and re-project 3D lines. First, we convert the 3D line from the world frame to the camera frame:

𝐋c=[𝐧c𝐯c]=[𝐑cw[𝐭cw]×𝐑cw𝟎𝐑cw][𝐧w𝐯w]=𝐇cw𝐋w,subscript𝐋𝑐delimited-[]subscript𝐧𝑐subscript𝐯𝑐delimited-[]subscript𝐑𝑐𝑤subscriptdelimited-[]subscript𝐭𝑐𝑤subscript𝐑𝑐𝑤0subscript𝐑𝑐𝑤delimited-[]subscript𝐧𝑤subscript𝐯𝑤subscript𝐇𝑐𝑤subscript𝐋𝑤\mathbf{L}_{c}=\left[\begin{array}[]{c}\mathbf{n}_{c}\\ \mathbf{v}_{c}\end{array}\right]=\left[\begin{array}[]{cc}{\mathbf{R}}_{cw}&% \left[{\mathbf{t}_{cw}}\right]_{\times}{\mathbf{R}_{cw}}\\ \mathbf{0}&{\mathbf{R}_{cw}}\end{array}\right]\left[\begin{array}[]{c}\mathbf{% n}_{w}\\ \mathbf{v}_{w}\end{array}\right]={\mathbf{H}_{cw}}{\mathbf{L}_{w}},bold_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL bold_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] = [ start_ARRAY start_ROW start_CELL bold_R start_POSTSUBSCRIPT italic_c italic_w end_POSTSUBSCRIPT end_CELL start_CELL [ bold_t start_POSTSUBSCRIPT italic_c italic_w end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT × end_POSTSUBSCRIPT bold_R start_POSTSUBSCRIPT italic_c italic_w end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_R start_POSTSUBSCRIPT italic_c italic_w end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL bold_n start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_v start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] = bold_H start_POSTSUBSCRIPT italic_c italic_w end_POSTSUBSCRIPT bold_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , (10)

where 𝐋csubscript𝐋𝑐\mathbf{L}_{c}bold_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and 𝐋wsubscript𝐋𝑤\mathbf{L}_{w}bold_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT are Plücker coordinates of 3D line in the camera frame and world frame, respectively. 𝐑cwSO(3)subscript𝐑𝑐𝑤SO(3){\mathbf{R}_{cw}}\in\text{SO(3)}bold_R start_POSTSUBSCRIPT italic_c italic_w end_POSTSUBSCRIPT ∈ SO(3) is the rotation matrix from world frame to camera frame and 𝐭cw3subscript𝐭𝑐𝑤superscript3{\mathbf{t}_{cw}}\in\mathbb{R}^{3}bold_t start_POSTSUBSCRIPT italic_c italic_w end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is the translation vector. []×subscriptdelimited-[]\left[\cdot\right]_{\times}[ ⋅ ] start_POSTSUBSCRIPT × end_POSTSUBSCRIPT denotes the skew-symmetric matrix of a vector and 𝐇cwsubscript𝐇𝑐𝑤{\mathbf{H}_{cw}}bold_H start_POSTSUBSCRIPT italic_c italic_w end_POSTSUBSCRIPT is the transformation matrix of 3D lines from world frame to camera frame.

Then the 3D line 𝐋csubscript𝐋𝑐\mathbf{L}_{c}bold_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT can be projected to the image plane through a line projection matrix 𝐏csubscript𝐏𝑐{\mathbf{P}_{c}}bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT:

𝐥=[ABC]=𝐏c𝐋c[:3]=[fx000fy0fycxfxcyfxfy]𝐧c,𝐥delimited-[]𝐴𝐵𝐶subscript𝐏𝑐subscriptsubscript𝐋𝑐delimited-[]:absent3delimited-[]subscript𝑓𝑥000subscript𝑓𝑦0subscript𝑓𝑦subscript𝑐𝑥subscript𝑓𝑥subscript𝑐𝑦subscript𝑓𝑥subscript𝑓𝑦subscript𝐧𝑐\mathbf{l}=\left[\begin{array}[]{c}A\\ B\\ C\end{array}\right]={\mathbf{P}_{c}}{\mathbf{L}_{c}}_{[:3]}=\left[\begin{array% }[]{ccc}f_{x}&0&0\\ 0&f_{y}&0\\ -f_{y}c_{x}&-f_{x}c_{y}&f_{x}f_{y}\end{array}\right]{\mathbf{n}_{c}},bold_l = [ start_ARRAY start_ROW start_CELL italic_A end_CELL end_ROW start_ROW start_CELL italic_B end_CELL end_ROW start_ROW start_CELL italic_C end_CELL end_ROW end_ARRAY ] = bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bold_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT [ : 3 ] end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL - italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL start_CELL - italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL start_CELL italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] bold_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , (11)

where 𝐥=[ABC]𝐥superscriptdelimited-[]𝐴𝐵𝐶top\mathbf{l}=\left[\begin{array}[]{ccc}A&B&C\end{array}\right]^{\top}bold_l = [ start_ARRAY start_ROW start_CELL italic_A end_CELL start_CELL italic_B end_CELL start_CELL italic_C end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is the re-projected 2D line on image plane. 𝐋c[:3]subscriptsubscript𝐋𝑐delimited-[]:absent3{\mathbf{L}_{c}}_{[:3]}bold_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUBSCRIPT [ : 3 ] end_POSTSUBSCRIPT donates the first three rows of vector 𝐋csubscript𝐋𝑐\mathbf{L}_{c}bold_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.

V-D Keyframe Selection

Observing that the learning-based data association method used in our system is able to track two frames that have a large baseline, so different from the frame-by-frame tracking strategy used in other VO or SLAM systems, we only match the current frame with the last keyframe. We argue this strategy can reduce the accumulated tracking error.

Therefore, the keyframe selection is essential for our system. On the one hand, as described in Section V-A, we want to make keyframes sparse to reduce the consumption of computational resources. On the other hand, the sparser the keyframes, the more likely tracking failure happens. To balance the efficiency and the tracking robustness, a frame will be selected as a keyframe if any of the following conditions is satisfied:

  • The tracked features are less than α1Nssubscript𝛼1subscript𝑁𝑠\alpha_{1}\cdot N_{s}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

  • The average parallax of tracked features between the current frame and the last keyframe is larger than α2WHsubscript𝛼2𝑊𝐻\alpha_{2}\cdot\sqrt{WH}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ square-root start_ARG italic_W italic_H end_ARG.

  • The number of tracked features is less than Nkfsubscript𝑁𝑘𝑓N_{kf}italic_N start_POSTSUBSCRIPT italic_k italic_f end_POSTSUBSCRIPT.

In the above, α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and Nkfsubscript𝑁𝑘𝑓N_{kf}italic_N start_POSTSUBSCRIPT italic_k italic_f end_POSTSUBSCRIPT are all preset thresholds. Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the number of detected features. W𝑊Witalic_W and H𝐻Hitalic_H respectively represent the width and height of the input image.

V-E Local Graph Optimization

To improve the accuracy, we perform the local bundle adjustment when a new keyframe is inserted. Nosubscript𝑁𝑜N_{o}italic_N start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT latest neighboring keyframes are selected to construct a local graph, where map points, 3D lines, and keyframes are vertices and pose constraints are edges. We use point constraints and line constraints as well as IMU constraints if an IMU is accessible. Their related error terms are defined as follows.

V-E1 Point Re-projection Error

If the frame i𝑖iitalic_i can observe the 3D map point 𝐗psubscript𝐗𝑝\mathbf{X}_{p}bold_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, then the re-projection error is defined as:

𝐫i,Xp=𝐱~i,pπ(𝐑cw𝐗p+𝐭cw),subscript𝐫𝑖subscript𝑋𝑝subscript~𝐱𝑖𝑝𝜋subscript𝐑𝑐𝑤subscript𝐗𝑝subscript𝐭𝑐𝑤\mathbf{r}_{i,X_{p}}={\tilde{\mathbf{x}}_{i,p}}-\pi\left({\mathbf{R}_{cw}}{% \mathbf{X}_{p}}+{\mathbf{t}_{cw}}\right),bold_r start_POSTSUBSCRIPT italic_i , italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT = over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i , italic_p end_POSTSUBSCRIPT - italic_π ( bold_R start_POSTSUBSCRIPT italic_c italic_w end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT italic_c italic_w end_POSTSUBSCRIPT ) , (12)

where 𝐱~i,psubscript~𝐱𝑖𝑝{\tilde{\mathbf{x}}_{i,p}}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i , italic_p end_POSTSUBSCRIPT is the observation of 𝐗psubscript𝐗𝑝{\mathbf{X}_{p}}bold_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT on frame i𝑖iitalic_i and π()𝜋\pi\left(\cdot\right)italic_π ( ⋅ ) represents the camera projection.

V-E2 Line Re-projection Error

If the frame i𝑖iitalic_i can observe the 3D line 𝐋qsubscript𝐋𝑞\mathbf{L}_{q}bold_L start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, then the re-projection error is defined as:

𝐫i,Lqsubscript𝐫𝑖subscript𝐿𝑞\displaystyle\mathbf{r}_{i,L_{q}}bold_r start_POSTSUBSCRIPT italic_i , italic_L start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT =el(𝐥~i,q,𝐏c(𝐇cw𝐋q)[:3])2,absentsubscript𝑒𝑙subscript~𝐥𝑖𝑞subscript𝐏𝑐subscriptsubscript𝐇𝑐𝑤subscript𝐋𝑞delimited-[]:absent3superscript2\displaystyle=e_{l}\left({\tilde{\mathbf{l}}_{i,q}},{\mathbf{P}_{c}}\left({{% \mathbf{H}_{cw}}{\mathbf{L}_{q}}}\right)_{[:3]}\right)\in\mathbb{R}^{2},= italic_e start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over~ start_ARG bold_l end_ARG start_POSTSUBSCRIPT italic_i , italic_q end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT italic_c italic_w end_POSTSUBSCRIPT bold_L start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT [ : 3 ] end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (13a)
el(𝐥~i,q,𝐥i,q)subscript𝑒𝑙subscript~𝐥𝑖𝑞subscript𝐥𝑖𝑞\displaystyle e_{l}\left({\tilde{\mathbf{l}}_{i,q}},\mathbf{l}_{i,q}\right)italic_e start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over~ start_ARG bold_l end_ARG start_POSTSUBSCRIPT italic_i , italic_q end_POSTSUBSCRIPT , bold_l start_POSTSUBSCRIPT italic_i , italic_q end_POSTSUBSCRIPT ) =[d(𝐩~i,q1,𝐥i,q)d(𝐩~i,q2,𝐥i,q)],absentsuperscriptdelimited-[]𝑑subscript~𝐩𝑖𝑞1subscript𝐥𝑖𝑞𝑑subscript~𝐩𝑖𝑞2subscript𝐥𝑖𝑞top\displaystyle=\left[\begin{array}[]{cc}d\left(\tilde{\mathbf{p}}_{i,q1},% \mathbf{l}_{i,q}\right)&d\left(\tilde{\mathbf{p}}_{i,q2},\mathbf{l}_{i,q}% \right)\end{array}\right]^{\top},= [ start_ARRAY start_ROW start_CELL italic_d ( over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i , italic_q 1 end_POSTSUBSCRIPT , bold_l start_POSTSUBSCRIPT italic_i , italic_q end_POSTSUBSCRIPT ) end_CELL start_CELL italic_d ( over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i , italic_q 2 end_POSTSUBSCRIPT , bold_l start_POSTSUBSCRIPT italic_i , italic_q end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (13c)

where 𝐥~i,qsubscript~𝐥𝑖𝑞{\tilde{\mathbf{l}}_{i,q}}over~ start_ARG bold_l end_ARG start_POSTSUBSCRIPT italic_i , italic_q end_POSTSUBSCRIPT is the observation of 𝐋qsubscript𝐋𝑞\mathbf{L}_{q}bold_L start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT on frame i𝑖iitalic_i, 𝐩~i,q1subscript~𝐩𝑖𝑞1{\tilde{\mathbf{p}}_{i,q1}}over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i , italic_q 1 end_POSTSUBSCRIPT and 𝐩~i,q2subscript~𝐩𝑖𝑞2{\tilde{\mathbf{p}}_{i,q2}}over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i , italic_q 2 end_POSTSUBSCRIPT are the endpoints of 𝐥~i,qsubscript~𝐥𝑖𝑞{\tilde{\mathbf{l}}_{i,q}}over~ start_ARG bold_l end_ARG start_POSTSUBSCRIPT italic_i , italic_q end_POSTSUBSCRIPT, and d(𝐩,𝐥)𝑑𝐩𝐥d\left(\mathbf{p},\mathbf{l}\right)italic_d ( bold_p , bold_l ) is the distance between point 𝐩𝐩\mathbf{p}bold_p and line 𝐥𝐥\mathbf{l}bold_l which is computed through (3).

V-E3 IMU Residuals

We first follow [72] to pre-integrate IMU measurements between the frame i𝑖iitalic_i and the frame j𝑗jitalic_j:

Δ𝐑~ijΔsubscript~𝐑𝑖𝑗\displaystyle\Delta{\mathbf{\tilde{R}}_{ij}}roman_Δ over~ start_ARG bold_R end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT =k=ij1Exp((𝝎~k𝐛kg𝜼kgd)Δt),absentsuperscriptsubscriptproduct𝑘𝑖𝑗1Expsubscriptbold-~𝝎𝑘superscriptsubscript𝐛𝑘𝑔superscriptsubscript𝜼𝑘𝑔𝑑Δ𝑡\displaystyle=\prod\limits_{k=i}^{j-1}\text{Exp}\left(\left(\bm{\tilde{\omega}% }_{k}-\mathbf{b}_{k}^{g}-\bm{\eta}_{k}^{gd}\right)\Delta t\right),= ∏ start_POSTSUBSCRIPT italic_k = italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT Exp ( ( overbold_~ start_ARG bold_italic_ω end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT - bold_italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_d end_POSTSUPERSCRIPT ) roman_Δ italic_t ) , (14a)
Δ𝐯~ijΔsubscript~𝐯𝑖𝑗\displaystyle\Delta{\mathbf{\tilde{v}}_{ij}}roman_Δ over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT =k=ij1Δ𝐑~ik(𝐚~k𝐛ka𝜼kad)Δt,absentsuperscriptsubscript𝑘𝑖𝑗1Δsubscript~𝐑𝑖𝑘subscript~𝐚𝑘superscriptsubscript𝐛𝑘𝑎superscriptsubscript𝜼𝑘𝑎𝑑Δ𝑡\displaystyle=\sum\limits_{k=i}^{j-1}\Delta{\mathbf{\tilde{R}}_{ik}}\left(% \mathbf{\tilde{a}}_{k}-\mathbf{b}_{k}^{a}-\bm{\eta}_{k}^{ad}\right)\Delta t,= ∑ start_POSTSUBSCRIPT italic_k = italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT roman_Δ over~ start_ARG bold_R end_ARG start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT - bold_italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_d end_POSTSUPERSCRIPT ) roman_Δ italic_t , (14b)
Δ𝐩~ijΔsubscript~𝐩𝑖𝑗\displaystyle\Delta{\mathbf{\tilde{p}}_{ij}}roman_Δ over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT =k=ij1(Δ𝐯~ikΔt+12Δ𝐑~ik(𝐚~k𝐛ka𝜼kad)Δt2),absentsuperscriptsubscript𝑘𝑖𝑗1Δsubscript~𝐯𝑖𝑘Δ𝑡12Δsubscript~𝐑𝑖𝑘subscript~𝐚𝑘superscriptsubscript𝐛𝑘𝑎superscriptsubscript𝜼𝑘𝑎𝑑Δsuperscript𝑡2\displaystyle=\sum\limits_{k=i}^{j-1}\left(\Delta{\mathbf{\tilde{v}}_{ik}}% \Delta t+\frac{1}{2}\Delta{\mathbf{\tilde{R}}_{ik}}\left(\mathbf{\tilde{a}}_{k% }-\mathbf{b}_{k}^{a}-\bm{\eta}_{k}^{ad}\right)\Delta t^{2}\right),= ∑ start_POSTSUBSCRIPT italic_k = italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT ( roman_Δ over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT roman_Δ italic_t + divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_Δ over~ start_ARG bold_R end_ARG start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT - bold_italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_d end_POSTSUPERSCRIPT ) roman_Δ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (14c)

where 𝝎~𝒌subscriptbold-~𝝎𝒌\bm{\tilde{\omega}_{k}}overbold_~ start_ARG bold_italic_ω end_ARG start_POSTSUBSCRIPT bold_italic_k end_POSTSUBSCRIPT and 𝐚~ksubscript~𝐚𝑘\mathbf{\tilde{a}}_{k}over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are respectively the angular velocity and the acceleration. 𝐛kgsuperscriptsubscript𝐛𝑘𝑔\mathbf{b}_{k}^{g}bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and 𝐛kasuperscriptsubscript𝐛𝑘𝑎\mathbf{b}_{k}^{a}bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT are biases of the sensor and they are modeled as constants between two keyframes through 𝐛kg=𝐛k+1gsuperscriptsubscript𝐛𝑘𝑔superscriptsubscript𝐛𝑘1𝑔\mathbf{b}_{k}^{g}=\mathbf{b}_{k+1}^{g}bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = bold_b start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and 𝐛ka=𝐛k+1asuperscriptsubscript𝐛𝑘𝑎superscriptsubscript𝐛𝑘1𝑎\mathbf{b}_{k}^{a}=\mathbf{b}_{k+1}^{a}bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = bold_b start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT. 𝜼kgdsuperscriptsubscript𝜼𝑘𝑔𝑑\bm{\eta}_{k}^{gd}bold_italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_d end_POSTSUPERSCRIPT and 𝜼kadsuperscriptsubscript𝜼𝑘𝑎𝑑\bm{\eta}_{k}^{ad}bold_italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_d end_POSTSUPERSCRIPT are Gaussian noises. Then IMU residuals are defined as:

𝐫ΔRijsubscript𝐫Δsubscript𝑅𝑖𝑗\displaystyle\mathbf{r}_{\Delta{{R}_{ij}}}bold_r start_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT =Log((Δ𝐑~ijExp(Δ𝐑ij𝐛gδ𝐛g))𝐑i𝐑j),absentLogsuperscriptΔsubscript~𝐑𝑖𝑗ExpΔsubscript𝐑𝑖𝑗superscript𝐛𝑔𝛿superscript𝐛𝑔topsuperscriptsubscript𝐑𝑖topsubscript𝐑𝑗\displaystyle=\text{Log}\left(\left(\Delta{\mathbf{\tilde{R}}_{ij}}\text{Exp}% \left(\frac{\partial\Delta{\mathbf{R}_{ij}}}{\partial\mathbf{b}^{g}}\delta% \mathbf{b}^{g}\right)\right)^{\top}\mathbf{R}_{i}^{\top}\mathbf{R}_{j}\right),= Log ( ( roman_Δ over~ start_ARG bold_R end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT Exp ( divide start_ARG ∂ roman_Δ bold_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_b start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_ARG italic_δ bold_b start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (15a)
𝐫Δvijsubscript𝐫Δsubscript𝑣𝑖𝑗\displaystyle\mathbf{r}_{\Delta{{v}_{ij}}}bold_r start_POSTSUBSCRIPT roman_Δ italic_v start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT =𝐑i(𝐯j𝐯i𝐠Δtij)absentsuperscriptsubscript𝐑𝑖topsubscript𝐯𝑗subscript𝐯𝑖𝐠Δsubscript𝑡𝑖𝑗\displaystyle=\mathbf{R}_{i}^{\top}\left(\mathbf{v}_{j}-\mathbf{v}_{i}-\mathbf% {g}\Delta t_{ij}\right)= bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_g roman_Δ italic_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT )
(Δ𝐯~ij+Δ𝐯ij𝐛gδ𝐛g+Δ𝐯ij𝐛aδ𝐛a),Δsubscript~𝐯𝑖𝑗Δsubscript𝐯𝑖𝑗superscript𝐛𝑔𝛿superscript𝐛𝑔Δsubscript𝐯𝑖𝑗superscript𝐛𝑎𝛿superscript𝐛𝑎\displaystyle-\left(\Delta\tilde{\mathbf{v}}_{ij}+\frac{\partial\Delta{\mathbf% {v}_{ij}}}{\partial\mathbf{b}^{g}}\delta\mathbf{b}^{g}+\frac{\partial\Delta{% \mathbf{v}_{ij}}}{\partial\mathbf{b}^{a}}\delta\mathbf{b}^{a}\right),- ( roman_Δ over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + divide start_ARG ∂ roman_Δ bold_v start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_b start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_ARG italic_δ bold_b start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT + divide start_ARG ∂ roman_Δ bold_v start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_b start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_ARG italic_δ bold_b start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) , (15b)
𝐫Δpijsubscript𝐫Δsubscript𝑝𝑖𝑗\displaystyle\mathbf{r}_{\Delta{{p}_{ij}}}bold_r start_POSTSUBSCRIPT roman_Δ italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT =𝐑i(𝐩j𝐩i𝐯iΔtij12𝐠Δtij2)absentsuperscriptsubscript𝐑𝑖topsubscript𝐩𝑗subscript𝐩𝑖subscript𝐯𝑖Δsubscript𝑡𝑖𝑗12𝐠Δsuperscriptsubscript𝑡𝑖𝑗2\displaystyle=\mathbf{R}_{i}^{\top}\left(\mathbf{p}_{j}-\mathbf{p}_{i}-\mathbf% {v}_{i}\Delta t_{ij}-\frac{1}{2}\mathbf{g}\Delta t_{ij}^{2}\right)= bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ italic_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_g roman_Δ italic_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
(Δ𝐩~ij+Δ𝐩ij𝐛gδ𝐛g+Δ𝐩ij𝐛aδ𝐛a),Δsubscript~𝐩𝑖𝑗Δsubscript𝐩𝑖𝑗superscript𝐛𝑔𝛿superscript𝐛𝑔Δsubscript𝐩𝑖𝑗superscript𝐛𝑎𝛿superscript𝐛𝑎\displaystyle-\left(\Delta\tilde{\mathbf{p}}_{ij}+\frac{\partial\Delta{\mathbf% {p}_{ij}}}{\partial\mathbf{b}^{g}}\delta\mathbf{b}^{g}+\frac{\partial\Delta{% \mathbf{p}_{ij}}}{\partial\mathbf{b}^{a}}\delta\mathbf{b}^{a}\right),- ( roman_Δ over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + divide start_ARG ∂ roman_Δ bold_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_b start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_ARG italic_δ bold_b start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT + divide start_ARG ∂ roman_Δ bold_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_b start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_ARG italic_δ bold_b start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) , (15c)
𝐫bijsubscript𝐫subscript𝑏𝑖𝑗\displaystyle\mathbf{r}_{b_{ij}}bold_r start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT =[(𝐛jg𝐛ig)(𝐛ja𝐛ia)]absentsuperscriptdelimited-[]superscriptsuperscriptsubscript𝐛𝑗𝑔superscriptsubscript𝐛𝑖𝑔topsuperscriptsuperscriptsubscript𝐛𝑗𝑎superscriptsubscript𝐛𝑖𝑎toptop\displaystyle=\left[\begin{array}[]{cc}\left(\mathbf{b}_{j}^{g}-\mathbf{b}_{i}% ^{g}\right)^{\top}&\left(\mathbf{b}_{j}^{a}-\mathbf{b}_{i}^{a}\right)^{\top}% \end{array}\right]^{\top}= [ start_ARRAY start_ROW start_CELL ( bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT - bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL start_CELL ( bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT - bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (15e)

where 𝐠𝐠\mathbf{g}bold_g is the gravity vector in world coordinates. In our system, we combine the initialization process in [4] and [62] to estimate 𝐠𝐠\mathbf{g}bold_g and initial values of biases.

The factor graph is optimized by the g2o toolbox [78]. The cost function is defined as:

𝐄𝐄\displaystyle\mathbf{E}bold_E =𝐫i,XpΣP2+𝐫i,LqΣL2+𝐫ΔRijΣΔR2\displaystyle=\sum{\lVert\mathbf{r}_{i,X_{p}}\lVert}^{2}_{\Sigma_{P}}+\sum{% \lVert\mathbf{r}_{i,L_{q}}\lVert}^{2}_{\Sigma_{L}}+\sum{\lVert\mathbf{r}_{% \Delta{{R}_{ij}}}\lVert}^{2}_{\Sigma_{\Delta{R}}}= ∑ ∥ bold_r start_POSTSUBSCRIPT italic_i , italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∑ ∥ bold_r start_POSTSUBSCRIPT italic_i , italic_L start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∑ ∥ bold_r start_POSTSUBSCRIPT roman_Δ italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT roman_Δ italic_R end_POSTSUBSCRIPT end_POSTSUBSCRIPT
+𝐫ΔvijΣΔv2+𝐫ΔpijΣΔp2+𝐫bijΣb2.\displaystyle+\sum{\lVert\mathbf{r}_{\Delta{{v}_{ij}}}\lVert}^{2}_{\Sigma_{% \Delta{v}}}+\sum{\lVert\mathbf{r}_{\Delta{{p}_{ij}}}\lVert}^{2}_{\Sigma_{% \Delta{p}}}+\sum{\lVert\mathbf{r}_{b_{ij}}\lVert}^{2}_{\Sigma_{b}}.+ ∑ ∥ bold_r start_POSTSUBSCRIPT roman_Δ italic_v start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT roman_Δ italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∑ ∥ bold_r start_POSTSUBSCRIPT roman_Δ italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT roman_Δ italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∑ ∥ bold_r start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (16a)

We use the Levenberg-Marquardt optimizer to minimize the cost function. The point and line outliers are also rejected in the optimization if their corresponding residuals are too large.

V-F Initial Map

As described in Section III-B, our map is optimized offline. Therefore, keyframes, map points, and 3D lines will be saved to the disk for subsequent optimization when the visual odometry is finished. For each keyframe, we save its index, pose, keypoints, keypoint descriptors, line features, and junctions. The correspondences between 2D features and 3D features are also recorded. To make the map faster to save, load, and transfer across different devices, the above information is stored in binary form, which also makes the initial map much smaller than the raw data. For example, on the OIVIO dataset [79], our initial map size is only about 2% of the raw data size.

VI Map Optimization and Reuse

VI-A Offline Map Optimization

This part aims to process an initial map generated by our VO module and outputs the optimized map that can be used for drift-free relocalization. Our offline map optimization module consists of the following several map-processing plugins.

VI-A1 Loop Closure Detection

Similar to most current vSLAM systems, we use a coarse-to-fine pipeline to detect loop closures. Our loop closure detection relies on DBoW2 [61] to retrieve candidates and LightGlue [27] to match features. We train a vocabulary for the keypoint detected by our PLNet on a database that contains 35k images. These images are selected from several large datasets [80, 81, 82] that include both indoor and outdoor scenes. The vocabulary has 4 layers, with 10 nodes at each layer, so it contains 10,000 words.

Coarse Candidate Selection: This step aims to find three candidates most similar to a keyframe 𝒦isubscript𝒦𝑖\mathcal{K}_{i}caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from a set 𝒮1={𝒦jj<i}subscript𝒮1conditional-setsubscript𝒦𝑗𝑗𝑖\mathcal{S}_{1}=\left\{\mathcal{K}_{j}\mid j<i\right\}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { caligraphic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_j < italic_i }. Note that we do not add keyframes with an index greater than 𝒦isubscript𝒦𝑖\mathcal{K}_{i}caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the set because this may miss some loop pairs. We build a co-visibility graph for all keyframes where two are connected if they obverse at last one feature. All keyframes connected with 𝒦isubscript𝒦𝑖\mathcal{K}_{i}caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT will be first removed from 𝒮1subscript𝒮1\mathcal{S}_{1}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Then we compute a similarity score between 𝒦isubscript𝒦𝑖\mathcal{K}_{i}caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and each keyframe in 𝒮1subscript𝒮1\mathcal{S}_{1}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT using DBoW2. Only keyframes with a score greater than 0.3Smax0.3subscript𝑆𝑚𝑎𝑥0.3\cdot S_{max}0.3 ⋅ italic_S start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT will be kept in 𝒮1subscript𝒮1\mathcal{S}_{1}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where Smaxsubscript𝑆𝑚𝑎𝑥S_{max}italic_S start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is the maximum computed score. After that, we group the remaining keyframes. If two keyframes can observe more than 10 features in common, they will be in the same group. For each group, we sum up the scores of the keyframes in this group and use it as the group score. Only the top 3 groups with the highest scores will be retained. Then we select one keyframe with the highest score within the group as the candidate from each group. These three candidates will be processed in the subsequent steps.

Fine Feature Matching: For each selected candidate, we match its features with 𝒦isubscript𝒦𝑖\mathcal{K}_{i}caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then the relative pose estimation with outlier rejection will be performed. The candidate will form a valid loop pair with 𝒦isubscript𝒦𝑖\mathcal{K}_{i}caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT if the inliers exceed 50.

VI-A2 Map Merging

A 3D feature observed by both frames of a loop pair is usually mistakenly used as two features. Therefore, in this part, we aim to merge the duplicated point and line features observed by loop pairs. For keypoint features, we use the above feature-matching results between loop pairs. If two matched keypoints are associated with two different map points, they will be regarded as duplicated features and only one map point will be retained. The correspondence between 2D keypoints and 3D map points, as well as the connections in the co-visibility graph, will also be updated.

For line features, we first associate 3D lines and map points through the 2D-3D feature correspondence and 2D point-line association built in Section V-B. Then we detect 3D line pairs that associate with the same map points. If two 3D lines share more than 3 associated map points, they will be regarded as duplicated and only one 3D line will be retained.

VI-A3 Global Bundle Adjustment

We perform the global bundle adjustment (GBA) after merging duplicated features. The form of the residuals and loss functions is similar to that in Section V-E; however, unlike Section V-E, all keyframes and features are jointly optimized, and loop closure residuals are also incorporated into the optimization process. In the initial stage of optimization, the re-projection errors of merged features are relatively large due to the VO drift error, so we first iterate 50 times without outlier rejection to optimize the variables to a good rough position, and then iterate another 40 times with outlier rejection.

We find that when the map is large, the initial 50 iterations can not optimize the variables to a satisfactory position. To address this, we first perform pose graph optimization (PGO) before the global bundle adjustment if a map contains more than 80k map points. Only the keyframe poses will be adjusted in the PGO and the cost function is defined as follows:

𝐄pgo=Log(Δ𝐓~ij1𝐓i1𝐓j)Σij2,\mathbf{E}_{pgo}=\sum{\lVert\text{Log}\left(\Delta\tilde{\mathbf{T}}_{ij}^{-1}% \mathbf{T}_{i}^{-1}\mathbf{T}_{j}\right)\lVert^{2}_{\Sigma_{ij}}},bold_E start_POSTSUBSCRIPT italic_p italic_g italic_o end_POSTSUBSCRIPT = ∑ ∥ Log ( roman_Δ over~ start_ARG bold_T end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (17)

where 𝐓iSE(3)subscript𝐓𝑖SE(3)\mathbf{T}_{i}\in\text{SE(3)}bold_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ SE(3) and 𝐓jSE(3)subscript𝐓𝑗SE(3)\mathbf{T}_{j}\in\text{SE(3)}bold_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ SE(3) are poses of 𝒦isubscript𝒦𝑖\mathcal{K}_{i}caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒦jsubscript𝒦𝑗\mathcal{K}_{j}caligraphic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, respectively. Log()=log():SE(3)se(3):LoglogsuperscriptSE(3)se(3)\text{Log}(\cdot)=\text{log}(\cdot)^{{\vee}}:\text{SE(3)}\to\text{se(3)}Log ( ⋅ ) = log ( ⋅ ) start_POSTSUPERSCRIPT ∨ end_POSTSUPERSCRIPT : SE(3) → se(3) is the Logarithm map proposed in [83]. 𝒦isubscript𝒦𝑖\mathcal{K}_{i}caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒦jsubscript𝒦𝑗\mathcal{K}_{j}caligraphic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT should either be adjacent or form a loop pair. After the pose graph optimization, the positions of map points and 3D lines will also be adjusted along with the keyframes in which they are first observed.

The systems with online loop detection usually perform the GBA after detecting a new loop, so they undergo repeated GBAs when a scene contains many loops. In contrast, our offline map optimization module only does the GBA after all loop closures are detected, allowing us to reduce the optimization iterations significantly compared with them.

VI-A4 Scene-Dependent Vocabulary

We train a junction vocabulary aiming to be used for relocalization. The vocabulary is built on the junctions of keyframes in the map so it is scene-dependent. Compared with the keypoint vocabulary trained in Section VI-A1, the database used to train the junction vocabulary is generally much smaller, so we set the number of layers to 3, with 10 nodes in each layer. The junction vocabulary is tiny, i.e., about 1 megabyte, as it only contains 1000 words. Its detailed usage will be introduced in Section VI-B.

VI-A5 Optimized Map

we save the optimized map for subsequent map reuse. Compared with the initial map in Section V-F, more information is saved such as the bag of words for each keyframe, the global co-visibility graph, and the scene-dependent junction vocabulary. In the meantime, the number of 3D features has decreased due to the fusion of duplicate map points and 3D lines. Therefore, the optimized map occupies a similar memory to the initial map.

VI-B Map Reuse

In this part, we present our illumination-robust relocalization using an existing optimized map. In most vSLAM systems, recognizing revisited places typically needs two steps: (1) retrieving Nkcsubscript𝑁𝑘𝑐N_{kc}italic_N start_POSTSUBSCRIPT italic_k italic_c end_POSTSUBSCRIPT keyframe candidates and (2) performing feature matching and estimating relative pose. The second step is usually time-consuming, so selecting a proper Nkcsubscript𝑁𝑘𝑐N_{kc}italic_N start_POSTSUBSCRIPT italic_k italic_c end_POSTSUBSCRIPT is very important. A larger Nkcsubscript𝑁𝑘𝑐N_{kc}italic_N start_POSTSUBSCRIPT italic_k italic_c end_POSTSUBSCRIPT will reduce the system’s efficiency while a smaller Nkcsubscript𝑁𝑘𝑐N_{kc}italic_N start_POSTSUBSCRIPT italic_k italic_c end_POSTSUBSCRIPT may prevent the correct candidate from being recalled. For example, in the loop closing module of ORB-SLAM3 [62], only the three most similar keyframes retrieved by DBoW2 [84] are used for better efficiency. It works well as two frames in a loop pair usually have a short time interval and thus the lighting conditions are relatively similar. But for challenging tasks, such as the day/night relocalization problem, retrieving so few candidates usually results in a low recall rate. However, retrieving more candidates needs to perform feature matching and pose estimation more times for each query frame, which makes it difficult to deploy for real-time applications.

To address this problem, we propose an efficient multi-stage relocalization method to make the optimized map usable in different lighting conditions. Our insight is that if most of the false candidates can be quickly filtered out, then the efficiency can be improved while maintaining or even improving the relocalization recall rate. Therefore, we add another step to the two-step pipeline mentioned above. We next introduce the proposed multi-stage pipeline in detail.

VI-B1 The First Step

This step is to retrieve the similar keyframes in the map that are similar to the query frame. For each input monocular image, we detect keypoints, junctions, and line features using our PLNet. Then a pipeline similar to the “coarse candidate selection” in Section VI-A1 will be executed, but with two differences. The first difference is that we do not filter out candidates using the co-visibility graph as the query frame is not in the graph. The second is that all candidates, not just three, will be retained for the next step.

VI-B2 The Second Step

This step filters out most of the candidates selected in the first step using junctions and line features. For query frame 𝒦qsubscript𝒦𝑞\mathcal{K}_{q}caligraphic_K start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and each candidate 𝒦bsubscript𝒦𝑏\mathcal{K}_{b}caligraphic_K start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, we first match their junctions by finding the same words through the junction vocabulary trained in Section VI-A4. We use {(qi,bi)qi𝒦q,bi𝒦b}conditional-setsubscript𝑞𝑖subscript𝑏𝑖formulae-sequencesubscript𝑞𝑖subscript𝒦𝑞subscript𝑏𝑖subscript𝒦𝑏\left\{\left(q_{i},b_{i}\right)\mid q_{i}\in\mathcal{K}_{q},b_{i}\in\mathcal{K% }_{b}\right\}{ ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_K start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_K start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT } to denote the matching pairs. Then we construct two structure graphs, i.e., GqJsuperscriptsubscript𝐺𝑞𝐽G_{q}^{J}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT and GbJsuperscriptsubscript𝐺𝑏𝐽G_{b}^{J}italic_G start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT, for 𝒦qsubscript𝒦𝑞\mathcal{K}_{q}caligraphic_K start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and 𝒦bsubscript𝒦𝑏\mathcal{K}_{b}caligraphic_K start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, respectively. The vertices are matched junctions, i.e., VqJ={qiqi𝒦q}superscriptsubscript𝑉𝑞𝐽conditional-setsubscript𝑞𝑖subscript𝑞𝑖subscript𝒦𝑞V_{q}^{J}=\left\{q_{i}\mid q_{i}\in\mathcal{K}_{q}\right\}italic_V start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT = { italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_K start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT } and VbJ={bibi𝒦b}superscriptsubscript𝑉𝑏𝐽conditional-setsubscript𝑏𝑖subscript𝑏𝑖subscript𝒦𝑏V_{b}^{J}=\left\{b_{i}\mid b_{i}\in\mathcal{K}_{b}\right\}italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT = { italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_K start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT }. The related adjacent matrices that describe the connection between vertices are defined as:

𝐀qJ=[q11q1nqn1qnn],𝐀bJ=[b11b1nbn1bnn],formulae-sequencesuperscriptsubscript𝐀𝑞𝐽delimited-[]subscript𝑞11subscript𝑞1𝑛subscript𝑞𝑛1subscript𝑞𝑛𝑛superscriptsubscript𝐀𝑏𝐽delimited-[]subscript𝑏11subscript𝑏1𝑛subscript𝑏𝑛1subscript𝑏𝑛𝑛\mathbf{A}_{q}^{J}=\left[\begin{array}[]{ccc}q_{11}&\cdots&q_{1n}\\ \vdots&\ddots&\vdots\\ q_{n1}&\cdots&q_{nn}\end{array}\right],\mathbf{A}_{b}^{J}=\left[\begin{array}[% ]{ccc}b_{11}&\cdots&b_{1n}\\ \vdots&\ddots&\vdots\\ b_{n1}&\cdots&b_{nn}\end{array}\right],bold_A start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL italic_q start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_q start_POSTSUBSCRIPT 1 italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_q start_POSTSUBSCRIPT italic_n 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_q start_POSTSUBSCRIPT italic_n italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] , bold_A start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_b start_POSTSUBSCRIPT 1 italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUBSCRIPT italic_n 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_b start_POSTSUBSCRIPT italic_n italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] , (18)

where n𝑛nitalic_n is the number of junction-matching pairs. qijsubscript𝑞𝑖𝑗q_{ij}italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is set to 1 if the junction qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and qjsubscript𝑞𝑗q_{j}italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are two endpoints of the same line, otherwise, it is set to 0. The same goes for bijsubscript𝑏𝑖𝑗b_{ij}italic_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Then the graph similarity of GqJsuperscriptsubscript𝐺𝑞𝐽G_{q}^{J}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT and GbJsuperscriptsubscript𝐺𝑏𝐽G_{b}^{J}italic_G start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT can be computed through:

SqbG=1|qijbij|.superscriptsubscript𝑆𝑞𝑏𝐺1subscript𝑞𝑖𝑗subscript𝑏𝑖𝑗S_{qb}^{G}=\sum{1-|q_{ij}-b_{ij}|}.italic_S start_POSTSUBSCRIPT italic_q italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = ∑ 1 - | italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | . (19)

We also compute a junction similarity score SqbJsuperscriptsubscript𝑆𝑞𝑏𝐽S_{qb}^{J}italic_S start_POSTSUBSCRIPT italic_q italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT using the junction vocabulary and the DBoW2 algorithm. Finally, the similarity score of 𝒦qsubscript𝒦𝑞\mathcal{K}_{q}caligraphic_K start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and 𝒦bsubscript𝒦𝑏\mathcal{K}_{b}caligraphic_K start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is given by combining the keypoint similarity, junction similarity, and structure graph similarity:

Sqb=SqbK+SqbJ(1+SqbGn),subscript𝑆𝑞𝑏superscriptsubscript𝑆𝑞𝑏𝐾superscriptsubscript𝑆𝑞𝑏𝐽1superscriptsubscript𝑆𝑞𝑏𝐺𝑛S_{qb}=S_{qb}^{K}+S_{qb}^{J}\cdot\left(1+\frac{S_{qb}^{G}}{n}\right),italic_S start_POSTSUBSCRIPT italic_q italic_b end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_q italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT + italic_S start_POSTSUBSCRIPT italic_q italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT ⋅ ( 1 + divide start_ARG italic_S start_POSTSUBSCRIPT italic_q italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG ) , (20)

where SqbKsuperscriptsubscript𝑆𝑞𝑏𝐾S_{qb}^{K}italic_S start_POSTSUBSCRIPT italic_q italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT is the keypoint similarity of 𝒦qsubscript𝒦𝑞\mathcal{K}_{q}caligraphic_K start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and 𝒦bsubscript𝒦𝑏\mathcal{K}_{b}caligraphic_K start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT computed in the first step. We compute the similarity score with the query frame for each candidate, and only the top 3 candidates with the highest similarity scores will be retained for the next step.

Analysis: We next analyze the second step. In the normal two-step pipeline that uses the DBoW method, only appearance information is used to retrieve candidates. The structural information, i.e., the necessity of the consistent spatial distribution of features between the query frame and candidate, is ignored in the first step and only used in the second step. However, in the illumination-challenging scenes, the structural information is essential as it is invariant to lighting conditions. In our second step, a portion of the structural information is utilized to select candidates. First, our PLNet uses the wireframe-parsing method to detect structural lines, which are more stable in illumination-challenging environments. Second, the similarity computed in (20) utilizes both the appearance information and the structural information. Therefore, our system can achieve good performance in illumination-challenging environments although using the efficient DBoW method.

The second step is also highly efficient. On the one hand, junctions are usually much less than keyponts. In normal scenes, our PLNet can detect more than 400 good keyponts but only about 50 junctions. On the other hand, the junction vocabulary is tiny and only contains 1,000 words. Therefore, matching junctions using DBoW2, constructing junction graphs, and computing similarity scores are all executed very efficiently. The experiment shows that the second step can be done within 0.7ms. More results will be presented in Section VII.

VI-B3 The Third Step

The third step aims to estimate the pose of the query frame. We first use LightGlue to match features between the query frame and the retained candidates. The candidate with the most matching inliers will be selected as the best candidate. Then based on the matching results of the query frame and the best candidate, we can associate the query keypoints with map points. Finally, a PnP problem is solved with RANSAC to estimate the pose. The pose will be considered valid if the inliers exceed 20.

VII Experiments

In this section, we present the experiment results. The remainder of this section is organized as follows. In Section VII-A, we evaluate the line detection performance of the proposed PLNet. In Section VII-B, we evaluate the mapping accuracy of our system by comparing it with other SOTA VO or SLAM systems. In Section VII-C, we test our system in three illumination-challenging scenarios: onboard illumination, dynamic illumination, and low illumination. The comparison of these three scenarios will show the excellent robustness of our system. In Section VII-D, we assess the performance of the proposed map reuse module in addressing the day/night localization challenges, i.e., mapping during the day and relocalization at night. In Section VII-E, we present the ablation study. In Section VII-F, we evaluate the efficiency.

We use two platforms in the experiments. Most evaluations are conducted on a personal computer with an Intel i9-13900 CPU and a NVIDIA GeForce RTX 4080 GPU. In the efficiency experiment in Section VII-F, we also deploy AirSLAM on an NVIDIA Jetson Orin to prove that our system can achieve good accuracy and efficiency on the embedded platform.

VII-A Line Detection

In this section, we evaluate the performance of our PLNet. As described in Section IV-B, we follow SuperPoint [24] to design and train our backbone and keypoint detection module, and we can even use the pre-trained model of SuperPoint, therefore, we do not evaluate the keypoint detection anymore. Instead, we assess the performance of the line detection module by comparing it with SOTA systems, as it is trained with a fixed backbone, which is different from other line detectors.

VII-A1 Datasets and Baseliens

This experiment is conducted on the Wireframe dataset [69] and the YorkUrban dataset [85]. The Wireframe dataset contains 5,000 training images and 462 test images that are all collected in man-made environments. We use them to train and test our PLNet. To validate the generalization ability, we also compare various methods on the YorkUrban dataset, which contains 102 test images. All the training and test images are resized to 512×512512512512\times 512512 × 512. We compare our method with AFM [86], AFM++ [87], L-CNN [67], LETR [88], F-Clip [89], ELSD [90], and HAWPv2 [51].

VII-A2 Evaluation Metrics

We evaluate both the accuracy and efficiency of the line detection. For accuracy, the structural average precision (sAP) [67] is the most challenging metric of the wireframe parsing task. It is inspired by the mean average precision (mAP) commonly used in object detection. A detected line l~=(𝐩~1,𝐩~2)~𝑙subscript~𝐩1subscript~𝐩2\tilde{l}=\left(\tilde{\mathbf{p}}_{1},\tilde{\mathbf{p}}_{2}\right)over~ start_ARG italic_l end_ARG = ( over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is a True Positive (TP) if and only if it satisfies the following:

min(𝐩1,𝐩2)𝐩1𝐩~12+𝐩2𝐩~22ϑ,subscriptsubscript𝐩1subscript𝐩2superscriptnormsubscript𝐩1subscript~𝐩12superscriptnormsubscript𝐩2subscript~𝐩22italic-ϑ\min_{\left(\mathbf{p}_{1},\mathbf{p}_{2}\right)\in\mathcal{L}}\|\mathbf{p}_{1% }-\tilde{\mathbf{p}}_{1}\|^{2}+\|\mathbf{p}_{2}-\tilde{\mathbf{p}}_{2}\|^{2}% \leq\vartheta,roman_min start_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ caligraphic_L end_POSTSUBSCRIPT ∥ bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ϑ , (21)

where \mathcal{L}caligraphic_L is the set of ground truth, and ϑitalic-ϑ\varthetaitalic_ϑ is a predefined threshold. We follow the previous methods to set ϑitalic-ϑ\varthetaitalic_ϑ to 5, 10, and 15, then the corresponding sAP scores are represented by sAP5superscriptsAP5\text{sAP}^{5}sAP start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, sAP10superscriptsAP10\text{sAP}^{10}sAP start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, and sAP15superscriptsAP15\text{sAP}^{15}sAP start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT, respectively. For efficiency, we use the frames per second (FPS) to evaluate various systems.

Refer to caption
Figure 7: The line detection comparison between our PLNet (a wireframe parsing method) and SOLD2 (a non-wireframe-parsing method). The red lines are detected line features and the green points are endpoints of lines. Our PLNet aims to detect structural lines while SOLD2 detects more general lines with significant gradients, such as the patterns on the floor and walls.
TABLE I: The comparison of various wireframe parsing methods. The top two results are highlighted and underlined in order.
Methods1 Wireframe Dataset YorkUrban Dataset FPS
sAP5superscriptsAP5\text{sAP}^{5}sAP start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT sAP10superscriptsAP10\text{sAP}^{10}sAP start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT sAP15superscriptsAP15\text{sAP}^{15}sAP start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT sAP5superscriptsAP5\text{sAP}^{5}sAP start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT sAP10superscriptsAP10\text{sAP}^{10}sAP start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT sAP15superscriptsAP15\text{sAP}^{15}sAP start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT
AFM [86] 18.5 24.4 27.5 7.3 9.4 11.1 10.4*
AFM++ [87] 27.7 32.4 34.8 9.5 11.6 13.2 8.0*
L-CNN [67] 59.7 63.6 65.3 25.0 27.1 28.3 29.6
LETR [88] 59.2 65.2 67.7 23.9 27.6 29.7 2.0
F-Clip [89] 64.3 68.3 69.1 28.6 31.0 32.4 82.3
ELSD [90] 64.3 68.9 70.9 27.6 30.2 31.8 42.6*
HAWPv2 [51] 65.7 69.7 71.3 28.9 31.2 32.6 85.2
HAWPv2 [51] 63.6 67.7 69.5 26.6 29.0 30.3 85.2
PLNet (Ours) 65.2 69.2 70.9 29.3 32.0 33.5 79.4
  • 1

    Methods represented using the font are evaluated with color inputs and methods represented using the font are evaluated with grayscale inputs.

  • *

    These numbers are cited from the original paper.

  • The FPS of our PLNet is the speed of detecting both keypoints and lines for the Python implementation.

VII-A3 Results and Analysis

We present the results in Table I. The top-performing results are distinctly highlighted and underlined in order. It can be seen that our PLNet achieves the second-best performance on the Wireframe dataset and the best performance on the YorkUrban dataset. On the Wireframe dataset, HAWPv2, the best method, only outperforms our PLNet by 0.5, 0.5, and 0.4 points in sAP5superscriptsAP5\text{sAP}^{5}sAP start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, sAP10superscriptsAP10\text{sAP}^{10}sAP start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, and sAP15superscriptsAP15\text{sAP}^{15}sAP start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT, respectively. On the YorkUrban dataset, our method surpasses the second-best method by 0.4, 0.8, and 0.9 points on these three metrics, respectively. Overall, we can conclude that our PLNet achieves comparable accuracy with SOTA methods.

TABLE II: Translational error (RMSE) on the EuRoC dataset (unit: m), the best results are in bold.
Sensors Features Sequence
M1 S1 I1 P1 L1 MH01 MH02 MH03 MH04 MH05 V101 V102 V103 V201 V202 V203 Avg2
Without Loop DROID-SLAM [18] 0.163 0.121 0.242 0.399 0.270 0.103 0.165 0.158 0.102 0.115 0.204 0.186
VINS-Fusion [4] 0.163 0.178 0.316 0.331 0.175 0.102 0.099 0.112 0.110 0.124 0.252 0.178
Struct-VIO [47] 0.119 0.100 0.283 0.275 0.256 0.075 0.197 0.161 0.081 0.152 0.177 0.171
PLF-VINS [91] 0.143 0.178 0.221 0.240 0.260 0.069 0.099 0.166 0.083 0.125 0.183 0.161
Kimera-VIO [63] 0.110 0.100 0.160 0.240 0.350 0.050 0.080 0.070 0.080 0.100 0.210 0.141
OKVIS [92] 0.197 0.108 0.122 0.138 0.272 0.040 0.067 0.120 0.055 0.150 0.240 0.137
\rowcolorgray!20 AirVIO (Ours) 0.074 0.060 0.114 0.167 0.125 0.033 0.132 0.238 0.036 0.083 0.168 0.113
With Loop iSLAM [19] 0.302 0.460 0.363 0.936 0.478 0.355 0.391 0.301 0.452 0.416 1.133 0.508
UV-SLAM [93] 0.161 0.179 0.176 0.291 0.189 0.077 0.071 0.094 0.078 0.085 0.125 0.139
Kimera [94] 0.090 0.110 0.120 0.160 0.180 0.050 0.060 0.130 0.050 0.070 0.230 0.114
OpenVINS [95] 0.072 0.143 0.086 0.173 0.247 0.055 0.060 0.059 0.054 0.047 0.141 0.103
Structure-PLP-SLAM [96] 0.046 0.056 0.048 0.071 0.071 0.091 0.066 0.065 0.061 0.061 0.166 0.073
VINS-Fusion [4] 0.052 0.040 0.052 0.124 0.088 0.046 0.053 0.108 0.040 0.081 0.098 0.071
Maplab [97] 0.041 0.026 0.045 0.110 0.067 0.039 0.045 0.080 0.053 0.084 0.196 0.071
SP-Loop [98] 0.070 0.044 0.068 0.100 0.090 0.042 0.034 0.082 0.038 0.054 0.100 0.066
PL-SLAM [5] 0.042 0.052 0.040 0.064 0.070 0.042 0.046 0.069 0.061 0.057 0.126 0.061
Basalt [15] 0.080 0.060 0.050 0.100 0.080 0.040 0.020 0.030 0.030 0.020 0.059 0.052
DVI-SLAM [99] 0.042 0.046 0.081 0.072 0.069 0.059 0.034 0.028 0.040 0.039 0.055 0.051
ORB-SLAM3 [62] 0.036 0.033 0.035 0.051 0.082 0.038 0.014 0.024 0.032 0.014 0.024 0.035
DROID-SLAM [18] 0.015 0.013 0.035 0.048 0.040 0.037 0.011 0.020 0.018 0.015 0.017 0.024
\rowcolorgray!20 AirSLAM (Ours) 0.019 0.013 0.025 0.056 0.051 0.032 0.014 0.025 0.014 0.018 0.068 0.030
  • 1

    M denotes the monocular camera, S denotes the stereo camera, I denotes the IMU, P denotes the keypoint feature, and L denotes the line feature.

  • 2

    The average error of the successful sequences.

Generalizability Analysis

We can also conclude that the generalizability of our PLNet is better than other methods. This conclusion is based on two comparative results between our method and HAWPv2, which is the current best wireframe parsing method. First, on the Wireframe dataset, which also serves as the training dataset, HAWPv2 outperforms our PLNet. However, on the YorkUrban dataset, it is surpassed by our method. Second, the previous methods are all evaluated with color inputs in their original paper. Considering that grayscale images are also widely used in vSLAM systems, we train our PLNet with grayscale inputs. We also retrain HAWPv2 and evaluate it using grayscale images for comparison. The result shows that our PLNet significantly outperforms HAWPv2 on both datasets when the inputs are grayscale images. We think the better generalizability comes from our backbone. Other methods are trained on only 5,000 images of the Wireframe dataset, while our backbone is trained on a large diverse dataset, which gives it a stronger feature extraction capability.

Efficiency Analysis

It is worth noting that the FPS of our method in Table I is the speed of detecting both keypoints and lines, while other methods can only output lines. Nevertheless, our PLNet remains one of the fastest methods due to the design of the shared backbone. PLNet processes each image only 0.86ms0.86ms0.86\mathrm{m}\mathrm{s}0.86 roman_ms slower than the fastest algorithm, i.e., HAWPv2.

Note that the selected baselines are all wireframe parsing methods. The non-wireframe-parsing line detection methods, such as SOLD2 [49] and DeepLSD [68], are not added to the comparison as it is unfair to do so. As shown in Fig. 7, the wireframe parsing techniques aim to detect structural lines. They are usually evaluated using the sAP and compared with the ground truth. The non-wireframe-parsing methods can detect more general lines with significant gradients, however, they often detect a long line segment as multiple short line segments, which results in their poor sAP performance.

Refer to caption
Figure 8: Comparison based on the OIVIO dataset. The vertical axis is the proportion of pose errors that are less than the given alignment error threshold on the horizontal axis. Our AirSLAM achieves the most accurate result.
TABLE III: RMSE (m) on the OIVIO dataset, the best results are in bold. F represents tracking failure or large dirft error.
Sequence Kimera PL- Basalt DROID- ORB- Ours
SLAM SLAM SLAM3
MN_015_GV_01 0.169 1.238 0.216 0.286 0.066 0.054
MN_015_GV_02 2.408 0.853 0.153 0.081 0.069 0.052
MN_050_GV_01 F 1.143 0.186 0.173 0.063 0.062
MN_050_GV_02 F 0.921 0.103 0.080 0.053 0.048
MN_100_GV_01 F 0.831 0.197 0.184 0.051 0.064
MN_100_GV_02 2.238 0.609 0.092 0.090 0.063 0.042
TN_015_GV_01 0.300 1.579 0.148 0.188 0.053 0.057
TN_050_GV_01 0.280 1.736 0.521 0.313 0.082 0.065
TN_100_GV_01 0.264 1.312 0.116 0.179 0.086 0.078
Average - 1.358 0.192 0.175 0.065 0.058

VII-B Mapping Accuracy

In this section, we evaluate the mapping accuracy of our system under well-illuminated conditions. The EuRoC dataset [100] is one of the most widely used datasets for vSLAM, so we use it for the accuracy evaluation. We compare our method only with systems capable of estimating the real scale, so the selected baselines are either visual-inertial systems, stereo systems, or those incorporating both. We incorporate traditional methods, learning-based systems, and hybrid systems into the comparison. We use AirVIO to represent our system without loop detection. The root mean square error (RMSE) is used as the metric and computed by the evo [101].

Refer to caption
Figure 9: Our feature detection and matching on a challenging sequence in UMA-VI dataset. The red lines represent detected line features and the colored lines across images indicate feature association. The image may suddenly go dark due to turning off the lights, which is very difficult for vSLAM systems.

The comparison results are presented in Table II. We evaluate the systems with and without loop detection on 11 sequences. For the comparison without loop detection, our method outperforms other VIO methods: we achieve the best results on 8 out of 11 sequences. The average translational error of AirVIO is 20% lower than the second-best system, i.e., Kimera-VIO. For the comparison with loop detection, our system achieves comparable performance with ORB-SLAM3 and DROID-SLAM, which are SOTA traditional and learning-based visual SLAM systems, respectively. Another conclusion that can be drawn from Table II is that loop detection significantly improves the accuracy of our system. The average error of our system decreases by 74% after the loop detection.

VII-C Mapping Robustness

Although many vSLAM systems have achieved impressive accuracy as shown in the previous Section VII-B, complex lighting conditions usually render them ineffective when deployed in real applications. Therefore, in this section, we evaluate the robustness of various vSLAM systems to lighting conditions. We select several representative SOTA systems as baselines. They are ORB-SLAM3 [62], an accurate feature-based system, DROID-SLAM [18], a learning-based hybrid system, Basalt [15], a system that achieves illumination-robust optical flow tracking with the LSSD algorithm, Kimera [94], a direct visual-inertial SLAM system, and OKIVS [92], a system proven to be illumination-robust in our previous work [22]. We test these methods and our system in three scenarios: onboard illumination, dynamic illumination, and low-lighting environments. We first present the evaluation results in Section VII-C1, Section VII-C2, and Section VII-C3, respectively, and then give an overall analysis in Section VII-C4.

TABLE IV: RMSE (m) on the UMA-VI dataset, the best results are in bold. F represents tracking failure or large dirft error.
Sequence PL- ORB- Basalt OKVIS DROID- Ours
SLAM SLAM3 SLAM
conference-csc1 2.697 F 1.270 1.118 0.711 0.490
conference-csc2 1.596 F 0.682 0.470 0.135 0.091
conference-csc3 F 0.426 0.469 0.088 0.724 0.088
lab-module-csc-rev F 0.063 0.486 0.861 0.364 0.504
lab-module-csc F F 0.403 0.579 0.319 0.979
long-walk-eng F F 5.046 3.005 F 1.801
third-floor-csc1 4.478 0.863 0.420 0.287 0.048 0.070
third-floor-csc2 6.068 0.149 0.590 0.271 0.890 0.127
two-floors-csc1 F F 0.760 0.154 0.341 0.066
two-floors-csc2 F F 1.211 0.679 0.299 0.190
Refer to caption
Figure 10: We use the gamma nonlinearity to generate image sequences with low illumination. i𝑖iitalic_i is the brightness level.A𝐴Aitalic_A and γ𝛾\gammaitalic_γ are parameters to control the image brightness. The smaller A and B are, the darker the image.

VII-C1 Onboard Illumination

We utilize the OIVIO dataset [79] to assess the performance of various systems with onboard illumination. The OIVIO dataset collects visual-inertial data in tunnels and mines. In each sequence, the scene is illuminated by an onboard light of approximately 1300, 4500, or 9000 lumens. We used all nine sequences with ground truth acquired by the Leica TCRP1203 R300. As no loop closure exists in the selected sequences, it is fair to compare the VO systems with the SLAM systems. The performance of translational error is presented in Table III. The most accurate results are in bold, and F represents that the tracking is lost for more than 10s10s10\mathrm{s}10 roman_s or the RMSE exceeds 10m10m10\mathrm{m}10 roman_m. It can be seen that our method achieves the most accurate results on 7 out of 9 sequences and the smallest average error. The onboard illumination has almost no impact on our AirSLAM and ORB-SLAM3, however, it reduces the accuracy of OKVIS, Basalt, and PL-SLAM. Kimera suffers a lot from such illumination conditions. It even experiences tracking failures and large drift errors on three sequences.

We show a comparison of our method with selected baselines on the OIVIO TN_100_GV_01 sequence in Fig. 8. In this case, the robot goes through a mine with onboard illumination. The distance is about 150 meters and the average speed is about 0.84m/s. The plot shows the proportion of pose errors on the horizontal axis that are less than the given alignment error threshold on the horizontal axis. Our system achieves a more accurate result than other systems on this sequence.

Refer to caption
Figure 11: The comparison results on the Dark EuRoC dataset. The higher the level of low illumination on the x-axis, the darker the image. Basalt is the most stable system while our system is more accurate.

VII-C2 Dynamic Illumination

The UMA-VI dataset is a visual-inertial dataset gathered in challenging scenarios with handheld custom sensors. We selected sequences with illumination changes to evaluate our system. As shown in Fig. 9, it contains many subsequences where the image suddenly goes dark due to turning off the lights. In the subsequence of the top row in Fig. 9, the total darkness lasted for 0.16 s , and low illumination lasted for 0.64 s . In the subsequence of the bottom row, the total darkness lasted for 0.32 s , and low illumination lasted for 0.4 s . It is more challenging than the OIVIO dataset for vSLAM systems. As the ground-truth poses are only available at the beginning and the end of each sequence, we disabled the loop closure part from all the evaluated methods.

The translational errors are presented in Table IV. The most accurate results are in bold, and F represents that the tracking is lost for more than 10s10s10\mathrm{s}10 roman_s or the RMSE exceeds 10m10m10\mathrm{m}10 roman_m. It can be seen that our AirSLAM outperforms other methods. Our system achieves the best results on 7 out of 10 sequences. The UMA-VI dataset is so challenging that PL-SLAM and ORB-SLAM3 fail on most sequences. Although OKVIS and Basalt, like our system, can complete all the sequences, their accuracy is significantly lower than ours. The average RMSEs of OKVIS and Basalt are around 1.134m1.134m1.134\mathrm{m}1.134 roman_m and 0.724m0.724m0.724\mathrm{m}0.724 roman_m, respectively, while ours is around 0.441m0.441m0.441\mathrm{m}0.441 roman_m, which means our average error is only 62.6% of OKVIS and 38.9% of Basalt.

Refer to caption
Figure 12: Some image pair samples for the mapping and relocalization in the TartanAir Day/Night Localization dataset. Due to the differences in capture viewpoints and scene depths, not all the image pairs have a valid overlap.

VII-C3 Low Illumination

Inspired by [16], we process a publicly available sequence by adjusting the brightness levels of its images. Then the processed sequences are used to evaluate the performance of various SLAM systems in low-illumination conditions. We select the “V2_01_easy” of the EuRoC dataset as the base sequence. The image brightness is adjusted using the gamma nonlinearity:

Vout=AVin1γ,subscript𝑉𝑜𝑢𝑡𝐴superscriptsubscript𝑉𝑖𝑛1𝛾V_{out}=AV_{in}^{\frac{1}{\gamma}},italic_V start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT = italic_A italic_V start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_γ end_ARG end_POSTSUPERSCRIPT , (22)

where Vinsubscript𝑉𝑖𝑛V_{in}italic_V start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT and Voutsubscript𝑉𝑜𝑢𝑡V_{out}italic_V start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT are normalized input and output pixel values, respectively. A𝐴Aitalic_A and γ𝛾\gammaitalic_γ control the maximum brightness and contrast. We set 12 adjustment levels and use Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote the i𝑖iitalic_ith level. L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represents the original sequence, i.e., A0=1subscript𝐴01A_{0}=1italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 and γ0=1subscript𝛾01\gamma_{0}=1italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1. When i[1,12]𝑖112i\in[1,12]italic_i ∈ [ 1 , 12 ], Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT alternate in descending order to make the image progressively darker. Fig. 10 shows the values of Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the processed image in each level. We name the processed dataset “Dark EuRoC”.

We present the comparison result in Fig. 11. As the errors of PL-SLAM are much greater than other methods, we do not show its result. Tracking failures and large drift errors, i.e., the RMSE is more than 1 m , are also marked. It can be seen that low illumination has varying degrees of impact on different systems. In this scenario, optical flow-based methods demonstrate greater stability compared to feature-based methods. Nonetheless, our approach achieves comparable performance to Basalt and DROID-SLAM, which utilize sparse and dense optical flow, respectively. The RMSEs of ORB-SLAM3 and OKVIS increase as the brightness decreases. They even experience tracking failures or large drift errors on L10subscript𝐿10L_{10}italic_L start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT and L11subscript𝐿11L_{11}italic_L start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT.

TABLE V: The relocalization comparison on the TartanAir Day/Night Localization dataset, the best results are in bold.
Global Feature1 + FPS2 Recall Rate of Sequences (%)
Matching1 + Local Feature1 P000 P001 P002 P003 P004 P005 P006 P007 P008 P009 P010 P011 Avg
Hloc[9] NV + NN + SOSNet 12.4 4.3 10.6 42.5 10.5 33.9 26.1 7.2 27.2 32.7 6.5 24.9 7.7 19.5
NV + NN + D2-Net 10.34 22.6 20.7 87.4 19.0 20.1 65.9 27.5 49.0 69.0 64.3 71.4 23.4 45.0
NV + NN + R2D2 8.14 15.7 32.3 92.4 22.7 52.7 75.5 14.4 84.9 54.8 67.7 60.7 26.0 50.0
NV + NN + SP 30.2 69.2 32.8 88.7 17.2 53.5 72.6 31.1 83.8 72.6 69.6 89.5 34.0 59.6
NV + AL + SOSNet 6.1 29.0 30.3 55.6 19.6 49.3 42.4 12.7 45.8 38.9 25.2 47.0 11.3 34.0
NV + LG + SIFT 10.8 45.9 31.3 83.0 58.5 52.4 64.7 18.8 78.8 57.2 43.6 79.8 33.6 53.7
NV + LG + DISK 12.9 22.9 35.4 97.9 62.9 60.9 84.2 33.8 89.9 90.0 85.5 74.9 33.4 64.3
NV + LG + SP 20.6 94.7 35.4 98.4 70.9 63.5 85.8 33.8 95.9 97.1 91.4 99.2 49.8 76.3
DIR + LG + SP 19.1 87.2 28.8 99.5 69.8 64.2 85.2 32.8 96.7 97.1 88.1 96.0 50.4 74.7
OpenIBL + LG + SP 21.3 88.6 35.4 97.7 70.3 64.3 86.1 34.7 96.4 91.3 95.7 99.7 45.1 75.4
EP + LG + SP 22.3 99.7 36.4 100.0 68.3 64.5 85.8 34.4 96.5 94.0 85.1 96.9 45.1 75.6
AirSLAM (Ours) 48.8 89.5 78.8 87.8 94.6 88.2 85.6 70.0 72.8 83.4 78.8 81.4 54.5 80.5
  • 1

    NV is NetVLAD, DIR is AP-GeM/DIR, EP is EigenPlace, NN is Nearest Neighbor Matching, AL is AdaLAM, LG is LightGlue, and SP is SuperPoint.

  • 2

    Running time of relocalization measured in Frame Per Second (FPS).

VII-C4 Result Analysis

We think the above three lighting conditions affect a visual system in different ways. The OIVIO dataset collects sequences in dark environments with only onboard illumination, so the light source moves along with the robot, which results in two effects. On the one hand, the lighting is uneven in the environment. The direction the robot is facing and the area closer to the robot is brighter than other areas. The uneven image brightness may lead to the uneven distribution of features. On the other hand, when the robot moves, the lighting of the same area will change, resulting in different brightness in different frames. The assumption of brightness constancy in some systems will be affected in such conditions. The UMA-VI dataset is collected under dynamic lighting conditions, where the dynamic lighting is caused by the sudden switching of lights or moving between indoor and outdoor environments. The image brightness variations in the UMA-VI dataset are much more intense than those in the OIVIO dataset, which may even make the extracted feature descriptor inconsistent in consecutive frames. In low-illumination environments, both the brightness and contrast of captured images are very low, making the vSLAM system more difficult to detect enough good features and extract distinct descriptors.

We summarize the above experiment results with the following conclusions. First, the systems that use descriptors for matching are more robust than the direct methods in illumination-dynamic environments. On the OIVIO dataset, our AirSLAM and ORB-SLAM3 outperform the other systems significantly. On the UMA-VI dataset, our method and OKVIS achieve the best and the second-best results, respectively. This is reasonable as the brightness constancy assumption constrains the direct methods. Despite Basalt uses LSSD to enhance its optical flow tracking, its accuracy still decreases significantly in these two scenarios. Second, the direct methods are more stable in the low illumination environments. This is because descriptor-based SLAM systems rely on enough high-quality features and descriptors, which are difficult to obtain on low brightness and contrast images. The direct methods use corners that are easier to detect, so the low illumination has less impact on them. Third, thanks to the robust feature detection and matching, the illumination robustness of our system is far better than that of other systems. AirSLAM achieves relatively high accuracy in these three illumination-challenging scenarios.

VII-D Map Reuse

VII-D1 Dataset

As mapping and relocalization in the same well-illuminated environment are no longer difficult for many current vSLAM systems, we only evaluate our map reuse module under illumination-challenging conditions, i.e., the day/night localization task. We use the “abandoned_factory” and “abandoned_factory_night” scenes in the TartanAir dataset [60] as they can provide consecutive stereo image sequences for the SLAM mapping and the corresponding accurate ground truth for the evaluation. The images in these two scenes are collected during the day and at night, respectively. We use the sequences in the “abandoned_factory” scene to build maps. Then, for each mapping image, the images with a relative distance of less than 3m3m3\mathrm{m}3 roman_m and a relative angle of less than 15°15arcdegree15\mathrm{\SIUnitSymbolDegree}15 ° from it in the “abandoned_factory_night” scene are selected as query images. We call the generated mapping and relocalization sequences the “TartanAir Day/Night Localization” dataset. Fig. 12 shows some sample pairs for mapping and relocalization. It is worth noting that due to the differences in capture viewpoints and scene depths, the query image selected based on the relative distance and angle may not always have valid overlapping with the mapping images.

VII-D2 Baseline

We have tried several traditional vSLAM systems, e.g., ORB-SLAM3 [62], and SOTA learning-based one-stage relocalization methods, e.g., ACE [102] on the TartanAir Day/Night Localization dataset, and find they perform badly: their relocalization recall rates are below 1%. Therefore, we only present the comparison results of our systems and some VPR methods. The Hloc toolbox [9] uses the structure from motion (SFM) method to build maps and has integrated many image retrieval methods, local feature extractors, and matching methods for localization. We mainly compare our system with these methods. Specifically, the NetVLAD [64], AP-GeM/DIR [103], OpenIBL [104], and EigenPlaces [105] are used to extract global features, the SuperPoint [24], SIFT [106], D2-Net [41], SOSNet [107], R2D2 [39], and DISK [40] are used to extract local features, and the LightGlue [27], AdaLAM [108] and Nearest Neighbor Matching are used to match features. We combine these methods into various “global feature detection + local feature matcher + local feature detection” pipelines for the mapping and relocalization. We do not add DXSLAM[37] to the comparison as it uses NetVLAD and SuperPoint with the binary descriptor, which has been included in the above pipelines.

Refer to caption
Figure 13: A point-line map of the P000 sequence built by our AirSLAM. The red points are mappoints and the blue lines are 3D lines.

VII-D3 Results

To achieve a fair comparison and balance the efficiency and effectiveness, we extract 400 local features and retrieve 3 candidates in the coarse localization stage for all methods. Unlike vSLAM systems that have the keyframe selection mechanism, the SFM mapping optimizes all input images, so it is very slow when mapping with original sequences. Therefore, to accelerate the SFM mapping while ensuring its mapping frames are more than our keyframes, we sample its mapping sequences by selecting one frame every four frames. We show a point-line map built by our AirSLAM in Fig. 13. The relocalization results are presented in Table V. We give the running time (FPS) and the relocalization recall rate of each method. We define a successful relocalization if the estimated pose of the query frame is within 2m2m2\mathrm{m}2 roman_m and 15°15arcdegree15\mathrm{\SIUnitSymbolDegree}15 ° of the ground truth. It can be seen that our AirSLAM outperforms other methods in terms of both efficiency and recall rate. Our system achieves the best results on 5 out of 11 sequences. AirSLAM has an average recall rate 4.2% higher than the second-best algorithm and is about 2.4 times faster than it.

TABLE VI: Ablation study of line features for mapping. We present the RMSE (m) of our system with and without line features.
EuRoC OIVIO UMA-VI Dark-EuRoC
w/o Lines 0.046 0.072 0.545 0.035
Ours 0.030 0.058 0.441 0.029

VII-D4 Analysis

We find that our system is more stable than the VPR methods on the TartanAir Day/Night Localization dataset. On several sequences, e.g., P000, P002, and P010, some VPR methods achieve remarkable results, with recall rates close to 100%. However, on some other sequences, e.g., P001 and P006, their recall rates are less than 40%. In contrast, our system maintains a recall rate of 70% to 90% on most sequences.

To clarify this, we examined each sequence and roughly categorized the images into three types. As shown in Fig. 12, the first type of image is captured with the camera relatively far from the features, so there is a significant overlap between each day/night image pair. Additionally, these images contain distinct buildings and landmarks. The second type of image pair also has a significant overlap. However, the common regions in these image pairs do not contain large buildings and landmarks. The third type of image is captured with the camera very close to the features. Although the camera distance between the day/night image pair is not large, they have almost no overlap, making their local feature matching impossible.

We find that the VPR methods perform very well on the first type of image but perform poorly on the second type of image. Therefore, their recall rates are very low on P001 and P006, which contain more of the second type of images. This may be because their global features are usually trained on datasets that have a lot of distinct buildings and landmarks, which makes them rely more on such semantic cues to retrieve similar images. By contrast, our system is based on the DBoW method, which only utilizes the low-level local features of images, so it achieves similar performance on the first and second types of images. This also proves the strong generalizability of our system. However, neither our system nor VPR methods can process the pairs with little overlap due to relying on local feature matching. Such image pairs are abundant in the P011.

TABLE VII: Relocalization ablation study. The recall rates (%) of our system with and without the structure graph during map reuse.
Seq. N𝒞subscript𝑁𝒞N_{\mathcal{C}}italic_N start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT=3 N𝒞subscript𝑁𝒞N_{\mathcal{C}}italic_N start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT=5 N𝒞subscript𝑁𝒞N_{\mathcal{C}}italic_N start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT=10
w/o G. Ours w/o G. Ours w/o G. Ours
P000 77.9 89.5 84.3 92.5 88.5 94.0
P001 69.2 78.8 79.3 83.8 85.4 87.9
P002 75.4 87.8 80.9 87.8 85.3 88.0
P003 86.4 94.6 92.4 95.5 93.9 95.9
P004 81.5 88.2 85.3 90.9 89.0 92.1
P005 78.4 85.6 82.9 88.8 88.4 91.5
P006 62.4 70.0 72.3 72.7 78.6 77.2
P007 60.6 72.8 63.7 73.7 69.0 76.2
P008 77.2 83.4 80.0 84.4 81.7 85.8
P009 63.4 78.8 70.6 81.7 76.9 84.8
P010 70.7 81.4 75.4 84.0 80.9 86.9
P011 48.1 54.5 55.1 58.9 62.0 64.0
Avg. 70.9 80.5 76.9 82.9 81.6 85.4

VII-E Ablation Study

VII-E1 Line Feature In Mapping

In this section, we evaluate the impact of line features on mapping performance. To this end, we remove line features from our system and compute the RMSE on the EuRoC, OIVIO, UMA-VI, and Dark-EuRoC datasets. The results are summarized in Table VI, where “w/o Lines” refers to our system without line features. As shown in the table, incorporating line features reduces the RMSE by 34.8%, 19.4%, 19.1%, and 17.1% on the four datasets, respectively. These results demonstrate that line features substantially enhance our system and improve the mapping accuracy. We believe this improvement comes from two aspects. On the one hand, a line feature is more likely to be observed across multiple frames compared to a point feature, thereby constraining more frames during pose optimization. This consistent constraint across multiple frames enhances the accuracy of pose estimation. On the other hand, we use the tracking results of a set of points to track a line feature, which makes the tracking of a single line feature more stable than that of a single point.

VII-E2 Structure Graph In Relocalization

We also verify the effectiveness of the proposed relocalization method. This experiment is conducted on the TartanAir Day/Night Localization dataset. We compare the systems with and without the second step proposed in Section VI-B2. The results are presented in Table VII, where “w/o G.” denotes our system without the structure graph, and N𝒞subscript𝑁𝒞N_{\mathcal{C}}italic_N start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT denotes the candidate number for local feature matching. It shows that using junctions, line features, and structure graphs to filter out relocalization candidates significantly improves recall rates. AirSLAM outperforms w/o G. across all sequences, and when N𝒞subscript𝑁𝒞N_{\mathcal{C}}italic_N start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT is 3, 5, and 10, the average improvements are 9.6%, 6.0%, and 3.8%, respectively, which demonstrates the effectiveness of the proposed method.

VII-F Efficiency Analysis

Efficiency is essential for robotics applications, so we also evaluate the efficiency of the proposed system. We first compare the running time of our AirSLAM with several SOTA VO and SLAM systems on a computer with an Intel i9-13900 CPU and an NVIDIA RTX 4080 GPU. Then we deploy AirSLAM on an NVIDIA Jetson AGX Orin to verify the efficiency and performance of our system on the embedded platform.

VII-F1 Odometry Efficiency

The VO/VIO efficiency experiment is conducted on the MH_01_easy sequence of the EuRoC dataset. We compare our AirSLAM with several SOTA systems. The loop detection and GBA are disabled from all the systems for a fair comparison. The metrics are the runtime per frame and the CPU usage. The results are presented in Fig. 14a, where 100% CPU usage means 1 CPU core is utilized. It should be additionally noted that DROID-SLAM actually uses 32 CPU cores, and its CPU usage in Fig. 14a is only for a compact presentation. Our system is very efficient, achieving a rate of 73 FPS. In addition, due to extracting and matching features using the GPU, our system requires relatively less CPU resources. We also test the GPU usage. It shows that DROID-SLAM requires about 8GB GPU memory, while our AirSLAM only requires around 3GB.

To investigate the contributions of the CPU and GPU to the efficiency of our system, we measured the total runtime of modules executed on the GPU and CPU using the MH_01_easy sequence and compared it with ORB-SLAM3. The loop detection and GBA are disabled from both systems. The results are presented in Table VIII. It can be seen that our feature detection and matching are highly efficient due to the use of the GPU resources and our system design, which processes only the left image in non-keyframes. Note that we perform the initial pose estimation in the back-end, whereas ORB-SLAM3 includes this process in the front-end. This leads to the differences in the runtime of the “Others” and “Back-end” components between the two systems.

Refer to caption
(a) Odometry Efficiency.
Refer to caption
(b) Mapping Efficiency.
Figure 14: For odometry efficiency, we disable the loop detection from all the methods. The GPU usage of systems using GPU is also indicated. For mapping efficiency, we compute the total mapping time and compare it with Hloc.
TABLE VIII: Evaluation of CPU and GPU contributions to our system.
Front-end Back-end
FD+FM1 Others
ORB-SLAM3 \cellcolorgreen!2032.9 s \cellcolorgreen!2025.2 s \cellcolorgreen!2039.6 s
Ours \cellcolorred!2017.5 s \cellcolorgreen!208.4 s \cellcolorgreen!2049.3 s
  • 1

    Total runtime for feature detection and feature matching.   and   represent the modules running on the GPU and CPU, respectively.

VII-F2 Mapping Efficiency

We also evaluate the mapping time, i.e., the total runtime for building the initial map and offline optimizing the map. As we compare our system with Hloc using the TartanAir dataset in the map reuse experiment, we use the same baseline and dataset in this experiment. The average mapping time per frame may differ when the map size varies, therefore, we measure the mapping time with different numbers of input images. The results are presented in Fig. 14b, where n×n\timesitalic_n × means our system is n𝑛nitalic_n times faster than Hloc. It can be seen that our system is much more efficient than Hloc, especially as the input images increase. Besides, Hloc can only use monocular images to build a map without the real scale, and the map only contains point features, while our system can build the point-line map and estimate the real scale using a stereo camera and an IMU. Therefore, our system is more stable and practical for robotics applications than Hloc.

VII-F3 Embedded Platform

Refer to caption
Figure 15: Accuracy comparison of our system on an NVIDIA Jetson Orin and a PC. The vertical axis represents ATE in meters.
TABLE IX: Efficiency comparison of our system on two platforms.
Platform Runtime CPU GPU
VIO (FPS) Optim1. ( s ) Usage (%) Usage (MB)
Ours-Jetson 40.3 57.8 224.7 989
Ours-PC 73.1 55.5 217.8 3076
  • 1

    The runtime of the offline map optimization.

We use 8 sequences in the EuRoC dataset to evaluate the efficiency of AirSLAM on the embedded platform. The suffixes, i.e., “-Jetson” and “-PC”, are added to distinguish results on different platforms. On the Jetson, we modify three parameters in our system to improve efficiency. First, we reduced the number of detected keypoints from 350 to 300. Second, we change two parameters in Section V-D to make keyframes sparser, i.e., α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are changed from 0.65 and 0.1 to 0.5 and 0.2, respectively. The other parameters are the same on these two platforms. The comparisons of efficiency and absolute trajectory error (ATE) are presented in Table IX and Fig. 15, respectively. Our AirSLAM can run at a rate of 40Hz40Hz40\mathrm{Hz}40 roman_H roman_z on the Jetson while only consuming 2 CPU cores and 989MB GPU memory. We find the runtime of the offline map optimization is very close on these two platforms. This is because AirSLAM-Jetson selects fewer keyframes than AirSLAM-PC, so the loop closure and GBA are faster.

VIII Conclusion

In this work, we presented an efficient and illumination-robust hybrid vSLAM system. To be robust to challenging illumination, the proposed system employs a CNN to detect both keypoints and structural lines. Then these two features are associated and tracked using a GNN. To enhance the efficiency, we proposed PLNet, a unified model capable of simultaneously detecting both point and line features. Furthermore, a multi-stage relocalization method based on both appearance and geometry information was proposed for efficient map reuse. We designed the system with an architecture that includes online mapping, offline optimization, and online relocalization, making it easier to deploy on real robots. Extensive experiments show that the proposed system outperforms other SOTA vSLAM systems in terms of accuracy, efficiency, and robustness in illumination-challenging environments.

Despite its remarkable performance, the proposed system still has limitations. Like other point-line-based SLAM systems, our AirSLAM relies on enough line features, so it is best to apply it to man-made environments. This is because our system was originally designed for warehouse robots. In unstructured environments, the system degrades into a point-only system.

References

  • [1] A. Macario Barros, M. Michel, Y. Moline, G. Corre, and F. Carrel, “A comprehensive survey of visual slam algorithms,” Robotics, vol. 11, no. 1, p. 24, 2022.
  • [2] S. Yuan, H. Wang, and L. Xie, “Survey on localization systems and algorithms for unmanned systems,” Unmanned Systems, vol. 9, no. 02, pp. 129–163, 2021.
  • [3] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
  • [4] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocular visual-inertial state estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018.
  • [5] R. Gomez-Ojeda, F.-A. Moreno, D. Zuniga-Noël, D. Scaramuzza, and J. Gonzalez-Jimenez, “Pl-slam: A stereo slam system through the combination of points and line segments,” IEEE Transactions on Robotics, vol. 35, no. 3, pp. 734–746, 2019.
  • [6] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,” IEEE Transactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
  • [7] D. Zuñiga-Noël, A. Jaenal, R. Gomez-Ojeda, and J. Gonzalez-Jimenez, “The uma-vi dataset: Visual–inertial odometry in low-textured and dynamic illumination environments,” The International Journal of Robotics Research, vol. 39, no. 9, pp. 1052–1060, 2020.
  • [8] A. Savinykh, M. Kurenkov, E. Kruzhkov, E. Yudin, A. Potapov, P. Karpyshev, and D. Tsetserukou, “Darkslam: Gan-assisted visual slam for reliable operation in low-light conditions,” in 2022 IEEE 95th Vehicular Technology Conference:(VTC2022-Spring).   IEEE, 2022, pp. 1–6.
  • [9] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From coarse to fine: Robust hierarchical localization at large scale,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 716–12 725.
  • [10] C. Toft, W. Maddern, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, T. Pajdla et al., “Long-term visual localization revisited,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 4, pp. 2074–2088, 2020.
  • [11] Q. Gu, P. Liu, J. Zhou, X. Peng, and Y. Zhang, “Drms: Dim-light robust monocular simultaneous localization and mapping,” in 2021 International Conference on Computer, Control and Robotics (ICCCR).   IEEE, 2021, pp. 267–271.
  • [12] L. Yu, E. Yang, and B. Yang, “Afe-orb-slam: robust monocular vslam based on adaptive fast threshold and image enhancement for complex lighting environments,” Journal of Intelligent & Robotic Systems, vol. 105, no. 2, pp. 1–14, 2022.
  • [13] R. Gomez-Ojeda, Z. Zhang, J. Gonzalez-Jimenez, and D. Scaramuzza, “Learning-based image enhancement for visual odometry in challenging hdr environments,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 805–811.
  • [14] G. G. Scandaroli, M. Meilland, and R. Richa, “Improving ncc-based direct visual tracking,” in European conference on Computer Vision.   Springer, 2012, pp. 442–455.
  • [15] V. Usenko, N. Demmel, D. Schubert, J. Stückler, and D. Cremers, “Visual-inertial mapping with non-linear factor recovery,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 422–429, 2019.
  • [16] S. Park, T. Schöps, and M. Pollefeys, “Illumination change robustness in direct visual slam,” in 2017 IEEE international conference on robotics and automation (ICRA).   IEEE, 2017, pp. 4523–4530.
  • [17] W. Wang, Y. Hu, and S. Scherer, “Tartanvo: A generalizable learning-based vo,” in Conference on Robot Learning.   PMLR, 2021, pp. 1761–1772.
  • [18] Z. Teed and J. Deng, “Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras,” Advances in neural information processing systems, vol. 34, pp. 16 558–16 569, 2021.
  • [19] T. Fu, S. Su, Y. Lu, and C. Wang, “iSLAM: Imperative SLAM,” IEEE Robotics and Automation Letters (RA-L), 2024. [Online]. Available: https://arxiv.org/pdf/2306.07894.pdf
  • [20] M. Labbé and F. Michaud, “Multi-session visual slam for illumination-invariant re-localization in indoor environments,” Frontiers in Robotics and AI, vol. 9, p. 801886, 2022.
  • [21] ——, “Rtab-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation,” Journal of field robotics, vol. 36, no. 2, pp. 416–446, 2019.
  • [22] K. Xu, Y. Hao, S. Yuan, C. Wang, and L. Xie, “Airvo: An illumination-robust point-line visual odometry,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2023, pp. 3429–3436.
  • [23] S. Kannapiran, N. Bendapudi, M.-Y. Yu, D. Parikh, S. Berman, A. Vora, and G. Pandey, “Stereo visual odometry with deep learning-based point and line feature matching using an attention graph neural network,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2023, pp. 3491–3498.
  • [24] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 224–236.
  • [25] R. G. Von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall, “Lsd: A line segment detector,” Image Processing On Line, vol. 2, pp. 35–55, 2012.
  • [26] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947.
  • [27] P. Lindenberger, P.-E. Sarlin, and M. Pollefeys, “Lightglue: Local feature matching at light speed,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 627–17 638.
  • [28] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to sift or surf,” in 2011 International conference on computer vision.   IEEE, 2011, pp. 2564–2571.
  • [29] D. G. Viswanathan, “Features from accelerated segment test (fast),” in Proceedings of the 10th workshop on image analysis for multimedia interactive services, London, UK, 2009, pp. 6–8.
  • [30] S. Leutenegger, M. Chli, and R. Y. Siegwart, “Brisk: Binary robust invariant scalable keypoints,” in 2011 International conference on computer vision.   IEEE, 2011, pp. 2548–2555.
  • [31] R. Kang, J. Shi, X. Li, Y. Liu, and X. Liu, “Df-slam: A deep-learning enhanced visual slam system based on deep local features,” arXiv preprint arXiv:1901.07223, 2019.
  • [32] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk, “Learning local feature descriptors with triplets and shallow convolutional neural networks.” in Bmvc, vol. 1, no. 2, 2016, p. 3.
  • [33] J. Tang, L. Ericson, J. Folkesson, and P. Jensfelt, “Gcnv2: Efficient correspondence prediction for real-time slam,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3505–3512, 2019.
  • [34] X. Han, Y. Tao, Z. Li, R. Cen, and F. Xue, “Superpointvo: A lightweight visual odometry based on cnn feature extraction,” in 2020 5th International Conference on Automation, Control and Robotics Engineering (CACRE).   IEEE, 2020, pp. 685–691.
  • [35] H. M. S. Bruno and E. L. Colombini, “Lift-slam: A deep-learning feature-based monocular visual slam method,” Neurocomputing, vol. 455, pp. 97–110, 2021.
  • [36] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invariant feature transform,” in European conference on computer vision.   Springer, 2016, pp. 467–483.
  • [37] D. Li, X. Shi, Q. Long, S. Liu, W. Yang, F. Wang, Q. Wei, and F. Qiao, “Dxslam: A robust and efficient visual slam system with deep features,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 4958–4965.
  • [38] Z. Zhan, D. Gao, Y.-J. Lin, Y. Xia, and C. Wang, “iMatching: Imperative correspondence learning,” in European Conference on Computer Vision (ECCV), 2024. [Online]. Available: https://arxiv.org/abs/2312.02141
  • [39] J. Revaud, C. De Souza, M. Humenberger, and P. Weinzaepfel, “R2d2: Reliable and repeatable detector and descriptor,” Advances in neural information processing systems, vol. 32, 2019.
  • [40] M. Tyszkiewicz, P. Fua, and E. Trulls, “Disk: Learning local features with policy gradient,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 254–14 265, 2020.
  • [41] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-net: A trainable cnn for joint description and detection of local features,” in Proceedings of the ieee/cvf conference on computer vision and pattern recognition, 2019, pp. 8092–8101.
  • [42] H. Hu, L. Sackewitz, and M. Lauer, “Joint learning of feature detector and descriptor for visual slam,” in 2021 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2021, pp. 928–933.
  • [43] C. Anagnostopoulos, A. S. Lalos, P. Kapsalas, D. Van Nguyen, and C. Stylios, “Reviewing deep learning-based feature extractors in a novel automotive slam framework,” in 2023 31st Mediterranean Conference on Control and Automation (MED).   IEEE, 2023, pp. 107–112.
  • [44] C. Akinlar and C. Topal, “Edlines: A real-time line segment detector with a false detection control,” Pattern Recognition Letters, vol. 32, no. 13, pp. 1633–1642, 2011.
  • [45] Y. He, J. Zhao, Y. Guo, W. He, and K. Yuan, “Pl-vio: Tightly-coupled monocular visual–inertial odometry using point and line features,” Sensors, vol. 18, no. 4, p. 1159, 2018.
  • [46] L. Zhou, G. Huang, Y. Mao, S. Wang, and M. Kaess, “Edplvo: Efficient direct point-line visual odometry,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 7559–7565.
  • [47] D. Zou, Y. Wu, L. Pei, H. Ling, and W. Yu, “Structvio: Visual-inertial odometry with structural regularity of man-made environments,” IEEE Transactions on Robotics, vol. 35, no. 4, pp. 999–1013, 2019.
  • [48] Q. Chen, Y. Cao, J. Hou, G. Li, S. Qiu, B. Chen, X. Xue, H. Lu, and J. Pu, “Vpl-slam: A vertical line supported point line monocular slam system,” IEEE Transactions on Intelligent Transportation Systems, 2024.
  • [49] R. Pautrat*, J.-T. Lin*, V. Larsson, M. R. Oswald, and M. Pollefeys, “Sold2: Self-supervised occlusion-aware line description and detection,” in Computer Vision and Pattern Recognition (CVPR), 2021.
  • [50] X. Lin and C. Wang, “AirLine: Efficient learnable line detection with local edge voting,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023. [Online]. Available: https://arxiv.org/abs/2303.16500
  • [51] N. Xue, T. Wu, S. Bai, F.-D. Wang, G.-S. Xia, L. Zhang, and P. H. Torr, “Holistically-attracted wireframe parsing: From supervised to self-supervised learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [52] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 611–625, 2017.
  • [53] A. Crivellaro and V. Lepetit, “Robust 3d tracking with descriptor fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3414–3421.
  • [54] J. Huang and S. Liu, “Robust simultaneous localization and mapping in low-light environment,” Computer Animation and Virtual Worlds, vol. 30, no. 3-4, p. e1895, 2019.
  • [55] P. Kim, H. Lee, and H. J. Kim, “Autonomous flight with robust visual odometry under dynamic lighting conditions,” Autonomous Robots, vol. 43, no. 6, pp. 1605–1622, 2019.
  • [56] Z. Chen and C. Heckman, “Robust pose estimation based on normalized information distance,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 2217–2223.
  • [57] H. Alismail, M. Kaess, B. Browning, and S. Lucey, “Direct visual odometry in low light using binary descriptors,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 444–451, 2016.
  • [58] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, “Generative adversarial networks: An overview,” IEEE signal processing magazine, vol. 35, no. 1, pp. 53–65, 2018.
  • [59] S. Pratap Singh, B. Mazotti, S. Mayilvahanan, G. Li, D. Manish Rajani, and M. Ghaffari, “Twilight slam: A comparative study of low-light visual slam pipelines,” in arXiv preprint arXiv:2304.11310, 2023. [Online]. Available: https://arxiv.org/abs/2304.11310
  • [60] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 4909–4916.
  • [61] D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
  • [62] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós, “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,” IEEE Transactions on Robotics, vol. 37, no. 6, pp. 1874–1890, 2021.
  • [63] A. Rosinol, M. Abate, Y. Chang, and L. Carlone, “Kimera: an open-source library for real-time metric-semantic localization and mapping,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 1689–1696.
  • [64] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297–5307.
  • [65] N. Keetha, A. Mishra, J. Karhade, K. M. Jatavallabhula, S. Scherer, M. Krishna, and S. Garg, “Anyloc: Towards universal visual place recognition,” IEEE Robotics and Automation Letters, 2023.
  • [66] S. Yan, Y. Liu, L. Wang, Z. Shen, Z. Peng, H. Liu, M. Zhang, G. Zhang, and X. Zhou, “Long-term visual localization with mobile sensors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 245–17 255.
  • [67] Y. Zhou, H. Qi, and Y. Ma, “End-to-end wireframe parsing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 962–971.
  • [68] R. Pautrat, D. Barath, V. Larsson, M. R. Oswald, and M. Pollefeys, “Deeplsd: Line segment detection and refinement with deep image gradients,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 327–17 336.
  • [69] K. Huang, Y. Wang, Z. Zhou, T. Ding, S. Gao, and Y. Ma, “Learning to parse wireframes in images of man-made environments,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 626–635.
  • [70] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18.   Springer, 2015, pp. 234–241.
  • [71] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations (ICLR), 2015. [Online]. Available: https://arxiv.org/abs/1412.6980
  • [72] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “On-manifold preintegration for real-time visual–inertial odometry,” IEEE Transactions on Robotics, vol. 33, no. 1, pp. 1–21, 2016.
  • [73] L. Zhang and R. Koch, “An efficient and robust line segment matching approach based on lbd descriptor and pairwise geometric consistency,” Journal of visual communication and image representation, vol. 24, no. 7, pp. 794–805, 2013.
  • [74] R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
  • [75] A. Bartoli and P. Sturm, “Structure-from-motion using lines: Representation, triangulation, and bundle adjustment,” Computer vision and image understanding, vol. 100, no. 3, pp. 416–441, 2005.
  • [76] X. Zuo, X. Xie, Y. Liu, and G. Huang, “Robust visual slam with point and line features,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 1775–1782.
  • [77] Y. Yang, P. Geneva, K. Eckenhoff, and G. Huang, “Visual-inertial odometry with point and line features,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2019, pp. 2447–2454.
  • [78] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, “g 2 o: A general framework for graph optimization,” in 2011 IEEE international conference on robotics and automation.   IEEE, 2011, pp. 3607–3613.
  • [79] M. Kasper, S. McGuire, and C. Heckman, “A benchmark for visual-inertial odometry systems employing onboard illumination,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2019, pp. 5256–5263.
  • [80] A. Torii, J. Sivic, T. Pajdla, and M. Okutomi, “Visual place recognition with repetitive structures,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 883–890.
  • [81] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii, “Inloc: Indoor visual localization with dense matching and view synthesis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7199–7209.
  • [82] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla, “24/7 place recognition by view synthesis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1808–1817.
  • [83] Y. Wang and G. S. Chirikjian, “Nonparametric second-order theory of error propagation on motion groups,” The International journal of robotics research, vol. 27, no. 11-12, pp. 1258–1273, 2008.
  • [84] D. Gálvez-López and J. D. Tardós, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, October 2012.
  • [85] P. Denis, J. H. Elder, and F. J. Estrada, “Efficient edge-based methods for estimating manhattan frames in urban imagery,” in Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part II 10.   Springer, 2008, pp. 197–210.
  • [86] N. Xue, S. Bai, F. Wang, G.-S. Xia, T. Wu, and L. Zhang, “Learning attraction field representation for robust line segment detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1595–1603.
  • [87] N. Xue, S. Bai, F.-D. Wang, G.-S. Xia, T. Wu, L. Zhang, and P. H. Torr, “Learning regional attraction for line segment detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 6, pp. 1998–2013, 2019.
  • [88] Y. Xu, W. Xu, D. Cheung, and Z. Tu, “Line segment detection using transformers without edges,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4257–4266.
  • [89] X. Dai, H. Gong, S. Wu, X. Yuan, and Y. Ma, “Fully convolutional line parsing,” Neurocomputing, vol. 506, pp. 1–11, 2022.
  • [90] H. Zhang, Y. Luo, F. Qin, Y. He, and X. Liu, “Elsd: Efficient line segment detector and descriptor,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2969–2978.
  • [91] J. Lee and S.-Y. Park, “Plf-vins: Real-time monocular visual-inertial slam with point-line fusion and parallel-line fusion,” IEEE Robotics and Automation Letters, 2021.
  • [92] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual–inertial odometry using nonlinear optimization,” The International Journal of Robotics Research, vol. 34, no. 3, pp. 314–334, 2015.
  • [93] H. Lim, J. Jeon, and H. Myung, “Uv-slam: Unconstrained line-based slam using vanishing points for structural mapping,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 1518–1525, 2022.
  • [94] A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, and L. Carlone, “Kimera: From slam to spatial perception with 3d dynamic scene graphs,” The International Journal of Robotics Research, vol. 40, no. 12-14, pp. 1510–1546, 2021.
  • [95] P. Geneva, K. Eckenhoff, W. Lee, Y. Yang, and G. Huang, “Openvins: A research platform for visual-inertial estimation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 4666–4672.
  • [96] F. Shu, J. Wang, A. Pagani, and D. Stricker, “Structure plp-slam: Efficient sparse mapping and localization using point, line and plane for monocular, rgb-d and stereo cameras,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 2105–2112.
  • [97] A. Cramariuc, L. Bernreiter, F. Tschopp, M. Fehr, V. Reijgwart, J. Nieto, R. Siegwart, and C. Cadena, “maplab 2.0–a modular and multi-modal mapping framework,” IEEE Robotics and Automation Letters, 2022.
  • [98] Y. Wang, B. Xu, W. Fan, and C. Xiang, “A robust and efficient loop closure detection approach for hybrid ground/aerial vehicles,” Drones, vol. 7, no. 2, p. 135, 2023.
  • [99] X. Peng, Z. Liu, W. Li, P. Tan, S. Y. Cho, and Q. Wang, “Dvi-slam: A dual visual inertial slam network,” in 2024 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2024, pp. 12 020–12 026.
  • [100] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,” The International Journal of Robotics Research, vol. 35, no. 10, pp. 1157–1163, 2016.
  • [101] M. Grupp, “evo: Python package for the evaluation of odometry and slam.” https://github.com/MichaelGrupp/evo, 2017.
  • [102] E. Brachmann, T. Cavallari, and V. A. Prisacariu, “Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5044–5053.
  • [103] J. Revaud, J. Almazán, R. S. Rezende, and C. R. d. Souza, “Learning with average precision: Training image retrieval with a listwise loss,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5107–5116.
  • [104] Y. Ge, H. Wang, F. Zhu, R. Zhao, and H. Li, “Self-supervising fine-grained region similarities for large-scale image localization,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16.   Springer, 2020, pp. 369–386.
  • [105] G. Berton, G. Trivigno, B. Caputo, and C. Masone, “Eigenplaces: Training viewpoint robust models for visual place recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11 080–11 090.
  • [106] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, pp. 91–110, 2004.
  • [107] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas, “Sosnet: Second order similarity regularization for local descriptor learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 016–11 025.
  • [108] L. Cavalli, V. Larsson, M. R. Oswald, T. Sattler, and M. Pollefeys, “Handcrafted outlier detection revisited,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16.   Springer, 2020, pp. 770–787.
[Uncaptioned image] Kuan Xu received the B.E. and M.E. degrees in Electrical Engineering from the Harbin Institute of Technology, Harbin, China, in 2016 and 2018, respectively. He is currently working toward the Ph.D. degree in Electrical and Electronic Engineering with the Department of Electrical and Electronic Engineering, Nanyang Technological University, Singapore. From July 2018 to March 2020, he worked as a robot algorithm engineer at Tencent Holdings Ltd, Beijing, China. From March 2020 to April 2022, he served as a senior robot algorithm engineer at Geekplus Technology Co., Ltd., Beijing, China. His research interests include visual SLAM, robot localization, and perception.
[Uncaptioned image] Yuefan Hao received a B.E. degree in Communication Engineering from Kunming University of Science and Technology, Kunming, China, in 2017 and a M.E. degree in Electrical and Communication Engineering from University of Electronic Science and Technology of China, Chengdu, China, in 2020. From June 2020 to June 2024, he served as a robot algorithm engineer at Geekplus Technology Co., Ltd., Beijing, China. His research interests include computer vision, deep learning, and robotics.
[Uncaptioned image] Shenghai Yuan (Member, IEEE) received his B.S. and Ph.D. degrees in Electrical and Electronic Engineering in 2013 and 2019, respectively, from Nanyang Technological University, Singapore. His research focuses on robotics perception and navigation. He is a postdoctoral senior research fellow at the Centre for Advanced Robotics Technology Innovation (CARTIN) at Nanyang Technological University, Singapore. He has contributed over 70 papers to journals such as TRO, IJRR, TIE, and RAL, and to conferences including ICRA, CVPR, ICCV, NeurIPS, and IROS. He has also contributed over 10 technical disclosures and patents. Currently, he serves as an associate editor for the Unmanned Systems Journal and as a guest editor for the Electronics Special Issue on Advanced Technologies of Navigation for Intelligent Vehicles. He achieved second place in the academic track of the 2021 Hilti SLAM Challenge, third place in the visual-inertial track of the 2023 ICCV SLAM Challenge, and won the IROS 2023 Best Entertainment and Amusement Paper Award. He also received the Outstanding Reviewer Award at ICRA 2024. He served as the organizer of the CARIC UAV Swarm Challenge and Workshop at the 2023 CDC and the UG2 Anti-drone Challenge and Workshop at CVPR 2024. Currently, he is the organizer of the second CARIC UAV Swarm Challenge and Workshop at IROS 2024.
[Uncaptioned image] Chen Wang (Senior Member, IEEE) received a B.Eng. degree in Electrical Engineering from Beijing Institute of Technology (BIT) in 2014 and a Ph.D. degree in Electrical Engineering from Nanyang Technological University (NTU) Singapore in 2019. He was a Postdoctoral Fellow with the Robotics Institute at Carnegie Mellon University (CMU). Dr. Wang is an Assistant Professor and leading the Spatial AI & Robotics (SAIR) Lab at the Department of Computer Science and Engineering, University at Buffalo (UB). He is an Associate Editor for the International Journal of Robotics Research (IJRR) and IEEE Robotics and Automation Letters (RA-L) and an Associate Co-chair for the IEEE RAS Technical Committee (TC) for Computer & Robot Vision. He served as an Area Chair for the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), the IEEE International Conference on Robotics and Automation (ICRA), and the Conference on Neural Information Processing Systems (NeurIPS). His research interests include Spatial AI and Robotics.
[Uncaptioned image] Lihua Xie (Fellow, IEEE) received the Ph.D. degree in electrical engineering from the University of Newcastle, Australia, in 1992. Since 1992, he has been with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, where he is currently a President’s Chair and Director, Center for Advanced Robotics Technology Innovation. He served as the Head of Division of Control and Instrumentation and Co-Director, Delta-NTU Corporate Lab for Cyber-Physical Systems. He held teaching appointments in the Department of Automatic Control, Nanjing University of Science and Technology from 1986 to 1989. Dr Xie’s research interests include robust control and estimation, networked control systems, multi-agent networks, and unmanned systems. He is an Editor-in-Chief for Unmanned Systems and has served as Editor of IET Book Series in Control and Associate Editor of a number of journals including IEEE Transactions on Automatic Control, Automatica, IEEE Transactions on Control Systems Technology, IEEE Transactions on Network Control Systems, and IEEE Transactions on Circuits and Systems-II. He was an IEEE Distinguished Lecturer (Jan 2012 – Dec 2014). Dr Xie is Fellow of Academy of Engineering Singapore, IEEE, IFAC, and CAA.