AirSLAM: An Efficient and Illumination-Robust Point-Line Visual SLAM System

Kuan Xu¹, Yuefan Hao², Shenghai Yuan¹,
Chen Wang², Lihua Xie¹ This work is supported by the National Research Foundation of Singapore under its Medium-Sized Center for Advanced Robotics Technology Innovation.¹Kuan Xu, Shenghai Yuan, and Lihua Xie are with the Centre for Advanced Robotics Technology Innovation (CARTIN), School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, {kuan.xu,shyuan,elhxie}@ntu.edu.sg.²Yuefan Hao and Chen Wang are with Spatial AI & Robotics Lab, Computer Science and Engineering, University at Buffalo, Buffalo, NY 14260, USA [email protected], [email protected].

Abstract

In this paper, we present an efficient visual SLAM system designed to tackle both short-term and long-term illumination challenges. Our system adopts a hybrid approach that combines deep learning techniques for feature detection and matching with traditional backend optimization methods. Specifically, we propose a unified convolutional neural network (CNN) that simultaneously extracts keypoints and structural lines. These features are then associated, matched, triangulated, and optimized in a coupled manner. Additionally, we introduce a lightweight relocalization pipeline that reuses the built map, where keypoints, lines, and a structure graph are used to match the query frame with the map. To enhance the applicability of the proposed system to real-world robots, we deploy and accelerate the feature detection and matching networks using C++ and NVIDIA TensorRT. Extensive experiments conducted on various datasets demonstrate that our system outperforms other state-of-the-art visual SLAM systems in illumination-challenging environments. Efficiency evaluations show that our system can run at a rate of $73\mathrm{Hz}$ on a PC and $40\mathrm{Hz}$ on an embedded platform. Our implementation is open-sourced: https://github.com/sair-lab/AirSLAM.

Index Terms:

Visual SLAM, Mapping, Relocalization.

I Introduction

Visual simultaneous localization and mapping (vSLAM) is essential for robot navigation due to its favorable balance between cost and accuracy [1]. Compared to LiDAR SLAM, vSLAM utilizes more cost-effective and compact sensors to achieve accurate localization, thus broadening its range of potential applications [2]. Moreover, cameras can capture richer and more detailed information, which enhances their potential for providing robust localization.

Despite the recent advancements, the present vSLAM systems still struggle with severe lighting conditions [3, 4, 5, 6], which can be summarized into two categories. First, feature detection and tracking often fail due to drastic changes or low light, severely affecting the quality of the estimated trajectory [7, 8]. Second, when the visual map is reused for relocalization, lighting variations could significantly reduce the success rate [9, 10]. In this paper, we refer to the first issue as the short-term illumination challenge, which impacts pose estimation between two temporally adjacent frames, and the second as the long-term illumination challenge, which affects matching between the query frame and an existing map.

Present methods usually focus on only one of the above challenges. For example, various image enhancement [11, 12, 13] and image normalization algorithms [14, 15] have been developed to ensure robust tracking. These methods primarily focus on maintaining either global or local brightness consistency, yet they often fall short of handling all types of challenging lighting conditions [16]. Some systems have addressed this issue by training a VO or SLAM network on large datasets containing diverse lighting conditions [17, 18, 19]. However, they have difficulty producing a map suitable for long-term localization. Some methods can provide illumination-robust relocalization, but they usually require map building under good lighting conditions [20, 21]. In real-world robot applications, these two challenges often arise simultaneously, necessitating a unified system capable of addressing both.

Furthermore, many of the aforementioned systems incorporate intricate neural networks, relying on powerful GPUs to run in real-time. They lack the efficiency necessary for deployment on resource-constrained platforms, such as warehouse robots. These limitations impede the transition of vSLAM from laboratory research to industrial applications.

In response to these gaps, this paper introduces AirSLAM. Observing that line features can improve the accuracy and robustness of vSLAM systems [5, 22, 23], we integrate both point and line features for tracking, mapping, optimization, and relocalization. To achieve a balance between efficiency and performance, we design our system as a hybrid system, employing learning-based methods for feature detection and matching, and traditional geometric approaches for pose and map optimization. Additionally, to enhance the efficiency of feature detection, we developed a unified model capable of simultaneously detecting point and line features. We also address long-term localization challenges by proposing a multi-stage relocalization strategy, which effectively reuses our point-line map. In summary, our contributions include

•

We propose a novel point-line-based vSLAM system that combines the efficiency of traditional optimization techniques with the robustness of learning-based methods. Our system is resilient to both short-term and long-term illumination challenges while remaining efficient enough for deployment on embedded platforms.
•

We develop a unified model for both keypoint and line detection, which we call PLNet. To our knowledge, PLNet is the first model capable of simultaneously detecting both point and line features. Furthermore, we associate these two types of features and jointly utilize them for tracking, mapping, and relocalization tasks.
•

We propose a multi-stage relocalization method based on both point and line features, utilizing both appearance and geometry information. This method can provide fast and illumination-robust localization in an existing visual map using only a single image.
•

We conduct extensive experiments to demonstrate the efficiency and effectiveness of the proposed methods. The results show that our system achieves accurate and robust mapping and relocalization performance under various illumination-challenging conditions. Additionally, our system is also very efficient. It runs at a rate of $73\mathrm{Hz}$ on a PC and $40\mathrm{Hz}$ on an embedded platform.

In addition, our engineering contributions include deploying and accelerating feature detection and matching networks using C++ and NVIDIA TensorRT, facilitating their deployment on real robots. We release all the source code at https://github.com/sair-lab/AirSLAM to benefit the community.

This paper extends our conference paper, AirVO [22]. AirVO utilizes SuperPoint [24] and LSD [25] for feature detection, and SuperGlue [26] for feature matching. It achieves remarkable performance in environments with changing illumination. However, as a visual-only odometry, it primarily addresses short-term illumination challenges and cannot reuse a map for drift-free relocalization. Additionally, despite carefully designed post-processing operations, the modified LSD is still not stable enough for long-term localization. It relies on image gradient information rather than environmental structural information, rendering it susceptible to varying lighting conditions. In this version, we introduce substantial improvements, including:

•

We design a unified CNN to detect both point and line features, enhancing the stability of feature detection in illumination-challenging environments. Additionally, the more efficient LightGlue [27] is used for feature matching.
•

We extend our system to support both stereo-only and stereo-inertial data, increasing its reliability when an inertial measurement unit (IMU) is available.
•

We incorporate loop closure detection and map optimization, forming a complete vSLAM system.
•

We design a multi-stage relocalization module based on both point and line features, enabling our system to effectively handle long-term illumination challenges.

The remainder of this article is organized as follows. In Section II, we discuss the relevant literature. Section III-B presents an overview of the complete system pipeline. The proposed PLNet is presented in Section IV. In Section V, we introduce the visual-inertial odometry based on PLNet. Section VI shows how to optimize the map offline and reuse it online. The detailed experimental results are presented in Section VII to verify the efficiency, accuracy, and robustness of AirSLAM. This article is concluded in Section VIII.

II Related work

II-A Keypoint and Line Detection for vSLAM

II-A1 Keypoint Detection

Various handcrafted keypoint features e.g., ORB [28], FAST [29], and BRISK [30], have been proposed and applied to VO and vSLAM systems. They are usually efficient but not robust enough in challenging environments [26, 9]. With the development of deep learning techniques, more and more learning-based features are proposed and used to replace the handcrafted features in vSLAM systems. Rong et al. [31] introduce TFeat network [32] to extract descriptors for FAST corners and apply it to a traditional vSLAM pipeline. Tang et al. [33] use a neural network to extract robust keypoints and binary feature descriptors with the same shape as the ORB. Han et al. [34] combine SuperPoint [24] feature extractor with a traditional back-end. Bruno et al. proposed LIFT-SLAM [35], where they use LIFT [36] to extract features. Li et al. [37] replace the ORB feature with SuperPoint in ORB-SLAM2 and optimize the feature extraction with the Intel OpenVINO toolkit. Zhan et al. [38] proposed a new self-supervised training scheme for learning-based features, using bundle adjustment and bi-level optimization as a supervision signal. Some other learning-based features, e.g., R2D2 [39] and DISK [40], and D2Net [41], are also being attempted to be applied to vSLAM systems, although they are not yet efficient enough [42, 43].

II-A2 Line Detection

Currently, most point-line-based vSLAM systems use the LSD [25] or EDLines [44] to detect line features because of their good efficiency [5, 45, 46, 47, 48]. Although many learning-based line detection methods, e.g., SOLD2 [49], AirLine [50], and HAWP [51], have been proposed and shown better robustness in challenging environments, they are difficult to apply to real-time vSLAM systems due to lacking efficiency. For example, Kannapiran et al. propose StereoVO [23], where they choose SuperPoint [24] and SOLD2 [49] to detect keypoints and line segments, respectively. Despite achieving good performance in dynamic lighting conditions, StereoVO can only run at a rate of about $7\mathrm{Hz}$ on a good GPU.

II-B Short-Term Illumination Challenge

Several handcrafted methods have been proposed to improve the robustness of VO and vSLAM to challenging illumination. DSO [52] models brightness changes and jointly optimizes camera poses and photometric parameters. DRMS [11] and AFE-ORB-SLAM [12] utilize various image enhancements. Some systems try different methods, such as ZNCC, the locally-scaled sum of squared differences (LSSD), and dense descriptor computation, to achieve robust tracking [14, 53, 15]. These methods mainly focus on either global or local illumination change for all kinds of images, however, lighting conditions often affect the scene differently in different areas [16]. Other related methods include that of Huang and Liu [54], which presents a multi-feature extraction algorithm to extract two kinds of image features when a single-feature algorithm fails to extract enough feature points. Kim et al. [55] employ a patch-based affine illumination model during direct motion estimation. Chen et al. [56] minimize the normalized information distance with nonlinear least square optimization for image registration. Alismail et al. [57] propose a binary feature descriptor using a descriptor assumption to avoid brightness constancy.

Compared with handcrafted methods, learning-based methods have shown better performance. Savinykh et al. [8] propose DarkSLAM, where Generative Adversarial Network (GAN) [58] is used to enhance input images. Pratap Singh et al. [59] compare different learning-based image enhancement methods for vSLAM in low-light environments. TartanVO [17], DROID-SLAM [18], and iSLAM [19] train their VO or SLAM networks on the TartanAir dataset [60], which is a large simulation dataset that contains various lighting conditions, therefore, they are very robust in challenging environments. However, they usually require good GPUs and long training times. Besides, DROID-SLAM runs very slowly and is difficult to apply to real-time applications on resource-constrained platforms. TartanVO and iSLAM are more efficient, but they cannot achieve performance as accurately as traditional vSLAM systems.

II-C Long-Term Illumination Challenge

Currently, most SLAM systems still use the bag of words (BoW) [61] for loop closure detection and relocalization due to its good balance between efficiency and effectiveness [62, 4, 63]. To make the relocalization more robust to large illumination variations, Labbé et al. [20] propose the multi-session relocalization method, where they combine multiple maps generated at different times and in various illumination conditions. DXSLAM [37] uses NetVLAD [64] for the coarse image retrieval and SuperPoint with a binary descriptor for keypoint matching between the query frame and candidates.

Another similar task in the robotics and computer vision communities is the visual place recognition (VPR) problem, where many researchers handle the localization problem with image retrieval methods [64, 65]. These VPR solutions try to find images most similar to the query image from a database. They usually cannot directly provide accurate pose estimation which is needed in robot applications. Sarlin et al. address this and propose Hloc [9]. They use a global retrieval to obtain several candidates and match local features within those candidates. The Hloc toolbox has integrated many image retrieval methods, local feature extractors, and matching methods, and it is currently the SOTA system. Yan et al. [66] propose a long-term visual localization method for mobile platforms, however, they rely on other sensors, e.g., GPS, compass, and gravity sensor, for the coarse location retrieval.

Refer to caption — Figure 1: The proposed system consists of three main parts: online stereo VO/VIO, offline map optimization, and online relocalization. The VO/VIO module uses the mapping image sequences to build an initial map. Then the initial map is processed offline and an optimized map is outputted. The optimized map can be used for the one-shot relocalization.

III System Overview

III-A Notations

In this paper, $\mathbb{R}$ represents the set of real numbers, and $\mathbb{R}^{m}$ denotes the $m$ -dimensional real vector space. The transpose of a vector or matrix is written as $(\cdot)^{\top}$ . For a vector $\mathbf{x}\in\mathbb{R}^{m}$ , $\lVert\mathbf{x}\rVert$ represents its Euclidean norm, while $\lVert\mathbf{x}\rVert^{2}_{\Sigma}$ is shorthand for $\mathbf{x}^{\top}\Sigma\mathbf{x}$ . The transformation, rotation, and translation from the $b$ -coordinate system to the $a$ -coordinate system are denoted by $\mathbf{T}_{ab}\in\text{SE(3)}$ , $\mathbf{R}_{ab}\in\text{SO(3)}$ , and $\mathbf{t}_{ab}\in\mathbb{R}^{3}$ , respectively. We use $(\cdot)_{c}$ to indicate a vector in the camera frame and $(\cdot)_{w}$ to indicate a vector in the global world frame. Throughout the paper, super/subscripts may be omitted for brevity, as long as the meaning remains unambiguous within the given context.

III-B System Architecture

We believe that a practical vSLAM system should possess the following features:

•

High efficiency. The system should have real-time performance on resource-constrained platforms.
•

Scalability. The system should be easily extensible for various purposes and real-world applications.
•

Easy to deploy. The system should be easy to deploy on real robots and capable of achieving robust localization.

Therefore, we design a system as shown in Fig. 1. The proposed system is a hybrid system as we need the robustness of data-driven approaches and the accuracy of geometric methods. It consists of three main components: stereo VO/VIO, offline map optimization, and lightweight relocalization. (1) Stereo VO/VIO: We propose a point-line-based visual odometry that can handle both stereo and stereo-inertial inputs. (2) Offline map optimization: We implement several commonly used plugins, such as loop detection, pose graph optimization, and global bundle adjustment. The system is easily extensible for other map-processing purposes by adding customized plugins. For example, we have implemented a plugin to train a scene-dependent junction vocabulary using the endpoints of line features, which is utilized in our lightweight multi-stage relocalization. (3) Lightweight relocalization: We propose a multi-stage relocalization method that improves efficiency while maintaining effectiveness. In the first stage, keypoints and line features are detected using the proposed PLNet, and several candidates are retrieved using a keypoint vocabulary trained on a large dataset. In the second stage, most false candidates are quickly filtered out using a scene-dependent junction vocabulary and a structure graph. In the third stage, feature matching is performed between the query frame and the remaining candidates to find the best match and estimate the pose of the query frame. Since feature matching in the third stage is typically time-consuming, the filtering process in the second stage enhances the efficiency of our system compared to other two-stage relocalization systems.

We transfer some time-consuming processes, e.g., loop closure detection, pose graph optimization, and global bundle adjustment, to the offline stage. This improves the efficiency of our online mapping module. In many practical applications, such as warehouse robotics, a map is typically built by one robot and then reused by others. Our system is designed with these applications in mind. The lightweight mapping and map reuse modules can be easily deployed on resource-constrained robots, while the offline optimization module can run on a more powerful computer for various map manipulations, such as map editing and visualization. The mapping robot uploads the initial map to the computer, which then distributes the optimized map to other robots, ensuring drift-free relocalization. In the following sections, we introduce our feature detection and visual odometry (VO) pipeline in Section IV and Section V, respectively. The offline optimization and relocalization modules are presented in Section VI.

IV Feature Detection

IV-A Motivation

With advancements in deep learning technology, learning-based feature detection methods have demonstrated more stable performance in illumination-challenging environments compared to traditional methods. However, existing point-line-based VO/VIO and SLAM systems typically detect keypoints and line features separately. While it is acceptable for handcrafted methods due to their efficiency, the simultaneous application of keypoint detection and line detection networks in VO/VIO or SLAM systems, especially in stereo configurations, often hinders real-time performance on resource-constrained platforms. Consequently, we aim to design an efficient unified model that can detect keypoints and line features concurrently.

However, achieving a unified model for keypoint and line detection is challenging, as these tasks typically require different real-image datasets and training procedures. Keypoint detection models are generally trained on large datasets comprising diverse images and depend on either a boosting step or the correspondences of image pairs for training [24, 40, 39]. For line detection, we find wireframe parsing methods [67, 51] can provide stronger geometric cues than the self-supervised models [49, 68] as they are able to detect longer and more complete lines. The wireframe includes all prominent straight lines and their junctions within the scene, providing an efficient and accurate representation of large-scale geometry and object shapes [69]. However, these methods are trained on the Wireframe dataset [69], which is limited in size with only 5,462 discontinuous images. In the following sections, we will address this challenge and demonstrate how to train a unified model capable of performing both tasks. It is important to note that in this paper, the term “line detection” refers specifically to the wireframe parsing task.

IV-B Architecture Design

As shown in Fig. 2, we have two findings when visualizing the results of the keypoint and line detection networks: (1) Most junctions (endpoints of lines) detected by the line detection model are also selected as keypoints by the keypoint detection model. (2) The feature maps outputted by the keypoint detection model contain the edge information. Therefore, we argue that a line detection model can be built on the backbone of a pre-trained keypoint detection model. Based on this assumption, we design the PLNet to detect keypoints and lines in a unified framework. As shown in Fig. 3, it consists of the shared backbone, the keypoint module, and the line module.

Backbone: We follow SuperPoint [24] to design the backbone for its good efficiency and effectiveness. It uses 8 convolutional layers and 3 max-pooling layers. The input is the grayscale image sized $H\times W$ . The outputs are $H\times W\times 64$ , $\frac{H}{2}\times\frac{W}{2}\times 64$ , $\frac{H}{4}\times\frac{W}{4}\times 128$ , $\frac{H}{8}\times\frac{W}{8}\times 128$ feature maps.

Keypoint Module: We also follow SuperPoint [24] to design the keypoint detection header. It has two branches: the score branch and the descriptor branch. The inputs are $\frac{H}{8}\times\frac{W}{8}\times 128$ feature maps outputted by the backbone. The score branch outputs a tensor sized $\frac{H}{8}\times\frac{W}{8}\times 65$ . The 65 channels correspond to an $8\times 8$ grid region and a dustbin indicating no keypoint. The tensor is processed by a softmax and then resized to $H\times W$ . The descriptor branch outputs a tensor sized $\frac{H}{8}\times\frac{W}{8}\times 256$ , which is used for interpolation to compute descriptors of keypoints.

Line Module: This module takes feature maps from the backbone as inputs. It consists of a U-Net-like CNN and the line detection header. We modify the U-Net [70] to make it contain fewer convolutional layers and thus be more efficient. The U-Net-like CNN is to increase the receptive field as detecting lines requires a larger receptive field than detecting keypoints. The EPD LOIAlign [51] is used to process the outputs of the line module and finally outputs junctions and lines.

IV-C Network Training

Due to the training problem described in Section IV-A and the assumption in Section IV-B, we train our PLNet in two rounds. In the first round, only the backbone and the keypoint detection module are trained, which means we need to train a keypoint detection network. In the second round, the backbone and the keypoint detection module are fixed, and we only train the line detection module on the Wireframe dataset. We skip the details of the first round as they are very similar to [24]. Instead, we present the training of the line detection module.

Line Encoding: We adopt the attraction region field [51] to encode line segments. As shown in Fig. 4, for a line segment $\mathbf{l}=(\mathbf{x_{1}},\mathbf{x_{2}})$ , where $\mathbf{x_{1}}$ and $\mathbf{x_{2}}$ are two endpoints of $\mathbf{l}$ , and a point $\mathbf{p}$ in the attraction region of $\mathbf{l}$ , four parameters and $\mathbf{p}$ are used to encode $\mathbf{l}$ :

\mathbf{p}\left(\mathbf{l}\right)=\left(d,\theta,\theta_{1},\theta_{2}\right),

(1)

where $d$ is the distance from $\mathbf{p}$ to $\mathbf{l}$ , $\theta$ is the angle between $\mathbf{l}$ and the $\mathbf{v}$ -axis of the image, $\theta_{1}$ is the angle between $\mathbf{px_{1}}$ and the perpendicular line from $\mathbf{p}$ to $\mathbf{l}$ , and $\theta_{2}$ is the angle between $\mathbf{px_{2}}$ and the perpendicular line. The network can predict these four parameters for point $\mathbf{p}$ and then $\mathbf{l}$ can be decoded through:

\mathbf{l}=d\cdot\left[\begin{array}[]{cc}\cos{\theta}&-\sin{\theta}\\ \sin{\theta}&\cos{\theta}\end{array}\right]\left[\begin{array}[]{cc}1&1\\ \tan{\theta_{1}}&\tan{\theta_{2}}\end{array}\right]+\left[\begin{array}[]{cc}% \mathbf{p}&\mathbf{p}\end{array}\right].

(2)

Line Prediction: The line detection module outputs a tensor sized $\frac{H}{4}\times\frac{W}{4}\times 4$ to predict parameters in (1) and a heatmap to predict junctions. For each decoded line segment by (2), two junctions closest to its endpoints will be selected to form a line proposal with it. Proposals with the same junctions will be deduplicated and only one is retained. Then the EPD LOIAlign [51] and a head classifier are applied to decide whether the line proposal is a true line feature.

Line Module Training: We use the $L1$ loss to supervise the prediction of parameters in (1) and the binary cross-entropy loss to supervise the junction heatmap and the head classifier. The total loss is the sum of them. As shown in Fig. 5, to improve the robustness of line detection in illumination-challenging environments, seven types of photometric data augmentation are applied to process training images. The training uses the ADAM optimizer [71] with the learning rate $lr=4e\text{-}4$ in the first 35 epochs and $lr=4e\text{-}5$ in the last 5 epochs.

V Stereo Visual Odometry

V-A Overview

The proposed point-line-based stereo visual odometry is shown in Fig. 6. It is a hybrid VO system utilizing both the learning-based front-end and the traditional optimization backend. For each stereo image pair, we first employ the proposed PLNet to extract keypoints and line features. Then a GNN (LightGlue [27]) is used to match keypoints. In parallel, we associate line features with keypoints and match them using the keypoint matching results. After that, we perform an initial pose estimation and reject outliers. Based on the results, we triangulate the 2D features of keyframes and insert them into the map. Finally, the local bundle adjustment will be performed to optimize points, lines, and keyframe poses. In the meantime, if an IMU is accessible, its measurements will be processed using the IMU preintegration method [72], and added to the initial pose estimation and local bundle adjustment.

Applying both learning-based feature detection and matching methods to the stereo VO is time-consuming. Therefore, to improve efficiency, the following three techniques are utilized in our system. (1) For keyframes, we extract features on both left and right images and perform stereo matching to estimate the real scale. But for non-keyframes, we only process the left image. Besides, we use some lenient criteria to make the selected keyframes in our system very sparse, so the runtime and resource consumption of feature detection and matching in our system are close to that of a monocular system. (2) We convert the inference code of the CNN and GNN from Python to C++, and deploy them using ONNX and NVIDIA TensorRT, where the 16-bit floating-point arithmetic replaces the 32-bit floating-point arithmetic. (3) We design a multi-thread pipeline. A producer-consumer model is used to split the system into two main threads, i.e., the front-end thread and the backend thread. The front-end thread extracts and matches features while the backend thread performs the initial pose estimation, keyframe insertion, and local bundle adjustment.

V-B Feature Matching

We use LightGlue [27] to match keypoints. For line features, most of the current VO and SLAM systems use the LBD algorithm [73] or tracking sample points to match them. However, the LBD algorithm extracts the descriptor from a local band region of the line, so it suffers from unstable line detection due to challenging illumination or viewpoint changes. Tracking sample points can match the line detected with different lengths in two frames, but current SLAM systems usually use optical flow to track the sample points, which have a bad performance when the light conditions change rapidly or violently. Some learning-based line feature descriptors [49] are also proposed, however, they are rarely used in current SLAM systems due to the increased time complexity.

Therefore, to address both the effectiveness problem and efficiency problem, we design a fast and robust line-matching method for illumination-challenging conditions. First, we associate keypoints with line segments through their distances. Assume that $M$ keypoints and $N$ line segments are detected on the image, where the $i$ -th keypoint is denoted as $\mathbf{p}_{i}=(x_{i},y_{i})$ and the $j$ -th line segment is denoted as $\mathbf{l}_{j}=(A_{j},B_{j},C_{j},x_{j,1},y_{j,1},x_{j,2},y_{j,2})$ , where $(A_{j},B_{j},C_{j})$ are line parameters of $\mathbf{l}_{j}$ and $(x_{j,1},y_{j,1},x_{j,2},y_{j,2})$ are the endpoints. We first compute the distance between $\mathbf{p}_{i}$ and $\mathbf{l}_{j}$ through:

d_{ij}=d\left(\mathbf{p}_{i},\mathbf{l}_{j}\right)=\frac{\lvert A_{j}\cdot x_{% i}+B_{j}\cdot y_{i}+C_{j}\rvert}{\sqrt{A_{j}^{2}+B_{j}^{2}}}.

(3)

If $d_{ij}<3$ and the projection of $\mathbf{p}_{i}$ on the coordinate axis lies within the projections of line segment endpoints, i.e., $\min({x}_{j,1},{x}_{j,2})\leq x_{i}\leq\max({x}_{j,1},{x}_{j,2})$ or $\min({y}_{j,1},{y}_{j,2})\leq{y}_{i}\leq\max({y}_{j,1},{y}_{j,2})$ , we will say $\mathbf{p}_{i}$ belongs to $\mathbf{l}_{j}$ . Then the line segments on two images can be matched based on the point-matching result of these two images. For ${\mathbf{l}_{k,m}}$ on image $k$ and ${\mathbf{l}_{k+1,n}}$ on image $k+1$ , we compute a score ${S}_{mn}$ to represent the confidence of that they are the same line:

{S}_{mn}=\frac{{N}_{pm}}{\min({{N}_{k,m}},{{N}_{k+1,n}})},

(4)

where ${N}_{pm}$ is the matching number between point features belonging to ${\mathbf{l}_{k,m}}$ and point features belonging to ${\mathbf{l}_{k+1,n}}$ . ${{N}_{k,m}}$ and ${{N}_{k+1,n}}$ are the numbers of point features belonging to ${\mathbf{l}_{k,m}}$ and ${\mathbf{l}_{k+1,n}}$ , respectively. Then if ${S}_{mn}>\delta_{S}$ and ${N}_{pm}>\delta_{N}$ , where $\delta_{S}$ and $\delta_{N}$ are two preset thresholds, we will regard ${\mathbf{l}_{k,m}}$ and ${\mathbf{l}_{k+1,n}}$ as the same line. This coupled feature matching method allows our line matching to share the robust performance of keypoint matching while being highly efficient due to that it does not need another line-matching network.

V-C 3D Feature Processing

In this part, we will introduce our 3D feature processing methods, including 3D feature representation, triangulation, i.e., constructing 3D features from 2D features, and re-projection, i.e., projecting 3D features to the image plane. For 3D point processing, a 3D point is denoted as $\mathbf{X}\in\mathbb{R}^{3}$ . LightGlue [27] is utilized to match keypoints between the left and right images. Successfully matched keypoints are triangulated using stereo disparity information, while unmatched keypoints are triangulated leveraging multi-frame observations. The projection of 3D points onto the image plane is modeled using either the pinhole camera model or the fisheye camera model, depending on the specific camera in use. We skip the details of 3D point processing in our system as they are similar to other point-based VO and SLAM systems [74, 4]. On the contrary, compared with 3D points, 3D lines have more degrees of freedom, and they are easier to degenerate when being triangulated. Therefore, the 3D line processing will be illustrated in detail.

V-C1 3D Line Representation

We use Plücker coordinates [75] to represent a 3D spatial line:

\mathbf{L}=\left[\begin{array}[]{c}\mathbf{n}\\ \mathbf{v}\end{array}\right]\in\mathbb{R}^{6},

(5)

where $\mathbf{v}$ is the direction vector of the line and $\mathbf{n}$ is the normal vector of the plane determined by the line and the origin. Plücker coordinates are used for 3D line triangulation, transformation, and projection. It is over-parameterized because it is a 6-dimensional vector, but a 3D line has only four degrees of freedom. In the graph optimization stage, the extra degrees of freedom will increase the computational cost and cause the numerical instability of the system [76]. Therefore, we also use orthonormal representation [75] to represent a 3D line:

\left(\mathbf{U},\mathbf{W}\right)\in\text{SO(3)}\times\text{SO(2)}

(6)

The relationship between Plücker coordinates and orthonormal representation is similar to SO(3) and so(3). Orthonormal representation can be obtained from Plücker coordinates by:

\mathbf{L}=\left[\mathbf{n}\mid\mathbf{v}\right]=\underbrace{\left[\begin{% array}[]{ccc}\frac{\mathbf{n}}{\lVert\mathbf{n}\lVert}&\frac{\mathbf{v}}{% \lVert\mathbf{v}\lVert}&\frac{\mathbf{n}\times\mathbf{v}}{\lVert\mathbf{n}% \times\mathbf{v}\lVert}\end{array}\right]}_{\mathbf{U}\in\text{SO(3)}}% \underbrace{\left[\begin{array}[]{cc}\lVert\mathbf{n}\lVert&\mathbf{0}\\ \mathbf{0}&\lVert\mathbf{v}\lVert\\ \mathbf{0}&\mathbf{0}\end{array}\right]}_{\Sigma_{3\times 2}},

(7)

where $\Sigma_{3\times 2}$ is a diagonal matrix and its two non-zero entries defined up to scale can be represented by an SO(2) matrix:

\mathbf{W}=\frac{1}{\sqrt{{\lVert\mathbf{n}\lVert}^{2}+{\lVert\mathbf{v}\lVert% }^{2}}}\left[\begin{array}[]{cc}{\lVert\mathbf{n}\lVert}&-{\lVert\mathbf{v}% \lVert}\\ {\lVert\mathbf{v}\lVert}&{\lVert\mathbf{n}\lVert}\end{array}\right]\in\text{SO% (2)}.

(8)

In practice, this conversion can be done simply and quickly with the QR decomposition.

V-C2 Triangulation

Triangulation is to initialize a 3D line from two or more 2D line features. In our system, we use two methods to triangulate a 3D line. The first is similar to the line triangulation algorithm $B$ in [77], where the pose of a 3D line can be computed from two planes. To achieve this, we select two line segments, $\mathbf{l}_{1}$ and $\mathbf{l}_{2}$ , on two images, which are two observations of a 3D line. Note that the two images can come from the stereo pair of the same keyframe or two different keyframes. $\mathbf{l}_{1}$ and $\mathbf{l}_{2}$ can be back-projected and construct two 3D planes, $\mathbf{\pi}_{1}$ and $\mathbf{\pi}_{2}$ . Then the 3D line can be regarded as the intersection of $\mathbf{\pi}_{1}$ and $\mathbf{\pi}_{2}$ .

However, triangulating a 3D line is more difficult than triangulating a 3D point, because it suffers more from degenerate motions[77]. Therefore, we also employ a second line triangulation method if the above method fails, where points are utilized to compute the 3D line. In Section V-B, we have associated point features with line features. So to initialize a 3D line, two triangulated points $\mathbf{X}_{1}$ and $\mathbf{X}_{2}$ , which belong to this line and have the shortest distance from this line on the image plane are selected. Then the Plücker coordinates of this line can be obtained through:

\mathbf{L}=\left[\begin{array}[]{c}\mathbf{n}\\ \mathbf{v}\end{array}\right]=\left[\begin{array}[]{c}\mathbf{X}_{1}\times% \mathbf{X}_{2}\\ \frac{\mathbf{X}_{1}-\mathbf{X}_{2}}{\lVert\mathbf{X}_{1}-\mathbf{X}_{2}\lVert% }\end{array}\right].

(9)

This method requires little extra computation because the selected 3D points have been triangulated in the point triangulating stage. It is very efficient and robust.

V-C3 Re-projection

Re-projection is used to compute the re-projection errors. We use Plücker coordinates to transform and re-project 3D lines. First, we convert the 3D line from the world frame to the camera frame:

\mathbf{L}_{c}=\left[\begin{array}[]{c}\mathbf{n}_{c}\\ \mathbf{v}_{c}\end{array}\right]=\left[\begin{array}[]{cc}{\mathbf{R}}_{cw}&% \left[{\mathbf{t}_{cw}}\right]_{\times}{\mathbf{R}_{cw}}\\ \mathbf{0}&{\mathbf{R}_{cw}}\end{array}\right]\left[\begin{array}[]{c}\mathbf{% n}_{w}\\ \mathbf{v}_{w}\end{array}\right]={\mathbf{H}_{cw}}{\mathbf{L}_{w}},

(10)

where $\mathbf{L}_{c}$ and $\mathbf{L}_{w}$ are Plücker coordinates of 3D line in the camera frame and world frame, respectively. ${\mathbf{R}_{cw}}\in\text{SO(3)}$ is the rotation matrix from world frame to camera frame and ${\mathbf{t}_{cw}}\in\mathbb{R}^{3}$ is the translation vector. $\left[\cdot\right]_{\times}$ denotes the skew-symmetric matrix of a vector and ${\mathbf{H}_{cw}}$ is the transformation matrix of 3D lines from world frame to camera frame.

Then the 3D line $\mathbf{L}_{c}$ can be projected to the image plane through a line projection matrix ${\mathbf{P}_{c}}$ :

\mathbf{l}=\left[\begin{array}[]{c}A\\ B\\ C\end{array}\right]={\mathbf{P}_{c}}{\mathbf{L}_{c}}_{[:3]}=\left[\begin{array% }[]{ccc}f_{x}&0&0\\ 0&f_{y}&0\\ -f_{y}c_{x}&-f_{x}c_{y}&f_{x}f_{y}\end{array}\right]{\mathbf{n}_{c}},

(11)

where $\mathbf{l}=\left[\begin{array}[]{ccc}A&B&C\end{array}\right]^{\top}$ is the re-projected 2D line on image plane. ${\mathbf{L}_{c}}_{[:3]}$ donates the first three rows of vector $\mathbf{L}_{c}$ .

V-D Keyframe Selection

Observing that the learning-based data association method used in our system is able to track two frames that have a large baseline, so different from the frame-by-frame tracking strategy used in other VO or SLAM systems, we only match the current frame with the last keyframe. We argue this strategy can reduce the accumulated tracking error.

Therefore, the keyframe selection is essential for our system. On the one hand, as described in Section V-A, we want to make keyframes sparse to reduce the consumption of computational resources. On the other hand, the sparser the keyframes, the more likely tracking failure happens. To balance the efficiency and the tracking robustness, a frame will be selected as a keyframe if any of the following conditions is satisfied:

•

The tracked features are less than $\alpha_{1}\cdot N_{s}$ .
•

The average parallax of tracked features between the current frame and the last keyframe is larger than $\alpha_{2}\cdot\sqrt{WH}$ .
•

The number of tracked features is less than $N_{kf}$ .

In the above, $\alpha_{1}$ , $\alpha_{2}$ , and $N_{kf}$ are all preset thresholds. $N_{s}$ is the number of detected features. $W$ and $H$ respectively represent the width and height of the input image.

V-E Local Graph Optimization

To improve the accuracy, we perform the local bundle adjustment when a new keyframe is inserted. $N_{o}$ latest neighboring keyframes are selected to construct a local graph, where map points, 3D lines, and keyframes are vertices and pose constraints are edges. We use point constraints and line constraints as well as IMU constraints if an IMU is accessible. Their related error terms are defined as follows.

V-E1 Point Re-projection Error

If the frame $i$ can observe the 3D map point $\mathbf{X}_{p}$ , then the re-projection error is defined as:

\mathbf{r}_{i,X_{p}}={\tilde{\mathbf{x}}_{i,p}}-\pi\left({\mathbf{R}_{cw}}{% \mathbf{X}_{p}}+{\mathbf{t}_{cw}}\right),

(12)

where ${\tilde{\mathbf{x}}_{i,p}}$ is the observation of ${\mathbf{X}_{p}}$ on frame $i$ and $\pi\left(\cdot\right)$ represents the camera projection.

V-E2 Line Re-projection Error

If the frame $i$ can observe the 3D line $\mathbf{L}_{q}$ , then the re-projection error is defined as:


$\displaystyle\mathbf{r}_{i,L_{q}}$	$\displaystyle=e_{l}\left({\tilde{\mathbf{l}}_{i,q}},{\mathbf{P}_{c}}\left({{% \mathbf{H}_{cw}}{\mathbf{L}_{q}}}\right)_{[:3]}\right)\in\mathbb{R}^{2},$	(13a)
$\displaystyle e_{l}\left({\tilde{\mathbf{l}}_{i,q}},\mathbf{l}_{i,q}\right)$	$\displaystyle=\left[\begin{array}[]{cc}d\left(\tilde{\mathbf{p}}_{i,q1},% \mathbf{l}_{i,q}\right)&d\left(\tilde{\mathbf{p}}_{i,q2},\mathbf{l}_{i,q}% \right)\end{array}\right]^{\top},$	(13c)

where ${\tilde{\mathbf{l}}_{i,q}}$ is the observation of $\mathbf{L}_{q}$ on frame $i$ , ${\tilde{\mathbf{p}}_{i,q1}}$ and ${\tilde{\mathbf{p}}_{i,q2}}$ are the endpoints of ${\tilde{\mathbf{l}}_{i,q}}$ , and $d\left(\mathbf{p},\mathbf{l}\right)$ is the distance between point $\mathbf{p}$ and line $\mathbf{l}$ which is computed through (3).

V-E3 IMU Residuals

We first follow [72] to pre-integrate IMU measurements between the frame $i$ and the frame $j$ :


$\displaystyle\Delta{\mathbf{\tilde{R}}_{ij}}$	$\displaystyle=\prod\limits_{k=i}^{j-1}\text{Exp}\left(\left(\bm{\tilde{\omega}% }_{k}-\mathbf{b}_{k}^{g}-\bm{\eta}_{k}^{gd}\right)\Delta t\right),$	(14a)
$\displaystyle\Delta{\mathbf{\tilde{v}}_{ij}}$	$\displaystyle=\sum\limits_{k=i}^{j-1}\Delta{\mathbf{\tilde{R}}_{ik}}\left(% \mathbf{\tilde{a}}_{k}-\mathbf{b}_{k}^{a}-\bm{\eta}_{k}^{ad}\right)\Delta t,$	(14b)
$\displaystyle\Delta{\mathbf{\tilde{p}}_{ij}}$	$\displaystyle=\sum\limits_{k=i}^{j-1}\left(\Delta{\mathbf{\tilde{v}}_{ik}}% \Delta t+\frac{1}{2}\Delta{\mathbf{\tilde{R}}_{ik}}\left(\mathbf{\tilde{a}}_{k% }-\mathbf{b}_{k}^{a}-\bm{\eta}_{k}^{ad}\right)\Delta t^{2}\right),$	(14c)

where $\bm{\tilde{\omega}_{k}}$ and $\mathbf{\tilde{a}}_{k}$ are respectively the angular velocity and the acceleration. $\mathbf{b}_{k}^{g}$ and $\mathbf{b}_{k}^{a}$ are biases of the sensor and they are modeled as constants between two keyframes through $\mathbf{b}_{k}^{g}=\mathbf{b}_{k+1}^{g}$ and $\mathbf{b}_{k}^{a}=\mathbf{b}_{k+1}^{a}$ . $\bm{\eta}_{k}^{gd}$ and $\bm{\eta}_{k}^{ad}$ are Gaussian noises. Then IMU residuals are defined as:


$\displaystyle\mathbf{r}_{\Delta{{R}_{ij}}}$	$\displaystyle=\text{Log}\left(\left(\Delta{\mathbf{\tilde{R}}_{ij}}\text{Exp}% \left(\frac{\partial\Delta{\mathbf{R}_{ij}}}{\partial\mathbf{b}^{g}}\delta% \mathbf{b}^{g}\right)\right)^{\top}\mathbf{R}_{i}^{\top}\mathbf{R}_{j}\right),$	(15a)
$\displaystyle\mathbf{r}_{\Delta{{v}_{ij}}}$	$\displaystyle=\mathbf{R}_{i}^{\top}\left(\mathbf{v}_{j}-\mathbf{v}_{i}-\mathbf% {g}\Delta t_{ij}\right)$
	$\displaystyle-\left(\Delta\tilde{\mathbf{v}}_{ij}+\frac{\partial\Delta{\mathbf% {v}_{ij}}}{\partial\mathbf{b}^{g}}\delta\mathbf{b}^{g}+\frac{\partial\Delta{% \mathbf{v}_{ij}}}{\partial\mathbf{b}^{a}}\delta\mathbf{b}^{a}\right),$	(15b)
$\displaystyle\mathbf{r}_{\Delta{{p}_{ij}}}$	$\displaystyle=\mathbf{R}_{i}^{\top}\left(\mathbf{p}_{j}-\mathbf{p}_{i}-\mathbf% {v}_{i}\Delta t_{ij}-\frac{1}{2}\mathbf{g}\Delta t_{ij}^{2}\right)$
	$\displaystyle-\left(\Delta\tilde{\mathbf{p}}_{ij}+\frac{\partial\Delta{\mathbf% {p}_{ij}}}{\partial\mathbf{b}^{g}}\delta\mathbf{b}^{g}+\frac{\partial\Delta{% \mathbf{p}_{ij}}}{\partial\mathbf{b}^{a}}\delta\mathbf{b}^{a}\right),$	(15c)
$\displaystyle\mathbf{r}_{b_{ij}}$	$\displaystyle=\left[\begin{array}[]{cc}\left(\mathbf{b}_{j}^{g}-\mathbf{b}_{i}% ^{g}\right)^{\top}&\left(\mathbf{b}_{j}^{a}-\mathbf{b}_{i}^{a}\right)^{\top}% \end{array}\right]^{\top}$	(15e)

where $\mathbf{g}$ is the gravity vector in world coordinates. In our system, we combine the initialization process in [4] and [62] to estimate $\mathbf{g}$ and initial values of biases.

The factor graph is optimized by the g2o toolbox [78]. The cost function is defined as:


$\displaystyle\mathbf{E}$	$\displaystyle=\sum{\lVert\mathbf{r}_{i,X_{p}}\lVert}^{2}_{\Sigma_{P}}+\sum{% \lVert\mathbf{r}_{i,L_{q}}\lVert}^{2}_{\Sigma_{L}}+\sum{\lVert\mathbf{r}_{% \Delta{{R}_{ij}}}\lVert}^{2}_{\Sigma_{\Delta{R}}}$
	$\displaystyle+\sum{\lVert\mathbf{r}_{\Delta{{v}_{ij}}}\lVert}^{2}_{\Sigma_{% \Delta{v}}}+\sum{\lVert\mathbf{r}_{\Delta{{p}_{ij}}}\lVert}^{2}_{\Sigma_{% \Delta{p}}}+\sum{\lVert\mathbf{r}_{b_{ij}}\lVert}^{2}_{\Sigma_{b}}.$	(16a)

We use the Levenberg-Marquardt optimizer to minimize the cost function. The point and line outliers are also rejected in the optimization if their corresponding residuals are too large.

V-F Initial Map

As described in Section III-B, our map is optimized offline. Therefore, keyframes, map points, and 3D lines will be saved to the disk for subsequent optimization when the visual odometry is finished. For each keyframe, we save its index, pose, keypoints, keypoint descriptors, line features, and junctions. The correspondences between 2D features and 3D features are also recorded. To make the map faster to save, load, and transfer across different devices, the above information is stored in binary form, which also makes the initial map much smaller than the raw data. For example, on the OIVIO dataset [79], our initial map size is only about 2% of the raw data size.

VI Map Optimization and Reuse

VI-A Offline Map Optimization

This part aims to process an initial map generated by our VO module and outputs the optimized map that can be used for drift-free relocalization. Our offline map optimization module consists of the following several map-processing plugins.

VI-A1 Loop Closure Detection

Similar to most current vSLAM systems, we use a coarse-to-fine pipeline to detect loop closures. Our loop closure detection relies on DBoW2 [61] to retrieve candidates and LightGlue [27] to match features. We train a vocabulary for the keypoint detected by our PLNet on a database that contains 35k images. These images are selected from several large datasets [80, 81, 82] that include both indoor and outdoor scenes. The vocabulary has 4 layers, with 10 nodes at each layer, so it contains 10,000 words.

Coarse Candidate Selection: This step aims to find three candidates most similar to a keyframe $\mathcal{K}_{i}$ from a set $\mathcal{S}_{1}=\left\{\mathcal{K}_{j}\mid j<i\right\}$ . Note that we do not add keyframes with an index greater than $\mathcal{K}_{i}$ to the set because this may miss some loop pairs. We build a co-visibility graph for all keyframes where two are connected if they obverse at last one feature. All keyframes connected with $\mathcal{K}_{i}$ will be first removed from $\mathcal{S}_{1}$ . Then we compute a similarity score between $\mathcal{K}_{i}$ and each keyframe in $\mathcal{S}_{1}$ using DBoW2. Only keyframes with a score greater than $0.3\cdot S_{max}$ will be kept in $\mathcal{S}_{1}$ , where $S_{max}$ is the maximum computed score. After that, we group the remaining keyframes. If two keyframes can observe more than 10 features in common, they will be in the same group. For each group, we sum up the scores of the keyframes in this group and use it as the group score. Only the top 3 groups with the highest scores will be retained. Then we select one keyframe with the highest score within the group as the candidate from each group. These three candidates will be processed in the subsequent steps.

Fine Feature Matching: For each selected candidate, we match its features with $\mathcal{K}_{i}$ . Then the relative pose estimation with outlier rejection will be performed. The candidate will form a valid loop pair with $\mathcal{K}_{i}$ if the inliers exceed 50.

VI-A2 Map Merging

A 3D feature observed by both frames of a loop pair is usually mistakenly used as two features. Therefore, in this part, we aim to merge the duplicated point and line features observed by loop pairs. For keypoint features, we use the above feature-matching results between loop pairs. If two matched keypoints are associated with two different map points, they will be regarded as duplicated features and only one map point will be retained. The correspondence between 2D keypoints and 3D map points, as well as the connections in the co-visibility graph, will also be updated.

For line features, we first associate 3D lines and map points through the 2D-3D feature correspondence and 2D point-line association built in Section V-B. Then we detect 3D line pairs that associate with the same map points. If two 3D lines share more than 3 associated map points, they will be regarded as duplicated and only one 3D line will be retained.

VI-A3 Global Bundle Adjustment

We perform the global bundle adjustment (GBA) after merging duplicated features. The form of the residuals and loss functions is similar to that in Section V-E; however, unlike Section V-E, all keyframes and features are jointly optimized, and loop closure residuals are also incorporated into the optimization process. In the initial stage of optimization, the re-projection errors of merged features are relatively large due to the VO drift error, so we first iterate 50 times without outlier rejection to optimize the variables to a good rough position, and then iterate another 40 times with outlier rejection.

We find that when the map is large, the initial 50 iterations can not optimize the variables to a satisfactory position. To address this, we first perform pose graph optimization (PGO) before the global bundle adjustment if a map contains more than 80k map points. Only the keyframe poses will be adjusted in the PGO and the cost function is defined as follows:

\mathbf{E}_{pgo}=\sum{\lVert\text{Log}\left(\Delta\tilde{\mathbf{T}}_{ij}^{-1}% \mathbf{T}_{i}^{-1}\mathbf{T}_{j}\right)\lVert^{2}_{\Sigma_{ij}}},

(17)

where $\mathbf{T}_{i}\in\text{SE(3)}$ and $\mathbf{T}_{j}\in\text{SE(3)}$ are poses of $\mathcal{K}_{i}$ and $\mathcal{K}_{j}$ , respectively. $\text{Log}(\cdot)=\text{log}(\cdot)^{{\vee}}:\text{SE(3)}\to\text{se(3)}$ is the Logarithm map proposed in [83]. $\mathcal{K}_{i}$ and $\mathcal{K}_{j}$ should either be adjacent or form a loop pair. After the pose graph optimization, the positions of map points and 3D lines will also be adjusted along with the keyframes in which they are first observed.

The systems with online loop detection usually perform the GBA after detecting a new loop, so they undergo repeated GBAs when a scene contains many loops. In contrast, our offline map optimization module only does the GBA after all loop closures are detected, allowing us to reduce the optimization iterations significantly compared with them.

VI-A4 Scene-Dependent Vocabulary

We train a junction vocabulary aiming to be used for relocalization. The vocabulary is built on the junctions of keyframes in the map so it is scene-dependent. Compared with the keypoint vocabulary trained in Section VI-A1, the database used to train the junction vocabulary is generally much smaller, so we set the number of layers to 3, with 10 nodes in each layer. The junction vocabulary is tiny, i.e., about 1 megabyte, as it only contains 1000 words. Its detailed usage will be introduced in Section VI-B.

VI-A5 Optimized Map

we save the optimized map for subsequent map reuse. Compared with the initial map in Section V-F, more information is saved such as the bag of words for each keyframe, the global co-visibility graph, and the scene-dependent junction vocabulary. In the meantime, the number of 3D features has decreased due to the fusion of duplicate map points and 3D lines. Therefore, the optimized map occupies a similar memory to the initial map.

VI-B Map Reuse

In this part, we present our illumination-robust relocalization using an existing optimized map. In most vSLAM systems, recognizing revisited places typically needs two steps: (1) retrieving $N_{kc}$ keyframe candidates and (2) performing feature matching and estimating relative pose. The second step is usually time-consuming, so selecting a proper $N_{kc}$ is very important. A larger $N_{kc}$ will reduce the system’s efficiency while a smaller $N_{kc}$ may prevent the correct candidate from being recalled. For example, in the loop closing module of ORB-SLAM3 [62], only the three most similar keyframes retrieved by DBoW2 [84] are used for better efficiency. It works well as two frames in a loop pair usually have a short time interval and thus the lighting conditions are relatively similar. But for challenging tasks, such as the day/night relocalization problem, retrieving so few candidates usually results in a low recall rate. However, retrieving more candidates needs to perform feature matching and pose estimation more times for each query frame, which makes it difficult to deploy for real-time applications.

To address this problem, we propose an efficient multi-stage relocalization method to make the optimized map usable in different lighting conditions. Our insight is that if most of the false candidates can be quickly filtered out, then the efficiency can be improved while maintaining or even improving the relocalization recall rate. Therefore, we add another step to the two-step pipeline mentioned above. We next introduce the proposed multi-stage pipeline in detail.

VI-B1 The First Step

This step is to retrieve the similar keyframes in the map that are similar to the query frame. For each input monocular image, we detect keypoints, junctions, and line features using our PLNet. Then a pipeline similar to the “coarse candidate selection” in Section VI-A1 will be executed, but with two differences. The first difference is that we do not filter out candidates using the co-visibility graph as the query frame is not in the graph. The second is that all candidates, not just three, will be retained for the next step.

VI-B2 The Second Step

This step filters out most of the candidates selected in the first step using junctions and line features. For query frame $\mathcal{K}_{q}$ and each candidate $\mathcal{K}_{b}$ , we first match their junctions by finding the same words through the junction vocabulary trained in Section VI-A4. We use $\left\{\left(q_{i},b_{i}\right)\mid q_{i}\in\mathcal{K}_{q},b_{i}\in\mathcal{K% }_{b}\right\}$ to denote the matching pairs. Then we construct two structure graphs, i.e., $G_{q}^{J}$ and $G_{b}^{J}$ , for $\mathcal{K}_{q}$ and $\mathcal{K}_{b}$ , respectively. The vertices are matched junctions, i.e., $V_{q}^{J}=\left\{q_{i}\mid q_{i}\in\mathcal{K}_{q}\right\}$ and $V_{b}^{J}=\left\{b_{i}\mid b_{i}\in\mathcal{K}_{b}\right\}$ . The related adjacent matrices that describe the connection between vertices are defined as:

\mathbf{A}_{q}^{J}=\left[\begin{array}[]{ccc}q_{11}&\cdots&q_{1n}\\ \vdots&\ddots&\vdots\\ q_{n1}&\cdots&q_{nn}\end{array}\right],\mathbf{A}_{b}^{J}=\left[\begin{array}[% ]{ccc}b_{11}&\cdots&b_{1n}\\ \vdots&\ddots&\vdots\\ b_{n1}&\cdots&b_{nn}\end{array}\right],

(18)

where $n$ is the number of junction-matching pairs. $q_{ij}$ is set to 1 if the junction $q_{i}$ and $q_{j}$ are two endpoints of the same line, otherwise, it is set to 0. The same goes for $b_{ij}$ . Then the graph similarity of $G_{q}^{J}$ and $G_{b}^{J}$ can be computed through:

S_{qb}^{G}=\sum{1-|q_{ij}-b_{ij}|}.

(19)

We also compute a junction similarity score $S_{qb}^{J}$ using the junction vocabulary and the DBoW2 algorithm. Finally, the similarity score of $\mathcal{K}_{q}$ and $\mathcal{K}_{b}$ is given by combining the keypoint similarity, junction similarity, and structure graph similarity:

S_{qb}=S_{qb}^{K}+S_{qb}^{J}\cdot\left(1+\frac{S_{qb}^{G}}{n}\right),

(20)

where $S_{qb}^{K}$ is the keypoint similarity of $\mathcal{K}_{q}$ and $\mathcal{K}_{b}$ computed in the first step. We compute the similarity score with the query frame for each candidate, and only the top 3 candidates with the highest similarity scores will be retained for the next step.

Analysis: We next analyze the second step. In the normal two-step pipeline that uses the DBoW method, only appearance information is used to retrieve candidates. The structural information, i.e., the necessity of the consistent spatial distribution of features between the query frame and candidate, is ignored in the first step and only used in the second step. However, in the illumination-challenging scenes, the structural information is essential as it is invariant to lighting conditions. In our second step, a portion of the structural information is utilized to select candidates. First, our PLNet uses the wireframe-parsing method to detect structural lines, which are more stable in illumination-challenging environments. Second, the similarity computed in (20) utilizes both the appearance information and the structural information. Therefore, our system can achieve good performance in illumination-challenging environments although using the efficient DBoW method.

The second step is also highly efficient. On the one hand, junctions are usually much less than keyponts. In normal scenes, our PLNet can detect more than 400 good keyponts but only about 50 junctions. On the other hand, the junction vocabulary is tiny and only contains 1,000 words. Therefore, matching junctions using DBoW2, constructing junction graphs, and computing similarity scores are all executed very efficiently. The experiment shows that the second step can be done within 0.7ms. More results will be presented in Section VII.

VI-B3 The Third Step

The third step aims to estimate the pose of the query frame. We first use LightGlue to match features between the query frame and the retained candidates. The candidate with the most matching inliers will be selected as the best candidate. Then based on the matching results of the query frame and the best candidate, we can associate the query keypoints with map points. Finally, a PnP problem is solved with RANSAC to estimate the pose. The pose will be considered valid if the inliers exceed 20.

VII Experiments

In this section, we present the experiment results. The remainder of this section is organized as follows. In Section VII-A, we evaluate the line detection performance of the proposed PLNet. In Section VII-B, we evaluate the mapping accuracy of our system by comparing it with other SOTA VO or SLAM systems. In Section VII-C, we test our system in three illumination-challenging scenarios: onboard illumination, dynamic illumination, and low illumination. The comparison of these three scenarios will show the excellent robustness of our system. In Section VII-D, we assess the performance of the proposed map reuse module in addressing the day/night localization challenges, i.e., mapping during the day and relocalization at night. In Section VII-E, we present the ablation study. In Section VII-F, we evaluate the efficiency.

We use two platforms in the experiments. Most evaluations are conducted on a personal computer with an Intel i9-13900 CPU and a NVIDIA GeForce RTX 4080 GPU. In the efficiency experiment in Section VII-F, we also deploy AirSLAM on an NVIDIA Jetson Orin to prove that our system can achieve good accuracy and efficiency on the embedded platform.

VII-A Line Detection

In this section, we evaluate the performance of our PLNet. As described in Section IV-B, we follow SuperPoint [24] to design and train our backbone and keypoint detection module, and we can even use the pre-trained model of SuperPoint, therefore, we do not evaluate the keypoint detection anymore. Instead, we assess the performance of the line detection module by comparing it with SOTA systems, as it is trained with a fixed backbone, which is different from other line detectors.

VII-A1 Datasets and Baseliens

This experiment is conducted on the Wireframe dataset [69] and the YorkUrban dataset [85]. The Wireframe dataset contains 5,000 training images and 462 test images that are all collected in man-made environments. We use them to train and test our PLNet. To validate the generalization ability, we also compare various methods on the YorkUrban dataset, which contains 102 test images. All the training and test images are resized to $512\times 512$ . We compare our method with AFM [86], AFM++ [87], L-CNN [67], LETR [88], F-Clip [89], ELSD [90], and HAWPv2 [51].

VII-A2 Evaluation Metrics

We evaluate both the accuracy and efficiency of the line detection. For accuracy, the structural average precision (sAP) [67] is the most challenging metric of the wireframe parsing task. It is inspired by the mean average precision (mAP) commonly used in object detection. A detected line $\tilde{l}=\left(\tilde{\mathbf{p}}_{1},\tilde{\mathbf{p}}_{2}\right)$ is a True Positive (TP) if and only if it satisfies the following:

\min_{\left(\mathbf{p}_{1},\mathbf{p}_{2}\right)\in\mathcal{L}}\|\mathbf{p}_{1% }-\tilde{\mathbf{p}}_{1}\|^{2}+\|\mathbf{p}_{2}-\tilde{\mathbf{p}}_{2}\|^{2}% \leq\vartheta,

(21)

where $\mathcal{L}$ is the set of ground truth, and $\vartheta$ is a predefined threshold. We follow the previous methods to set $\vartheta$ to 5, 10, and 15, then the corresponding sAP scores are represented by $\text{sAP}^{5}$ , $\text{sAP}^{10}$ , and $\text{sAP}^{15}$ , respectively. For efficiency, we use the frames per second (FPS) to evaluate various systems.

TABLE I: The comparison of various wireframe parsing methods. The top two results are highlighted and underlined in order.

Methods¹	Wireframe Dataset			YorkUrban Dataset			FPS
	$\text{sAP}^{5}$	$\text{sAP}^{10}$	$\text{sAP}^{15}$	$\text{sAP}^{5}$	$\text{sAP}^{10}$	$\text{sAP}^{15}$
AFM [86]	18.5	24.4	27.5	7.3	9.4	11.1	10.4^*
AFM++ [87]	27.7	32.4	34.8	9.5	11.6	13.2	8.0^*
L-CNN [67]	59.7	63.6	65.3	25.0	27.1	28.3	29.6
LETR [88]	59.2	65.2	67.7	23.9	27.6	29.7	2.0
F-Clip [89]	64.3	68.3	69.1	28.6	31.0	32.4	82.3
ELSD [90]	64.3	68.9	70.9	27.6	30.2	31.8	42.6^*
HAWPv2 [51]	65.7	69.7	71.3	28.9	31.2	32.6	85.2
HAWPv2 [51]	63.6	67.7	69.5	26.6	29.0	30.3	85.2
PLNet (Ours)	65.2	69.2	70.9	29.3	32.0	33.5	79.4^†

1

Methods represented using the font are evaluated with color inputs and methods represented using the font are evaluated with grayscale inputs.
*

These numbers are cited from the original paper.
†

The FPS of our PLNet is the speed of detecting both keypoints and lines for the Python implementation.

VII-A3 Results and Analysis

We present the results in Table I. The top-performing results are distinctly highlighted and underlined in order. It can be seen that our PLNet achieves the second-best performance on the Wireframe dataset and the best performance on the YorkUrban dataset. On the Wireframe dataset, HAWPv2, the best method, only outperforms our PLNet by 0.5, 0.5, and 0.4 points in $\text{sAP}^{5}$ , $\text{sAP}^{10}$ , and $\text{sAP}^{15}$ , respectively. On the YorkUrban dataset, our method surpasses the second-best method by 0.4, 0.8, and 0.9 points on these three metrics, respectively. Overall, we can conclude that our PLNet achieves comparable accuracy with SOTA methods.

TABLE II: Translational error (RMSE) on the EuRoC dataset (unit: m), the best results are in bold.

		Sensors			Features		Sequence
		M¹	S¹	I¹	P¹	L¹	MH01	MH02	MH03	MH04	MH05	V101	V102	V103	V201	V202	V203	Avg²
Without Loop	DROID-SLAM [18]	✓	✕	✕	✓	✕	0.163	0.121	0.242	0.399	0.270	0.103	0.165	0.158	0.102	0.115	0.204	0.186
	VINS-Fusion [4]	✕	✓	✓	✓	✕	0.163	0.178	0.316	0.331	0.175	0.102	0.099	0.112	0.110	0.124	0.252	0.178
	Struct-VIO [47]	✓	✕	✓	✓	✓	0.119	0.100	0.283	0.275	0.256	0.075	0.197	0.161	0.081	0.152	0.177	0.171
	PLF-VINS [91]	✕	✓	✓	✓	✓	0.143	0.178	0.221	0.240	0.260	0.069	0.099	0.166	0.083	0.125	0.183	0.161
	Kimera-VIO [63]	✕	✓	✓	✓	✕	0.110	0.100	0.160	0.240	0.350	0.050	0.080	0.070	0.080	0.100	0.210	0.141
	OKVIS [92]	✕	✓	✓	✓	✕	0.197	0.108	0.122	0.138	0.272	0.040	0.067	0.120	0.055	0.150	0.240	0.137
\rowcolorgray!20	AirVIO (Ours)	✕	✓	✓	✓	✓	0.074	0.060	0.114	0.167	0.125	0.033	0.132	0.238	0.036	0.083	0.168	0.113
With Loop	iSLAM [19]	✕	✓	✓	✕	✕	0.302	0.460	0.363	0.936	0.478	0.355	0.391	0.301	0.452	0.416	1.133	0.508
	UV-SLAM [93]	✓	✕	✓	✓	✓	0.161	0.179	0.176	0.291	0.189	0.077	0.071	0.094	0.078	0.085	0.125	0.139
	Kimera [94]	✕	✓	✓	✓	✕	0.090	0.110	0.120	0.160	0.180	0.050	0.060	0.130	0.050	0.070	0.230	0.114
	OpenVINS [95]	✕	✓	✓	✓	✕	0.072	0.143	0.086	0.173	0.247	0.055	0.060	0.059	0.054	0.047	0.141	0.103
	Structure-PLP-SLAM [96]	✕	✓	✕	✓	✓	0.046	0.056	0.048	0.071	0.071	0.091	0.066	0.065	0.061	0.061	0.166	0.073
	VINS-Fusion [4]	✕	✓	✓	✓	✕	0.052	0.040	0.052	0.124	0.088	0.046	0.053	0.108	0.040	0.081	0.098	0.071
	Maplab [97]	✓	✕	✓	✓	✕	0.041	0.026	0.045	0.110	0.067	0.039	0.045	0.080	0.053	0.084	0.196	0.071
	SP-Loop [98]	✕	✓	✓	✓	✕	0.070	0.044	0.068	0.100	0.090	0.042	0.034	0.082	0.038	0.054	0.100	0.066
	PL-SLAM [5]	✕	✓	✓	✓	✓	0.042	0.052	0.040	0.064	0.070	0.042	0.046	0.069	0.061	0.057	0.126	0.061
	Basalt [15]	✕	✓	✓	✓	✕	0.080	0.060	0.050	0.100	0.080	0.040	0.020	0.030	0.030	0.020	0.059	0.052
	DVI-SLAM [99]	✕	✓	✓	✓	✕	0.042	0.046	0.081	0.072	0.069	0.059	0.034	0.028	0.040	0.039	0.055	0.051
	ORB-SLAM3 [62]	✕	✓	✓	✓	✕	0.036	0.033	0.035	0.051	0.082	0.038	0.014	0.024	0.032	0.014	0.024	0.035
	DROID-SLAM [18]	✕	✓	✕	✓	✕	0.015	0.013	0.035	0.048	0.040	0.037	0.011	0.020	0.018	0.015	0.017	0.024
\rowcolorgray!20	AirSLAM (Ours)	✕	✓	✓	✓	✓	0.019	0.013	0.025	0.056	0.051	0.032	0.014	0.025	0.014	0.018	0.068	0.030

1

M denotes the monocular camera, S denotes the stereo camera, I denotes the IMU, P denotes the keypoint feature, and L denotes the line feature.
2

The average error of the successful sequences.

Generalizability Analysis

We can also conclude that the generalizability of our PLNet is better than other methods. This conclusion is based on two comparative results between our method and HAWPv2, which is the current best wireframe parsing method. First, on the Wireframe dataset, which also serves as the training dataset, HAWPv2 outperforms our PLNet. However, on the YorkUrban dataset, it is surpassed by our method. Second, the previous methods are all evaluated with color inputs in their original paper. Considering that grayscale images are also widely used in vSLAM systems, we train our PLNet with grayscale inputs. We also retrain HAWPv2 and evaluate it using grayscale images for comparison. The result shows that our PLNet significantly outperforms HAWPv2 on both datasets when the inputs are grayscale images. We think the better generalizability comes from our backbone. Other methods are trained on only 5,000 images of the Wireframe dataset, while our backbone is trained on a large diverse dataset, which gives it a stronger feature extraction capability.

Efficiency Analysis

It is worth noting that the FPS of our method in Table I is the speed of detecting both keypoints and lines, while other methods can only output lines. Nevertheless, our PLNet remains one of the fastest methods due to the design of the shared backbone. PLNet processes each image only $0.86\mathrm{m}\mathrm{s}$ slower than the fastest algorithm, i.e., HAWPv2.

Note that the selected baselines are all wireframe parsing methods. The non-wireframe-parsing line detection methods, such as SOLD2 [49] and DeepLSD [68], are not added to the comparison as it is unfair to do so. As shown in Fig. 7, the wireframe parsing techniques aim to detect structural lines. They are usually evaluated using the sAP and compared with the ground truth. The non-wireframe-parsing methods can detect more general lines with significant gradients, however, they often detect a long line segment as multiple short line segments, which results in their poor sAP performance.

TABLE III: RMSE (m) on the OIVIO dataset, the best results are in bold. F represents tracking failure or large dirft error.

Sequence	Kimera	PL-	Basalt	DROID-	ORB-	Ours
		SLAM		SLAM	SLAM3	Ours
MN_015_GV_01	0.169	1.238	0.216	0.286	0.066	0.054
MN_015_GV_02	2.408	0.853	0.153	0.081	0.069	0.052
MN_050_GV_01	F	1.143	0.186	0.173	0.063	0.062
MN_050_GV_02	F	0.921	0.103	0.080	0.053	0.048
MN_100_GV_01	F	0.831	0.197	0.184	0.051	0.064
MN_100_GV_02	2.238	0.609	0.092	0.090	0.063	0.042
TN_015_GV_01	0.300	1.579	0.148	0.188	0.053	0.057
TN_050_GV_01	0.280	1.736	0.521	0.313	0.082	0.065
TN_100_GV_01	0.264	1.312	0.116	0.179	0.086	0.078
Average	-	1.358	0.192	0.175	0.065	0.058

VII-B Mapping Accuracy

In this section, we evaluate the mapping accuracy of our system under well-illuminated conditions. The EuRoC dataset [100] is one of the most widely used datasets for vSLAM, so we use it for the accuracy evaluation. We compare our method only with systems capable of estimating the real scale, so the selected baselines are either visual-inertial systems, stereo systems, or those incorporating both. We incorporate traditional methods, learning-based systems, and hybrid systems into the comparison. We use AirVIO to represent our system without loop detection. The root mean square error (RMSE) is used as the metric and computed by the evo [101].

The comparison results are presented in Table II. We evaluate the systems with and without loop detection on 11 sequences. For the comparison without loop detection, our method outperforms other VIO methods: we achieve the best results on 8 out of 11 sequences. The average translational error of AirVIO is 20% lower than the second-best system, i.e., Kimera-VIO. For the comparison with loop detection, our system achieves comparable performance with ORB-SLAM3 and DROID-SLAM, which are SOTA traditional and learning-based visual SLAM systems, respectively. Another conclusion that can be drawn from Table II is that loop detection significantly improves the accuracy of our system. The average error of our system decreases by 74% after the loop detection.

VII-C Mapping Robustness

Although many vSLAM systems have achieved impressive accuracy as shown in the previous Section VII-B, complex lighting conditions usually render them ineffective when deployed in real applications. Therefore, in this section, we evaluate the robustness of various vSLAM systems to lighting conditions. We select several representative SOTA systems as baselines. They are ORB-SLAM3 [62], an accurate feature-based system, DROID-SLAM [18], a learning-based hybrid system, Basalt [15], a system that achieves illumination-robust optical flow tracking with the LSSD algorithm, Kimera [94], a direct visual-inertial SLAM system, and OKIVS [92], a system proven to be illumination-robust in our previous work [22]. We test these methods and our system in three scenarios: onboard illumination, dynamic illumination, and low-lighting environments. We first present the evaluation results in Section VII-C1, Section VII-C2, and Section VII-C3, respectively, and then give an overall analysis in Section VII-C4.

TABLE IV: RMSE (m) on the UMA-VI dataset, the best results are in bold. F represents tracking failure or large dirft error.

Sequence	PL-	ORB-	Basalt	OKVIS	DROID-	Ours
	SLAM	SLAM3			SLAM	Ours
conference-csc1	2.697	F	1.270	1.118	0.711	0.490
conference-csc2	1.596	F	0.682	0.470	0.135	0.091
conference-csc3	F	0.426	0.469	0.088	0.724	0.088
lab-module-csc-rev	F	0.063	0.486	0.861	0.364	0.504
lab-module-csc	F	F	0.403	0.579	0.319	0.979
long-walk-eng	F	F	5.046	3.005	F	1.801
third-floor-csc1	4.478	0.863	0.420	0.287	0.048	0.070
third-floor-csc2	6.068	0.149	0.590	0.271	0.890	0.127
two-floors-csc1	F	F	0.760	0.154	0.341	0.066
two-floors-csc2	F	F	1.211	0.679	0.299	0.190

VII-C1 Onboard Illumination

We utilize the OIVIO dataset [79] to assess the performance of various systems with onboard illumination. The OIVIO dataset collects visual-inertial data in tunnels and mines. In each sequence, the scene is illuminated by an onboard light of approximately 1300, 4500, or 9000 lumens. We used all nine sequences with ground truth acquired by the Leica TCRP1203 R300. As no loop closure exists in the selected sequences, it is fair to compare the VO systems with the SLAM systems. The performance of translational error is presented in Table III. The most accurate results are in bold, and F represents that the tracking is lost for more than $10\mathrm{s}$ or the RMSE exceeds $10\mathrm{m}$ . It can be seen that our method achieves the most accurate results on 7 out of 9 sequences and the smallest average error. The onboard illumination has almost no impact on our AirSLAM and ORB-SLAM3, however, it reduces the accuracy of OKVIS, Basalt, and PL-SLAM. Kimera suffers a lot from such illumination conditions. It even experiences tracking failures and large drift errors on three sequences.

We show a comparison of our method with selected baselines on the OIVIO TN_100_GV_01 sequence in Fig. 8. In this case, the robot goes through a mine with onboard illumination. The distance is about 150 meters and the average speed is about 0.84m/s. The plot shows the proportion of pose errors on the horizontal axis that are less than the given alignment error threshold on the horizontal axis. Our system achieves a more accurate result than other systems on this sequence.

VII-C2 Dynamic Illumination

The UMA-VI dataset is a visual-inertial dataset gathered in challenging scenarios with handheld custom sensors. We selected sequences with illumination changes to evaluate our system. As shown in Fig. 9, it contains many subsequences where the image suddenly goes dark due to turning off the lights. In the subsequence of the top row in Fig. 9, the total darkness lasted for 0.16 s , and low illumination lasted for 0.64 s . In the subsequence of the bottom row, the total darkness lasted for 0.32 s , and low illumination lasted for 0.4 s . It is more challenging than the OIVIO dataset for vSLAM systems. As the ground-truth poses are only available at the beginning and the end of each sequence, we disabled the loop closure part from all the evaluated methods.

The translational errors are presented in Table IV. The most accurate results are in bold, and F represents that the tracking is lost for more than $10\mathrm{s}$ or the RMSE exceeds $10\mathrm{m}$ . It can be seen that our AirSLAM outperforms other methods. Our system achieves the best results on 7 out of 10 sequences. The UMA-VI dataset is so challenging that PL-SLAM and ORB-SLAM3 fail on most sequences. Although OKVIS and Basalt, like our system, can complete all the sequences, their accuracy is significantly lower than ours. The average RMSEs of OKVIS and Basalt are around $1.134\mathrm{m}$ and $0.724\mathrm{m}$ , respectively, while ours is around $0.441\mathrm{m}$ , which means our average error is only 62.6% of OKVIS and 38.9% of Basalt.

VII-C3 Low Illumination

Inspired by [16], we process a publicly available sequence by adjusting the brightness levels of its images. Then the processed sequences are used to evaluate the performance of various SLAM systems in low-illumination conditions. We select the “V2_01_easy” of the EuRoC dataset as the base sequence. The image brightness is adjusted using the gamma nonlinearity:

V_{out}=AV_{in}^{\frac{1}{\gamma}},

(22)

where $V_{in}$ and $V_{out}$ are normalized input and output pixel values, respectively. $A$ and $\gamma$ control the maximum brightness and contrast. We set 12 adjustment levels and use $L_{i}$ to denote the $i$ th level. $L_{0}$ represents the original sequence, i.e., $A_{0}=1$ and $\gamma_{0}=1$ . When $i\in[1,12]$ , $A_{i}$ and $\gamma_{i}$ alternate in descending order to make the image progressively darker. Fig. 10 shows the values of $A_{i}$ and $\gamma_{i}$ , and the processed image in each level. We name the processed dataset “Dark EuRoC”.

We present the comparison result in Fig. 11. As the errors of PL-SLAM are much greater than other methods, we do not show its result. Tracking failures and large drift errors, i.e., the RMSE is more than 1 m , are also marked. It can be seen that low illumination has varying degrees of impact on different systems. In this scenario, optical flow-based methods demonstrate greater stability compared to feature-based methods. Nonetheless, our approach achieves comparable performance to Basalt and DROID-SLAM, which utilize sparse and dense optical flow, respectively. The RMSEs of ORB-SLAM3 and OKVIS increase as the brightness decreases. They even experience tracking failures or large drift errors on $L_{10}$ and $L_{11}$ .

TABLE V: The relocalization comparison on the TartanAir Day/Night Localization dataset, the best results are in bold.

	Global Feature¹ +	FPS²	Recall Rate of Sequences (%)
	Matching¹ + Local Feature¹		P000	P001	P002	P003	P004	P005	P006	P007	P008	P009	P010	P011	Avg
Hloc[9]	NV + NN + SOSNet	12.4	4.3	10.6	42.5	10.5	33.9	26.1	7.2	27.2	32.7	6.5	24.9	7.7	19.5
	NV + NN + D2-Net	10.34	22.6	20.7	87.4	19.0	20.1	65.9	27.5	49.0	69.0	64.3	71.4	23.4	45.0
	NV + NN + R2D2	8.14	15.7	32.3	92.4	22.7	52.7	75.5	14.4	84.9	54.8	67.7	60.7	26.0	50.0
	NV + NN + SP	30.2	69.2	32.8	88.7	17.2	53.5	72.6	31.1	83.8	72.6	69.6	89.5	34.0	59.6
	NV + AL + SOSNet	6.1	29.0	30.3	55.6	19.6	49.3	42.4	12.7	45.8	38.9	25.2	47.0	11.3	34.0
	NV + LG + SIFT	10.8	45.9	31.3	83.0	58.5	52.4	64.7	18.8	78.8	57.2	43.6	79.8	33.6	53.7
	NV + LG + DISK	12.9	22.9	35.4	97.9	62.9	60.9	84.2	33.8	89.9	90.0	85.5	74.9	33.4	64.3
	NV + LG + SP	20.6	94.7	35.4	98.4	70.9	63.5	85.8	33.8	95.9	97.1	91.4	99.2	49.8	76.3
	DIR + LG + SP	19.1	87.2	28.8	99.5	69.8	64.2	85.2	32.8	96.7	97.1	88.1	96.0	50.4	74.7
	OpenIBL + LG + SP	21.3	88.6	35.4	97.7	70.3	64.3	86.1	34.7	96.4	91.3	95.7	99.7	45.1	75.4
	EP + LG + SP	22.3	99.7	36.4	100.0	68.3	64.5	85.8	34.4	96.5	94.0	85.1	96.9	45.1	75.6
AirSLAM (Ours)		48.8	89.5	78.8	87.8	94.6	88.2	85.6	70.0	72.8	83.4	78.8	81.4	54.5	80.5

1

NV is NetVLAD, DIR is AP-GeM/DIR, EP is EigenPlace, NN is Nearest Neighbor Matching, AL is AdaLAM, LG is LightGlue, and SP is SuperPoint.
2

Running time of relocalization measured in Frame Per Second (FPS).

VII-C4 Result Analysis

We think the above three lighting conditions affect a visual system in different ways. The OIVIO dataset collects sequences in dark environments with only onboard illumination, so the light source moves along with the robot, which results in two effects. On the one hand, the lighting is uneven in the environment. The direction the robot is facing and the area closer to the robot is brighter than other areas. The uneven image brightness may lead to the uneven distribution of features. On the other hand, when the robot moves, the lighting of the same area will change, resulting in different brightness in different frames. The assumption of brightness constancy in some systems will be affected in such conditions. The UMA-VI dataset is collected under dynamic lighting conditions, where the dynamic lighting is caused by the sudden switching of lights or moving between indoor and outdoor environments. The image brightness variations in the UMA-VI dataset are much more intense than those in the OIVIO dataset, which may even make the extracted feature descriptor inconsistent in consecutive frames. In low-illumination environments, both the brightness and contrast of captured images are very low, making the vSLAM system more difficult to detect enough good features and extract distinct descriptors.

We summarize the above experiment results with the following conclusions. First, the systems that use descriptors for matching are more robust than the direct methods in illumination-dynamic environments. On the OIVIO dataset, our AirSLAM and ORB-SLAM3 outperform the other systems significantly. On the UMA-VI dataset, our method and OKVIS achieve the best and the second-best results, respectively. This is reasonable as the brightness constancy assumption constrains the direct methods. Despite Basalt uses LSSD to enhance its optical flow tracking, its accuracy still decreases significantly in these two scenarios. Second, the direct methods are more stable in the low illumination environments. This is because descriptor-based SLAM systems rely on enough high-quality features and descriptors, which are difficult to obtain on low brightness and contrast images. The direct methods use corners that are easier to detect, so the low illumination has less impact on them. Third, thanks to the robust feature detection and matching, the illumination robustness of our system is far better than that of other systems. AirSLAM achieves relatively high accuracy in these three illumination-challenging scenarios.

VII-D Map Reuse

VII-D1 Dataset

As mapping and relocalization in the same well-illuminated environment are no longer difficult for many current vSLAM systems, we only evaluate our map reuse module under illumination-challenging conditions, i.e., the day/night localization task. We use the “abandoned_factory” and “abandoned_factory_night” scenes in the TartanAir dataset [60] as they can provide consecutive stereo image sequences for the SLAM mapping and the corresponding accurate ground truth for the evaluation. The images in these two scenes are collected during the day and at night, respectively. We use the sequences in the “abandoned_factory” scene to build maps. Then, for each mapping image, the images with a relative distance of less than $3\mathrm{m}$ and a relative angle of less than $15\mathrm{\SIUnitSymbolDegree}$ from it in the “abandoned_factory_night” scene are selected as query images. We call the generated mapping and relocalization sequences the “TartanAir Day/Night Localization” dataset. Fig. 12 shows some sample pairs for mapping and relocalization. It is worth noting that due to the differences in capture viewpoints and scene depths, the query image selected based on the relative distance and angle may not always have valid overlapping with the mapping images.

VII-D2 Baseline

We have tried several traditional vSLAM systems, e.g., ORB-SLAM3 [62], and SOTA learning-based one-stage relocalization methods, e.g., ACE [102] on the TartanAir Day/Night Localization dataset, and find they perform badly: their relocalization recall rates are below 1%. Therefore, we only present the comparison results of our systems and some VPR methods. The Hloc toolbox [9] uses the structure from motion (SFM) method to build maps and has integrated many image retrieval methods, local feature extractors, and matching methods for localization. We mainly compare our system with these methods. Specifically, the NetVLAD [64], AP-GeM/DIR [103], OpenIBL [104], and EigenPlaces [105] are used to extract global features, the SuperPoint [24], SIFT [106], D2-Net [41], SOSNet [107], R2D2 [39], and DISK [40] are used to extract local features, and the LightGlue [27], AdaLAM [108] and Nearest Neighbor Matching are used to match features. We combine these methods into various “global feature detection + local feature matcher + local feature detection” pipelines for the mapping and relocalization. We do not add DXSLAM[37] to the comparison as it uses NetVLAD and SuperPoint with the binary descriptor, which has been included in the above pipelines.

VII-D3 Results

To achieve a fair comparison and balance the efficiency and effectiveness, we extract 400 local features and retrieve 3 candidates in the coarse localization stage for all methods. Unlike vSLAM systems that have the keyframe selection mechanism, the SFM mapping optimizes all input images, so it is very slow when mapping with original sequences. Therefore, to accelerate the SFM mapping while ensuring its mapping frames are more than our keyframes, we sample its mapping sequences by selecting one frame every four frames. We show a point-line map built by our AirSLAM in Fig. 13. The relocalization results are presented in Table V. We give the running time (FPS) and the relocalization recall rate of each method. We define a successful relocalization if the estimated pose of the query frame is within $2\mathrm{m}$ and $15\mathrm{\SIUnitSymbolDegree}$ of the ground truth. It can be seen that our AirSLAM outperforms other methods in terms of both efficiency and recall rate. Our system achieves the best results on 5 out of 11 sequences. AirSLAM has an average recall rate 4.2% higher than the second-best algorithm and is about 2.4 times faster than it.

TABLE VI: Ablation study of line features for mapping. We present the RMSE (m) of our system with and without line features.

	EuRoC	OIVIO	UMA-VI	Dark-EuRoC
w/o Lines	0.046	0.072	0.545	0.035
Ours	0.030	0.058	0.441	0.029

VII-D4 Analysis

We find that our system is more stable than the VPR methods on the TartanAir Day/Night Localization dataset. On several sequences, e.g., P000, P002, and P010, some VPR methods achieve remarkable results, with recall rates close to 100%. However, on some other sequences, e.g., P001 and P006, their recall rates are less than 40%. In contrast, our system maintains a recall rate of 70% to 90% on most sequences.

To clarify this, we examined each sequence and roughly categorized the images into three types. As shown in Fig. 12, the first type of image is captured with the camera relatively far from the features, so there is a significant overlap between each day/night image pair. Additionally, these images contain distinct buildings and landmarks. The second type of image pair also has a significant overlap. However, the common regions in these image pairs do not contain large buildings and landmarks. The third type of image is captured with the camera very close to the features. Although the camera distance between the day/night image pair is not large, they have almost no overlap, making their local feature matching impossible.

We find that the VPR methods perform very well on the first type of image but perform poorly on the second type of image. Therefore, their recall rates are very low on P001 and P006, which contain more of the second type of images. This may be because their global features are usually trained on datasets that have a lot of distinct buildings and landmarks, which makes them rely more on such semantic cues to retrieve similar images. By contrast, our system is based on the DBoW method, which only utilizes the low-level local features of images, so it achieves similar performance on the first and second types of images. This also proves the strong generalizability of our system. However, neither our system nor VPR methods can process the pairs with little overlap due to relying on local feature matching. Such image pairs are abundant in the P011.

TABLE VII: Relocalization ablation study. The recall rates (%) of our system with and without the structure graph during map reuse.

Seq.	$N_{\mathcal{C}}$ =3		$N_{\mathcal{C}}$ =5		$N_{\mathcal{C}}$ =10
	w/o G.	Ours	w/o G.	Ours	w/o G.	Ours
P000	77.9	89.5	84.3	92.5	88.5	94.0
P001	69.2	78.8	79.3	83.8	85.4	87.9
P002	75.4	87.8	80.9	87.8	85.3	88.0
P003	86.4	94.6	92.4	95.5	93.9	95.9
P004	81.5	88.2	85.3	90.9	89.0	92.1
P005	78.4	85.6	82.9	88.8	88.4	91.5
P006	62.4	70.0	72.3	72.7	78.6	77.2
P007	60.6	72.8	63.7	73.7	69.0	76.2
P008	77.2	83.4	80.0	84.4	81.7	85.8
P009	63.4	78.8	70.6	81.7	76.9	84.8
P010	70.7	81.4	75.4	84.0	80.9	86.9
P011	48.1	54.5	55.1	58.9	62.0	64.0
Avg.	70.9	80.5	76.9	82.9	81.6	85.4

VII-E Ablation Study

VII-E1 Line Feature In Mapping

In this section, we evaluate the impact of line features on mapping performance. To this end, we remove line features from our system and compute the RMSE on the EuRoC, OIVIO, UMA-VI, and Dark-EuRoC datasets. The results are summarized in Table VI, where “w/o Lines” refers to our system without line features. As shown in the table, incorporating line features reduces the RMSE by 34.8%, 19.4%, 19.1%, and 17.1% on the four datasets, respectively. These results demonstrate that line features substantially enhance our system and improve the mapping accuracy. We believe this improvement comes from two aspects. On the one hand, a line feature is more likely to be observed across multiple frames compared to a point feature, thereby constraining more frames during pose optimization. This consistent constraint across multiple frames enhances the accuracy of pose estimation. On the other hand, we use the tracking results of a set of points to track a line feature, which makes the tracking of a single line feature more stable than that of a single point.

VII-E2 Structure Graph In Relocalization

We also verify the effectiveness of the proposed relocalization method. This experiment is conducted on the TartanAir Day/Night Localization dataset. We compare the systems with and without the second step proposed in Section VI-B2. The results are presented in Table VII, where “w/o G.” denotes our system without the structure graph, and $N_{\mathcal{C}}$ denotes the candidate number for local feature matching. It shows that using junctions, line features, and structure graphs to filter out relocalization candidates significantly improves recall rates. AirSLAM outperforms w/o G. across all sequences, and when $N_{\mathcal{C}}$ is 3, 5, and 10, the average improvements are 9.6%, 6.0%, and 3.8%, respectively, which demonstrates the effectiveness of the proposed method.

VII-F Efficiency Analysis

Efficiency is essential for robotics applications, so we also evaluate the efficiency of the proposed system. We first compare the running time of our AirSLAM with several SOTA VO and SLAM systems on a computer with an Intel i9-13900 CPU and an NVIDIA RTX 4080 GPU. Then we deploy AirSLAM on an NVIDIA Jetson AGX Orin to verify the efficiency and performance of our system on the embedded platform.

VII-F1 Odometry Efficiency

The VO/VIO efficiency experiment is conducted on the MH_01_easy sequence of the EuRoC dataset. We compare our AirSLAM with several SOTA systems. The loop detection and GBA are disabled from all the systems for a fair comparison. The metrics are the runtime per frame and the CPU usage. The results are presented in Fig. 14a, where 100% CPU usage means 1 CPU core is utilized. It should be additionally noted that DROID-SLAM actually uses 32 CPU cores, and its CPU usage in Fig. 14a is only for a compact presentation. Our system is very efficient, achieving a rate of 73 FPS. In addition, due to extracting and matching features using the GPU, our system requires relatively less CPU resources. We also test the GPU usage. It shows that DROID-SLAM requires about 8GB GPU memory, while our AirSLAM only requires around 3GB.

To investigate the contributions of the CPU and GPU to the efficiency of our system, we measured the total runtime of modules executed on the GPU and CPU using the MH_01_easy sequence and compared it with ORB-SLAM3. The loop detection and GBA are disabled from both systems. The results are presented in Table VIII. It can be seen that our feature detection and matching are highly efficient due to the use of the GPU resources and our system design, which processes only the left image in non-keyframes. Note that we perform the initial pose estimation in the back-end, whereas ORB-SLAM3 includes this process in the front-end. This leads to the differences in the runtime of the “Others” and “Back-end” components between the two systems.

TABLE VIII: Evaluation of CPU and GPU contributions to our system.

	Front-end		Back-end
	FD+FM¹	Others
ORB-SLAM3	\cellcolorgreen!2032.9 s	\cellcolorgreen!2025.2 s	\cellcolorgreen!2039.6 s
Ours	\cellcolorred!2017.5 s	\cellcolorgreen!208.4 s	\cellcolorgreen!2049.3 s

1

Total runtime for feature detection and feature matching. and represent the modules running on the GPU and CPU, respectively.

VII-F2 Mapping Efficiency

We also evaluate the mapping time, i.e., the total runtime for building the initial map and offline optimizing the map. As we compare our system with Hloc using the TartanAir dataset in the map reuse experiment, we use the same baseline and dataset in this experiment. The average mapping time per frame may differ when the map size varies, therefore, we measure the mapping time with different numbers of input images. The results are presented in Fig. 14b, where $n\times$ means our system is $n$ times faster than Hloc. It can be seen that our system is much more efficient than Hloc, especially as the input images increase. Besides, Hloc can only use monocular images to build a map without the real scale, and the map only contains point features, while our system can build the point-line map and estimate the real scale using a stereo camera and an IMU. Therefore, our system is more stable and practical for robotics applications than Hloc.

VII-F3 Embedded Platform

TABLE IX: Efficiency comparison of our system on two platforms.

Platform	Runtime		CPU	GPU
	VIO (FPS)	Optim¹. ( s )	Usage (%)	Usage (MB)
Ours-Jetson	40.3	57.8	224.7	989
Ours-PC	73.1	55.5	217.8	3076

1

The runtime of the offline map optimization.

We use 8 sequences in the EuRoC dataset to evaluate the efficiency of AirSLAM on the embedded platform. The suffixes, i.e., “-Jetson” and “-PC”, are added to distinguish results on different platforms. On the Jetson, we modify three parameters in our system to improve efficiency. First, we reduced the number of detected keypoints from 350 to 300. Second, we change two parameters in Section V-D to make keyframes sparser, i.e., $\alpha_{1}$ and $\alpha_{2}$ are changed from 0.65 and 0.1 to 0.5 and 0.2, respectively. The other parameters are the same on these two platforms. The comparisons of efficiency and absolute trajectory error (ATE) are presented in Table IX and Fig. 15, respectively. Our AirSLAM can run at a rate of $40\mathrm{Hz}$ on the Jetson while only consuming 2 CPU cores and 989MB GPU memory. We find the runtime of the offline map optimization is very close on these two platforms. This is because AirSLAM-Jetson selects fewer keyframes than AirSLAM-PC, so the loop closure and GBA are faster.

VIII Conclusion

In this work, we presented an efficient and illumination-robust hybrid vSLAM system. To be robust to challenging illumination, the proposed system employs a CNN to detect both keypoints and structural lines. Then these two features are associated and tracked using a GNN. To enhance the efficiency, we proposed PLNet, a unified model capable of simultaneously detecting both point and line features. Furthermore, a multi-stage relocalization method based on both appearance and geometry information was proposed for efficient map reuse. We designed the system with an architecture that includes online mapping, offline optimization, and online relocalization, making it easier to deploy on real robots. Extensive experiments show that the proposed system outperforms other SOTA vSLAM systems in terms of accuracy, efficiency, and robustness in illumination-challenging environments.

Despite its remarkable performance, the proposed system still has limitations. Like other point-line-based SLAM systems, our AirSLAM relies on enough line features, so it is best to apply it to man-made environments. This is because our system was originally designed for warehouse robots. In unstructured environments, the system degrades into a point-only system.

References

[1] A. Macario Barros, M. Michel, Y. Moline, G. Corre, and F. Carrel, “A comprehensive survey of visual slam algorithms,” Robotics, vol. 11, no. 1, p. 24, 2022.
[2] S. Yuan, H. Wang, and L. Xie, “Survey on localization systems and algorithms for unmanned systems,” Unmanned Systems, vol. 9, no. 02, pp. 129–163, 2021.
[3] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
[4] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocular visual-inertial state estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018.
[5] R. Gomez-Ojeda, F.-A. Moreno, D. Zuniga-Noël, D. Scaramuzza, and J. Gonzalez-Jimenez, “Pl-slam: A stereo slam system through the combination of points and line segments,” IEEE Transactions on Robotics, vol. 35, no. 3, pp. 734–746, 2019.
[6] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,” IEEE Transactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
[7] D. Zuñiga-Noël, A. Jaenal, R. Gomez-Ojeda, and J. Gonzalez-Jimenez, “The uma-vi dataset: Visual–inertial odometry in low-textured and dynamic illumination environments,” The International Journal of Robotics Research, vol. 39, no. 9, pp. 1052–1060, 2020.
[8] A. Savinykh, M. Kurenkov, E. Kruzhkov, E. Yudin, A. Potapov, P. Karpyshev, and D. Tsetserukou, “Darkslam: Gan-assisted visual slam for reliable operation in low-light conditions,” in 2022 IEEE 95th Vehicular Technology Conference:(VTC2022-Spring). IEEE, 2022, pp. 1–6.
[9] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From coarse to fine: Robust hierarchical localization at large scale,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 716–12 725.
[10] C. Toft, W. Maddern, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, T. Pajdla et al., “Long-term visual localization revisited,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 4, pp. 2074–2088, 2020.
[11] Q. Gu, P. Liu, J. Zhou, X. Peng, and Y. Zhang, “Drms: Dim-light robust monocular simultaneous localization and mapping,” in 2021 International Conference on Computer, Control and Robotics (ICCCR). IEEE, 2021, pp. 267–271.
[12] L. Yu, E. Yang, and B. Yang, “Afe-orb-slam: robust monocular vslam based on adaptive fast threshold and image enhancement for complex lighting environments,” Journal of Intelligent & Robotic Systems, vol. 105, no. 2, pp. 1–14, 2022.
[13] R. Gomez-Ojeda, Z. Zhang, J. Gonzalez-Jimenez, and D. Scaramuzza, “Learning-based image enhancement for visual odometry in challenging hdr environments,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 805–811.
[14] G. G. Scandaroli, M. Meilland, and R. Richa, “Improving ncc-based direct visual tracking,” in European conference on Computer Vision. Springer, 2012, pp. 442–455.
[15] V. Usenko, N. Demmel, D. Schubert, J. Stückler, and D. Cremers, “Visual-inertial mapping with non-linear factor recovery,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 422–429, 2019.
[16] S. Park, T. Schöps, and M. Pollefeys, “Illumination change robustness in direct visual slam,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 4523–4530.
[17] W. Wang, Y. Hu, and S. Scherer, “Tartanvo: A generalizable learning-based vo,” in Conference on Robot Learning. PMLR, 2021, pp. 1761–1772.
[18] Z. Teed and J. Deng, “Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras,” Advances in neural information processing systems, vol. 34, pp. 16 558–16 569, 2021.
[19] T. Fu, S. Su, Y. Lu, and C. Wang, “iSLAM: Imperative SLAM,” IEEE Robotics and Automation Letters (RA-L), 2024. [Online]. Available: https://arxiv.org/pdf/2306.07894.pdf
[20] M. Labbé and F. Michaud, “Multi-session visual slam for illumination-invariant re-localization in indoor environments,” Frontiers in Robotics and AI, vol. 9, p. 801886, 2022.
[21] ——, “Rtab-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation,” Journal of field robotics, vol. 36, no. 2, pp. 416–446, 2019.
[22] K. Xu, Y. Hao, S. Yuan, C. Wang, and L. Xie, “Airvo: An illumination-robust point-line visual odometry,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 3429–3436.
[23] S. Kannapiran, N. Bendapudi, M.-Y. Yu, D. Parikh, S. Berman, A. Vora, and G. Pandey, “Stereo visual odometry with deep learning-based point and line feature matching using an attention graph neural network,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 3491–3498.
[24] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 224–236.
[25] R. G. Von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall, “Lsd: A line segment detector,” Image Processing On Line, vol. 2, pp. 35–55, 2012.
[26] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947.
[27] P. Lindenberger, P.-E. Sarlin, and M. Pollefeys, “Lightglue: Local feature matching at light speed,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 627–17 638.
[28] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to sift or surf,” in 2011 International conference on computer vision. IEEE, 2011, pp. 2564–2571.
[29] D. G. Viswanathan, “Features from accelerated segment test (fast),” in Proceedings of the 10th workshop on image analysis for multimedia interactive services, London, UK, 2009, pp. 6–8.
[30] S. Leutenegger, M. Chli, and R. Y. Siegwart, “Brisk: Binary robust invariant scalable keypoints,” in 2011 International conference on computer vision. IEEE, 2011, pp. 2548–2555.
[31] R. Kang, J. Shi, X. Li, Y. Liu, and X. Liu, “Df-slam: A deep-learning enhanced visual slam system based on deep local features,” arXiv preprint arXiv:1901.07223, 2019.
[32] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk, “Learning local feature descriptors with triplets and shallow convolutional neural networks.” in Bmvc, vol. 1, no. 2, 2016, p. 3.
[33] J. Tang, L. Ericson, J. Folkesson, and P. Jensfelt, “Gcnv2: Efficient correspondence prediction for real-time slam,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3505–3512, 2019.
[34] X. Han, Y. Tao, Z. Li, R. Cen, and F. Xue, “Superpointvo: A lightweight visual odometry based on cnn feature extraction,” in 2020 5th International Conference on Automation, Control and Robotics Engineering (CACRE). IEEE, 2020, pp. 685–691.
[35] H. M. S. Bruno and E. L. Colombini, “Lift-slam: A deep-learning feature-based monocular visual slam method,” Neurocomputing, vol. 455, pp. 97–110, 2021.
[36] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invariant feature transform,” in European conference on computer vision. Springer, 2016, pp. 467–483.
[37] D. Li, X. Shi, Q. Long, S. Liu, W. Yang, F. Wang, Q. Wei, and F. Qiao, “Dxslam: A robust and efficient visual slam system with deep features,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4958–4965.
[38] Z. Zhan, D. Gao, Y.-J. Lin, Y. Xia, and C. Wang, “iMatching: Imperative correspondence learning,” in European Conference on Computer Vision (ECCV), 2024. [Online]. Available: https://arxiv.org/abs/2312.02141
[39] J. Revaud, C. De Souza, M. Humenberger, and P. Weinzaepfel, “R2d2: Reliable and repeatable detector and descriptor,” Advances in neural information processing systems, vol. 32, 2019.
[40] M. Tyszkiewicz, P. Fua, and E. Trulls, “Disk: Learning local features with policy gradient,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 254–14 265, 2020.
[41] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-net: A trainable cnn for joint description and detection of local features,” in Proceedings of the ieee/cvf conference on computer vision and pattern recognition, 2019, pp. 8092–8101.
[42] H. Hu, L. Sackewitz, and M. Lauer, “Joint learning of feature detector and descriptor for visual slam,” in 2021 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2021, pp. 928–933.
[43] C. Anagnostopoulos, A. S. Lalos, P. Kapsalas, D. Van Nguyen, and C. Stylios, “Reviewing deep learning-based feature extractors in a novel automotive slam framework,” in 2023 31st Mediterranean Conference on Control and Automation (MED). IEEE, 2023, pp. 107–112.
[44] C. Akinlar and C. Topal, “Edlines: A real-time line segment detector with a false detection control,” Pattern Recognition Letters, vol. 32, no. 13, pp. 1633–1642, 2011.
[45] Y. He, J. Zhao, Y. Guo, W. He, and K. Yuan, “Pl-vio: Tightly-coupled monocular visual–inertial odometry using point and line features,” Sensors, vol. 18, no. 4, p. 1159, 2018.
[46] L. Zhou, G. Huang, Y. Mao, S. Wang, and M. Kaess, “Edplvo: Efficient direct point-line visual odometry,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 7559–7565.
[47] D. Zou, Y. Wu, L. Pei, H. Ling, and W. Yu, “Structvio: Visual-inertial odometry with structural regularity of man-made environments,” IEEE Transactions on Robotics, vol. 35, no. 4, pp. 999–1013, 2019.
[48] Q. Chen, Y. Cao, J. Hou, G. Li, S. Qiu, B. Chen, X. Xue, H. Lu, and J. Pu, “Vpl-slam: A vertical line supported point line monocular slam system,” IEEE Transactions on Intelligent Transportation Systems, 2024.
[49] R. Pautrat*, J.-T. Lin*, V. Larsson, M. R. Oswald, and M. Pollefeys, “Sold2: Self-supervised occlusion-aware line description and detection,” in Computer Vision and Pattern Recognition (CVPR), 2021.
[50] X. Lin and C. Wang, “AirLine: Efficient learnable line detection with local edge voting,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023. [Online]. Available: https://arxiv.org/abs/2303.16500
[51] N. Xue, T. Wu, S. Bai, F.-D. Wang, G.-S. Xia, L. Zhang, and P. H. Torr, “Holistically-attracted wireframe parsing: From supervised to self-supervised learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[52] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 611–625, 2017.
[53] A. Crivellaro and V. Lepetit, “Robust 3d tracking with descriptor fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3414–3421.
[54] J. Huang and S. Liu, “Robust simultaneous localization and mapping in low-light environment,” Computer Animation and Virtual Worlds, vol. 30, no. 3-4, p. e1895, 2019.
[55] P. Kim, H. Lee, and H. J. Kim, “Autonomous flight with robust visual odometry under dynamic lighting conditions,” Autonomous Robots, vol. 43, no. 6, pp. 1605–1622, 2019.
[56] Z. Chen and C. Heckman, “Robust pose estimation based on normalized information distance,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 2217–2223.
[57] H. Alismail, M. Kaess, B. Browning, and S. Lucey, “Direct visual odometry in low light using binary descriptors,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 444–451, 2016.
[58] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, “Generative adversarial networks: An overview,” IEEE signal processing magazine, vol. 35, no. 1, pp. 53–65, 2018.
[59] S. Pratap Singh, B. Mazotti, S. Mayilvahanan, G. Li, D. Manish Rajani, and M. Ghaffari, “Twilight slam: A comparative study of low-light visual slam pipelines,” in arXiv preprint arXiv:2304.11310, 2023. [Online]. Available: https://arxiv.org/abs/2304.11310
[60] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4909–4916.
[61] D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
[62] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós, “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,” IEEE Transactions on Robotics, vol. 37, no. 6, pp. 1874–1890, 2021.
[63] A. Rosinol, M. Abate, Y. Chang, and L. Carlone, “Kimera: an open-source library for real-time metric-semantic localization and mapping,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 1689–1696.
[64] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297–5307.
[65] N. Keetha, A. Mishra, J. Karhade, K. M. Jatavallabhula, S. Scherer, M. Krishna, and S. Garg, “Anyloc: Towards universal visual place recognition,” IEEE Robotics and Automation Letters, 2023.
[66] S. Yan, Y. Liu, L. Wang, Z. Shen, Z. Peng, H. Liu, M. Zhang, G. Zhang, and X. Zhou, “Long-term visual localization with mobile sensors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 245–17 255.
[67] Y. Zhou, H. Qi, and Y. Ma, “End-to-end wireframe parsing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 962–971.
[68] R. Pautrat, D. Barath, V. Larsson, M. R. Oswald, and M. Pollefeys, “Deeplsd: Line segment detection and refinement with deep image gradients,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 327–17 336.
[69] K. Huang, Y. Wang, Z. Zhou, T. Ding, S. Gao, and Y. Ma, “Learning to parse wireframes in images of man-made environments,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 626–635.
[70] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 2015, pp. 234–241.
[71] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations (ICLR), 2015. [Online]. Available: https://arxiv.org/abs/1412.6980
[72] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “On-manifold preintegration for real-time visual–inertial odometry,” IEEE Transactions on Robotics, vol. 33, no. 1, pp. 1–21, 2016.
[73] L. Zhang and R. Koch, “An efficient and robust line segment matching approach based on lbd descriptor and pairwise geometric consistency,” Journal of visual communication and image representation, vol. 24, no. 7, pp. 794–805, 2013.
[74] R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
[75] A. Bartoli and P. Sturm, “Structure-from-motion using lines: Representation, triangulation, and bundle adjustment,” Computer vision and image understanding, vol. 100, no. 3, pp. 416–441, 2005.
[76] X. Zuo, X. Xie, Y. Liu, and G. Huang, “Robust visual slam with point and line features,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 1775–1782.
[77] Y. Yang, P. Geneva, K. Eckenhoff, and G. Huang, “Visual-inertial odometry with point and line features,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 2447–2454.
[78] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, “g 2 o: A general framework for graph optimization,” in 2011 IEEE international conference on robotics and automation. IEEE, 2011, pp. 3607–3613.
[79] M. Kasper, S. McGuire, and C. Heckman, “A benchmark for visual-inertial odometry systems employing onboard illumination,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 5256–5263.
[80] A. Torii, J. Sivic, T. Pajdla, and M. Okutomi, “Visual place recognition with repetitive structures,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 883–890.
[81] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii, “Inloc: Indoor visual localization with dense matching and view synthesis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7199–7209.
[82] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla, “24/7 place recognition by view synthesis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1808–1817.
[83] Y. Wang and G. S. Chirikjian, “Nonparametric second-order theory of error propagation on motion groups,” The International journal of robotics research, vol. 27, no. 11-12, pp. 1258–1273, 2008.
[84] D. Gálvez-López and J. D. Tardós, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, October 2012.
[85] P. Denis, J. H. Elder, and F. J. Estrada, “Efficient edge-based methods for estimating manhattan frames in urban imagery,” in Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part II 10. Springer, 2008, pp. 197–210.
[86] N. Xue, S. Bai, F. Wang, G.-S. Xia, T. Wu, and L. Zhang, “Learning attraction field representation for robust line segment detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1595–1603.
[87] N. Xue, S. Bai, F.-D. Wang, G.-S. Xia, T. Wu, L. Zhang, and P. H. Torr, “Learning regional attraction for line segment detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 6, pp. 1998–2013, 2019.
[88] Y. Xu, W. Xu, D. Cheung, and Z. Tu, “Line segment detection using transformers without edges,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4257–4266.
[89] X. Dai, H. Gong, S. Wu, X. Yuan, and Y. Ma, “Fully convolutional line parsing,” Neurocomputing, vol. 506, pp. 1–11, 2022.
[90] H. Zhang, Y. Luo, F. Qin, Y. He, and X. Liu, “Elsd: Efficient line segment detector and descriptor,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2969–2978.
[91] J. Lee and S.-Y. Park, “Plf-vins: Real-time monocular visual-inertial slam with point-line fusion and parallel-line fusion,” IEEE Robotics and Automation Letters, 2021.
[92] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual–inertial odometry using nonlinear optimization,” The International Journal of Robotics Research, vol. 34, no. 3, pp. 314–334, 2015.
[93] H. Lim, J. Jeon, and H. Myung, “Uv-slam: Unconstrained line-based slam using vanishing points for structural mapping,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 1518–1525, 2022.
[94] A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, and L. Carlone, “Kimera: From slam to spatial perception with 3d dynamic scene graphs,” The International Journal of Robotics Research, vol. 40, no. 12-14, pp. 1510–1546, 2021.
[95] P. Geneva, K. Eckenhoff, W. Lee, Y. Yang, and G. Huang, “Openvins: A research platform for visual-inertial estimation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 4666–4672.
[96] F. Shu, J. Wang, A. Pagani, and D. Stricker, “Structure plp-slam: Efficient sparse mapping and localization using point, line and plane for monocular, rgb-d and stereo cameras,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 2105–2112.
[97] A. Cramariuc, L. Bernreiter, F. Tschopp, M. Fehr, V. Reijgwart, J. Nieto, R. Siegwart, and C. Cadena, “maplab 2.0–a modular and multi-modal mapping framework,” IEEE Robotics and Automation Letters, 2022.
[98] Y. Wang, B. Xu, W. Fan, and C. Xiang, “A robust and efficient loop closure detection approach for hybrid ground/aerial vehicles,” Drones, vol. 7, no. 2, p. 135, 2023.
[99] X. Peng, Z. Liu, W. Li, P. Tan, S. Y. Cho, and Q. Wang, “Dvi-slam: A dual visual inertial slam network,” in 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 12 020–12 026.
[100] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,” The International Journal of Robotics Research, vol. 35, no. 10, pp. 1157–1163, 2016.
[101] M. Grupp, “evo: Python package for the evaluation of odometry and slam.” https://github.com/MichaelGrupp/evo, 2017.
[102] E. Brachmann, T. Cavallari, and V. A. Prisacariu, “Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5044–5053.
[103] J. Revaud, J. Almazán, R. S. Rezende, and C. R. d. Souza, “Learning with average precision: Training image retrieval with a listwise loss,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5107–5116.
[104] Y. Ge, H. Wang, F. Zhu, R. Zhao, and H. Li, “Self-supervising fine-grained region similarities for large-scale image localization,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, 2020, pp. 369–386.
[105] G. Berton, G. Trivigno, B. Caputo, and C. Masone, “Eigenplaces: Training viewpoint robust models for visual place recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11 080–11 090.
[106] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, pp. 91–110, 2004.
[107] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas, “Sosnet: Second order similarity regularization for local descriptor learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 016–11 025.
[108] L. Cavalli, V. Larsson, M. R. Oswald, T. Sattler, and M. Pollefeys, “Handcrafted outlier detection revisited,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16. Springer, 2020, pp. 770–787.