Semantic Masking and Visual Feature Matching for Robust Localization

Luisa Mao^1,2, Ryan Soussan², Brian Coltin², Trey Smith², Joydeep Biswas¹ *The NASA Game Changing Development Program (Space Technology Mission Directorate) provided funding for this work.The authors are with ¹the Department of Computer Science at the University of Texas at Austin and ²the NASA Ames Research Center {luisa.mao, joydeepb}@utexas.edu
{ryan.soussan, brian.coltin, trey.smith}@nasa.gov

Abstract

We are interested in long-term deployments of autonomous robots to aid astronauts with maintenance and monitoring operations in settings such as the International Space Station. Unfortunately, such environments tend to be highly dynamic and unstructured, and their frequent reconfiguration poses a challenge for robust long-term localization of robots. Many state-of-the-art visual feature-based localization algorithms are not robust towards spatial scene changes, and SLAM algorithms, while promising, cannot run within the low-compute budget available to space robots. To address this gap, we present a computationally efficient semantic masking approach for visual feature matching that improves the accuracy and robustness of visual localization systems during long-term deployment in changing environments. Our method introduces a lightweight check that enforces matches to be within long-term static objects and have consistent semantic classes. We evaluate this approach using both map-based relocalization and relative pose estimation and show that it improves Absolute Trajectory Error (ATE) and correct match ratios on the publicly available Astrobee dataset. While this approach was originally developed for microgravity robotic freeflyers, it can be applied to any visual feature matching pipeline to improve robustness.

I INTRODUCTION

Accurate and robust localization is required for reliable long-term robot autonomy. In environments with dynamic or movable objects, place recognition can be challenging as scene consistency is often assumed. The International Space Station (ISS) is an example of such an environment, and the Astrobee robots [1] operating onboard face constant changes as objects such as cargo bags, wires, laptops, and racks are introduced or rearranged as displayed in Fig. 2. Increasing map matching robustness in the presence of environmental differences would enable more lifelong autonomy for these and other robots.

Localization for the Astrobee robots is made possible by a specialized system which can handle the microgravity, constricted modules and planar, repeated scenes of the ISS. As the Astrobee is limited by compute, maps must be pre-built offline. The remote nature of the ISS makes it difficult to remap frequently enough to capture changes, so there are often discrepancies between the map and deployment environment. Additional challenges of the ISS, such as the limited space to move in, planar scenes, and monocular camera images, cause many state-of-the-art visual feature-matching approaches, including ORBSLAM3[2], to fail. The lack gravity and noisy IMU data also preclude other well-known localization systems, such as MAPLAB 2.0 [3] which has ingrained assumptions about gravity. On top of this, these approaches (along with other more recent and robust algorithms) are too computationally intensive to run on the Astrobee, whose compute platform [1] is roughly 10 times slower than an Intel i9-9980HK 2.4 GHz CPU, and of which only a single core is available for the graph-based localizer.

Refer to caption — Figure 1: feature matching with and without bounding boxes. Horizontal image pairs taken several years apart display multiple scene changes, including a rotated ISS flag that causes faulty associations and a failed relative pose estimate in the top image pair. With semantic masks applied to the matches (bottom image pair), detections of stable scene elements including vents (purple), lights (blue), and handrails (red) enable the pruning of faulty associations due to environment changes and successful relative pose estimation.

We are therefore interested in methods which are: 1) Computationally inexpensive, 2) Use visual features and are robust to scene changes, and 3) Can be easily added into an existing visual localization framework for ease of integration.

Bounding box-based semantic segmentation can be run relatively efficiently and provides object level understanding of a visual scene [4]. Semantic segmentation generates object classes that can be used to prune dynamic or unstable objects [5] and can improve resiliancy to scene changes by detecting stable, static classes and removing those likely to change over time.

To take advantage of the accuracy of feature-based matches and robustness of using semantics, we present a meta-algorithm that enhances visual feature matching for mapping and localization. Our contributions include:

•

A semantic masking stage applied to visual feature matching that enforces class consistency between matches using efficient bounding box detections. This approach can be used with any visual SLAM or localization algorithm to improve robustness to scene changes.
•

An evaluation using the publicly available Astrobee ISS dataset [6] demonstrating increased accuracy and robustness for both map-based pose estimates and relative correspondences in image pairs.

II RELATED WORK

II-A Geometric Approaches

ORB-SLAM3 relies on ORB features [7] and a distributed bag of words (DBoW) [8] for place recognition and loop closures. It quantizes the feature space by clustering descriptors into visual words, and queries are made by finding map frames described by similar visual words. MabLab collects BRISK [9] or FREAK [10] features to build a sparse map alongside performing online VIO, which is later optimized offline. COLMAP [11] matches SIFT features and performs bundle-adjustment using the matches. While each of these approaches are quite successful at matching images from individual activities or without large changes over time, they ignore semantics of the environment and are prone to matching errors if the surroundings change.

II-B Semantic Approaches

II-B1 Localization

Miller et al. [4] explore the use of semantic maps for localization, introducing a new mapping technique which constructs 3d heatmaps of object locations from the use of a bounding box object detector in the image space. Though adding semantic localization improved accuracy when no map-based visual features were otherwise available, it decreased it when both were accessible.

X-View [12] uses pixel level semantics to construct descriptors from segmented frames, but does not incorporate geometric features, using only an odometry source for relative pose estimation in addition to the semantic matches. Similarly, Liu et al. [13] also relies on random walk descriptors to match semantic objects. Both of these approaches rely on dense pixel-level detections that require increased computation and expensive datasets for training.

II-B2 Odometry

VSO [14] uses dense pixel-level semantics and introduces a semantic likelihood function to optimize semantic reprojection errors for visual odometry. Semantic-Direct Visual Odometry [15] also uses pixel-level semantics, but performs dense alignment of semantic images. An et al. [16] perform visual odometry using dense semantics to assign weights for sparse reprojection errors based on their semantic classes and similarly to prioritize sampling certain matches during a RANSAC-based essential matrix calculation. They additionally performed semi-dense matching between images using patches matching defined static semantic classes.

II-B3 SLAM

Wang et al. [17] demonstrate that semantics could enhance SLAM by integrating YOLO with ORB-SLAM2, evaluating on the Freiburg dataset and with an RGB-D quadcopter system. We present a different method of integration which requires structural differences to the sparse map and a more in-depth evaluation including ML baselines on data specific to our application, on which out-of-the-box SLAM approaches fail.

Bowman et al. [18] integrate image semantics with geometric features in the same SLAM algorithm but decouple these as inputs, relying on map projections into detected semantic bounding boxes and geometric feature tracking between keyframes. Civera et al. [19] use a map of objects with extracted SURF [20] features, but do not use a semantic detector to filter or classify matches. Instead they rely on a RANSAC projection algorithm to identify any detected objects in new images. Kimera [21] [22] generates a 3D metric-semantic mesh using image-space detections, but only adds semantics after performing SLAM.

II-C Learning Based Matching

Research into attention-based GNN matchers have produced algorithms such as Superglue [23] which reason about the geometry of the scene. However, the ability of Superglue to capture spatial relationships does not help when the spatial relationships between the components of the scene change. In a dynamic environment, not all parts of the scene are useful and a controlled way to select the useful portions is needed. Additionally, GNNs are difficult to interpret, whereas our approach gives the domain expert control in picking portions of the scene which have semantic meaning.

Erlich et al. explore the use of object-level features [24] for object matching across large viewpoint changes. They find that keypoint-based descriptors used with SuperGlue perform better on images with smaller viewpoint changes, but are not as robust as object-level descriptors when there are large viewpoint changes. Whereas Erlich et al. focus on robustness towards viewpoint changes for the same scene, we focus on robustness towards changes to the scene itself. Rather than combining objects and keypoints through a match score, we compare approaches using object detection either as preprocessing or post-match filtering, and provide an evaluation of a real pipeline.

III METHOD

The semantic image matching pipeline depicted in Fig. 3 improves upon traditional feature matching approaches by adding semantic filtering on a per class basis to visual feature matches.

III-A Object Detection

The semantic object detection stage in the pipeline uses a bounding box object detector fine-tuned on ISS data and with eight defined object classes [4]. Semantic bounding boxes are displayed in Fig. 3(a), where three classes (vents, lights, and handrails) are detected.

III-B Object Masking

The pipeline generates masks for each image using the detected semantic bounding boxes as shown in Fig. 3(b). Regions without semantic detections are not used for later stages of the matching pipeline. Masking is performed before feature detection to improve runtime as features only need to be calculated in masked regions.

III-C Feature Detection

The matching pipeline uses SURF [20] features and hyperparameters tuned for the ISS for feature detection. The SURF detector relies on a dynamic Hessian threshold which adjusts itself until there are between 1000 and 5000 features extracted for each image [25]. Fig. 1 shows example detections for an ISS image.

III-D Interclass Matching

III-D1 Map

The interclass matching stage relies on a prebuilt feature map that consists of extracted SURF keypoints and their triangulated 3d positions [25] augmented with semantic labels from the semantic object detector. Only keypoints with valid semantic object detections are retained in the map which drastically reduces the memory usage.

III-D2 Feature Matching

Candidate matching images in the map are obtained for each image using a DBoW query [8]. The pipeline matches features with the corresponding semantic labels using the FLANN matcher [26] with a goodness ratio of 0.7. Fig. 1 depict the image matching results with and without semantic filtering.

III-E Pose Estimation

The pose estimation stage of the matching pipeline uses the perspective-three-point algorithm [27] to estimate the camera pose from the 2d-3d matches between the image and map. A RANSAC selection procedure [25] iteratively computes poses using four randomly sampled matches at a time and returns the pose with the most inlier matches.

III-F Implementation

The Astroloc relocalization module performs visual-feature matching to a pre-built sparse map to recover pose. Due to the importance of this module of the localization pipeline as the only method of recovery should the Astrobee become lost, we choose to integrate the semantic filter into this module.

We evaluate offline, though all of the individual components of the pipeline, including the object detector model, have previously been successfully run on the Astrobee robots.

IV EXPERIMENTS

Our experiments show the effects of the semantic filter on 2d-2d matching for both classical and learning-based systems, answering the following questions:
1) Does the use of semantics improve visual feature matching as used for visual localization?
2) What are the effects of the semantic filter on learning-based matching approaches which already incorporate spatial relations? Though Astrobee is not currently capable of using these techniques, future missions may use more advanced localization techniques that may benefit from semantic masking.

To answer these, we evaluate our approach using the Astroloc [28] map-based relocalizer with and without semantic filtering. Additionally, we compare the performance of the learning-based image feature matcher Superglue on learned Superpoint [29] features both with and without semantics.

IV-A Dataset

The algorithms are evaluated using eight publicly available datasets from Astrobee deployment in the Japanese Experiment Module (JEM) on the ISS [6]. Table I shows the key to the sequence names. Our data spans from 2019 to 2022, and covers a variety of activities, viewpoints, and lighting. The repeated deployments in the same contained environment gives an opportunity to observe changes to the scene through time.

1	tb_roll	4	ff_return_journey_forward	6	iva_kibo_trans
2	tb_pitch	5	ff_return_journey_left	7	iva_kibo_rot
3	tb_yaw			8	iva_kibo_tag

TABLE I: Key of Number to Sequence Name in Astrobee Dataset

IV-B Visual Localization

IV-B1 Metrics

The Absolute Position Error (APE) in meters and the Absolute Rotation Error (ARE) in degrees are calculated for each relocalized pose in the trajectory. We report the max and median errors along with RMSE, since even a single large failure in relocalization can impact subsequent state estimations. We also calculate the Success Rate (SR), or percentage of localized poses within 0.3m and 5 degrees of groundtruth. These results can be directly compared to the evaluations of SLAM baselines in [6].

IV-C Image Feature Matching

IV-C1 Metrics

Unlike the localization setup where a global pose is recovered from the collected set of 2d-3d matches between a query image and a list of map images, the image feature matching evaluation uses Superglue to find the relative camera pose between pairs of images. Within each trajectory, each image is paired with the single most similar image from the image database and Superglue matches are used to estimate the essential matrix, from which the relative camera pose is recovered. The rotation error in degrees and the translation heading error (the angular difference between the norm of translation vectors) between the estimated and groundtruth extrinsic transformations are reported. We also report the average proportion of correctly matched keypoints (defined by having an epipolar error less than 5e-4) over the entire trajectory.

IV-C2 Segmentation as Pre-Processing

Each image pair is segmented and masked, and the masked pairs of images are matched in eight passes according to object class. All matches are collected and the essential matrix is estimated using a five-point relative pose method [30].

IV-C3 Segmentation as Post-Processing

Each image pair is matched with SuperGlue and matches where both keypoints are within bounding boxes of the same semantic class are kept. The filtered matches are again used to estimate the essential matrix.

V RESULTS

V-A Visual Localization

	max		median		RMSE
Seq	Baseline	Semantic	Baseline	Semantic	Baseline	Semantic
1	$0.015\,403$	$0.015\,633$	$0.005\,671$	$0.005\,492$	$0.007\,647$	$0.008\,46$
2	$0.026\,673$	$0.013\,847$	$0.007\,639$	$0.006\,259$	$0.009\,53$	$0.007\,35$
3	$1.350\,333$	$0.376\,626$	$0.032\,541$	$0.032\,697$	$0.301\,33$	$0.057\,787$
4	$1.437\,444$	$1.326\,094$	$1.022\,843$	$1.021\,239$	$0.978\,852$	$0.970\,929$
5	$1.043\,529$	$1.066\,863$	$0.334\,358$	$0.329\,946$	$0.337\,499$	$0.332\,322$
6	$3.405\,037$	$2.068\,955$	$0.015\,037$	$0.015\,63$	$0.428\,08$	$0.259\,888$
7	$1.005\,764$	$2.277\,931$	$0.017\,173$	$0.014\,688$	$0.098\,748$	$0.173\,117$
8	$1.167\,855$	$0.530\,592$	$0.093\,657$	$0.102\,779$	$0.229\,465$	$0.129\,597$

TABLE II: Non-Semantic vs. Semantic relocalization ATE (m) on Astrobee ISS Datasets

	max		median		RMSE
Seq	Baseline	Semantic	Baseline	Semantic	Baseline	Semantic
1	$0.001\,764$	$0.002\,231$	$0.000\,82$	$0.000\,865$	$0.000\,883$	$0.000\,961$
2	$0.006\,773$	$0.006\,228$	$0.002\,643$	$0.003\,175$	$0.003\,237$	$0.003\,351$
3	$3.141\,498$	$3.069\,498$	$0.029\,714$	$0.029\,167$	$1.474\,307$	$0.202\,956$
4	$0.184\,153$	$0.441\,073$	$0.006\,386$	$0.006\,457$	$0.014\,822$	$0.021\,205$
5	$0.190\,478$	$0.204\,514$	$0.029\,229$	$0.027\,218$	$0.036\,205$	$0.034\,287$
6	$0.563\,893$	$0.366\,709$	$0.004\,334$	$0.004\,954$	$0.076\,822$	$0.046\,994$
7	$0.357\,659$	$0.637\,306$	$0.008\,638$	$0.004\,98$	$0.034\,576$	$0.055\,242$
8	$0.889\,333$	$0.361\,835$	$0.072\,258$	$0.085\,769$	$0.179\,128$	$0.098\,38$

TABLE III: Non-semantic relocalization v.s. Semantic ARE (deg.) on Astrobee ISS Datasets

Sequence	Baseline	Semantic
1	$1$	$1$
2	$1$	$1$
3	$0.7755$	$0.9429$
4	$0.0433$	$0.0433$
5	$0.3012$	$0.3034$
6	$0.4813$	$0.544$
7	$0.7859$	$0.7103$
8	$0.4803$	$0.5263$

TABLE IV: Relocalization success rates with and without semantics on Astrobee Datasets.

Table II displays a reduction in ATE when using semantics for all but two datasets, whereas Table III shows both approaches attained low ARE. To further illustrate performance, we show the success rates for the datasets in Table IV, where semantics improve relocalization for all but one dataset. The difference between the Astroloc relocalizer with and without the semantic filter is best observed when there are modular changes to the environment.

A particularly interesting example occurs in the tb_yaw sequence, in which the robot spins around its z-axis from facing one end of the JEM to facing the other end. In the middle of its trajectory, the robot observes a flag which has been flipped upside-down and is inconsistent with its prior map as shown in Fig. 1. This causes the localized poses from the entire middle portion of its trajectory to be upside down with respect to the map. This is fixed when using semantics as displayed in Fig. 1. We further highlight this in Fig. 4 and 5 where a position offset in the non-semantic relocalizer and reversed yaw between approximately 15 and 30 seconds are both avoided when using semantics.

The errors of the iva_kibo_trans and iva_ARtag sequences from years 2022 and 2021 are also lower with the addition of the semantic filter. As shown here, there can be serious failures if changes in the environment are unnoticed and unreflected in the map. For microgravity free-flyers in particular, these issues are exacerbated by the lack of a gravity vector to verify against and the difficulty of creating maps frequently enough to capture changes due to the inaccessibility of the ISS.

In other datasets, there is negligible difference of the Astroloc relocalization module with ground truth, as the environment and the map are similar enough for the relocalizer to find the robot’s pose. Changes are usually contained within certain portions of the environment, resulting in segments of the trajectory being mislocalized, which is not well-conveyed when the absolute position or rotation error is averaged over the entire trajectory. In iva_kibo_rot, the localization with the semantic filter has greater error than without, since only using features within boxes results in less inliers with which to refine the camera pose.

Though not explicity shown, the two visual localization baselines ORB-SLAM3 and maplab 2.0 were also employed on Astrobee data. Unlike the evaluations in the Astrobee ISS dataset [6], ORB-SLAM3 and maplab 2.0 were run in localization mode on a map built several years apart from the evaluation datasets. Both algorithms failed to find loop closures with the previous map and relied only on odometry. Furthermore, maplab 2.0 could not be used without additional engineering effort due to their assumptions about the existence of a gravity vector.

We also note that since we use a map built years apart from the bags used to evaluate, there is a registration difference which causes the success rate of ff_return_journey_forward to be low, even with the map origin alignment.

V-B Image Feature Matching

Seq	error t (deg):			error R (deg):
	Baseline	Pre	Post	Baseline	Pre	Post
1	$32.560\,72$	$35.779\,97$	$13.920\,52$	$4.140\,139$	$3.537\,02$	$2.032\,448$
2	$22.860\,25$	$39.911\,08$	$13.230\,55$	$2.990\,863$	$4.956\,284$	$1.955\,149$
3	$24.5179$	$42.2196$	$14.078\,22$	$5.965\,214$	$10.6487$	$3.952\,338$
4	$35.793\,93$	$38.503\,15$	$19.028\,57$	$3.084\,002$	$6.211\,718$	$3.375\,606$
5	$35.782\,41$	$45.911\,12$	$28.893\,92$	$5.959\,15$	$6.912\,781$	$8.312\,445$
6	$43.837\,55$	$46.108\,67$	$33.418\,59$	$21.4706$	$27.546\,66$	$13.539\,62$
7	$39.785\,24$	$45.774\,26$	$42.560\,35$	$14.006\,08$	$22.156$	$17.474\,11$
8	$25.414\,95$	$59.044\,71$	$42.098\,71$	$12.490\,89$	$67.481\,47$	$21.684\,65$

TABLE V: Essential matrix estimation errors using Superglue with and without semantic pre and post processing. Since these estimates are not to scale, error is measured in translation and orientation headings.

Seq	avg correct match ratio:			success rate:
	Baseline	Pre	Post	Baseline	Pre	Post
1	$0.654\,51$	$0.812\,111$	$0.688\,818$	$1$	$0.435\,484$	$1$
2	$0.726\,771$	$0.790\,862$	$0.745\,443$	$1$	$0.981\,03$	$1$
3	$0.520\,139$	$0.595\,147$	$0.584\,439$	$1$	$0.915$	$1$
4	$0.278\,964$	$0.375\,81$	$0.323\,893$	$0.748\,815$	$0.639\,81$	$0.748\,815$
5	$0.305\,618$	$0.349\,949$	$0.339\,82$	$0.691\,304$	$0.473\,913$	$0.691\,304$
6	$0.191\,862$	$0.270\,376$	$0.212\,98$	$0.327\,103$	$0.224\,299$	$0.327\,103$
7	$0.095\,07$	$0.104\,719$	$0.098\,898$	$0.161\,702$	$0.127\,66$	$0.161\,702$
8	$0.443\,206$	$0.305\,245$	$0.425\,435$	$0.134\,868$	$0.036\,184$	$0.134\,868$

TABLE VI: Match and success ratios when running Superglue without semantics, using semantics as a pre-processing step, and using semantics as a post-processing step for a sequence of activities.

The cumulative distribution functions (cdf) in Figures 6(a) and 6(b) illustrate the improvements in essential matrix calculation when using semantics with Superglue as a post-processing step. Pre-processing performed worse than baseline, as Superglue always attempts to match 1024 points between images, which can force false matches if different instances of the same object class are detected in each images or heavily concentrate correct matches within boxes. A sample of closely clustered keypoints for essential matrix estimation yields less accurate results than if the keypoints were distributed across the image.

Table VI displays the improvement in correct match ratios when using semantics as a pre and post processing step compared to not using them. Masking images during pre-processing limits the searchable range for matches, and although the pre-processing results show the largest correct match ratio, this clustering effect yields worse essential matrix estimation compared to post-processed matching as described above. Still, the increase in the ratio of correct matches is still valuable for other applications. If the 3d landmarks have already been triangulated, 2d-2d matching before PnP (as is done in most visual localization pipelines) could be improved with using semantics. Essential matrix estimation performance is displayed in Table V, which again shows semantics as a post-processing step outperforming the baseline Superglue approach and semantics as a pre-processing step.

Since the interior of the JEM is shaped as a rectangualar prism, images consist of mostly planar surfaces. Additionally, the microgravity environment results in many query-map image pairs having only relative rotation motion between the camera poses. Despite making high ratios of correct matches on most sequences, these degenerate cases can cause high errors when estimating the essential matrix, especially the translation component. For this reason, Superglue results have much higher errors than the Astroloc evaluation method (where pose can be directly recovered from previously 3d-triangulated points using PnP).

VI CONCLUSIONS

We have presented a lightweight semantic consistency check for visual feature matching that improves the robustness of localization performance. We have shown that enforcing consistent semantic classes for feature matches improves both relocalization performance and essential matrix calculation as evaluated on a dataset of eight Astrobee activities on the ISS.

As this method is designed to be computationally efficient, we additionally plan to deploy and test our semantic relocalization approach on the Astrobee robots during future ISS activities to improve their resilience to environment changes.

In future work, we wish to explore using movable object detections as negative matches, and weighting feature matches based on their semantics or lack there of. Additionally, we are interested in further using semantic results to perform informed map updates on an object level.

ACKNOWLEDGMENT

We would like to thank Ian D. Miller and Suyoung Kang for supporting this work. We would also like to thank Marina Gouviea Moreira for her assistance during testing in the Granite Lab and the rest of the Astrobee Facilities team for their help.

References

[1] T. Smith, J. Barlow, M. Bualat, T. Fong, C. Provencher, H. Sanchez, E. Smith et al., “Astrobee: A new platform for free-flying robotics on the International Space Station,” in Int. Symp. on Artificial Intelligence, Robotics and Automation in Space, 2016.
[2] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós, “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,” IEEE Transactions on Robotics, vol. 37, no. 6, pp. 1874–1890, 2021.
[3] A. Cramariuc, L. Bernreiter, F. Tschopp, M. Fehr, V. Reijgwart, J. Nieto, R. Siegwart, and C. Cadena, “maplab 2.0–a modular and multi-modal mapping framework,” IEEE Robotics and Automation Letters, vol. 8, no. 2, pp. 520–527, 2022.
[4] I. Miller, R. Soussan, B. Coltin, T. Smith, and V. Kumar, “Robust semantic mapping and localization on a free-flying robot in microgravity,” in 2022 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2022.
[5] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds-slam: A semantic visual slam towards dynamic environments,” in 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2018, pp. 1168–1174.
[6] S. Kang, R. Soussan, D. Lee, B. Coltin, A. M. Vargas, M. Moreira, K. Hamilton, R. Garcia, R. Bualat, T. Smith, J. Barlow, J. Benavides, E. Jeong, and P. Kim, “Astrobee iss free-flyer datasets for space intra-vehicular robot navigation research,” in 2023 IEEE Robotics and Automation Letters (RA-L), Under Review.
[7] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in 2011 International conference on computer vision. Ieee, 2011, pp. 2564–2571.
[8] D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
[9] S. Leutenegger, M. Chli, and R. Y. Siegwart, “Brisk: Binary robust invariant scalable keypoints,” in 2011 International conference on computer vision. Ieee, 2011, pp. 2548–2555.
[10] A. Alahi, R. Ortiz, and P. Vandergheynst, “Freak: Fast retina keypoint,” in 2012 IEEE conference on computer vision and pattern recognition. Ieee, 2012, pp. 510–517.
[11] J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4104–4113.
[12] A. Gawel, C. Del Don, R. Siegwart, J. Nieto, and C. Cadena, “X-view: Graph-based semantic multi-view localization,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1687–1694, 2018.
[13] Y. Liu, Y. Petillot, D. Lane, and S. Wang, “Global localization with object-level semantics and topology,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 4909–4915.
[14] K.-N. Lianos, J. L. Schonberger, M. Pollefeys, and T. Sattler, “Vso: Visual semantic odometry,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 234–250.
[15] Y. Bao, Z. Yang, Y. Pan, and R. Huan, “Semantic-direct visual odometry,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 6718–6725, 2022.
[16] L. An, X. Zhang, H. Gao, and Y. Liu, “Semantic segmentation–aided visual odometry for urban autonomous driving,” International Journal of Advanced Robotic Systems, vol. 14, no. 5, p. 1729881417735667, 2017.
[17] Y. Wang and A. Zell, “Improving feature-based visual slam by semantics,” in 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS), 2018, pp. 7–12.
[18] S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas, “Probabilistic data association for semantic slam,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 1722–1729.
[19] J. Civera, D. Gálvez-López, L. Riazuelo, J. D. Tardós, and J. M. M. Montiel, “Towards semantic slam using a monocular camera,” in 2011 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2011, pp. 1277–1284.
[20] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Computer vision and image understanding, vol. 110, no. 3, pp. 346–359, 2008.
[21] A. Rosinol, M. Abate, Y. Chang, and L. Carlone, “Kimera: an open-source library for real-time metric-semantic localization and mapping,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 1689–1696.
[22] Y. Chang, Y. Tian, J. P. How, and L. Carlone, “Kimera-multi: a system for distributed multi-robot metric-semantic simultaneous localization and mapping,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 11 210–11 218.
[23] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947.
[24] C. Elich, I. Armeni, M. R. Oswald, M. Pollefeys, and J. Stueckler, “Learning-based relational object matching across views,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 5999–6005.
[25] B. Coltin, J. Fusco, Z. Moratto, O. Alexandrov, and R. Nakamura, “Localization from visual landmarks on a free-flying robot,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2016, pp. 4377–4382.
[26] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configuration.” VISAPP (1), vol. 2, no. 331-340, p. 2, 2009.
[27] X.-S. Gao, X.-R. Hou, J. Tang, and H.-F. Cheng, “Complete solution classification for the perspective-three-point problem,” IEEE transactions on pattern analysis and machine intelligence, vol. 25, no. 8, pp. 930–943, 2003.
[28] R. Soussan, V. Kumar, B. Coltin, and T. Smith, “Astroloc: An efficient and robust localizer for a free-flying robot,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 4106–4112.
[29] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 224–236.
[30] D. Nistér, “An efficient solution to the five-point relative pose problem,” IEEE transactions on pattern analysis and machine intelligence, vol. 26, no. 6, pp. 756–770, 2004.