Semantic Masking and Visual Feature Matching for Robust Localization

Luisa Mao1,2, Ryan Soussan2, Brian Coltin2, Trey Smith2, Joydeep Biswas1 *The NASA Game Changing Development Program (Space Technology Mission Directorate) provided funding for this work.The authors are with 1the Department of Computer Science at the University of Texas at Austin and 2the NASA Ames Research Center {luisa.mao, joydeepb}@utexas.edu
{ryan.soussan, brian.coltin, trey.smith}@nasa.gov
Abstract

We are interested in long-term deployments of autonomous robots to aid astronauts with maintenance and monitoring operations in settings such as the International Space Station. Unfortunately, such environments tend to be highly dynamic and unstructured, and their frequent reconfiguration poses a challenge for robust long-term localization of robots. Many state-of-the-art visual feature-based localization algorithms are not robust towards spatial scene changes, and SLAM algorithms, while promising, cannot run within the low-compute budget available to space robots. To address this gap, we present a computationally efficient semantic masking approach for visual feature matching that improves the accuracy and robustness of visual localization systems during long-term deployment in changing environments. Our method introduces a lightweight check that enforces matches to be within long-term static objects and have consistent semantic classes. We evaluate this approach using both map-based relocalization and relative pose estimation and show that it improves Absolute Trajectory Error (ATE) and correct match ratios on the publicly available Astrobee dataset. While this approach was originally developed for microgravity robotic freeflyers, it can be applied to any visual feature matching pipeline to improve robustness.

I INTRODUCTION

Accurate and robust localization is required for reliable long-term robot autonomy. In environments with dynamic or movable objects, place recognition can be challenging as scene consistency is often assumed. The International Space Station (ISS) is an example of such an environment, and the Astrobee robots [1] operating onboard face constant changes as objects such as cargo bags, wires, laptops, and racks are introduced or rearranged as displayed in Fig. 2. Increasing map matching robustness in the presence of environmental differences would enable more lifelong autonomy for these and other robots.

Localization for the Astrobee robots is made possible by a specialized system which can handle the microgravity, constricted modules and planar, repeated scenes of the ISS. As the Astrobee is limited by compute, maps must be pre-built offline. The remote nature of the ISS makes it difficult to remap frequently enough to capture changes, so there are often discrepancies between the map and deployment environment. Additional challenges of the ISS, such as the limited space to move in, planar scenes, and monocular camera images, cause many state-of-the-art visual feature-matching approaches, including ORBSLAM3[2], to fail. The lack gravity and noisy IMU data also preclude other well-known localization systems, such as MAPLAB 2.0 [3] which has ingrained assumptions about gravity. On top of this, these approaches (along with other more recent and robust algorithms) are too computationally intensive to run on the Astrobee, whose compute platform [1] is roughly 10 times slower than an Intel i9-9980HK 2.4 GHz CPU, and of which only a single core is available for the graph-based localizer.

Refer to caption
Figure 1: feature matching with and without bounding boxes. Horizontal image pairs taken several years apart display multiple scene changes, including a rotated ISS flag that causes faulty associations and a failed relative pose estimate in the top image pair. With semantic masks applied to the matches (bottom image pair), detections of stable scene elements including vents (purple), lights (blue), and handrails (red) enable the pruning of faulty associations due to environment changes and successful relative pose estimation.

We are therefore interested in methods which are: 1) Computationally inexpensive, 2) Use visual features and are robust to scene changes, and 3) Can be easily added into an existing visual localization framework for ease of integration.

Bounding box-based semantic segmentation can be run relatively efficiently and provides object level understanding of a visual scene [4]. Semantic segmentation generates object classes that can be used to prune dynamic or unstable objects [5] and can improve resiliancy to scene changes by detecting stable, static classes and removing those likely to change over time.

To take advantage of the accuracy of feature-based matches and robustness of using semantics, we present a meta-algorithm that enhances visual feature matching for mapping and localization. Our contributions include:

  • A semantic masking stage applied to visual feature matching that enforces class consistency between matches using efficient bounding box detections. This approach can be used with any visual SLAM or localization algorithm to improve robustness to scene changes.

  • An evaluation using the publicly available Astrobee ISS dataset [6] demonstrating increased accuracy and robustness for both map-based pose estimates and relative correspondences in image pairs.

Refer to caption
Figure 2: Astrobee free-flying robots roaming the ISS during an activity. Background objects such as laptops, wires, and cargo bags are often moved between flights and can cause localization errors for the robots.

II RELATED WORK

II-A Geometric Approaches

ORB-SLAM3 relies on ORB features [7] and a distributed bag of words (DBoW) [8] for place recognition and loop closures. It quantizes the feature space by clustering descriptors into visual words, and queries are made by finding map frames described by similar visual words. MabLab collects BRISK [9] or FREAK [10] features to build a sparse map alongside performing online VIO, which is later optimized offline. COLMAP [11] matches SIFT features and performs bundle-adjustment using the matches. While each of these approaches are quite successful at matching images from individual activities or without large changes over time, they ignore semantics of the environment and are prone to matching errors if the surroundings change.

II-B Semantic Approaches

II-B1 Localization

Miller et al. [4] explore the use of semantic maps for localization, introducing a new mapping technique which constructs 3d heatmaps of object locations from the use of a bounding box object detector in the image space. Though adding semantic localization improved accuracy when no map-based visual features were otherwise available, it decreased it when both were accessible.

X-View [12] uses pixel level semantics to construct descriptors from segmented frames, but does not incorporate geometric features, using only an odometry source for relative pose estimation in addition to the semantic matches. Similarly, Liu et al. [13] also relies on random walk descriptors to match semantic objects. Both of these approaches rely on dense pixel-level detections that require increased computation and expensive datasets for training.

II-B2 Odometry

VSO [14] uses dense pixel-level semantics and introduces a semantic likelihood function to optimize semantic reprojection errors for visual odometry. Semantic-Direct Visual Odometry [15] also uses pixel-level semantics, but performs dense alignment of semantic images. An et al. [16] perform visual odometry using dense semantics to assign weights for sparse reprojection errors based on their semantic classes and similarly to prioritize sampling certain matches during a RANSAC-based essential matrix calculation. They additionally performed semi-dense matching between images using patches matching defined static semantic classes.

II-B3 SLAM

Wang et al. [17] demonstrate that semantics could enhance SLAM by integrating YOLO with ORB-SLAM2, evaluating on the Freiburg dataset and with an RGB-D quadcopter system. We present a different method of integration which requires structural differences to the sparse map and a more in-depth evaluation including ML baselines on data specific to our application, on which out-of-the-box SLAM approaches fail.

Bowman et al. [18] integrate image semantics with geometric features in the same SLAM algorithm but decouple these as inputs, relying on map projections into detected semantic bounding boxes and geometric feature tracking between keyframes. Civera et al. [19] use a map of objects with extracted SURF [20] features, but do not use a semantic detector to filter or classify matches. Instead they rely on a RANSAC projection algorithm to identify any detected objects in new images. Kimera [21] [22] generates a 3D metric-semantic mesh using image-space detections, but only adds semantics after performing SLAM.

II-C Learning Based Matching

Research into attention-based GNN matchers have produced algorithms such as Superglue [23] which reason about the geometry of the scene. However, the ability of Superglue to capture spatial relationships does not help when the spatial relationships between the components of the scene change. In a dynamic environment, not all parts of the scene are useful and a controlled way to select the useful portions is needed. Additionally, GNNs are difficult to interpret, whereas our approach gives the domain expert control in picking portions of the scene which have semantic meaning.

Erlich et al. explore the use of object-level features [24] for object matching across large viewpoint changes. They find that keypoint-based descriptors used with SuperGlue perform better on images with smaller viewpoint changes, but are not as robust as object-level descriptors when there are large viewpoint changes. Whereas Erlich et al. focus on robustness towards viewpoint changes for the same scene, we focus on robustness towards changes to the scene itself. Rather than combining objects and keypoints through a match score, we compare approaches using object detection either as preprocessing or post-match filtering, and provide an evaluation of a real pipeline.

III METHOD

Refer to caption
(a) Bounding Boxes
Refer to caption
(b) Masked Images
Refer to caption
(c) Semantic filtering pipline
Figure 3: The semantic image matching pipeline adds semantic segmentation stages in blue to a visual feature matching pipeline in red to improve pose estimation accuracy. The pipeline detects semantic objects in each image (Fig. 3(a)) and generates masked image-space regions for each detection in each object class (Fig. 3(b)). It then detects visual features in the masked regions and performs matching between features of the same class for each pair of images. Finally, the pipeline estimates the relative pose between the images using the resulting matches.

The semantic image matching pipeline depicted in Fig. 3 improves upon traditional feature matching approaches by adding semantic filtering on a per class basis to visual feature matches.

III-A Object Detection

The semantic object detection stage in the pipeline uses a bounding box object detector fine-tuned on ISS data and with eight defined object classes [4]. Semantic bounding boxes are displayed in Fig. 3(a), where three classes (vents, lights, and handrails) are detected.

III-B Object Masking

The pipeline generates masks for each image using the detected semantic bounding boxes as shown in Fig. 3(b). Regions without semantic detections are not used for later stages of the matching pipeline. Masking is performed before feature detection to improve runtime as features only need to be calculated in masked regions.

III-C Feature Detection

The matching pipeline uses SURF [20] features and hyperparameters tuned for the ISS for feature detection. The SURF detector relies on a dynamic Hessian threshold which adjusts itself until there are between 1000 and 5000 features extracted for each image [25]. Fig. 1 shows example detections for an ISS image.

III-D Interclass Matching

III-D1 Map

The interclass matching stage relies on a prebuilt feature map that consists of extracted SURF keypoints and their triangulated 3d positions [25] augmented with semantic labels from the semantic object detector. Only keypoints with valid semantic object detections are retained in the map which drastically reduces the memory usage.

III-D2 Feature Matching

Candidate matching images in the map are obtained for each image using a DBoW query [8]. The pipeline matches features with the corresponding semantic labels using the FLANN matcher [26] with a goodness ratio of 0.7. Fig. 1 depict the image matching results with and without semantic filtering.

III-E Pose Estimation

The pose estimation stage of the matching pipeline uses the perspective-three-point algorithm [27] to estimate the camera pose from the 2d-3d matches between the image and map. A RANSAC selection procedure [25] iteratively computes poses using four randomly sampled matches at a time and returns the pose with the most inlier matches.

III-F Implementation

The Astroloc relocalization module performs visual-feature matching to a pre-built sparse map to recover pose. Due to the importance of this module of the localization pipeline as the only method of recovery should the Astrobee become lost, we choose to integrate the semantic filter into this module.

We evaluate offline, though all of the individual components of the pipeline, including the object detector model, have previously been successfully run on the Astrobee robots.

IV EXPERIMENTS

Our experiments show the effects of the semantic filter on 2d-2d matching for both classical and learning-based systems, answering the following questions:
1) Does the use of semantics improve visual feature matching as used for visual localization?
2) What are the effects of the semantic filter on learning-based matching approaches which already incorporate spatial relations? Though Astrobee is not currently capable of using these techniques, future missions may use more advanced localization techniques that may benefit from semantic masking.

To answer these, we evaluate our approach using the Astroloc [28] map-based relocalizer with and without semantic filtering. Additionally, we compare the performance of the learning-based image feature matcher Superglue on learned Superpoint [29] features both with and without semantics.

IV-A Dataset

The algorithms are evaluated using eight publicly available datasets from Astrobee deployment in the Japanese Experiment Module (JEM) on the ISS [6]. Table I shows the key to the sequence names. Our data spans from 2019 to 2022, and covers a variety of activities, viewpoints, and lighting. The repeated deployments in the same contained environment gives an opportunity to observe changes to the scene through time.

1 tb_roll 4 ff_return_journey_forward 6 iva_kibo_trans
2 tb_pitch 5 ff_return_journey_left 7 iva_kibo_rot
3 tb_yaw 8 iva_kibo_tag
TABLE I: Key of Number to Sequence Name in Astrobee Dataset

IV-B Visual Localization

IV-B1 Metrics

The Absolute Position Error (APE) in meters and the Absolute Rotation Error (ARE) in degrees are calculated for each relocalized pose in the trajectory. We report the max and median errors along with RMSE, since even a single large failure in relocalization can impact subsequent state estimations. We also calculate the Success Rate (SR), or percentage of localized poses within 0.3m and 5 degrees of groundtruth. These results can be directly compared to the evaluations of SLAM baselines in [6].

IV-C Image Feature Matching

IV-C1 Metrics

Unlike the localization setup where a global pose is recovered from the collected set of 2d-3d matches between a query image and a list of map images, the image feature matching evaluation uses Superglue to find the relative camera pose between pairs of images. Within each trajectory, each image is paired with the single most similar image from the image database and Superglue matches are used to estimate the essential matrix, from which the relative camera pose is recovered. The rotation error in degrees and the translation heading error (the angular difference between the norm of translation vectors) between the estimated and groundtruth extrinsic transformations are reported. We also report the average proportion of correctly matched keypoints (defined by having an epipolar error less than 5e-4) over the entire trajectory.

IV-C2 Segmentation as Pre-Processing

Each image pair is segmented and masked, and the masked pairs of images are matched in eight passes according to object class. All matches are collected and the essential matrix is estimated using a five-point relative pose method [30].

IV-C3 Segmentation as Post-Processing

Each image pair is matched with SuperGlue and matches where both keypoints are within bounding boxes of the same semantic class are kept. The filtered matches are again used to estimate the essential matrix.

V RESULTS

V-A Visual Localization

max median RMSE
Seq Baseline Semantic Baseline Semantic Baseline Semantic
1 0.015 4030.0154030.015\,4030.015 403 0.015 6330.0156330.015\,6330.015 633 0.005 6710.0056710.005\,6710.005 671 0.005 4920.0054920.005\,4920.005 492 0.007 6470.0076470.007\,6470.007 647 0.008 460.008460.008\,460.008 46
2 0.026 6730.0266730.026\,6730.026 673 0.013 8470.0138470.013\,8470.013 847 0.007 6390.0076390.007\,6390.007 639 0.006 2590.0062590.006\,2590.006 259 0.009 530.009530.009\,530.009 53 0.007 350.007350.007\,350.007 35
3 1.350 3331.3503331.350\,3331.350 333 0.376 6260.3766260.376\,6260.376 626 0.032 5410.0325410.032\,5410.032 541 0.032 6970.0326970.032\,6970.032 697 0.301 330.301330.301\,330.301 33 0.057 7870.0577870.057\,7870.057 787
4 1.437 4441.4374441.437\,4441.437 444 1.326 0941.3260941.326\,0941.326 094 1.022 8431.0228431.022\,8431.022 843 1.021 2391.0212391.021\,2391.021 239 0.978 8520.9788520.978\,8520.978 852 0.970 9290.9709290.970\,9290.970 929
5 1.043 5291.0435291.043\,5291.043 529 1.066 8631.0668631.066\,8631.066 863 0.334 3580.3343580.334\,3580.334 358 0.329 9460.3299460.329\,9460.329 946 0.337 4990.3374990.337\,4990.337 499 0.332 3220.3323220.332\,3220.332 322
6 3.405 0373.4050373.405\,0373.405 037 2.068 9552.0689552.068\,9552.068 955 0.015 0370.0150370.015\,0370.015 037 0.015 630.015630.015\,630.015 63 0.428 080.428080.428\,080.428 08 0.259 8880.2598880.259\,8880.259 888
7 1.005 7641.0057641.005\,7641.005 764 2.277 9312.2779312.277\,9312.277 931 0.017 1730.0171730.017\,1730.017 173 0.014 6880.0146880.014\,6880.014 688 0.098 7480.0987480.098\,7480.098 748 0.173 1170.1731170.173\,1170.173 117
8 1.167 8551.1678551.167\,8551.167 855 0.530 5920.5305920.530\,5920.530 592 0.093 6570.0936570.093\,6570.093 657 0.102 7790.1027790.102\,7790.102 779 0.229 4650.2294650.229\,4650.229 465 0.129 5970.1295970.129\,5970.129 597
TABLE II: Non-Semantic vs. Semantic relocalization ATE (m) on Astrobee ISS Datasets
max median RMSE
Seq Baseline Semantic Baseline Semantic Baseline Semantic
1 0.001 7640.0017640.001\,7640.001 764 0.002 2310.0022310.002\,2310.002 231 0.000 820.000820.000\,820.000 82 0.000 8650.0008650.000\,8650.000 865 0.000 8830.0008830.000\,8830.000 883 0.000 9610.0009610.000\,9610.000 961
2 0.006 7730.0067730.006\,7730.006 773 0.006 2280.0062280.006\,2280.006 228 0.002 6430.0026430.002\,6430.002 643 0.003 1750.0031750.003\,1750.003 175 0.003 2370.0032370.003\,2370.003 237 0.003 3510.0033510.003\,3510.003 351
3 3.141 4983.1414983.141\,4983.141 498 3.069 4983.0694983.069\,4983.069 498 0.029 7140.0297140.029\,7140.029 714 0.029 1670.0291670.029\,1670.029 167 1.474 3071.4743071.474\,3071.474 307 0.202 9560.2029560.202\,9560.202 956
4 0.184 1530.1841530.184\,1530.184 153 0.441 0730.4410730.441\,0730.441 073 0.006 3860.0063860.006\,3860.006 386 0.006 4570.0064570.006\,4570.006 457 0.014 8220.0148220.014\,8220.014 822 0.021 2050.0212050.021\,2050.021 205
5 0.190 4780.1904780.190\,4780.190 478 0.204 5140.2045140.204\,5140.204 514 0.029 2290.0292290.029\,2290.029 229 0.027 2180.0272180.027\,2180.027 218 0.036 2050.0362050.036\,2050.036 205 0.034 2870.0342870.034\,2870.034 287
6 0.563 8930.5638930.563\,8930.563 893 0.366 7090.3667090.366\,7090.366 709 0.004 3340.0043340.004\,3340.004 334 0.004 9540.0049540.004\,9540.004 954 0.076 8220.0768220.076\,8220.076 822 0.046 9940.0469940.046\,9940.046 994
7 0.357 6590.3576590.357\,6590.357 659 0.637 3060.6373060.637\,3060.637 306 0.008 6380.0086380.008\,6380.008 638 0.004 980.004980.004\,980.004 98 0.034 5760.0345760.034\,5760.034 576 0.055 2420.0552420.055\,2420.055 242
8 0.889 3330.8893330.889\,3330.889 333 0.361 8350.3618350.361\,8350.361 835 0.072 2580.0722580.072\,2580.072 258 0.085 7690.0857690.085\,7690.085 769 0.179 1280.1791280.179\,1280.179 128 0.098 380.098380.098\,380.098 38
TABLE III: Non-semantic relocalization v.s. Semantic ARE (deg.) on Astrobee ISS Datasets
Sequence Baseline Semantic
1 1111 1111
2 1111 1111
3 0.77550.77550.77550.7755 0.94290.94290.94290.9429
4 0.04330.04330.04330.0433 0.04330.04330.04330.0433
5 0.30120.30120.30120.3012 0.30340.30340.30340.3034
6 0.48130.48130.48130.4813 0.5440.5440.5440.544
7 0.78590.78590.78590.7859 0.71030.71030.71030.7103
8 0.48030.48030.48030.4803 0.52630.52630.52630.5263
TABLE IV: Relocalization success rates with and without semantics on Astrobee Datasets.

Table II displays a reduction in ATE when using semantics for all but two datasets, whereas Table III shows both approaches attained low ARE. To further illustrate performance, we show the success rates for the datasets in Table IV, where semantics improve relocalization for all but one dataset. The difference between the Astroloc relocalizer with and without the semantic filter is best observed when there are modular changes to the environment.

A particularly interesting example occurs in the tb_yaw sequence, in which the robot spins around its z-axis from facing one end of the JEM to facing the other end. In the middle of its trajectory, the robot observes a flag which has been flipped upside-down and is inconsistent with its prior map as shown in Fig. 1. This causes the localized poses from the entire middle portion of its trajectory to be upside down with respect to the map. This is fixed when using semantics as displayed in Fig. 1. We further highlight this in Fig. 4 and 5 where a position offset in the non-semantic relocalizer and reversed yaw between approximately 15 and 30 seconds are both avoided when using semantics.

Refer to caption
(a) Baseline relocalizer
Refer to caption
(b) Semantic relocalizer
Figure 4: The XYZ position of the Astrobee through time in the tb_yaw sequence is plotted above. The non-semantic localizer accrues a position offset in the middle of the plot (visible as a discontinuous step) whereas the semantic localizer maintains its fixed position.

The errors of the iva_kibo_trans and iva_ARtag sequences from years 2022 and 2021 are also lower with the addition of the semantic filter. As shown here, there can be serious failures if changes in the environment are unnoticed and unreflected in the map. For microgravity free-flyers in particular, these issues are exacerbated by the lack of a gravity vector to verify against and the difficulty of creating maps frequently enough to capture changes due to the inaccessibility of the ISS.

In other datasets, there is negligible difference of the Astroloc relocalization module with ground truth, as the environment and the map are similar enough for the relocalizer to find the robot’s pose. Changes are usually contained within certain portions of the environment, resulting in segments of the trajectory being mislocalized, which is not well-conveyed when the absolute position or rotation error is averaged over the entire trajectory. In iva_kibo_rot, the localization with the semantic filter has greater error than without, since only using features within boxes results in less inliers with which to refine the camera pose.

Refer to caption
(a) Baseline relocalizer
Refer to caption
(b) Semantic relocalizer
Figure 5: The addition of semantics helps the robot track its orientation during an in-place rotation in the tb_yaw sequence. Here, the non-semantic relocalizer localizes upside-down and observes a reversed yaw around 15 seconds while the semantic version properly tracks the rotation.

Though not explicity shown, the two visual localization baselines ORB-SLAM3 and maplab 2.0 were also employed on Astrobee data. Unlike the evaluations in the Astrobee ISS dataset [6], ORB-SLAM3 and maplab 2.0 were run in localization mode on a map built several years apart from the evaluation datasets. Both algorithms failed to find loop closures with the previous map and relied only on odometry. Furthermore, maplab 2.0 could not be used without additional engineering effort due to their assumptions about the existence of a gravity vector.

We also note that since we use a map built years apart from the bags used to evaluate, there is a registration difference which causes the success rate of ff_return_journey_forward to be low, even with the map origin alignment.

V-B Image Feature Matching

Refer to caption
(a) Translation Heading Error (deg)
Refer to caption
(b) Rotational Error (deg)
Figure 6: Translation and Rotation Error CDFs for Superglue with and without semantics. A curve closer to the upper left denotes lower error.
Seq error t (deg): error R (deg):
Baseline Pre Post Baseline Pre Post
1 32.560 7232.5607232.560\,7232.560 72 35.779 9735.7799735.779\,9735.779 97 13.920 5213.9205213.920\,5213.920 52 4.140 1394.1401394.140\,1394.140 139 3.537 023.537023.537\,023.537 02 2.032 4482.0324482.032\,4482.032 448
2 22.860 2522.8602522.860\,2522.860 25 39.911 0839.9110839.911\,0839.911 08 13.230 5513.2305513.230\,5513.230 55 2.990 8632.9908632.990\,8632.990 863 4.956 2844.9562844.956\,2844.956 284 1.955 1491.9551491.955\,1491.955 149
3 24.517924.517924.517924.5179 42.219642.219642.219642.2196 14.078 2214.0782214.078\,2214.078 22 5.965 2145.9652145.965\,2145.965 214 10.648710.648710.648710.6487 3.952 3383.9523383.952\,3383.952 338
4 35.793 9335.7939335.793\,9335.793 93 38.503 1538.5031538.503\,1538.503 15 19.028 5719.0285719.028\,5719.028 57 3.084 0023.0840023.084\,0023.084 002 6.211 7186.2117186.211\,7186.211 718 3.375 6063.3756063.375\,6063.375 606
5 35.782 4135.7824135.782\,4135.782 41 45.911 1245.9111245.911\,1245.911 12 28.893 9228.8939228.893\,9228.893 92 5.959 155.959155.959\,155.959 15 6.912 7816.9127816.912\,7816.912 781 8.312 4458.3124458.312\,4458.312 445
6 43.837 5543.8375543.837\,5543.837 55 46.108 6746.1086746.108\,6746.108 67 33.418 5933.4185933.418\,5933.418 59 21.470621.470621.470621.4706 27.546 6627.5466627.546\,6627.546 66 13.539 6213.5396213.539\,6213.539 62
7 39.785 2439.7852439.785\,2439.785 24 45.774 2645.7742645.774\,2645.774 26 42.560 3542.5603542.560\,3542.560 35 14.006 0814.0060814.006\,0814.006 08 22.15622.15622.15622.156 17.474 1117.4741117.474\,1117.474 11
8 25.414 9525.4149525.414\,9525.414 95 59.044 7159.0447159.044\,7159.044 71 42.098 7142.0987142.098\,7142.098 71 12.490 8912.4908912.490\,8912.490 89 67.481 4767.4814767.481\,4767.481 47 21.684 6521.6846521.684\,6521.684 65
TABLE V: Essential matrix estimation errors using Superglue with and without semantic pre and post processing. Since these estimates are not to scale, error is measured in translation and orientation headings.
Seq avg correct match ratio: success rate:
Baseline Pre Post Baseline Pre Post
1 0.654 510.654510.654\,510.654 51 0.812 1110.8121110.812\,1110.812 111 0.688 8180.6888180.688\,8180.688 818 1111 0.435 4840.4354840.435\,4840.435 484 1111
2 0.726 7710.7267710.726\,7710.726 771 0.790 8620.7908620.790\,8620.790 862 0.745 4430.7454430.745\,4430.745 443 1111 0.981 030.981030.981\,030.981 03 1111
3 0.520 1390.5201390.520\,1390.520 139 0.595 1470.5951470.595\,1470.595 147 0.584 4390.5844390.584\,4390.584 439 1111 0.9150.9150.9150.915 1111
4 0.278 9640.2789640.278\,9640.278 964 0.375 810.375810.375\,810.375 81 0.323 8930.3238930.323\,8930.323 893 0.748 8150.7488150.748\,8150.748 815 0.639 810.639810.639\,810.639 81 0.748 8150.7488150.748\,8150.748 815
5 0.305 6180.3056180.305\,6180.305 618 0.349 9490.3499490.349\,9490.349 949 0.339 820.339820.339\,820.339 82 0.691 3040.6913040.691\,3040.691 304 0.473 9130.4739130.473\,9130.473 913 0.691 3040.6913040.691\,3040.691 304
6 0.191 8620.1918620.191\,8620.191 862 0.270 3760.2703760.270\,3760.270 376 0.212 980.212980.212\,980.212 98 0.327 1030.3271030.327\,1030.327 103 0.224 2990.2242990.224\,2990.224 299 0.327 1030.3271030.327\,1030.327 103
7 0.095 070.095070.095\,070.095 07 0.104 7190.1047190.104\,7190.104 719 0.098 8980.0988980.098\,8980.098 898 0.161 7020.1617020.161\,7020.161 702 0.127 660.127660.127\,660.127 66 0.161 7020.1617020.161\,7020.161 702
8 0.443 2060.4432060.443\,2060.443 206 0.305 2450.3052450.305\,2450.305 245 0.425 4350.4254350.425\,4350.425 435 0.134 8680.1348680.134\,8680.134 868 0.036 1840.0361840.036\,1840.036 184 0.134 8680.1348680.134\,8680.134 868
TABLE VI: Match and success ratios when running Superglue without semantics, using semantics as a pre-processing step, and using semantics as a post-processing step for a sequence of activities.

The cumulative distribution functions (cdf) in Figures 6(a) and 6(b) illustrate the improvements in essential matrix calculation when using semantics with Superglue as a post-processing step. Pre-processing performed worse than baseline, as Superglue always attempts to match 1024 points between images, which can force false matches if different instances of the same object class are detected in each images or heavily concentrate correct matches within boxes. A sample of closely clustered keypoints for essential matrix estimation yields less accurate results than if the keypoints were distributed across the image.

Table VI displays the improvement in correct match ratios when using semantics as a pre and post processing step compared to not using them. Masking images during pre-processing limits the searchable range for matches, and although the pre-processing results show the largest correct match ratio, this clustering effect yields worse essential matrix estimation compared to post-processed matching as described above. Still, the increase in the ratio of correct matches is still valuable for other applications. If the 3d landmarks have already been triangulated, 2d-2d matching before PnP (as is done in most visual localization pipelines) could be improved with using semantics. Essential matrix estimation performance is displayed in Table V, which again shows semantics as a post-processing step outperforming the baseline Superglue approach and semantics as a pre-processing step.

Since the interior of the JEM is shaped as a rectangualar prism, images consist of mostly planar surfaces. Additionally, the microgravity environment results in many query-map image pairs having only relative rotation motion between the camera poses. Despite making high ratios of correct matches on most sequences, these degenerate cases can cause high errors when estimating the essential matrix, especially the translation component. For this reason, Superglue results have much higher errors than the Astroloc evaluation method (where pose can be directly recovered from previously 3d-triangulated points using PnP).

VI CONCLUSIONS

We have presented a lightweight semantic consistency check for visual feature matching that improves the robustness of localization performance. We have shown that enforcing consistent semantic classes for feature matches improves both relocalization performance and essential matrix calculation as evaluated on a dataset of eight Astrobee activities on the ISS.

As this method is designed to be computationally efficient, we additionally plan to deploy and test our semantic relocalization approach on the Astrobee robots during future ISS activities to improve their resilience to environment changes.

In future work, we wish to explore using movable object detections as negative matches, and weighting feature matches based on their semantics or lack there of. Additionally, we are interested in further using semantic results to perform informed map updates on an object level.

ACKNOWLEDGMENT

We would like to thank Ian D. Miller and Suyoung Kang for supporting this work. We would also like to thank Marina Gouviea Moreira for her assistance during testing in the Granite Lab and the rest of the Astrobee Facilities team for their help.

References

  • [1] T. Smith, J. Barlow, M. Bualat, T. Fong, C. Provencher, H. Sanchez, E. Smith et al., “Astrobee: A new platform for free-flying robotics on the International Space Station,” in Int. Symp. on Artificial Intelligence, Robotics and Automation in Space, 2016.
  • [2] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós, “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,” IEEE Transactions on Robotics, vol. 37, no. 6, pp. 1874–1890, 2021.
  • [3] A. Cramariuc, L. Bernreiter, F. Tschopp, M. Fehr, V. Reijgwart, J. Nieto, R. Siegwart, and C. Cadena, “maplab 2.0–a modular and multi-modal mapping framework,” IEEE Robotics and Automation Letters, vol. 8, no. 2, pp. 520–527, 2022.
  • [4] I. Miller, R. Soussan, B. Coltin, T. Smith, and V. Kumar, “Robust semantic mapping and localization on a free-flying robot in microgravity,” in 2022 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2022.
  • [5] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds-slam: A semantic visual slam towards dynamic environments,” in 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS).   IEEE, 2018, pp. 1168–1174.
  • [6] S. Kang, R. Soussan, D. Lee, B. Coltin, A. M. Vargas, M. Moreira, K. Hamilton, R. Garcia, R. Bualat, T. Smith, J. Barlow, J. Benavides, E. Jeong, and P. Kim, “Astrobee iss free-flyer datasets for space intra-vehicular robot navigation research,” in 2023 IEEE Robotics and Automation Letters (RA-L), Under Review.
  • [7] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in 2011 International conference on computer vision.   Ieee, 2011, pp. 2564–2571.
  • [8] D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
  • [9] S. Leutenegger, M. Chli, and R. Y. Siegwart, “Brisk: Binary robust invariant scalable keypoints,” in 2011 International conference on computer vision.   Ieee, 2011, pp. 2548–2555.
  • [10] A. Alahi, R. Ortiz, and P. Vandergheynst, “Freak: Fast retina keypoint,” in 2012 IEEE conference on computer vision and pattern recognition.   Ieee, 2012, pp. 510–517.
  • [11] J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4104–4113.
  • [12] A. Gawel, C. Del Don, R. Siegwart, J. Nieto, and C. Cadena, “X-view: Graph-based semantic multi-view localization,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1687–1694, 2018.
  • [13] Y. Liu, Y. Petillot, D. Lane, and S. Wang, “Global localization with object-level semantics and topology,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 4909–4915.
  • [14] K.-N. Lianos, J. L. Schonberger, M. Pollefeys, and T. Sattler, “Vso: Visual semantic odometry,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 234–250.
  • [15] Y. Bao, Z. Yang, Y. Pan, and R. Huan, “Semantic-direct visual odometry,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 6718–6725, 2022.
  • [16] L. An, X. Zhang, H. Gao, and Y. Liu, “Semantic segmentation–aided visual odometry for urban autonomous driving,” International Journal of Advanced Robotic Systems, vol. 14, no. 5, p. 1729881417735667, 2017.
  • [17] Y. Wang and A. Zell, “Improving feature-based visual slam by semantics,” in 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS), 2018, pp. 7–12.
  • [18] S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas, “Probabilistic data association for semantic slam,” in 2017 IEEE international conference on robotics and automation (ICRA).   IEEE, 2017, pp. 1722–1729.
  • [19] J. Civera, D. Gálvez-López, L. Riazuelo, J. D. Tardós, and J. M. M. Montiel, “Towards semantic slam using a monocular camera,” in 2011 IEEE/RSJ international conference on intelligent robots and systems.   IEEE, 2011, pp. 1277–1284.
  • [20] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Computer vision and image understanding, vol. 110, no. 3, pp. 346–359, 2008.
  • [21] A. Rosinol, M. Abate, Y. Chang, and L. Carlone, “Kimera: an open-source library for real-time metric-semantic localization and mapping,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 1689–1696.
  • [22] Y. Chang, Y. Tian, J. P. How, and L. Carlone, “Kimera-multi: a system for distributed multi-robot metric-semantic simultaneous localization and mapping,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 11 210–11 218.
  • [23] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947.
  • [24] C. Elich, I. Armeni, M. R. Oswald, M. Pollefeys, and J. Stueckler, “Learning-based relational object matching across views,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 5999–6005.
  • [25] B. Coltin, J. Fusco, Z. Moratto, O. Alexandrov, and R. Nakamura, “Localization from visual landmarks on a free-flying robot,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2016, pp. 4377–4382.
  • [26] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configuration.” VISAPP (1), vol. 2, no. 331-340, p. 2, 2009.
  • [27] X.-S. Gao, X.-R. Hou, J. Tang, and H.-F. Cheng, “Complete solution classification for the perspective-three-point problem,” IEEE transactions on pattern analysis and machine intelligence, vol. 25, no. 8, pp. 930–943, 2003.
  • [28] R. Soussan, V. Kumar, B. Coltin, and T. Smith, “Astroloc: An efficient and robust localizer for a free-flying robot,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 4106–4112.
  • [29] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 224–236.
  • [30] D. Nistér, “An efficient solution to the five-point relative pose problem,” IEEE transactions on pattern analysis and machine intelligence, vol. 26, no. 6, pp. 756–770, 2004.