Exploring Event Camera-based Odometry for Planetary Robots
Exploring Event Camera-based Odometry for Planetary Robots
several locations. Mars helicopters are candidate platforms higher tracking performance than existing methods on newly
to scout multiple lava tubes throughout a single mission. collected data in Mars-like conditions. This demonstrates the
However, Mars helicopters cannot fly LiDARs and have to viability of our EKLT-VIO on Mars. Our contributions are:
rely on passive cameras for navigation. Frame cameras are ill- • We introduce EKLT-VIO, an event-based VIO method
suited to explore lava tubes because of the HDR conditions that combines an accurate state-of-the-art event-based
created by the shadow at the entrance of the tube, as well feature tracker EKLT with an EKF backend. It outper-
as the low-light conditions once inside. This capability gap is forms state-of-the-art event- and frame-based methods,
filled by event cameras, which offer the potential to explore reducing the overall tracking error by 32%.
and map the lava tube for potentially tens of meters using • We show accurate and robust tracking even in rotation-
residual light from the entrance. only sequences, which are closest to the hover-like sce-
Mars helicopters come with their own requirements on the narios experienced by Mars helicopters, outperforming
state estimation system [15], [10]. They must rely on small optimization-based and frame-based methods.
passive lightweight cameras to observe the full state up to scale • We outperform existing methods on newly collected
and gravity direction. The camera is fused with an inertial Mars-like sequences collected in the JPL Mars Yard and
measurement unit (IMU), which makes gravity observable, Wells Cave for planetary exploration.
enables a high estimation rate, and acts as an emergency
landing sensor in case of camera failure. Finally, a laser range
II. R ELATED W ORK
finder is used to observe scale in the absence of accelerometer
excitation. The estimation backend must be able to handle Frame-based VIO: An overview of existing approaches
feature depth uncertainty associated with helicopter hovering is discussed in [22]. Frame-based VIO algorithms can be
and rotation-only dynamics. Due to this uncertainty successful roughly segmented into two classes: optimization-based and
feature triangulation is often inhibited in these cases, leading filter-based algorithms [22]. While both algorithms focus
to failure of optimization-based backends, which critically rely on tracking camera poses by minimizing both visual and
on triangulated features. By contrast, filter-based approaches inertial residuals, optimization-based methods solve this by
leverage priors to initialize depth measurements and thus performing iterative Gauss-Newton steps, while filtering-based
do not suffer from this issue [16]. This proved critical in methods achieve this through Kalman Filtering steps.
Ingenuity Mars helicopter’s sixth flight on Mars, where an Since optimizing both 3D landmarks (i.e., SLAM features)
image timestamping anomaly caused roll and pitch oscillations and camera poses is costly, several filtering-based techniques
greater than 20 degrees [17]. Such rotations cause a loss of exist that focus on refining camera poses from bearing
features, which can lead to estimation failure in non-filter- measurements (i.e., multi-state constraint Kalman filter
based state estimation approaches, which are fundamentally (MSCKF) features [23]) directly. However, MSCKF features
unable to handle the depth uncertainty of the new feature need translational motion and provide updates only after the
tracks without a dedicated re-initialization procedure. full feature track is known. The filtering-based approach,
State-of-the-art event-based VIO methods are unsuitable xVIO [15], combines the advantages of both features, with
in these conditions since they either (i) use optimization- robustness to depth uncertainty in rotation-only motion and
based backends, which do not model depth uncertainty, thus computational efficiency with many MSCKF features.
featuring brittle performance in mission-typical rotation-only
motion, or when a significant portion of features are lost [6], or Event-based VIO: First event-based, 6-DOF visual odom-
(ii) show a higher tracking error, due to the use of suboptimal etry (VO) algorithms only started to appear recently [24],
event-based frontends [18]. Image-based VIO methods such [25]. Later work incorporated an IMU to improve tracking
as [15], [19] have addressed this by using depth priors [15] or performance and stability [26], [18], achieving impressive
motion classification [19]. tracking on a fast spinning leash [26]. Despite their robustness,
In this work, we introduce EKLT-VIO, which builds on the these methods are affected by drift due to the differential
EKF backend in [15] which handles pure rotational motion, nature of the used sensors. This is why Ultimate SLAM
and combines it with the state-of-the-art event-based feature (USLAM) [6] used a combination of events, frames, and
tracker EKLT [20], thereby addressing the limitations above. IMU, all provided by the Dynamic and Active Vision Sensor
EKLT-VIO is accurate, outperforming previous state-of-the- (DAVIS) [27]. It tracks FAST corners [28] on frames and
art frame-based and event-based methods on the challenging motion-compensated event frames separately using the Lucas-
Event-Camera Dataset [21], with a 32% improvement in terms Kanade tracker (KLT) [29] and fuses these feature tracks with
of pose accuracy. Moreover, by leveraging depth uncertainty IMU measurements in a sliding window.
it reduces its reliance on triangulating features, which both While addressing drift, USLAM still relies on a sliding
increase robustness during purely-rotational motion, and fa- window optimization scheme, which is expensive and does
cilitates rapid initialization, both of which are limitations of not allow pose-only optimization through the use of MSCKF
existing optimization-based methods. This is because they features. Moreover, its FAST/KLT frontend, first introduced
require lengthy bootstrapping sequences, which would be in [26], is optimized explicitly for frame-like inputs and was
impractical on Mars. Additionally, it maintains state-estimate, shown to transfer suboptimally to event-based frames [20].
even when frame-based methods fail due to excessive motion In this work, we incorporate the state-of-the-art event-based
blur. We show that our event-based EKLT frontend has a tracker EKLT [20], which takes a more principled approach
MAHLKNECHT et al.: EXPLORING EVENT CAMERA-BASED ODOMETRY FOR PLANETARY ROBOTS 3
Fig. 2: We combine the feature tracker EKLT, which use frames and events, with the filter-based backend xVIO to enable low-
translation state-estimation. In contrast to standard, frame-based VIO, an additional synchronization step converts asynchronous
tracks to synchronous matches, which are used by the backend. This enables variable-rate backend updates.
prior, and discards the MSCKF feature track [35]. This depth This variable rate allows our algorithm to adapt to the
prior is especially useful during pure rotation or initialization, scene dynamics (Fig. 3), leading to fewer EKF updates in
where few features can be triangulated, since it can directly slow sequences (Fig.3, left) and a lower tracking error during
contribute to reducing the state covariance. high-speed sequences, compared to fixed rate updating. These
features motivate the use of an event-based frontend since
a purely frame-based one is limited by the framerate of the
B. Frontend
camera. Although, this may lead to drift in purely stationary
Here we provide a summary of our EKLT frontend, and environments where no events are triggered, this can easily be
refer the reader to [20] for more details. EKLT tracks Harris amended by enforcing a minimal backend update rate. Or by
corners, extracted on frames, by aligning the predicted and enforcing a no-motion prior when the event rate goes below a
measured brightness increment in a patch around the corners. threshold, as in [6].
It minimizes the normalized distance between these patches to Outlier rejection: For EKLT we exclusively reject outliers
recover the warping parameters p and normalize optical flow by setting a maximum threshold on the optimized residual
v as of the alignment score in Eq. (6). This allows outliers to
be rejected quickly, without the need for costly geometric
verification, such as 8-point RANSAC.
∆L(u) ∆L̂(u, p, v)
{p, v} = arg min − . (6)
p,v k∆L(u)k ∆L̂(u, p, v) IV. E XPERIMENTS
While ∆L is defined as an aggregation of events in a local We start by validating our approach on standard bench-
patch, ∆L̂ is defined as the negative dot product between the marks in Sec. IV-B, where we compare the performance of
local log image gradient and optical flow vector, following the EKLT-VIO against state-of-the-art event-based [18], frame-
linearized event generation model [36]. Here W (u, p) aligns based [15] and event- and frame-based methods [6]. To
the image gradient with the measured brightness increments study the effect on the event-based feature tracker, we also
according to the alignment parameters p. EKLT minimizes study an additional baseline, based on the HASTE feature
Eq. (6) using Gauss-Newton and the Ceres library[37], and re- tracker [38]. We then proceed to demonstrate the suitability
covers alignment parameters p and optical flow v. As opposed of our approach on two important use-cases motivated by the
to the reference implementation of EKLT, which optimizes Mars exploration scenario: (i) pure rotational motion, imitating
in a sliding window fashion after a fixed number of events, hover-like conditions on Mars (Sec. IV-C), and (ii) challenging
we trigger the optimization only when the adaptive number HDR conditions on newly collected datasets in the JPL Mars
of events is reached, using each event batch only once. This Yard and at the entrance of the Wells Cave, emulating the
entails a significant speed-up without loss in accuracy. entry into lava tubes (Sec. IV-D).
Dataset USLAM* [6] USLAM [6] EVIO [18] KLT-VIO [15] HASTE-VIO EKLT-VIO (ours)
MPE MYE MPE MYE MPE MYE MPE MYE MPE MYE MPE MYE
Boxes 6DOF 0.30 0.04 0.68 0.03 4.13 0.92 0.97 0.05 2.03 0.03 0.84 0.09
Boxes Translation 0.27 0.02 1.12 2.62 3.18 0.67 0.33 0.08 2.55 0.46 0.48 0.25
Dynamic 6DOF 0.19 0.10 0.76 0.09 3.38 1.20 0.78 0.03 0.52 0.06 0.79 0.06
Dynamic Translation 0.18 0.15 0.63 0.22 1.06 0.25 0.55 0.06 1.32 0.06 0.40 0.04
HDR Boxes 0.37 0.03 1.01 0.31 3.22 0.15 0.42 0.02 1.75 0.09 0.46 0.06
HDR Poster 0.31 0.05 1.48 0.09 1.41 0.13 0.77 0.03 0.57 0.02 0.65 0.04
Poster 6DOF 0.28 0.07 0.59 0.03 5.79 1.84 0.69 0.02 1.50 0.03 0.35 0.02
Poster Translation 0.12 0.04 0.24 0.02 1.59 0.38 0.16 0.02 1.34 0.02 0.35 0.03
Shapes 6DOF 0.10 0.04 1.07 0.03 2.52 0.61 1.80 0.03 2.35 0.02 0.60 0.03
Shapes Translation 0.26 0.06 1.36 0.01 4.56 2.60 1.38 0.02 1.09 0.02 0.51 0.03
Average 0.24 0.06 0.89 0.34 3.08 0.88 0.79 0.04 1.50 0.08 0.54 0.07
*per-sequence hyperparameter tuning and correct IMU bias intialization
TABLE II: Pose estimate accuracy comparison on the Event-Camera Dataset [21] in terms of mean position error (MPE) in
% and mean yaw error (MYE) in deg/m. Grayed-out results with (*) by USLAM [6] were achieved through per-sequence
parameter tuning and correct IMU bias initialization, while results in black used a single parameter set, tuned on all sequences
simultaneously, and were initialized with an IMU bias of zero.
TABLE III: Mean position and yaw error (MPE and MYE) in % and deg/m on rotation-only sequences.
mean position error (MPE) in % of the total trajectory length backends such as USLAM [6]. Similar to hover-like conditions
and mean yaw error (MYE) in deg/m in Tab.II. expected during Mars missions, these sequences translate only
In [6], USLAM uses different parameters for each sequence, little compared to the average scene depth, which poses a
and correct IMU bias initialization, resulting in the gray challenge for keyframe generation and triangulation.
columns in Tab. II. We mark this method as USLAM*. We adopt the same evaluation protocol as before and report
However, on Mars, VIO systems should perform robustly in results for all methods in Tab. III. We observed during this ex-
unknown environments, making, parameter tuning and bias periment that USLAM did not initialize during these sequences
initialization infeasible. For this reason, we retune the param- since it could never detect sufficient translation to insert a
eters of USLAM to perform best on all sequences simultane- new keyframe, and it is thus marked with unfeasible. Frame-
ously resulting in the black values in Tab. II. All other methods based KLT-VIO tracks well for the first 30s, but diverges in the
were tuned in the same way. Comparing USLAM* with second part, where rapid shaking motion causes motion blur
USLAM shows that IMU bias initialization, and per-sequence on the frames, and high feature displacements, both of which
hyperparameter tuning are clearly important to achieve low significantly impact the accuracy of the KLT frontend. This
tracking error, reducing the error from 0.89% to 0.24%. Our leads to a diverging state estimate. By contrast, event-based
EKLT-VIO, on the other hand, achieves an average error of methods EKLT-VIO and HASTE-VIO can track robustly,
0.54% without bias initialization, 39% lower than USLAM. because their event-based front-ends are unaffected by motion-
This improvement indicates that EKLT-VIO is simultaneously blur. EKLT-VIO, however, is the only method to converge
more robust to zero IMU bias initialization, and per-sequence on all sequences and yields a consistently lower tracking
hyperparameter tuning. error compared to all compared methods. In summary, EKLT-
In terms of position error, EKLT-VIO outperforms all other VIO leverages the advantages of event-based frontends for
methods on 5 out of 10 sequences. With an average MPE of robust high-speed tracking and the advantages of a filter-based
0.54% EKLT-VIO shows a 32% lower MPE than runner-up backend to fuse small translational motions. This shows that
KLT-VIO with 0.79%. Finally, with a 3.08% MPE, EVIO [18] EKLT-VIO is most suitable in these conditions.
is outperformed by EKLT-VIO by 82%.
D. Mars-mission Scenario: Wells Cave and JPL Mars yard
C. Rotation-only sequences Finally, we show the capabilities of EKLT-VIO in Mars-
As a next step, we show the suitability of EKLT-VIO like exploration scenarios, by comparing it to image-based
in a Mars Mission-like scenario. To do this, we evaluate methods KLT-VIO [16], ORB-SLAM3 [41], OpenVINS [42],
all methods on the rotation-only sequences of the Event- VINS-Mono [3] and ROVIO [40] on sequences recorded at
Camera Dataset, which are challenging for optimization-based the JPL Mars Yard (Fig. 4 (a)), and Wells Cave Nature
6 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JUNE, 2022
(a) Mars Yard preview (b) Overexposed image (c) Recons. from events (d) Trajectories
(e) Wells Cave preview (f) Underexposed image (g) Recons. from events (h) Trajectories
Fig. 4: In the Mars Yard (a) we test HDR conditions which cause severe oversaturation artefacts in standard images (b). Instead
in the Wells Cave (e) we study low light scenarios encountered in lava tubes, which cause undersaturation (f). HDR images
reconstructed from events [39] (c,g) do not suffer from these artefacts, and are used by our method. As a result, we outperform
existing frame-based approaches KLT-VIO [15] and ROVIO [40] on both trajectories.
2.5
EKLT
1.0 17 13 Feature
management
0.5 4
Visual
update
0.0
0.5 1.0 1.5 2.0 2.5 3.0 3.5
Event rate [Me/s]
(a) Realtime factor (b) Realtime factor vs. Event rate (c) Computational pie chart
Fig. 6: Real-time factor (RTF) (a) for EKLT-VIO (orange), HASTE-VIO (green) and KLT-VIO (blue) on Poster 6DOF. The
RTF per tracked feature (b) increases with the event rate. Our method can process 89’000 events per second when tracking
45 features. As seen in (c), EKLT-VIO spends most of its computation time tracking features.