GAFusion: Adaptive Fusing LiDAR and Camera with Multiple Guidance for
3D Object Detection

Xiaotian Li1∗  Baojie Fan1  Jiandong Tian 2  Huijie Fan2
1Nanjing University of Posts and Telecommunications
2Shenyang Institute of Automation Chinese Academy of Science
{xiaotianli981, jobfbj}@gmail.com{tianjd, fanhuijie}@sia.cn
Equal Contribution.Corresponding Author.
Abstract

Recent years have witnessed the remarkable progress of 3D multi-modality object detection methods based on the Bird’s-Eye-View (BEV) perspective. However, most of them overlook the complementary interaction and guidance between LiDAR and camera. In this work, we propose a novel multi-modality 3D objection detection method, named GAFusion, with LiDAR-guided global interaction and adaptive fusion. Specifically, we introduce sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG) to generate 3D features with sufficient depth information. In the following, LiDAR-guided adaptive fusion transformer (LGAFT) is developed to adaptively enhance the interaction of different modal BEV features from a global perspective. Meanwhile, additional downsampling with sparse height compression and multi-scale dual-path transformer (MSDPT) are designed to enlarge the receptive fields of different modal features. Finally, a temporal fusion module is introduced to aggregate features from previous frames. GAFusion achieves state-of-the-art 3D object detection results with 73.6%percent\%% mAP and 74.9%percent\%% NDS on the nuScenes test set.

1 Introduction

3D object detection is a crucial task in autonomous driving. To cope with the complex road scenarios, multiple sensors (LiDARs or cameras) are usually employed for scene understanding. LiDAR can generate accurate but sparse 3D point clouds, which contains precise spatial information. Images have rich semantic and texture information, but lack depth information. Therefore, a natural operation is to extensively fuse LiDAR and camera to leverage the complementarity of multi-modality information, which can enable the autonomous driving system to achieve higher accuracy and robustness.

Refer to caption
Figure 1: Comparison between BEVFusion and the proposed GAFusion. (a) In BEVFusion, the camera stream and the LiDAR stream separately generate BEV features, which are then concatenated together. (b) In GAFusion, the camera modality BEV features are generated by multiple guidance from the LiDAR stream, and the receptive fields are enhanced by MSDPT. The BEV features are fused by LGAFT. “VT” is view transformer.

Recently, fusing LiDAR and camera has achieved some progress. Early methods [4, 14, 45, 35] achieve LiDAR-camera fusion by projecting 3D LiDAR point clouds (or region proposals) onto 2D images. But these methods overlook the information gap between the two modalities. Recent works [22, 26, 1, 40, 39, 42, 13, 3] adopt different query generation strategies or create a unified Bird’s-Eye-View (BEV) [28] intermediate feature to fuse multi-modality features. For instance, TransFusion [1] applies a two-stage pipeline to fuse the camera and LiDAR features, but its performance relies on the query initialization strategy. BEVFusion [22, 26] explores a unified representation for BEV features through view transformation, which not only preserves the spatial information of sparse LiDAR point clouds, but also lifts the 2D images to the 3D features, effectively maintaining the consistency between the two modalities. However, the camera modality still struggles with geometric perception information, which limits the complementarity between LiDAR and camera. As shown in Fig. 1(a), there is no interaction between both modalities.

To tackle the above challenges, we propose an effective 3D multi-modality object detection method, named GAFusion. Within it, a LiDAR guidance module is developed, which consists of sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG). SDG combines the sparse depth generated by LiDAR point clouds and the camera features to produce depth-aware features, enhancing the sensitivity of camera features to depth information. Inspired by the occupancy task [12], LOG guides a 3D feature volume generated by view transformation with occupancy features, and focuses on the targets in the 3D feature volume, thus providing more valuable information for fusion. Then, we construct a multi-scale dual-path transformer (MSDPT) module to improve the interactions around the objects and expand the receptive fields of the 3D feature volume. With the above designs, the camera modality has sufficient semantic features and more accurate depth distribution. In the following, to obtain abundant features in the LiDAR modality, we perform additional downsampling on the LiDAR point clouds and use sparse depth compression to aggregate features from different scales. This operation can provide larger receptive fields with less computation and memory consumption. Moreover, a LiDAR-guided adaptive fusion transformer (LGAFT) module is proposed to effectively fuse BEV features generated by LiDAR point clouds and images. In this module, the LiDAR BEV features adaptively guide the camera BEV features to strengthen the cross-modality interaction from global scope.

All of the above operations are evaluated on single-frame raw data. In order to further explore the target correlation and motion consistency among multiple successive frames, a temporal fusion module is designed. To be specific, we store the BEV features of different frames in memory buffer, which is used to fuse the features of the previous frame and the current frame.

Our contributions are summarized as follows:

1) We propose GAFusion, a novel 3D object detection method that leverages LiDAR guidance to compensate for depth distribution of the camera features, and provides sufficient spatial information for the camera features.

2) We design LiDAR-guided adaptive fusion transformer (LGAFT), which aims to enhance the global features interaction between the two modalities in an adaptive way, facilitating the fusion of semantic and geometric features.

3) We conduct extensive experiments on the nuScenes dataset to verify the effectiveness of our GAFusion. The experiments show that without using any augmentation strategies, our model achieves the state-of-the-art performances of 72.1%percent\%% mAP and 73.5%percent\%% NDS.

2 Related Work

Refer to caption
Figure 2: The overall architecture of GAFusion. The multi-view images and point clouds are fed into the corresponding backbone networks to obtain multi-scale LiDAR features and camera features. For LiDAR guidance, we propose sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG) to guide the 2D camera features by adopting the raw point clouds and LiDAR BEV features, respectively. In addition, we use multi-scale dual-path transformer (MSDPT) to enlarge the receptive fields. Then, LiDAR-guided adaptive fusion transformer (LGAFT) further fuses the two modalities’ BEV features. A temporal fusion module is introduced to aggregate the previous frame’s BEV features, and finally feeds these BEV features into an encoder and a detection head.

2.1 Single-modality 3D Object Detection

Single-modality 3D object detection, mainly including LiDAR-based 3D object detection and camera-based 3D object detection, has achieved remarkable progress in recent years.

LiDAR-based 3D object detection aims to predict 3D object bounding boxes using the point clouds captured from LiDAR. Existing methods [31, 48, 15, 33, 34, 18] either directly predict on point clouds, or convert point clouds into voxels or pillars. PointNet [31] is the first framework that processes point clouds in an end-to-end manner, by taking unordered point cloud sets as direct inputs and preserving the spatial structure of point clouds. VoxelNet [48] discretizes point clouds into voxels, and uses dense convolution to obtain BEV features.

Camera-based 3D object detection, which can be divided into two categories: image-view-based and BEV-based. DETR3D [38] and PETR [23] introduce transformer into the framework, wherein the former aggregates 2D features into 3D Query, and the latter embeds coordinate information into 2D features. They both use transformer to implicitly transform the image features to 3D space. BEVDet [11] and BEVDepth [16] predict the depth distribution to lift the image features to a 3D frustum meshgrid. The semantic or spatial information provided by a single-modality is still limited, despite the impressive performance achieved by the aforementioned detection tasks.

2.2 Multi-modality 3D Object Detection

Multi-sensor fusion has gained great attention in 3D detection due to its superior performance. Previous works [4, 14, 45, 35] fuse 3D point cloud features and 2D image features by projecting the former onto the latter. MV3D [4] associates 3D proposals with 2D RoI features and converts 3D representation into 2D pseudo images, enabling the network to leverage 2D convolutions for geometric refinement. PointPainting [36] enriches point clouds with semantic labels from images. However, the above methods fail to fully exploit the dense semantic information in images. Recently, BEVFusion [22, 26] are proposed to fuse LiDAR features and camera features in BEV space and apply a lift-splat-shoot (LSS) [30] operation to project image features, resulting in semantic-rich features. Before fusing BEV features from two modalities, LiDAR point clouds and images do not interact at all. From another perspective, CMT [40] proposes a novel end-to-end transformer-based 3D object detection framework, which implicitly encodes 3D point clouds into multi-modality tokens. Inspired by the above works, we propose the global interaction and adaptive BEV fusion that achieves significant performance improvement while maintaining the simplicity of the framework.

2.3 Occupancy Task

Recently, 3D occupancy prediction (Occ) [12] has been proposed as a novel 3D detection task. Based on FB-BEV [21], FB-OCC [20] emphasizes the importance of model scale and pre-training. OccDepth [29] leverages the implicit depth information from depth images (or RGBD images) to help recover the 3D geometry. OccFormer [47] utilizes a dual-path transformer for efficient long-range encoding of 3D voxel features generated by cameras. VoxFromer [17] adopts a two-stage design and generates a set of sparse visible and occupied voxel queries from depth estimation. OpenOccupancy [37] is the first omnidirectional semantic occupancy perception benchmark. We notice that 3D occupancy prediction can provide more fine-grained and comprehensive 3D perception capabilities.

2.4 Temporal Fusion

Temporal fusion adopts multiple frames of images or point clouds to improve the performance of 3D object detection, as it can enhance the perception system’s understanding and prediction of dynamic scenes. 3D-VID [46] employs a bidirectional recurrent neural network (Bi-RNN) [32] to model the temporal sequences of multiple point clouds, capturing the motion information and state changes of the targets. BEVDet4D [10] fuses BEV features from different time sequences by coordinate alignment. BEVformer [19] is a transformer-based 3D object detection model that uses BEV to represent the scene and exploits multiple frames of images for spatial-temporal information fusion. After all, temporal fusion is an effective technique that enhances the continuity and relevance among different frames, and it can utilize the information from multiple frames to enrich the feature representation of each frame.

3 Method

The overall architecture of GAFusion is illustrated in Fig. 2. We feed the LiDAR point clouds and the corresponding multi-view images into the backbone to extract dual-stream features. The LiDAR stream first uses additional downsampling and sparse depth compression to obtain BEV features(Sec. 3.1). The design of LiDAR guidance, which includes sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG), is detailed in Sec. 3.2. After LiDAR guidance, we adopt multi-scale dual-path transformer (MSDPT) module to enlarge the receptive fields of camera features (Sec. 3.3). Then, the proposed LiDAR-guided adaptive fusion transformer (LGAFT) module is utilized to fuse different modalities of the BEV features (Sec. 3.4). We also introduce the temporal fusion module to appropriately fuse the information from the previous frame (Sec. 3.5).

3.1 LiDAR and Camera Features Extraction

In the high-level feature extraction stage, we adopt a dual-stream approach to process the LiDAR point clouds and the multi-view images separately.

For the LiDAR stream, a common method [22, 26, 1] is to use 3D sparse convolution [41] to extract single-scale features from the voxelized point clouds, which has a weak feature representation with limited receptive fields. Therefore, we use additional downsampling layers to compensate for this deficiency. The common sparse convolution features have strides of 1,2,4,8, and the output sparse features are named F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, F3subscript𝐹3F_{3}italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, F4subscript𝐹4F_{4}italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT respectively. We adopt two additional downsampling layers with strides of 16,32 to obtain the F5subscript𝐹5F_{5}italic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, F6subscript𝐹6F_{6}italic_F start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT features. Finally, to effectively combine different scales of F4subscript𝐹4F_{4}italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, F5subscript𝐹5F_{5}italic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, F6subscript𝐹6F_{6}italic_F start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT features and maintain geometric and positional information, we use sparse depth compression to process different scales of the features. Specifically, we first align the spatial resolutions of F5subscript𝐹5F_{5}italic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, F6subscript𝐹6F_{6}italic_F start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT with F4subscript𝐹4F_{4}italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. For stage i𝑖iitalic_i, Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a set of individual features. pPi𝑝subscript𝑃𝑖p\in P_{i}italic_p ∈ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a position in 3D space, with the coordinate (xpsubscript𝑥𝑝x_{p}italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, ypsubscript𝑦𝑝y_{p}italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, zpsubscript𝑧𝑝z_{p}italic_z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT). In addition, we design a BEV grid of size (xpsubscript𝑥𝑝x_{p}italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT,ypsubscript𝑦𝑝y_{p}italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT) that only contains Pcsubscript𝑃𝑐P_{c}italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, which aggregates the sparse features of different scales at the same height, and forms a rich BEV feature. The whole process is shown in Fig. 3. Sparse features Fcsubscript𝐹𝑐F_{c}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and their positions Pcsubscript𝑃𝑐P_{c}italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are obtained as follows:

Fc=F4(F5F6),subscript𝐹𝑐subscript𝐹4subscript𝐹5subscript𝐹6\displaystyle{F}_{c}=F_{4}\cup(F_{5}\cup F_{6}),\qquaditalic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∪ ( italic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ∪ italic_F start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) , (1)
P6={(xp×22,yp×22,zp×22)|pP6}superscriptsubscript𝑃6conditional-setsubscript𝑥𝑝superscript22subscript𝑦𝑝superscript22subscript𝑧𝑝superscript22𝑝subscript𝑃6\displaystyle P_{6}^{\prime}=\{(x_{p}\times 2^{2},\,y_{p}\times 2^{2},\,z_{p}% \times 2^{2})\,|\,p\in P_{6}\}italic_P start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × 2 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × 2 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × 2 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) | italic_p ∈ italic_P start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT }
P5={(xp×21,yp×21,zp×21)|pP5}superscriptsubscript𝑃5conditional-setsubscript𝑥𝑝superscript21subscript𝑦𝑝superscript21subscript𝑧𝑝superscript21𝑝subscript𝑃5\displaystyle P_{5}^{\prime}=\{(x_{p}\times 2^{1},\,y_{p}\times 2^{1},\,z_{p}% \times 2^{1})\,|\,p\in P_{5}\}italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × 2 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × 2 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × 2 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) | italic_p ∈ italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT }
Pc=P4(P5P6).subscript𝑃𝑐subscript𝑃4superscriptsubscript𝑃5superscriptsubscript𝑃6\displaystyle{P}_{c}=P_{4}\cup(P_{5}^{\prime}\cup P_{6}^{\prime}).\qquaditalic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∪ ( italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ italic_P start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

For the camera stream, Following the previous works [22, 26], we input multi-view images into backbone to obtain 2D image features FcNc×C×H×Wsubscript𝐹𝑐superscriptsubscript𝑁𝑐𝐶𝐻𝑊F_{c}\in\mathbb{R}^{N_{c}\times C\times H\times W}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_C × italic_H × italic_W end_POSTSUPERSCRIPT with sufficient semantic information, where Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, C𝐶Citalic_C, H𝐻Hitalic_H, W𝑊Witalic_W denote the number of cameras, feature size, image height and image width respectively.

3.2 LiDAR Guidance

Refer to caption
Figure 3: Additional downsampling and sparse height compression. This operation enlarges the receptive fields of the features and reduces the computational cost.

To integrate the camera high-level features into a unified BEV space, a view transformation is required, which first needs to project the 2D image features into 3D space. During this process, it is often difficult to estimate the depth distribution accurately, resulting in the loss of a lot of useful information in the BEV feature generated by the camera stream. To obtain a reliable depth distribution, our proposed LiDAR guidance consists of two parts: sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG). They can help the image features better capture accurate geometric and depth information.

Sparse Depth Guidance. As shown in Fig. 4, SDG first projects each point of the input LiDAR point clouds into multi-view images, and obtains sparse multi-view depth maps. Then, they are fed into a shared encoder to extract depth features, which are concatenated with image features to form the depth-aware camera features. They are used as the input of view transformation, and finally voxel pooling [9] is employed to generate the image 3D feature volume, which is denoted as Fcsuperscriptsubscript𝐹𝑐F_{c}^{\prime}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, FcC×Z×H×Wsuperscriptsubscript𝐹𝑐superscript𝐶𝑍𝐻𝑊F_{c}^{\prime}\in\mathbb{R}^{C\times Z\times H\times W}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_Z × italic_H × italic_W end_POSTSUPERSCRIPT. SDG can effectively incorporate LiDAR depth information and generate more accurate and reliable depth.

Refer to caption
Figure 4: The architecture of sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG). These two modules guide the 2D camera features to generate 3D features that contain sufficient semantic information and accurate depth information.

LiDAR Occupancy Guidance. Due to the sparsity and measurement noises of LiDAR point clouds, the depth information of some pixels is inaccurate. Therefore, LOG is proposed to address the aforementioned drawbacks, as shown in Fig. 4. Specifically, we first map LiDAR BEV features to 3D space to obtain the 3D features, and then attach an occupancy prediction head that estimates occupancy states to obtain the LiDAR 3D occupancy voxel, denoted as OL1×Z×H×Wsubscript𝑂𝐿superscript1𝑍𝐻𝑊O_{L}\in\mathbb{R}^{1\times Z\times H\times W}italic_O start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_Z × italic_H × italic_W end_POSTSUPERSCRIPT. It is worth noting that the resolution of OLsubscript𝑂𝐿O_{L}italic_O start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is the same as that of Fcsuperscriptsubscript𝐹𝑐F_{c}^{\prime}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The LiDAR 3D occupancy voxel is then multiplied by Fcsuperscriptsubscript𝐹𝑐F_{c}^{\prime}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to obtain the LiDAR occupancy-guided image 3D feature volume Fc′′C×Z×H×Wsuperscriptsubscript𝐹𝑐′′superscript𝐶𝑍𝐻𝑊F_{c}^{\prime\prime}\in\mathbb{R}^{C\times Z\times H\times W}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_Z × italic_H × italic_W end_POSTSUPERSCRIPT using the following equation:

Fc′′=Mul(Fc,OL)superscriptsubscript𝐹𝑐′′𝑀𝑢𝑙superscriptsubscript𝐹𝑐subscript𝑂𝐿\displaystyle{F}_{c}^{\prime\prime}=Mul(F_{c}^{\prime},\,O_{L})italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT = italic_M italic_u italic_l ( italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_O start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) (2)

Where Mul𝑀𝑢𝑙Mulitalic_M italic_u italic_l denotes element-wise multiplication with broadcasting operation. With the above designs, the 2D camera features contain sufficient semantic information and accurate depth information, which provide an excellent reference for subsequent module interactions.

3.3 Multi-Scale Dual-Path Transformer

To effectively aggregate semantic information and enlarge the receptive fields, we improved a multi-scale dual-path transformer (MSDPT), a module inspired by OccFormer [47]. Dual-path transformer (DPT) consists of a local path and a global path, which uses 3D convolution to perform downsampling to obtain features of different scales. The detailed structures of DPT are shown in Fig. 5. The local path is mainly used to extract fine-grained semantic structures. Since the height direction has less variation in 3D object detection, the local path only slices and processes the 3D feature volume extracted from the multi-view images in parallel along the horizontal direction. The global path attempts to acquire the semantic layout of the scene accurately. It first obtains BEV features by average pooling along the height dimension, and then interacts with the basic information of the BEV features. To improve computational efficiency, they both use windowed self-attention [25], and share weights. Finally, the 3D feature volume from the local path merges the sufficient semantic features from the global path. The dual-path outputs are FlocalC×X×Y×Zsubscript𝐹𝑙𝑜𝑐𝑎𝑙superscript𝐶𝑋𝑌𝑍F_{local}\in\mathbb{R}^{C\times X\times Y\times Z}italic_F start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_X × italic_Y × italic_Z end_POSTSUPERSCRIPT and FglobalC×X×Ysubscript𝐹𝑔𝑙𝑜𝑏𝑎𝑙superscript𝐶𝑋𝑌F_{global}\in\mathbb{R}^{C\times X\times Y}italic_F start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_X × italic_Y end_POSTSUPERSCRIPT, the combined output Foutsubscript𝐹𝑜𝑢𝑡F_{out}italic_F start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT is computed as:

Fout=Flocal+σ(WHFlocal)unsqueeze(Fglobal,1)subscript𝐹outsubscript𝐹local𝜎subscript𝑊𝐻subscript𝐹localunsqueezesubscript𝐹global1\displaystyle{F}_{\text{out}}={F}_{\text{local}}+\sigma({W_{H}}{F}_{\text{% local}})\cdot\text{unsqueeze}({F}_{\text{global}},-1)italic_F start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT local end_POSTSUBSCRIPT + italic_σ ( italic_W start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT local end_POSTSUBSCRIPT ) ⋅ unsqueeze ( italic_F start_POSTSUBSCRIPT global end_POSTSUBSCRIPT , - 1 ) (3)

where WHsubscript𝑊𝐻W_{H}italic_W start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT refers to the aggregation weights along the height dimension generated by the FFN, σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is the sigmoid function, and “unsqueeze” expands the global 2D features along the height.

Refer to caption
Figure 5: The schema of dual-path transformer (DPT), which effectively aggregates semantic information and expands the receptive fields of the camera features.

3.4 LiDAR-Guided Adaptive Fusion Transformer

Recent works [22, 26] simply concatenate different modalities of BEV features to obtain a shared BEV representation, which does not consider the information interaction and global spatial relevance among different modalities. To this end, LGAFT is developed to adaptively enhance the interaction of LiDAR BEV features FLBsubscript𝐹𝐿𝐵F_{LB}italic_F start_POSTSUBSCRIPT italic_L italic_B end_POSTSUBSCRIPT and camera BEV features FCBsubscript𝐹𝐶𝐵F_{CB}italic_F start_POSTSUBSCRIPT italic_C italic_B end_POSTSUBSCRIPT from a global perspective. The detailed architecture is illustrated in Fig. 6. We use 1×1111\times 11 × 1 convolution to expand FLBsubscript𝐹𝐿𝐵F_{LB}italic_F start_POSTSUBSCRIPT italic_L italic_B end_POSTSUBSCRIPT and FCBsubscript𝐹𝐶𝐵F_{CB}italic_F start_POSTSUBSCRIPT italic_C italic_B end_POSTSUBSCRIPT to appropriate channels, and concatenate the expand BEV features FLBsuperscriptsubscript𝐹𝐿𝐵F_{LB}^{\prime}italic_F start_POSTSUBSCRIPT italic_L italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and FCBsuperscriptsubscript𝐹𝐶𝐵F_{CB}^{\prime}italic_F start_POSTSUBSCRIPT italic_C italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to obtain feature weights WFsubscript𝑊𝐹W_{F}italic_W start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT from a sigmoid function. Then, we adopt WFsubscript𝑊𝐹W_{F}italic_W start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT to fuse the LiDAR and camera BEV features adaptively, and the fused features are denoted as Fasubscript𝐹𝑎F_{a}italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. The weights WFsubscript𝑊𝐹W_{F}italic_W start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT can be expressed as:

FLB=conv1×1(FLB)superscriptsubscript𝐹𝐿𝐵subscriptconv11subscript𝐹𝐿𝐵\displaystyle F_{LB}^{\prime}=\textrm{conv}_{1\times 1}(F_{LB})italic_F start_POSTSUBSCRIPT italic_L italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = conv start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_L italic_B end_POSTSUBSCRIPT ) (4)
FCB=conv1×1(FCB)superscriptsubscript𝐹𝐶𝐵subscriptconv11subscript𝐹𝐶𝐵\displaystyle F_{CB}^{\prime}=\textrm{conv}_{1\times 1}(F_{CB})italic_F start_POSTSUBSCRIPT italic_C italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = conv start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_C italic_B end_POSTSUBSCRIPT )
WF=σ(Concat(FLB,FCB))subscript𝑊𝐹𝜎Concatsuperscriptsubscript𝐹𝐿𝐵superscriptsubscript𝐹𝐶𝐵\displaystyle W_{F}=\sigma(\textrm{Concat}(F_{LB}^{\prime},F_{CB}^{\prime}))italic_W start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = italic_σ ( Concat ( italic_F start_POSTSUBSCRIPT italic_L italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_C italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )
Refer to caption
Figure 6: The overview of LiDAR-guided adaptive fusion transformer (LGAFT). LGAFT adaptively enhances the interaction between LiDAR and camera BEV features from a global perspective.

Where “Concat” denotes the concatenate operation. To reduce the computation cost, we do not use the multi-head attention module in transformer structure. specifically, we adopt Fasubscript𝐹𝑎F_{a}italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT as the query of the cross-attention module. The adaptive camera features are regarded as the keys and values to avoid the gradient explosion convergence problem. Therefore, the final fused features FBEVsubscript𝐹𝐵𝐸𝑉F_{BEV}italic_F start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT can be presented as:

Q=Concat((1WF)FLB,WF(FCB+P))WQK=WF(FCB+P)WKV=(FCB+P)WVFBEV=MLP(LN(Softmax(QKTC)V))𝑄Concat1subscript𝑊𝐹superscriptsubscript𝐹𝐿𝐵subscript𝑊𝐹superscriptsubscript𝐹𝐶𝐵𝑃subscriptW𝑄𝐾subscript𝑊𝐹superscriptsubscript𝐹𝐶𝐵𝑃subscriptW𝐾𝑉superscriptsubscript𝐹𝐶𝐵𝑃subscriptW𝑉subscript𝐹𝐵𝐸𝑉𝑀𝐿𝑃𝐿𝑁𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝑄superscript𝐾𝑇𝐶𝑉\begin{split}&Q=\textrm{Concat}((1-W_{F}){F_{LB}^{\prime}},W_{F}{(F_{CB}^{% \prime}+P)})\textbf{W}_{Q}\\ &K=W_{F}{(F_{CB}^{\prime}+P)}\textbf{W}_{K}\\ &V={(F_{CB}^{\prime}+P)}\textbf{W}_{V}\\ &F_{BEV}={MLP}\big{(}{LN}({{Softmax}(\frac{QK^{T}}{\sqrt{C}})V})\big{)}\\ \end{split}start_ROW start_CELL end_CELL start_CELL italic_Q = Concat ( ( 1 - italic_W start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) italic_F start_POSTSUBSCRIPT italic_L italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_C italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_P ) ) W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_K = italic_W start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_C italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_P ) W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_V = ( italic_F start_POSTSUBSCRIPT italic_C italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_P ) W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_F start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT = italic_M italic_L italic_P ( italic_L italic_N ( italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_C end_ARG end_ARG ) italic_V ) ) end_CELL end_ROW (5)

Where Q𝑄Qitalic_Q, K𝐾Kitalic_K, and V𝑉Vitalic_V denote the query, key, and value. WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, WKsubscript𝑊𝐾W_{K}italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT are learnable parameters, P𝑃Pitalic_P stands for the learnable position embedding, LN𝐿𝑁LNitalic_L italic_N means layer normalization and MLP𝑀𝐿𝑃MLPitalic_M italic_L italic_P is the multi-layer perception block.

3.5 Temporal Fusion Module

Temporal information is crucial for the visual system to understand the surrounding environment. Temporal information can better help detect the motion states of the objects and occluded objects. We follow the fusion scheme of BEVDet4D [10] and store the BEV features of historical frames in a memory buffer, and fuse the BEV features of the previous frame at each time. The detailed operation can be found in [10]. Finally, we feed the fused BEV features into the BEV encoder and detection head to obtain the final detection results.

Method Modality mAP\uparrow NDS\uparrow Car Truck C.V. Trailer Bus Barrier Motor. Bike Ped. T.C.
PointPillars [15] L 30.5 45.3 68.4 23.0 4.1 23.4 28.2 38.9 27.4 1.1 59.7 30.8
CenterPoint [44] L 60.3 67.3 85.2 53.5 20.0 56.0 63.6 71.1 59.5 30.7 84.6 78.4
TransFusion-L [1] L 65.5 70.2 86.2 56.7 28.2 58.8 66.3 78.2 68.3 44.2 86.1 82.0
LargeKernel3D [5] L 65.3 70.5 85.9 55.3 26.8 60.2 66.2 74.3 72.5 46.6 85.6 80.0
FocalFormer3D [6] L 68.7 72.6 87.8 59.4 37.8 65.7 73.0 77.8 77.4 52.4 90.0 83.4
PointPainting [36] LC 46.4 58.1 77.9 35.8 15.8 37.3 36.2 60.2 41.5 24.1 73.3 62.4
3D-CVF [45] LC 52.7 62.3 83.0 45.0 15.9 49.6 48.8 65.9 51.2 30.4 74.2 62.9
MVP [43] LC 66.4 70.5 86.8 58.5 26.1 57.3 67.4 74.8 70.0 49.3 89.1 85.0
TransFusion [1] LC 68.9 71.7 87.1 60.0 33.1 60.8 68.3 78.1 73.6 52.9 88.4 86.7
AutoAlignV2 [7] LC 68.4 72.4 87.0 59.0 33.1 59.3 69.3 - 72.9 52.1 87.6 -
BEVFusion [26] LC 70.2 72.9 88.6 60.1 39.3 63.8 69.8 80.0 74.1 51.0 89.2 85.2
BEVFusion [22] LC 71.3 73.3 88.1 60.9 34.4 62.1 69.3 78.2 72.2 52.2 89.2 86.7
CMT [40] LC 70.4 73.0 87.2 61.5 37.5 62.8 72.4 86.9 79.4 58.3 86.9 83.1
DeepInteraction [42] LC 70.8 73.4 87.9 60.2 37.5 63.8 70.8 80.4 75.4 54.5 91.7 87.2
FocalFormer3D [6] LC 71.6 73.9 88.5 61.4 35.9 66.4 71.7 79.3 80.3 57.1 89.7 85.3
MSMDFusion [13] LC 71.5 74.0 88.4 61.0 35.2 66.2 71.4 80.7 76.9 58.3 90.6 88.1
GAFusion(ours) LC 73.6 74.9 89.4 65.3 42.4 65.8 73.7 79.2 80.8 60.2 92.3 87.0
Table 1: Comparison on the nuScenes test set. The models in the table are without ensemble or test-time augmentation. “L” is LiDAR, “C” is camera.

4 Experiments

4.1 Dataset and Metrics

Similar to previous works [22, 26, 1], we conduct extensive synthetic experiments on the nuScenes dataset. The nuScenes [2] dataset is a large-scale autonomous driving benchmark, which includes 1000 scenes with images from 6 cameras with surrounding views, points from 5 Radars and 1 LiDAR. The scenes are officially split into 700/150/150 scenes for training/validation/testing. Each scene lasts for about 20 seconds, where key frames are annotated at 2Hz. Each frame of point cloud data corresponds to 6 RGB images with 360° horizontal FOV.

For the 3D detection task, we adopt the nuScenes Detection Score (NDS) and mean Average Precision (mAP) to evaluate the performance of the proposed model. In addition, the evaluation metrics of nuScenes also include five True Positive (TP) metrics, namely mean Average Translation Error (mATE), mean Average Scale Error (mASE), mean Average Orientation Error (mAOE), mean Average Velocity Error (mAVE) and mean Average Attribute Error (mAAE), which assess the performance of the model from different perspectives. NDS is the weighted average of mAP and five TP metrics.

Refer to caption
Figure 7: Visualization results of BEVFusion and GAFusion on the nuScenes validation set. The red circles and boxes show the detection ability of GAFusion for small and occluded objects.

Methods Modality mAP\uparrow NDS\uparrow BEVFormer [19] C 41.6 51.7 PETRv2 [24] C 45.6 35.0 CenterPoint [44] L 60.3 67.3 TransFusion-L [1] L 65.5 70.2 TransFusion [1] LC 67.5 71.3 BEVFusion [26] LC 68.5 71.4 CMT [40] LC 67.9 70.8 MSMDFusion [13] LC 69.3 72.1 DeepInteration [42] LC 69.9 72.6 SparseFusion [39] LC 70.5 72.8 GAFusion(ours) LC 72.1 73.5

Table 2: Comparison on the nuScenes val set. The models in the table are without ensemble or test-time augmentation.

4.2 Implementation Details

The developed model is implemented based on the MMDetection3D [8] framework. For the LiDAR stream, we utilize the additional downsampling operation on top of VoxelNet [48] as our backbone. For the camera stream, we adopt Swin-T [25] and FPN as the image backbone, and use the pre-trained model of Swin-T. Similar to most models, the image resolution is 448×{\times}×800, and the voxel size is (0.075m, 0.075m, 0.2m). The whole training process follows the previous work [22, 26]. Firstly, a LiDAR detector is trained as 3D backbone for 20 epochs. Then, we freeze the pre-trained LiDAR components and jointly train for another 6 epochs according to the proposed framework. During the training stage, we use AdamW𝐴𝑑𝑎𝑚𝑊AdamWitalic_A italic_d italic_a italic_m italic_W [27] optimizer with an initial learning rate of 5×{\times}×10-5 and a weight decay of 10-2. GAFusion is trained on two 3090 GPUs with batch size of 4. In the inference stage, we do not use test-time augmentation (TTA) or multi-model ensemble.

4.3 Results and Comparison

As shown in Table 2 and Table 1, we report the results of GAFusion on the nuScenes validation and test sets, and compare them with other state-of-the-art models. The results show that, on the test set, GAFusion surpasses all the existing methods with 73.6%percent\%% mAP and 74.9%percent\%% NDS, such as MSMDFusion [13] and CMT [40]. It also achieves excellent performance on the validation set. In addition, we also provide the visualization results of GAFusion and BEVFusion to demonstrate the superiority of the proposed method, and they can be seen in Fig. 7. This is attributed to better guidance mechanisms, larger receptive fields and a more suitable fusion method.

4.4 Ablation Studies

To demonstrate the effectiveness and rationality of GAFusion, we conduct comprehensive ablation studies for each of the proposed components.

Additional downsampling and sparse height compression. To prove the validity and generalization of this module, we separately insert the developed module into TransFusion [1] and BEVFusion [26], as shown in Table 3. We do not use any augmentation strategies or multi-model ensemble during testing. The results illustrate that it can significantly improve the performance of different models. It enhances 1.0%percent\%% mAP and 0.6%percent\%% NDS in TransFusion, and 0.8%percent\%% mAP and 0.5%percent\%% NDS in BEVFusion, which indicates that it can aggregate multi-scale information.

Backbone + Sparse2Dense TransFusion BEVFusion mAP\uparrow NDS\uparrow mAP\uparrow NDS\uparrow Voxelnet + HC 68.9 71.6 70.2 72.9 VoxelNet, AD + SHC 69.9 72.4 71.0 73.4

Table 3: Performance and generalization of additional downsampling (AD) and sparse height compression (SHC) on other common models. NDS/mAP comparison on nuScenes test set. “HC” is height compression.

Impacts of LiDAR guidance. To demonstrate that the contributions of LiDAR guidance indeed improve the model performance, we introduce SDG and LOG into BEVFusion [26]. Table 4 presents the impacts of different combinations of the guidance modules in BEVFusion. We observe that the model performance brings about 1.4%percent\%% mAP and 0.8%percent\%% NDS gain with both SDG and LOG. When neither SDG nor LOG module is used, the model scores drop significantly. It can be attributed to the lack of guidance in the camera stream, and results in unreliable depth information. Moreover, it realizes 0.8%percent\%% mAP and 0.4%percent\%% NDS with SDG alone and 1.0%percent\%% mAP and 0.6%percent\%% NDS with LOG alone. The interaction effects of LOG are more remarkable, so we conjecture that directly interacting among 3D features can provide sufficient located information. SDG and LOG play their respective roles: the former integrates sparse depth information into 2D image features, and the latter guides depth information in 3D feature volume, which enables the camera stream to obtain rich geometric information.

Refer to caption
Figure 8: Receptive fields of the preliminary fused BEV features from different modalities. The colored dots indicate effective receptive fields.

SDG LOG mAP\uparrow NDS\uparrow (1) 68.52 71.38 (2) 69.33 71.76 (3) 69.49 72.04 (4) 69.93 72.24

Table 4: Ablation study of LiDAR guidance on nuScenes val set. (1) represents the performance of the original BEVFusion model.

BEV features fusion strategy. We explore the impacts of different fusion methods, including addition, concatenation, LiDAR-guide fusion transformer (LGFT) and LGAFT. As shown in Table 5, LGFT achieves a noticeable improvement over addition and concatenation, with about 0.7%percent\%% mAP and 0.4%percent\%% NDS. LGAFT further enhances 0.16%percent\%% mAP and 0.11%percent\%% NDS against LGFT due to the addition of adaptive mechanism. It presents that enhancing the interaction between LiDAR and camera BEV features from a global scope and the adaptive mechanism can sufficiently improve global spatial relevance.

Effects of MSDPT. We illustrate the related results to prove that MSDPT can effectively enlarge the receptive fields of camera features and aggregate semantic information. In Table 6, (1)-(5) are using output features (Fc′′superscriptsubscript𝐹𝑐′′{F}_{c}^{\prime\prime}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT) to fuse the features of different scales. Without MSDPT, the model performance drops by about 0.5%percent\%% mAP and 0.4%percent\%% NDS. Different scales of features also affect the model accuracy, which is due to the fact that the multi-scale operation can enlarge the receptive fields of camera features. However, redundant scales also cause too much computation and the performance enhancement is not obvious. Therefore, we select 3 scales to combine the different scale features for the balance of performance and computation.

Larger receptive fields. As shown in Fig. 8, (a) and (b) illustrate the effective receptive fields of the fused features from the camera and LiDAR BEV features by BEVfusion [26] and GAFusion, respectively. We observe that GAFusion achieves larger effective receptive fields than BEVFusion. This is attributed to the additional downsampling and MSDPT modules, which indicate that multi-scale features can provide more contextual information. Besides, the global and local interaction of LGAFT contributes to enlarging the feature receptive fields to some extent.

BEV Fusion Tem mAP\uparrow NDS\uparrow (1) ADD. 70.85 72.82 71.17 72.93 \hdashline(2) Cat. 70.92 72.88 71.21 72.98 \hdashline(3) LGFT 71.63 73.32 71.91 73.41 \hdashline(4) LGAFT 71.79 73.43 72.08 73.53

Table 5: Ablation study of BEV fusion strategy and temporal fusion on nuScenes val set. “Tem” is temporal fusion.

C1 C2 C3 C4 mAP\uparrow NDS\uparrow (1) 71.60 73.11 (2) 71.92 73.39 (3) 72.01 73.45 (4) 72.08 73.53 (5) 72.07 73.54

Table 6: Ablation study of MSDPT on nuScenes val set. C1-C4 denote the number of 3D convolution layers (1-4) applied to the 3D feature volume, respectively.

Temporal fusion. In Table 5, the temporal fusion improves about 0.3%percent\%% mAP and 0.1%percent\%% NDS. We integrate two frames of the BEV features, and the approach enables partial features alignment between adjacent frames, which leads to a marginal performance improvement. For multiple frames, it can attain higher enhancement, which is our future work.

5 Conclusion

We propose GAFusion, a more effective 3D object detection method in BEV representation, which is equipped with excellent guidance and fusion mechanisms. Additional downsampling and MSDPT are developed to enlarge the receptive fields of different modal features. Then SDG and LOG are employed to transform the 2D camera features into 3D features with sufficient located and spatial information. Afterward, we propose LGAFT to facilitate the fusion of LiDAR and camera BEV features. Finally, a temporal fusion module is adopted to aggregate features from different frames. Extensive experiments demonstrate the effectiveness and generality of our developed modules and GAFusion achieves state-of-the-art performances on the nuScenes dataset. We hope that the proposed components of GAFusion could provide more insights for subsequent research in this field.

Acknowledgments This project is developed based on the following open-sourced projects:MMDetection3D [8], TransFusion [1], BEVFusion [26], BEVFusion [22], OccFormer [47]. Thanks for their excellent work.

References

  • Bai et al. [2022] Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu, and Chiew-Lan Tai. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1090–1099, 2022.
  • Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
  • Cai et al. [2023] Hongxiang Cai, Zeyuan Zhang, Zhenyu Zhou, Ziyin Li, Wenbo Ding, and Jiuhua Zhao. Bevfusion4d: Learning lidar-camera fusion under bird’s-eye-view via cross-modality guidance and temporal aggregation. arXiv preprint arXiv:2303.17099, 2023.
  • Chen et al. [2017] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1907–1915, 2017.
  • Chen et al. [2023a] Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia. Largekernel3d: Scaling up kernels in 3d sparse cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13488–13498, 2023a.
  • Chen et al. [2023b] Yilun Chen, Zhiding Yu, Yukang Chen, Shiyi Lan, Anima Anandkumar, Jiaya Jia, and Jose M Alvarez. Focalformer3d: Focusing on hard instance for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8394–8405, 2023b.
  • Chen et al. [2022] Zehui Chen, Zhenyu Li, Shiquan Zhang, Liangji Fang, Qinhong Jiang, and Feng Zhao. Autoalignv2: Deformable feature aggregation for dynamic multi-modal 3d object detection. arXiv preprint arXiv:2207.10316, 2022.
  • Contributors [2020] MMDetection3D Contributors. MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d, 2020.
  • Deng et al. [2021] Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1201–1209, 2021.
  • Huang and Huang [2022] Junjie Huang and Guan Huang. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054, 2022.
  • Huang et al. [2021] Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  • Huang et al. [2023] Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9223–9232, 2023.
  • Jiao et al. [2023] Yang Jiao, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. Msmdfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21643–21652, 2023.
  • Ku et al. [2018] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L Waslander. Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8. IEEE, 2018.
  • Lang et al. [2019] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019.
  • Li et al. [2023a] Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1477–1485, 2023a.
  • Li et al. [2023b] Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M Alvarez, Sanja Fidler, Chen Feng, and Anima Anandkumar. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9087–9098, 2023b.
  • Li et al. [2021] Zhichao Li, Feng Wang, and Naiyan Wang. Lidar r-cnn: An efficient and universal 3d object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7546–7555, 2021.
  • Li et al. [2022] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, pages 1–18. Springer, 2022.
  • Li et al. [2023c] Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, and Jose M Alvarez. Fb-occ: 3d occupancy prediction based on forward-backward view transformation. arXiv preprint arXiv:2307.01492, 2023c.
  • Li et al. [2023d] Zhiqi Li, Zhiding Yu, Wenhai Wang, Anima Anandkumar, Tong Lu, and Jose M Alvarez. Fb-bev: Bev representation from forward-backward view transformations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6919–6928, 2023d.
  • Liang et al. [2022] Tingting Liang, Hongwei Xie, Kaicheng Yu, Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Tao Tang, Bing Wang, and Zhi Tang. Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems, 35:10421–10434, 2022.
  • Liu et al. [2022] Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII, pages 531–548. Springer, 2022.
  • Liu et al. [2023a] Yingfei Liu, Junjie Yan, Fan Jia, Shuailin Li, Aqi Gao, Tiancai Wang, and Xiangyu Zhang. Petrv2: A unified framework for 3d perception from multi-camera images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3262–3272, 2023a.
  • Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  • Liu et al. [2023b] Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela L Rus, and Song Han. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In 2023 IEEE international conference on robotics and automation (ICRA), pages 2774–2781. IEEE, 2023b.
  • Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • Ma et al. [2022] Yuexin Ma, Tai Wang, Xuyang Bai, Huitong Yang, Yuenan Hou, Yaming Wang, Yu Qiao, Ruigang Yang, Dinesh Manocha, and Xinge Zhu. Vision-centric bev perception: A survey. arXiv preprint arXiv:2208.02797, 2022.
  • Miao et al. [2023] Ruihang Miao, Weizhou Liu, Mingrui Chen, Zheng Gong, Weixin Xu, Chen Hu, and Shuchang Zhou. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540, 2023.
  • Philion and Fidler [2020] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
  • Qi et al. [2017] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
  • Schuster and Paliwal [1997] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997.
  • Shi et al. [2020] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In CVPR, pages 10529–10538, 2020.
  • Shi et al. [2022] Shaoshuai Shi, Li Jiang, Jiajun Deng, Zhe Wang, Chaoxu Guo, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. IJCV, pages 1–21, 2022.
  • Sindagi et al. [2019] Vishwanath A Sindagi, Yin Zhou, and Oncel Tuzel. Mvx-net: Multimodal voxelnet for 3d object detection. In 2019 International Conference on Robotics and Automation (ICRA), pages 7276–7282. IEEE, 2019.
  • Vora et al. [2020] Sourabh Vora, Alex H Lang, Bassam Helou, and Oscar Beijbom. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4604–4612, 2020.
  • Wang et al. [2023] Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu, and Xingang Wang. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17850–17859, 2023.
  • Wang et al. [2022] Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022.
  • Xie et al. [2023] Yichen Xie, Chenfeng Xu, Marie-Julie Rakotosaona, Patrick Rim, Federico Tombari, Kurt Keutzer, Masayoshi Tomizuka, and Wei Zhan. Sparsefusion: Fusing multi-modal sparse representations for multi-sensor 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17591–17602, 2023.
  • Yan et al. [2023] Junjie Yan, Yingfei Liu, Jianjian Sun, Fan Jia, Shuailin Li, Tiancai Wang, and Xiangyu Zhang. Cross modal transformer: Towards fast and robust 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18268–18278, 2023.
  • Yan et al. [2018] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
  • Yang et al. [2022] Zeyu Yang, Jiaqi Chen, Zhenwei Miao, Wei Li, Xiatian Zhu, and Li Zhang. Deepinteraction: 3d object detection via modality interaction. In NeurIPS, 2022.
  • Yin et al. [2021a] Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Multimodal virtual point 3d detection. Advances in Neural Information Processing Systems, 34:16494–16507, 2021a.
  • Yin et al. [2021b] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021b.
  • Yoo et al. [2020] Jin Hyeok Yoo, Yeocheol Kim, Ji Song Kim, and J. Choi. 3D-CVF: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In ECCV, 2020.
  • Zhai et al. [2022] Zhenyu Zhai, Qiantong Wang, Zongxu Pan, Zhentong Gao, and Wenlong Hu. Muti-frame point cloud feature fusion based on attention mechanisms for 3d object detection. Sensors, 22(19):7473, 2022.
  • Zhang et al. [2023] Yunpeng Zhang, Zheng Zhu, and Dalong Du. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9433–9443, 2023.
  • Zhou and Tuzel [2018] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4490–4499, 2018.