Integrated Intention Prediction and Decision-Making with Spectrum Attention Net and Proximal Policy Optimization

Xiao Zhou, Chengzhen Meng, Wenru Liu, Zengqi Peng, Ming Liu, and Jun Ma This work was supported in part by the National Natural Science Foundation of China under Grant 62303390; and in part by the Guangzhou-HKUST(GZ) Joint Funding Scheme under Grant 2024A03J0618. (Corresponding Author: Jun Ma.)Xiao Zhou, Chengzhen Meng, Wenru Liu, Zengqi Peng, and Ming Liu are with the Robotics and Autonomous Systems Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China. (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]).Jun Ma is with the Robotics and Autonomous Systems Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China, and also with the Division of Emerging Interdisciplinary Areas, The Hong Kong University of Science and Technology, Hong Kong SAR, China. (e-mail: [email protected]).
Abstract

For autonomous driving in highly dynamic environments, it is anticipated to predict the future behaviors of surrounding vehicles (SVs) and make safe and effective decisions. However, modeling the inherent coupling effect between the prediction and decision-making modules has been a long-standing challenge, especially when there is a need to maintain appropriate computational efficiency. To tackle these problems, we propose a novel integrated intention prediction and decision-making approach, which explicitly models the coupling relationship and achieves efficient computation. Specifically, a spectrum attention net is designed to predict the intentions of SVs by capturing the trends of each frequency component over time and their interrelations. Fast computation of the intention prediction module is attained as the predicted intentions are not decoded to trajectories in the executing process. Furthermore, the proximal policy optimization (PPO) algorithm is employed to address the non-stationary problem in the framework through a modest policy update enabled by a clipping mechanism within its objective function. On the basis of these developments, the intention prediction and decision-making modules are integrated through joint learning. Experiments are conducted in representative traffic scenarios, and the results reveal that the proposed integrated framework demonstrates superior performance over several deep reinforcement learning (DRL) baselines in terms of success rate, efficiency, and safety in driving tasks.

I INTRODUCTION

Advancements in computer technology and artificial intelligence have led to significant developments in the field of autonomous driving technology. Nevertheless, the realization of autonomous driving systems remains a formidable challenge, with system design standing as a principal barrier. The architectures of contemporary autonomous driving systems fall into two major categories: end-to-end and hierarchical systems. While there have been notable advancements in end-to-end autonomous driving systems based on machine learning [1], these systems continue to grapple with issues of poor interpretability, high data demands, limited generalization capabilities, and insufficient robustness. Alternatively, hierarchical autonomous driving systems [2, 3], which involve the process of perception, prediction, decision-making, planning, and control have exhibited considerable enhancements in terms of interpretability and robustness. However, for such systems, tasks including prediction and decision-making are often treated as relatively distinct modules, which ignores the inherent coupling relationship between them. This mutual effect comes from the influence of the prediction module on the decision-making module; and also, due to the interaction between vehicles, the decision-making module will in turn affect the behavior of the surrounding vehicles (SVs), thereby affecting the next prediction step. Since the coupling between prediction and decision-making could be overlooked, this separation can lead to deteriorated performance of autonomous vehicles (AVs) in complex situations. In response to this shortcoming, several studies have proposed frameworks that integrate the prediction and decision-making modules [4, 5], leveraging predicted trajectories or intention to inform the decision-making modules in autonomous driving. In integrated frameworks such as [6], the forecasting and imitation nets are trained jointly, which is more efficient compared to training the network individually. Despite these efforts, the design of an integrated framework remains a significant challenge.

In general, the output of the decision-making module can be obtained through approaches such as rule-based methods [7], optimization-based methods [8], and learning-based methods [9]. Rule-based and optimization-based methods have obtained commendable results owing to their strong interpretability and related attributes. Nevertheless, due to the limited adaptability of these methods, they generally deliver poor performance in complex real-world environments. On the other hand, learning-based methods have demonstrated significant and efficacious progress, with a particularly promising direction in reinforcement learning (RL). A deep reinforcement learning (DRL) method presented in [10] adjusts the clipping parameter at different stages of training using proximal policy optimization (PPO), so that the vehicle could quickly search for an approximate optimal policy or its neighborhood with a large parameter and then converge to the optimal policy with a smaller one. The close interplay between prediction and decision-making implies that the prediction module exerts a considerable influence on decision-making module. Therefore, it is crucial to incorporate a well-developed prediction module in the framework to ensure that the decisions made by the system are safe and reliable.

Mainstream prediction methods typically focus on intention or trajectory prediction. Within the integrated framework, trajectory prediction is more commonly utilized and it can be divided into two main types: optimization-based and learning-based methods. Exemplified by [11], optimization-based methods are highly efficient in capturing long-term movements, but replicating this success in complex urban road environments proves challenging due to the intricate road layouts and variable traffic conditions. In contrast, learning-based methods [12, 13] are more adept at handling prediction tasks in such complex settings, benefiting from the strengths of neural networks in processing large amounts of trajectory data and understanding urban road conditions. Learning-based methods forecast future trajectories by analyzing historical trajectory data, resulting in encouraging outcomes. The performance can be improved when transforming trajectory representation from the time domain to the frequency domain, as the frequency domain reflects different characteristics such as energy distribution. Relevant records have shown that these features in the frequency domain are critical to the prediction process. A novel approach for predicting state sequences through Fourier transform (FT) is proposed by [14], which extracts hidden structural information from the frequency domain, thereby enhancing sample efficiency. A network called SpectrumNet is introduced in [15] to predict pedestrian trajectories, which is capable of breaking down information across various time scales through FT and representing it in the frequency domain. In [16], the future trajectory is predicted from the history trajectory, which is transformed from the time domain to the frequency domain using FT. The low-frequency and high-frequency components reflect the long-term objective (LTO) and the short-term dynamic (STD) intention, respectively. However, the FT used in [14, 15, 16] fails to account for changes in frequency over time, hindering the dynamic estimation of trends in trajectory changes and affecting the performance of prediction in real-world scenarios. Given that the goal of trajectory prediction is to obtain precise future trajectories, it demands considerable computational resources, leading to difficulties in prediction within complex dynamic environments. In addition to trajectory prediction, intention prediction has also been the subject of extensive research. For example, an intention prediction approach developed by [17] utilizes Gaussian mixed hidden Markov models to forecast the short-term intention of vehicles based on their historical trajectories. However, this method concentrates exclusively on past trajectory data to predict short-term driving intention and does not take into account the LTO intention. The same issue can also be found in the majority of intention prediction approaches, such as [18]. The absence of consideration for LTO intention challenges the assurance of long-term decision-making effectiveness, which can compromise robustness in complex driving scenarios applied within integrated frameworks.

To address the aforementioned issues, we propose an integrated intention prediction and decision-making approach for autonomous driving, which leverages the proposed spectrum attention net to predict the intention of SVs at the frequency domain and utilizes the PPO algorithm to generate the decisions. The main contributions of this paper are as follows: An integrated intention prediction and decision-making framework is proposed to explicitly model the strong coupling effect between intention prediction and decision-making, where we use a joint learning mechanism to learn the coupling and ensure the effectiveness of network parameter updates. Specifically, we propose the spectrum attention net to predict the intention of the SVs with the spectrum of trajectory obtained by short-time Fourier transform (STFT). With the predicted intention of SVs (which is the combination of LTO and STD intention), the AV can make safer and more effective decisions. The effectiveness of the proposed integrated framework is demonstrated by comparisons with existing methods in various driving scenarios, and the results indicate that the proposed approach outperforms other baseline methods in these driving tasks.

We organize the remainder of the article as follows: Section II introduces the preliminaries of the Markov decision process (MDP) and STFT, and also gives the problem statement. The details of the proposed integrated framework are given in Section III, including the design of the spectrum attention net, the PPO algorithm, and the joint learning mechanism in the framework. Based on these findings, we conduct experiments in four different scenarios described in Section IV. Finally, we summarize the conclusion and future work in Section V.

II Preliminaries

II-A Markov Decision Process

MDP is used to describe decision-making problems with the Markov property. In an MDP, the agent selects the optimal decision based on the current state and possible actions to maximize long-term rewards. The MDP consists of five elements, which can be represented by the tuple 𝒮,𝒜,𝒫,,γ𝒮𝒜𝒫𝛾\langle\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma\rangle⟨ caligraphic_S , caligraphic_A , caligraphic_P , caligraphic_R , italic_γ ⟩, where 𝒮𝒮\mathcal{S}caligraphic_S is the state space, 𝒜𝒜\mathcal{A}caligraphic_A is the action space, 𝒫𝒫\mathcal{P}caligraphic_P is the state transition probability function that represents the probability of transitioning from a past state to a new state after taking the action. \mathcal{R}caligraphic_R is the reward function that represents the immediate reward obtained after taking action a𝑎aitalic_a in state s𝑠sitalic_s, and γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ) is the discount factor used to discount the value of future rewards.

II-B Short-Time Fourier Transform

FT is pivotal in applications that need to transform signals between the time and frequency domains. An important variant is the STFT [19], which addresses the limitation of the FT in analyzing non-stationary signals where frequency components vary over time. The primary purpose of the STFT is to determine the frequency and phase content of local sections of a signal as it changes over time. This is achieved by multiplying the signal by a window function which is shifted along the duration of the signal. For a signal x[n]𝑥delimited-[]𝑛x[n]italic_x [ italic_n ] where n𝑛nitalic_n represents the time step, the STFT transforms it into the frequency domain by applying FT to segments of the signal, defined by the window function. The STFT can be expressed as:

X(m,ω)=n=x[n]w[nm]ejωn𝑋𝑚𝜔superscriptsubscript𝑛𝑥delimited-[]𝑛𝑤delimited-[]𝑛𝑚superscript𝑒𝑗𝜔𝑛\begin{split}X(m,\omega)=\sum_{n=-\infty}^{\infty}x[n]\cdot w[n-m]\cdot e^{-j% \omega n}\end{split}start_ROW start_CELL italic_X ( italic_m , italic_ω ) = ∑ start_POSTSUBSCRIPT italic_n = - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_x [ italic_n ] ⋅ italic_w [ italic_n - italic_m ] ⋅ italic_e start_POSTSUPERSCRIPT - italic_j italic_ω italic_n end_POSTSUPERSCRIPT end_CELL end_ROW (1)

where w[nm]𝑤delimited-[]𝑛𝑚w[n-m]italic_w [ italic_n - italic_m ] denotes the window function centered around time m𝑚mitalic_m, m𝑚mitalic_m is the index of time frame, and ω𝜔\omegaitalic_ω corresponds to the frequency.

On the other hand, when the frequency domain signal obtained from STFT needs to be converted to the time domain, the inverse short-time Fourier transform (ISTFT) can be adopted. Given the STFT representation X(m,ω)𝑋𝑚𝜔X(m,\omega)italic_X ( italic_m , italic_ω ) of a signal, the corresponding signal x(n)𝑥𝑛x(n)italic_x ( italic_n ) at time domain can be reconstructed using the formula:

x(n)=mωX(m,ω)w(nm)ejωn𝑥𝑛subscript𝑚subscript𝜔𝑋𝑚𝜔𝑤𝑛𝑚superscript𝑒𝑗𝜔𝑛x(n)=\sum_{m}\sum_{\omega}X(m,\omega)\cdot w(n-m)\cdot e^{j\omega n}italic_x ( italic_n ) = ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_X ( italic_m , italic_ω ) ⋅ italic_w ( italic_n - italic_m ) ⋅ italic_e start_POSTSUPERSCRIPT italic_j italic_ω italic_n end_POSTSUPERSCRIPT (2)

II-C Problem Statement

The objective of this research is to develop a framework for effectively navigating AV in various traffic environments. We will focus on four key scenarios: a straight roadway, a four-way intersection, a two-way intersection, and a roundabout.

By modeling this problem as an MDP, our goal is to find an optimal policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, guided by parameters θ𝜃\thetaitalic_θ, that maximizes expected cumulative discounted rewards, formulated as:

maxπθ𝔼[t=1Tγtrt(𝐒t,at)] s.t. at𝒜,𝐒t𝒮subscriptsubscript𝜋𝜃𝔼delimited-[]superscriptsubscript𝑡1𝑇superscript𝛾𝑡subscript𝑟𝑡subscript𝐒𝑡subscript𝑎𝑡 s.t. formulae-sequencesubscript𝑎𝑡𝒜subscript𝐒𝑡𝒮\begin{array}[]{cl}\max_{\pi_{\theta}}&\mathbb{E}\left[\sum_{t=1}^{T}\gamma^{t% }r_{t}\left(\mathbf{S}_{t},a_{t}\right)\right]\\ \text{ s.t. }&a_{t}\in\mathcal{A},\mathbf{S}_{t}\in\mathcal{S}\end{array}start_ARRAY start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL s.t. end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A , bold_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S end_CELL end_ROW end_ARRAY (3)

Specifically, the state space 𝒮𝒮\mathcal{S}caligraphic_S and action space 𝒜𝒜\mathcal{A}caligraphic_A in the problem is defined as:

State space 𝒮𝒮\mathcal{S}caligraphic_S: We define the state at any given time as a matrix,

𝐒t=[[𝐬t0]T[𝐬t1]T[𝐬tN]T]Tsubscript𝐒𝑡superscriptsuperscriptdelimited-[]superscriptsubscript𝐬𝑡0𝑇superscriptdelimited-[]superscriptsubscript𝐬𝑡1𝑇superscriptdelimited-[]superscriptsubscript𝐬𝑡𝑁𝑇𝑇\begin{split}\mathbf{S}_{t}=\left[\ \left[\mathbf{s}_{t}^{0}\right]^{T}\ \ % \left[\mathbf{s}_{t}^{1}\right]^{T}\ ...\ \left[\mathbf{s}_{t}^{N}\ \right]^{T% }\right]^{T}\end{split}start_ROW start_CELL bold_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ [ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT … [ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW (4)

where N𝑁Nitalic_N is the number of vehicles. The state of the AV is captured in the first row of 𝐒tsubscript𝐒𝑡\mathbf{S}_{t}bold_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as 𝐬t0superscriptsubscript𝐬𝑡0\mathbf{s}_{t}^{0}bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, and other rows of 𝐒tsubscript𝐒𝑡\mathbf{S}_{t}bold_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, i.e., 𝐬ti(i=1,2,,N)superscriptsubscript𝐬𝑡𝑖𝑖12𝑁\mathbf{s}_{t}^{i}\ (i=1,2,...,N)bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_i = 1 , 2 , … , italic_N ), represent vectors of the kinematic features of the SVs. Each vehicle’s state is characterized by the following kinematic features:

𝐬ti=[xtiytivx,tivy,tisinψticosψti]superscriptsubscript𝐬𝑡𝑖delimited-[]superscriptsubscript𝑥𝑡𝑖superscriptsubscript𝑦𝑡𝑖superscriptsubscript𝑣𝑥𝑡𝑖superscriptsubscript𝑣𝑦𝑡𝑖superscriptsubscript𝜓𝑡𝑖superscriptsubscript𝜓𝑡𝑖missing-subexpression\begin{split}\mathbf{s}_{t}^{i}={\left[\begin{array}[]{l l l l l l l}{x_{t}^{i% }}&{y_{t}^{i}}&{v_{x,t}^{i}}&{v_{y,t}^{i}}&{\sin\psi_{t}^{i}}&{\cos\psi_{t}^{i% }}\end{array}\right]}\end{split}start_ROW start_CELL bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL start_CELL italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL start_CELL italic_v start_POSTSUBSCRIPT italic_x , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL start_CELL italic_v start_POSTSUBSCRIPT italic_y , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL start_CELL roman_sin italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL start_CELL roman_cos italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL end_ROW end_ARRAY ] end_CELL end_ROW (5)

which contains the information about the position, velocity, and orientation of vehicle i𝑖iitalic_i at time step t𝑡titalic_t.

Action space 𝒜𝒜\mathcal{A}caligraphic_A: The agent has access to a set of five discrete actions,

𝒜={A0,A1,A2,A3,A4}𝒜superscript𝐴0superscript𝐴1superscript𝐴2superscript𝐴3superscript𝐴4\begin{split}\mathcal{A}=\left\{A^{0},A^{1},A^{2},A^{3},A^{4}\right\}\end{split}start_ROW start_CELL caligraphic_A = { italic_A start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT } end_CELL end_ROW (6)

including lateral maneuvers such as changing lanes to the left or right (A0superscript𝐴0A^{0}italic_A start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, A2superscript𝐴2A^{2}italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT), maintaining current motion (A1superscript𝐴1A^{1}italic_A start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT), and longitudinal adjustments like decelerating or accelerating (A3superscript𝐴3A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, A4superscript𝐴4A^{4}italic_A start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT).

III METHODOLOGY

In this section, we present an inaugural framework for integrated intention prediction and decision-making as shown in Fig. 1, where the AV utilizes the intention prediction obtained from the spectrum attention net to guide its decisions. We first introduce the details of the proposed spectrum attention net, which is used to predict the intention of SVs. Then, we illustrate the advantages of the PPO algorithm in processing complex and non-stationary inputs as a decision-making module in the integrated framework. Finally, we give the details about the joint learning process of the intention prediction module and decision-making module in the integrated framework.

Refer to caption
Figure 1: Overall architecture of the proposed integrated intention prediction and decision-making framework is exemplified through a four-way intersection scenario. For observable SV i𝑖iitalic_i, its segmented historical trajectory is transformed into a spectrogram by STFT first, then used to predict its intention with the spectrum attention net. The decision-making module receives the predicted intention of SVs and makes a decision based on predicted intention and direct observation. Note that the trajectory decoder is only executed in the joint training process to improve the real-time capabilities of the integrated framework.

III-A Intention Prediction Module with Spectrum Attention Net

The behavior of SVs is affected by both LTO and STD intention, which can be reflected by the low-frequency component and high-frequency component in the spectrum, respectively. Therefore, we propose a spectrum attention net to predict the intention of SVs from the frequency spectrum, where the spectral information can characterize the intention of SVs.

At time step t𝑡titalic_t, the AV observes current environmental states, which include information about both itself and observable SVs. The state information of these observable SVs is recorded in the trajectory buffer with a horizon of H𝐻Hitalic_H to maintain a consistent input length for the spectrum attention net. Specifically, for each observable vehicle i𝑖iitalic_i in the scenario, its state information 𝐬tisubscriptsuperscript𝐬𝑖𝑡\mathbf{s}^{i}_{t}bold_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over the time interval [tH+1,t]𝑡𝐻1𝑡[t-H+1,t][ italic_t - italic_H + 1 , italic_t ] is incorporated into the trajectory buffer Tego,isubscript𝑇𝑒𝑔𝑜𝑖T_{ego,i}italic_T start_POSTSUBSCRIPT italic_e italic_g italic_o , italic_i end_POSTSUBSCRIPT, thereby forming a segmented historical trajectory for vehicle i𝑖iitalic_i. By utilizing the segmented historical trajectory as time sequence input, we derive the features of multiple frequency components evolving over time through the transformation of time sequence input into the frequency domain using STFT, expressed as follows:

Xs(m,ω)=n=tH+1t𝐬tiw[tm]ejωtsubscript𝑋𝑠𝑚𝜔superscriptsubscript𝑛𝑡𝐻1𝑡subscriptsuperscript𝐬𝑖𝑡𝑤delimited-[]𝑡𝑚superscript𝑒𝑗𝜔𝑡\begin{split}X_{s}(m,\omega)=\sum_{n=t-H+1}^{t}\mathbf{s}^{i}_{t}\cdot w[t-m]% \cdot e^{-j\omega t}\end{split}start_ROW start_CELL italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_m , italic_ω ) = ∑ start_POSTSUBSCRIPT italic_n = italic_t - italic_H + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_w [ italic_t - italic_m ] ⋅ italic_e start_POSTSUPERSCRIPT - italic_j italic_ω italic_t end_POSTSUPERSCRIPT end_CELL end_ROW (7)

where Xs(m,ω)F×Ω×Msubscript𝑋𝑠𝑚𝜔superscript𝐹Ω𝑀X_{s}(m,\omega)\in\mathbb{R}^{F\times\Omega\times M}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_m , italic_ω ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × roman_Ω × italic_M end_POSTSUPERSCRIPT denotes the trend of the frequency component at ω𝜔\omegaitalic_ω across time frame m𝑚mitalic_m. Note that ΩΩ\Omegaroman_Ω is the number of frequency components, M𝑀Mitalic_M is the number of time frames, and F𝐹Fitalic_F denotes the dimension of kinematic features. Indeed, STFT can be regarded as a feature extraction function that derives intent representation from a segmented historical trajectory. The fixed time-frequency resolution of STFT ensures uniform representation of each time frame, which is crucial for the attention mechanism to effectively discern and correlate different temporal segments, thereby facilitating accurate intention predictions.

The behavior of SVs involves weighing different levels of intention, and the intention is interrelated. The frequency components at different time frames within the spectrogram Xs(m,ω)subscript𝑋𝑠𝑚𝜔X_{s}(m,\omega)italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_m , italic_ω ) depict the distribution of intention across time. Therefore, to thoroughly research intention patterns across diverse combinations of frequency components, we introduce a multi-head spectrum attention net in this study to forecast the intention of SVs. Initially, the spectrogram Xs(m,ω)subscript𝑋𝑠𝑚𝜔X_{s}(m,\omega)italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_m , italic_ω ) is embedded with a linear projection and then sent to the spectral attention module. Through the transformations executed by networks Lqsubscript𝐿𝑞L_{q}italic_L start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, Lksubscript𝐿𝑘L_{k}italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and Lvsubscript𝐿𝑣L_{v}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, all embedded vectors are converted into query QF×Ω×M𝑄superscript𝐹Ω𝑀Q\in\mathbb{R}^{F\times\Omega\times M}italic_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × roman_Ω × italic_M end_POSTSUPERSCRIPT, key KF×Ω×M𝐾superscript𝐹Ω𝑀K\in\mathbb{R}^{F\times\Omega\times M}italic_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × roman_Ω × italic_M end_POSTSUPERSCRIPT, and value VF×Ω×M𝑉superscript𝐹Ω𝑀V\in\mathbb{R}^{F\times\Omega\times M}italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × roman_Ω × italic_M end_POSTSUPERSCRIPT, respectively. The attention scores are computed by multiplying query matrix Q𝑄Qitalic_Q with the transpose of K𝐾Kitalic_K, followed by scaling with the inverse square root of dimension M𝑀Mitalic_M and normalization via a softmax function across the frequency dimension. This process yields a spectral attention matrix, which is utilized to evaluate the correlation among different frequency components, thereby facilitating the prediction about the intention of vehicle i𝑖iitalic_i. The resulting prediction vector is subsequently fed into the linear projection network, with each head of spectral attention generating outputs as represented below:

Pti=softmax(QKTdk)Vsuperscriptsubscript𝑃𝑡𝑖softmax𝑄superscript𝐾𝑇subscript𝑑𝑘𝑉P_{t}^{i}=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)Vitalic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V (8)

Finally, the outputs from all heads are combined and decoded by the intention decoder, ultimately serving as the predicted intention of SV.

Remark 1

The behavior of SVs is affected by multiple levels of intention at the same time, including LTO and STD intention. For a spectrogram Xs(m,ω)subscript𝑋𝑠𝑚𝜔X_{s}(m,\omega)italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_m , italic_ω ) that delineates frequency domain variations over time, the attention mechanism can capture the trend of each frequency component over time and their interrelations. Therefore, reasonable predictions about the spectrum distribution in a future time horizon can be obtained by spectrum attention net and reflect the intention of SVs.

III-B Decision-Making Module with PPO

One of the main challenges in the decision-making module of the integrated framework is the non-stationary problem led by joint learning. In the joint learning process, the intention prediction in the observation input of the decision-making algorithm keeps changing, therefore the decision-making policy needs to be adjusted accordingly. As the spectrum attention net in the intention prediction module updates, corresponding adjustments of the policy in the decision-making module become necessary. However, unchecked policy updates can lead to catastrophic consequences in this scenario. Significant updates on the policy of the decision-making module based on current intent prediction module may lead to excellent performance in current environment, but the performance could be suboptimal or even terrible after the intent prediction module updates its network parameters. To address this concern, we adopt the PPO algorithm in the integrated framework.

The PPO algorithm is an online reinforcement learning framework designed to solve sequential decision-making problems. Notably, the PPO algorithm employs a clipping mechanism within its objective function to ensure modest policy updates. This feature becomes especially crucial when handling complex and non-stationary multimodal inputs from the intention prediction module. Specifically, the objective function of PPO designed to mitigate the risk of destabilizing updates is given by:

LCLIP(θ)=𝔼^t[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]superscript𝐿CLIP𝜃subscript^𝔼𝑡delimited-[]subscript𝑟𝑡𝜃subscript^𝐴𝑡clipsubscript𝑟𝑡𝜃1italic-ϵ1italic-ϵsubscript^𝐴𝑡L^{\text{CLIP}}(\theta)=\hat{\mathbb{E}}_{t}\left[\min\left(r_{t}(\theta)\hat{% A}_{t},\text{clip}\left(r_{t}(\theta),1-\epsilon,1+\epsilon\right)\hat{A}_{t}% \right)\right]italic_L start_POSTSUPERSCRIPT CLIP end_POSTSUPERSCRIPT ( italic_θ ) = over^ start_ARG blackboard_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ roman_min ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , clip ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) , 1 - italic_ϵ , 1 + italic_ϵ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] (9)

where rt(θ)=πθ(at|st)πθold(at|st)subscript𝑟𝑡𝜃subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝜋subscript𝜃𝑜𝑙𝑑conditionalsubscript𝑎𝑡subscript𝑠𝑡r_{t}(\theta)=\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG represents the probability ratio of selecting action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT under the new policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT compared to the old policy πθoldsubscript𝜋subscript𝜃𝑜𝑙𝑑\pi_{\theta_{old}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Additionally, A^tsubscript^𝐴𝑡\hat{A}_{t}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the estimated advantage function at timestep t𝑡titalic_t, and ϵitalic-ϵ\epsilonitalic_ϵ serves as a hyperparameter defining the clipping boundary.

Essentially, the clipped surrogate objective in PPO provides a theoretically constrained network update routine to tackle the non-stationary problem associated with the intention prediction module. This mechanism ensures a stable policy update step within the clipping boundary, which is indispensable for the algorithm’s convergence and stationarity in a non-stationary environment.

III-C Joint Learning Process

The integrated intention prediction and decision-making framework is proposed in this study utilizing the spectrum attention net and PPO algorithm. The joint training of the intention prediction and decision-making modules is employed, as it outperforms the standalone training mechanism [6].

For the joint learning process in the integrated framework, the agent needs to explore the environment and collect experience based on the current network parameters first. In detail, the spectrum attention net predicts SVs’ intention and the PPO algorithm makes decisions based on the observation and predicted intention to explore the environment and collect experiences. Then, the collected experiences serve as training data for the spectrum attention net and the PPO algorithm in the integrated framework. Specifically, the predicted intention recorded in the experiences is sent to the trajectory decoder and transformed to trajectory time sequences in the future time horizon [t+1,t+H]𝑡1𝑡𝐻[t+1,t+H][ italic_t + 1 , italic_t + italic_H ] through ISTFT transformation. Subsequently, the trajectory time sequences are compared with the ground truth recorded in the collected experience to calculate the prediction loss. Finally, the spectrum attention net is optimized by minimizing the prediction loss with stochastic gradient descent. Meanwhile, the collected experiences contribute to optimizing the network parameters in the PPO algorithm by maximizing the objective function as in (9). To sum up, the learning process in the integrated framework is outlined in Algorithm 1.

Algorithm 1 Integrated intention prediction and decision-making algorithm
1:Initialize spectrum attention net with parameters ω𝜔\omegaitalic_ω, policy (actor) network with parameters θ𝜃\thetaitalic_θ and value (critic) network with parameters ϕitalic-ϕ\phiitalic_ϕ.
2:Initialize trajectory buffer Btsubscript𝐵𝑡B_{t}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to capacity H𝐻Hitalic_H, training epochs per collect E𝐸Eitalic_E, trajectory length T𝑇Titalic_T, batch size B𝐵Bitalic_B.
3:for k=0,1,2,𝑘012k=0,1,2,\dotsitalic_k = 0 , 1 , 2 , … do
4:     Collect set of experiences 𝒟k={τi}subscript𝒟𝑘subscript𝜏𝑖\mathcal{D}_{k}=\{\tau_{i}\}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } by running policy πθksubscript𝜋subscript𝜃𝑘\pi_{\theta_{k}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT interaction with environment.
5:     for time t=0,1,,T𝑡01𝑇t=0,1,\ldots,Titalic_t = 0 , 1 , … , italic_T do
6:         for observable vehicle i=1,2,,N𝑖12𝑁i=1,2,\ldots,Nitalic_i = 1 , 2 , … , italic_N do
7:              Record state of vehicle i𝑖iitalic_i in Btisuperscriptsubscript𝐵𝑡𝑖B_{t}^{i}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, transform the segmented historical trajectory into the frequency domain with STFT
8:              Predict the intention of observable vehicle Ptisuperscriptsubscript𝑃𝑡𝑖P_{t}^{i}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT by spectrum attention net based on Btisuperscriptsubscript𝐵𝑡𝑖B_{t}^{i}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
9:         end for
10:         Update the predicted intention of observable SVs Pt={Pt1,Pt2,,PtN}subscript𝑃𝑡superscriptsubscript𝑃𝑡1superscriptsubscript𝑃𝑡2superscriptsubscript𝑃𝑡𝑁P_{t}=\{P_{t}^{1},P_{t}^{2},\ldots,P_{t}^{N}\}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT }
11:         Select action by running policy πθksubscript𝜋subscript𝜃𝑘\pi_{\theta_{k}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT
12:         Execute action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and observe reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and new state 𝐒t+1subscript𝐒𝑡1\mathbf{S}_{t+1}bold_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT
13:         Store transition (𝐒t,Pt,𝐚t,rt,𝐒t+1)subscript𝐒𝑡subscript𝑃𝑡subscript𝐚𝑡subscript𝑟𝑡subscript𝐒𝑡1\left(\mathbf{S}_{t},P_{t},\mathbf{a}_{t},r_{t},\mathbf{S}_{t+1}\right)( bold_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )
14:     end for
15:     Compute trajectory target return estimates R^tsubscript^𝑅𝑡\hat{R}_{t}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
16:     Compute advantage estimates A^ksubscript^𝐴𝑘\hat{A}_{k}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with value V^ϕsubscript^𝑉italic-ϕ\hat{V}_{\phi}over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT.
17:     for e=0,1,,E1𝑒01𝐸1e=0,1,\dots,E-1italic_e = 0 , 1 , … , italic_E - 1 do
18:         for minibatch b𝒟k𝑏subscript𝒟𝑘b\in\mathcal{D}_{k}italic_b ∈ caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT do
19:              Decode the predicted intention Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the trajectory decoder and transform it into the trajectory time sequence by ISTFT
20:              Update the spectrum attention net by minimizing the prediction loss with SGD
21:              Lpred=1HFτ=t+1t+H𝐬^iτ𝐬iτ2subscript𝐿𝑝𝑟𝑒𝑑1𝐻𝐹superscriptsubscript𝜏𝑡1𝑡𝐻subscriptnormsuperscriptsubscript^𝐬𝑖𝜏superscriptsubscript𝐬𝑖𝜏2L_{pred}=\frac{1}{H\cdot F}\sum_{\tau=t+1}^{t+H}\left\|\hat{\mathbf{s}}_{i}^{% \tau}-\mathbf{s}_{i}^{\tau}\right\|_{2}italic_L start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_H ⋅ italic_F end_ARG ∑ start_POSTSUBSCRIPT italic_τ = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_H end_POSTSUPERSCRIPT ∥ over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT - bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
22:              Update the policy by maximizing the PPO-Clip objective as in (9) with SGD
23:         end for
24:         Fit the value by regression on mean-squared error with SGD:
25:         LVF(ϕ)=1BTτbt=0T1(Vϕ(st)R^t)2superscript𝐿VFitalic-ϕ1𝐵𝑇subscript𝜏𝑏superscriptsubscript𝑡0𝑇1superscriptsubscript𝑉italic-ϕsubscript𝑠𝑡subscript^𝑅𝑡2L^{\text{VF}}(\phi)=\frac{1}{B\cdot T}\sum_{\tau\in b}\sum_{t=0}^{T-1}(V_{\phi% }(s_{t})-\hat{R}_{t})^{2}italic_L start_POSTSUPERSCRIPT VF end_POSTSUPERSCRIPT ( italic_ϕ ) = divide start_ARG 1 end_ARG start_ARG italic_B ⋅ italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_τ ∈ italic_b end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
26:     end for
27:end for
Remark 2

The prediction and decision-making modules of autonomous driving systems are highly coupled in the conventional setting. Therefore, in this work, we explicitly model the coupling relationship through joint learning of the prediction and the decision-making modules. Specifically, in the proposed integrated framework, the spectrum attention net and the decision-making algorithm use the identical data for training and then update parameters simultaneously. It is worth noting that unlike end-to-end methods which optimize all networks with decision-making objective function, we optimize the spectrum attention net through the prediction loss, thus ensuring the effectiveness of network parameter updates.

Refer to caption
(a) Straight Road
Refer to caption
(b) Intersection-v0
Refer to caption
(c) Intersection-v1
Refer to caption
(d) Roundabout
Figure 2: Learning curves of different methods in four representative scenarios. The training curves are smoothed by the Savitzky-Golay filter.

IV EXPERIMENTS

In this section, we test the proposed integrated framework in the four common autonomous driving scenarios and compare the results with other classic DRL algorithms. The simulations are conducted in an OpenAI Gym environment Highway_Env𝐻𝑖𝑔𝑤𝑎𝑦_𝐸𝑛𝑣Highway\_Envitalic_H italic_i italic_g italic_h italic_w italic_a italic_y _ italic_E italic_n italic_v [20], and the behaviors of SVs are characterized by the intelligent driver model (IDM) [21]. and the baseline DRL methods are implemented based on Stable-baselines3 [22].

TABLE I: Hyperparameters
Hyperparameter DQN A2C PPO Ours
Discount Factor 0.990.990.990.99 0.990.990.990.99 0.990.990.990.99 0.990.990.990.99
Learning Rate 0.00050.00050.00050.0005 0.00050.00050.00050.0005 0.00050.00050.00050.0005 0.00050.00050.00050.0005
Batch Size 64646464 64646464 64646464 64646464
Exploration Rate 0.10.10.10.1 - - -
Target Network Update Frequency 5000500050005000 - - -
Entropy Coefficient - 0.00.00.00.0 0.00.00.00.0 0.00.00.00.0
Value Function Coefficient - 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5
Number of Steps - 1024102410241024 1024102410241024 1024102410241024

IV-A Experimental Setup

Straight Road: In this scenario, the AV drives on a four-lane straight road filled with SVs, with the task of safely driving 600 meters without collisions within 1 minute. The reward function is defined as r=rRel+rR+rcol+rLC+rspeed𝑟subscript𝑟Relsubscript𝑟Rsubscript𝑟colsubscript𝑟LCsubscript𝑟speedr=r_{\text{Rel}}+r_{\text{R}}+r_{\text{col}}+r_{\text{LC}}+r_{\text{speed}}italic_r = italic_r start_POSTSUBSCRIPT Rel end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT R end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT col end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT LC end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT speed end_POSTSUBSCRIPT, where rRelsubscript𝑟Relr_{\text{Rel}}italic_r start_POSTSUBSCRIPT Rel end_POSTSUBSCRIPT and rRsubscript𝑟Rr_{\text{R}}italic_r start_POSTSUBSCRIPT R end_POSTSUBSCRIPT represent the rewards of completing the task and driving on the right side of the road. rcolsubscript𝑟colr_{\text{col}}italic_r start_POSTSUBSCRIPT col end_POSTSUBSCRIPT and rLCsubscript𝑟LCr_{\text{LC}}italic_r start_POSTSUBSCRIPT LC end_POSTSUBSCRIPT are the penalties of collision and lane change, which can reduce unnecessary lane-changing behavior. rspeedsubscript𝑟speedr_{\text{speed}}italic_r start_POSTSUBSCRIPT speed end_POSTSUBSCRIPT will reward high-speed driving behavior within the speed limitation while penalizing for exceeding the speed limitation.

Intersection-v0: This scenario is a four-way intersection, with the task of safely and quickly driving toward the designated lane. The reward function is settled as r=rRel+rcol(v,N)+rOfR+rspeed𝑟subscript𝑟Relsubscript𝑟col𝑣𝑁subscript𝑟OfRsubscript𝑟speedr=r_{\text{Rel}}+r_{\text{col}}\left(v,N\right)+r_{\text{OfR}}+r_{\text{speed}}italic_r = italic_r start_POSTSUBSCRIPT Rel end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT col end_POSTSUBSCRIPT ( italic_v , italic_N ) + italic_r start_POSTSUBSCRIPT OfR end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT speed end_POSTSUBSCRIPT, where rOfRsubscript𝑟OfRr_{\text{OfR}}italic_r start_POSTSUBSCRIPT OfR end_POSTSUBSCRIPT is the penalty of out of the road and (v,N)𝑣𝑁\left(v,N\right)( italic_v , italic_N ) is the speed of the AV and number of SVs in every time interval, which reflects the collision rate.

Intersection-v1: It is similar to Intersection-v0, except that there are two lanes in one direction, and vehicles are allowed to change lanes. Therefore, a lane-changing penalty is added to the reward function, which is written as r=rRel+rcol(v,N)+rOfR+rLC+rspeed𝑟subscript𝑟Relsubscript𝑟col𝑣𝑁subscript𝑟OfRsubscript𝑟LCsubscript𝑟speedr=r_{\text{Rel}}+r_{\text{col}}\left(v,N\right)+r_{\text{OfR}}+r_{\text{LC}}+r% _{\text{speed}}italic_r = italic_r start_POSTSUBSCRIPT Rel end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT col end_POSTSUBSCRIPT ( italic_v , italic_N ) + italic_r start_POSTSUBSCRIPT OfR end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT LC end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT speed end_POSTSUBSCRIPT.

Roundabout: In this scenario, there is a roundabout with two lanes, with the task of passing through the roundabout as soon as possible without collision. To prevent vehicles from circling, a penalty of timeout rTOsubscript𝑟TOr_{\text{TO}}italic_r start_POSTSUBSCRIPT TO end_POSTSUBSCRIPT is added in the reward function, which can be expressed as r=rRel+rcol(v,N)+rOfR+rLC+rTO+rspeed𝑟subscript𝑟Relsubscript𝑟col𝑣𝑁subscript𝑟OfRsubscript𝑟LCsubscript𝑟TOsubscript𝑟speedr=r_{\text{Rel}}+r_{\text{col}}\left(v,N\right)+r_{\text{OfR}}+r_{\text{LC}}+r% _{\text{TO}}+r_{\text{speed}}italic_r = italic_r start_POSTSUBSCRIPT Rel end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT col end_POSTSUBSCRIPT ( italic_v , italic_N ) + italic_r start_POSTSUBSCRIPT OfR end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT LC end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT TO end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT speed end_POSTSUBSCRIPT

To validate the effectiveness of the proposed method, we evaluate its performance in the above scenarios and compare it with other three baseline DRL methods. The number of SVs is 8. In each simulation, we randomize the states of all agents to prevent the policy network from memorizing actions. We compare the proposed framework with the following baseline methods:

  • A2C performs gradient ascent on the parameter θ𝜃\thetaitalic_θ by θJ(θ)=𝔼τ[t=0T1θlogπθ(atst)Gt]subscript𝜃𝐽𝜃subscript𝔼𝜏delimited-[]superscriptsubscript𝑡0𝑇1subscript𝜃subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝐺𝑡\nabla_{\theta}J(\theta)=\mathbb{E}_{\tau}\left[\sum_{t=0}^{T-1}\nabla_{\theta% }\log\pi_{\theta}\left(a_{t}\mid s_{t}\right)G_{t}\right]∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] to directly maximize the reward [23].

  • DQN updates the Q function using the SGD algorithm by minimizing the loss =1/Nj=0N1(Q(sj,aj)yj)21𝑁superscriptsubscript𝑗0𝑁1superscript𝑄subscript𝑠𝑗subscript𝑎𝑗subscript𝑦𝑗2\mathcal{L}=1/N\sum_{j=0}^{N-1}\left(Q\left(s_{j},a_{j}\right)-y_{j}\right)^{2}caligraphic_L = 1 / italic_N ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( italic_Q ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [24].

  • PPO updates indirectly by maximizing a surrogate objective function as in (9), which provides a cautious estimate of the potential change in J(πθ)𝐽subscript𝜋𝜃J(\pi_{\theta})italic_J ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) resulting from the update [25]. This method is used as an ablation study.

The hyperparameters for network training of the four algorithms are listed in Table I, and all networks use Adam as the optimizer. The experiments with the aforementioned scenarios are conducted on the Windows 11 system environment with an Intel Core i7-14700KF processor and an NVIDIA RTX 4070 Ti GPU.

IV-B Performance Demonstration

Refer to caption
(a) Time step t=60𝑡60t=60italic_t = 60
Refer to caption
(b) Time step t=100𝑡100t=100italic_t = 100
Refer to caption
(c) Time step t=115𝑡115t=115italic_t = 115
Refer to caption
(d) Time step t=140𝑡140t=140italic_t = 140
Refer to caption
(e) Time step t=15𝑡15t=15italic_t = 15
Refer to caption
(f) Time step t=35𝑡35t=35italic_t = 35
Refer to caption
(g) Time step t=65𝑡65t=65italic_t = 65
Refer to caption
(h) Time step t=115𝑡115t=115italic_t = 115
Refer to caption
(i) Time step t=20𝑡20t=20italic_t = 20
Refer to caption
(j) Time step t=40𝑡40t=40italic_t = 40
Refer to caption
(k) Time step t=65𝑡65t=65italic_t = 65
Refer to caption
(l) Time step t=90𝑡90t=90italic_t = 90
Refer to caption
(m) Time step t=20𝑡20t=20italic_t = 20
Refer to caption
(n) Time step t=60𝑡60t=60italic_t = 60
Refer to caption
(o) Time step t=110𝑡110t=110italic_t = 110
Refer to caption
(p) Time step t=150𝑡150t=150italic_t = 150
Figure 3: Illustrations of the AV’s behavior attained by the proposed integrated intention prediction and decision-making framework in four scenarios. Four representative snapshots for each scenario during the performance evaluation are presented with Straight Road ((a)-(d)), Intersection-v0 ((e)-(h)), Intersection-v1 ((i)-(l)), and Roundabout ((m)-(p)), respectively. The green car and blue cars represent the AV and SVs under normal driving conditions, respectively. The color of the trajectory reflects the speed of the vehicle, with the transition from purple to yellow indicating an increase in vehicle speed.

To verify the effectiveness of the proposed algorithm, we implement our framework and other baseline methods in the aforementioned scenarios and obtain the learning curves shown in Fig. 2. It can be observed that the training results of the proposed algorithm are significantly better than those of all baseline methods. This could be attributed to the effectiveness of the intention prediction module within the proposed integrated framework. The decision-making process of SVs takes into account both STD and LTO intention, where LTO intention indicates the willingness to complete tasks of SVs and STD reflects short-term obstacle avoidance intention. The AV predicts the intention of SVs, thereby making decisions accordingly in quick succession, resulting in greater rewards than other baseline methods. Compared with the ablation study of the PPO method without these intention predictions, the method we proposed achieves a reward increase in all scenarios, but just a little improvement in the roundabout scenario. The reason is that our framework has some difficulties in taking spatio-spectrum into account, which significantly affects the performance in the roundabout scenario, due to its complex road structure. When compared with two other baseline methods, the improvement is even greater.

To further evaluate the performance of the integrated intention prediction and decision-making framework, we evaluate the trained policy in the test environment and set three metrics to assess the effectiveness of each algorithm: (1) we define the success rate as the ratio of successful task completions in 100 trials to reflect the ability of different algorithm-trained policies to complete the task; (2) we define the efficiency of the trained policy as (tmaxt)/(tmax tmin )subscript𝑡max𝑡subscript𝑡max subscript𝑡min \left({t_{\text{max}}-t}\right)/\left({t_{\text{max }}-t_{\text{min }}}\right)( italic_t start_POSTSUBSCRIPT max end_POSTSUBSCRIPT - italic_t ) / ( italic_t start_POSTSUBSCRIPT max end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ), where tmaxsubscript𝑡maxt_{\text{max}}italic_t start_POSTSUBSCRIPT max end_POSTSUBSCRIPT and tminsubscript𝑡mint_{\text{min}}italic_t start_POSTSUBSCRIPT min end_POSTSUBSCRIPT are the time thresholds for task completion; (3) we define the safety as the proportion of the AV that does not collide with SVs or the road boundary in 100 trials. The results are shown in Table II. It can be observed that the framework we proposed achieves the best performance in all four scenarios, which can be attributed to predicting the LTO and STD intention of SVs, enabling the agent to strike a balance between safety and efficiency, thus yielding better results.

TABLE II: Comparison of success rate, efficiency, and safety among different methods in various driving scenarios.
Methods Straight Road Intersection-v0 Intersection-v1 Roundabout
succ.(%) eff.(%) safety(%) succ.(%) eff.(%) safety(%) succ.(%) eff.(%) safety(%) succ.(%) eff.(%) safety(%)
A2C 65 74 79 70 92 72 42 91 46 67 54 81
PPO 75 70 81 68 88 71 67 81 68 91 82 94
DQN 61 67 77 68 84 73 66 71 73 61 88 65
Ours 93 86 95 88 92 91 87 90 89 93 92 96

Fig. 3 shows the interactive behavior of the AV in the four scenarios. As a representative example, in the scenario of Intersection-v1, the goal of the AV is to safely and efficiently navigate the intersection. In the left turn task, the AV notices the intention of a vehicle making a left turn and moving slowly. So the AV chooses to change lanes before a conflict occurs to avoid a collision. When the AV detects a lane conflict with the planned route of the same vehicle in front, it opts to change lanes to a safer but further lane. This behavior demonstrates the trade-off considered by the AV between efficiency and safety, and the AV selects a safer decision after evaluation. Here, we only visualize the AV’s trajectory in the representative snapshots.

V CONCLUSIONS

In this paper, we propose a novel integrated intention prediction and decision-making framework for autonomous driving systems from the perspective of the frequency domain. It is worth noting that a spectrum attention net is designed to capture the trend of each frequency component over time and their interrelations, so that the intention of SVs can be predicted. Besides, the PPO algorithm tackles the non-stationary problem in the integrated framework by employing a clipping mechanism within its objective function. Experiments in four representative traffic scenarios are conducted to verify the effectiveness of the proposed framework. The results show that the proposed framework outperforms baselines across various metrics in terms of the success rate of completing driving tasks, efficiency, and safety. Future work involves taking advantage of frequency domain representation in dealing with uncertainty in autonomous driving tasks.

References

  • [1] S. R. Jaladi, Z. Chen, N. R. Malayanur, R. M. Macherla, and B. Li, “End-to-End Training and Testing Gamification Framework to Learn Human Highway Driving,” in IEEE 25th Conf. Intell. Transport. Syst. Proc., pp. 4296–4301, 2022.
  • [2] K. Shu, H. Yu, X. Chen, S. Li, L. Chen, Q. Wang, L. Li, and D. Cao, “Autonomous Driving at Intersections: A Behavior-Oriented Critical-Turning-Point Approach for Decision Making,” IEEE-ASME Trans. Mechatron., vol. 27, no. 1, pp. 234–244, 2022.
  • [3] Y. Jiao, M. Miao, Z. Yin, C. Lei, X. Zhu, X. Zhao, L. Nie, and B. Tao, “A Hierarchical Hybrid Learning Framework for Multi-Agent Trajectory Prediction,” IEEE Trans. Intell. Transp. Syst., pp. 1–11, 2024.
  • [4] M. Kloock, P. Scheffe, S. Marquardt, J. Maczijewski, B. Alrifaee, and S. Kowalewski, “Distributed model predictive intersection control of multiple vehicles,” in IEEE 22nd Conf. Intell. Transport. Syst. Proc., pp. 1735–1740, IEEE, 2019.
  • [5] X. Tang, K. Yang, H. Wang, J. Wu, Y. Qin, W. Yu, and D. Cao, “Prediction-Uncertainty-Aware Decision-Making for Autonomous Vehicles,” IEEE Trans. Intell. Veh., vol. 7, no. 4, pp. 849–862, 2022.
  • [6] Z. Huang, H. Liu, J. Wu, and C. Lv, “Differentiable Integrated Motion Prediction and Planning with Learnable Cost Function for Autonomous Driving,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–15, 2023.
  • [7] J. Ding, L. Li, H. Peng, and Y. Zhang, “A Rule-Based Cooperative Merging Strategy for Connected and Automated Vehicles,” IEEE Trans. Intell. Transp. Syst., vol. 21, no. 8, pp. 3436–3446, 2020.
  • [8] J. Ma, Z. Cheng, X. Zhang, M. Tomizuka, and T. H. Lee, “Alternating direction method of multipliers for constrained iterative LQR in autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 12, pp. 23031–23042, 2022.
  • [9] Z. Peng, X. Zhou, L. Zheng, Y. Wang, and J. Ma, “Reward-Driven Automated Curriculum Learning for Interaction-Aware Self-Driving at Unsignalized Intersections,” arXiv preprint arXiv:2403.13674, 2024.
  • [10] Z. Peng, X. Zhou, Y. Wang, L. Zheng, M. Liu, and J. Ma, “Curriculum Proximal Policy Optimization with Stage-Decaying Clipping for Self-Driving at Unsignalized Intersections,” in Proc. 26th Int. IEEE Conf. Intell. Transp. Syst., 2023.
  • [11] J. Leu, Y. Wang, M. Tomizuka, and S. Di Cairano, “Autonomous Vehicle Parking in Dynamic Environments: An Integrated System with Prediction and Motion Planning,” in Int. Conf. Robot. Autom., pp. 10890–10897, 2022.
  • [12] E. Zhang, R. Zhang, and N. Masoud, “Predictive Trajectory Planning for Autonomous Vehicles at Intersections Using Reinforcement Learning,” Transp. Res. Pt. C-Emerg. Technol., vol. 149, 2023.
  • [13] Y. Chen, B. Ivanovic, and M. Pavone, “ScePT: Scene-Consistent, Policy-Based Trajectory Predictions for Planning,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 17082–17091, 2022.
  • [14] M. Ye, Y. Kuang, J. Wang, Y. Rui, W. Zhou, H. Li, and F. Wu, “State Sequences Prediction via Fourier Transform for Representation Learning,” in Adv. Neural Inf. Proces. Syst., vol. 36, pp. 67565–67588, 2023.
  • [15] S. Liu, Y. Zhu, P. Yao, T. Mao, and Z. Wang, “SpectrumNet: Spectrum-Based Trajectory Encode Neural Network for Pedestrian Trajectory Prediction,” in IEEE Int. Conf. Acoust., Speech Signal Process., Proc., pp. 7075–7079, 2024.
  • [16] C. Wong, B. Xia, Z. Hong, Q. Peng, W. Yuan, Q. Cao, Y. Yang, and X. You, View Vertically: A Hierarchical Network for Trajectory Prediction via Fourier Spectrums, book section Chapter 39, pp. 682–700. Lecture Notes in Computer Science, 2022.
  • [17] Y. Luo, J. Zhang, S. Wang, F. Lv, J. Zhang, and H. Gao, “A Driving Intention Prediction Method for Mixed Traffic Scenarios,” in IEEE 7th Int. Conf. Intell. Transp. Eng., pp. 296–301, 2022.
  • [18] J. Karlsson and J. Tumova, “Intention-Aware Motion Planning with Road Rules,” in IEEE 16th Int. Conf. Autom. Sci. Eng., pp. 526–532, 2020.
  • [19] J. B. Allen and L. R. Rabiner, “A Unified Approach to Short-Time Fourier Analysis and Synthesis,” Proc. IEEE, vol. 65, no. 11, pp. 1558–1564, 1977.
  • [20] E. Leurent, “An Environment for Autonomous Driving Decision-Making.” https://github.com/eleurent/highway-env, 2018.
  • [21] M. Treiber, A. Hennecke, and D. Helbing, “Congested Traffic States in Empirical Observations and Microscopic Simulations,” Phys. Rev. E, vol. 62, no. 2, p. 1805, 2000.
  • [22] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-Baselines3: Reliable Reinforcement Learning Implementations,” J. of Mach. Learn. Res., vol. 22, no. 268, pp. 1–8, 2021.
  • [23] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning,” in Int. Conf. on Mach. Learn.Proc., pp. 1928–1937, 2016.
  • [24] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” arXiv:1312.5602, 2013.
  • [25] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” arXiv:1707.06347, 2017.