Integrated Intention Prediction and Decision-Making with Spectrum Attention Net and Proximal Policy Optimization

Xiao Zhou, Chengzhen Meng, Wenru Liu, Zengqi Peng, Ming Liu, and Jun Ma This work was supported in part by the National Natural Science Foundation of China under Grant 62303390; and in part by the Guangzhou-HKUST(GZ) Joint Funding Scheme under Grant 2024A03J0618. (Corresponding Author: Jun Ma.)Xiao Zhou, Chengzhen Meng, Wenru Liu, Zengqi Peng, and Ming Liu are with the Robotics and Autonomous Systems Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China. (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]).Jun Ma is with the Robotics and Autonomous Systems Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China, and also with the Division of Emerging Interdisciplinary Areas, The Hong Kong University of Science and Technology, Hong Kong SAR, China. (e-mail: [email protected]).

Abstract

For autonomous driving in highly dynamic environments, it is anticipated to predict the future behaviors of surrounding vehicles (SVs) and make safe and effective decisions. However, modeling the inherent coupling effect between the prediction and decision-making modules has been a long-standing challenge, especially when there is a need to maintain appropriate computational efficiency. To tackle these problems, we propose a novel integrated intention prediction and decision-making approach, which explicitly models the coupling relationship and achieves efficient computation. Specifically, a spectrum attention net is designed to predict the intentions of SVs by capturing the trends of each frequency component over time and their interrelations. Fast computation of the intention prediction module is attained as the predicted intentions are not decoded to trajectories in the executing process. Furthermore, the proximal policy optimization (PPO) algorithm is employed to address the non-stationary problem in the framework through a modest policy update enabled by a clipping mechanism within its objective function. On the basis of these developments, the intention prediction and decision-making modules are integrated through joint learning. Experiments are conducted in representative traffic scenarios, and the results reveal that the proposed integrated framework demonstrates superior performance over several deep reinforcement learning (DRL) baselines in terms of success rate, efficiency, and safety in driving tasks.

I INTRODUCTION

Advancements in computer technology and artificial intelligence have led to significant developments in the field of autonomous driving technology. Nevertheless, the realization of autonomous driving systems remains a formidable challenge, with system design standing as a principal barrier. The architectures of contemporary autonomous driving systems fall into two major categories: end-to-end and hierarchical systems. While there have been notable advancements in end-to-end autonomous driving systems based on machine learning [1], these systems continue to grapple with issues of poor interpretability, high data demands, limited generalization capabilities, and insufficient robustness. Alternatively, hierarchical autonomous driving systems [2, 3], which involve the process of perception, prediction, decision-making, planning, and control have exhibited considerable enhancements in terms of interpretability and robustness. However, for such systems, tasks including prediction and decision-making are often treated as relatively distinct modules, which ignores the inherent coupling relationship between them. This mutual effect comes from the influence of the prediction module on the decision-making module; and also, due to the interaction between vehicles, the decision-making module will in turn affect the behavior of the surrounding vehicles (SVs), thereby affecting the next prediction step. Since the coupling between prediction and decision-making could be overlooked, this separation can lead to deteriorated performance of autonomous vehicles (AVs) in complex situations. In response to this shortcoming, several studies have proposed frameworks that integrate the prediction and decision-making modules [4, 5], leveraging predicted trajectories or intention to inform the decision-making modules in autonomous driving. In integrated frameworks such as [6], the forecasting and imitation nets are trained jointly, which is more efficient compared to training the network individually. Despite these efforts, the design of an integrated framework remains a significant challenge.

In general, the output of the decision-making module can be obtained through approaches such as rule-based methods [7], optimization-based methods [8], and learning-based methods [9]. Rule-based and optimization-based methods have obtained commendable results owing to their strong interpretability and related attributes. Nevertheless, due to the limited adaptability of these methods, they generally deliver poor performance in complex real-world environments. On the other hand, learning-based methods have demonstrated significant and efficacious progress, with a particularly promising direction in reinforcement learning (RL). A deep reinforcement learning (DRL) method presented in [10] adjusts the clipping parameter at different stages of training using proximal policy optimization (PPO), so that the vehicle could quickly search for an approximate optimal policy or its neighborhood with a large parameter and then converge to the optimal policy with a smaller one. The close interplay between prediction and decision-making implies that the prediction module exerts a considerable influence on decision-making module. Therefore, it is crucial to incorporate a well-developed prediction module in the framework to ensure that the decisions made by the system are safe and reliable.

Mainstream prediction methods typically focus on intention or trajectory prediction. Within the integrated framework, trajectory prediction is more commonly utilized and it can be divided into two main types: optimization-based and learning-based methods. Exemplified by [11], optimization-based methods are highly efficient in capturing long-term movements, but replicating this success in complex urban road environments proves challenging due to the intricate road layouts and variable traffic conditions. In contrast, learning-based methods [12, 13] are more adept at handling prediction tasks in such complex settings, benefiting from the strengths of neural networks in processing large amounts of trajectory data and understanding urban road conditions. Learning-based methods forecast future trajectories by analyzing historical trajectory data, resulting in encouraging outcomes. The performance can be improved when transforming trajectory representation from the time domain to the frequency domain, as the frequency domain reflects different characteristics such as energy distribution. Relevant records have shown that these features in the frequency domain are critical to the prediction process. A novel approach for predicting state sequences through Fourier transform (FT) is proposed by [14], which extracts hidden structural information from the frequency domain, thereby enhancing sample efficiency. A network called SpectrumNet is introduced in [15] to predict pedestrian trajectories, which is capable of breaking down information across various time scales through FT and representing it in the frequency domain. In [16], the future trajectory is predicted from the history trajectory, which is transformed from the time domain to the frequency domain using FT. The low-frequency and high-frequency components reflect the long-term objective (LTO) and the short-term dynamic (STD) intention, respectively. However, the FT used in [14, 15, 16] fails to account for changes in frequency over time, hindering the dynamic estimation of trends in trajectory changes and affecting the performance of prediction in real-world scenarios. Given that the goal of trajectory prediction is to obtain precise future trajectories, it demands considerable computational resources, leading to difficulties in prediction within complex dynamic environments. In addition to trajectory prediction, intention prediction has also been the subject of extensive research. For example, an intention prediction approach developed by [17] utilizes Gaussian mixed hidden Markov models to forecast the short-term intention of vehicles based on their historical trajectories. However, this method concentrates exclusively on past trajectory data to predict short-term driving intention and does not take into account the LTO intention. The same issue can also be found in the majority of intention prediction approaches, such as [18]. The absence of consideration for LTO intention challenges the assurance of long-term decision-making effectiveness, which can compromise robustness in complex driving scenarios applied within integrated frameworks.

To address the aforementioned issues, we propose an integrated intention prediction and decision-making approach for autonomous driving, which leverages the proposed spectrum attention net to predict the intention of SVs at the frequency domain and utilizes the PPO algorithm to generate the decisions. The main contributions of this paper are as follows: An integrated intention prediction and decision-making framework is proposed to explicitly model the strong coupling effect between intention prediction and decision-making, where we use a joint learning mechanism to learn the coupling and ensure the effectiveness of network parameter updates. Specifically, we propose the spectrum attention net to predict the intention of the SVs with the spectrum of trajectory obtained by short-time Fourier transform (STFT). With the predicted intention of SVs (which is the combination of LTO and STD intention), the AV can make safer and more effective decisions. The effectiveness of the proposed integrated framework is demonstrated by comparisons with existing methods in various driving scenarios, and the results indicate that the proposed approach outperforms other baseline methods in these driving tasks.

We organize the remainder of the article as follows: Section II introduces the preliminaries of the Markov decision process (MDP) and STFT, and also gives the problem statement. The details of the proposed integrated framework are given in Section III, including the design of the spectrum attention net, the PPO algorithm, and the joint learning mechanism in the framework. Based on these findings, we conduct experiments in four different scenarios described in Section IV. Finally, we summarize the conclusion and future work in Section V.

II Preliminaries

II-A Markov Decision Process

MDP is used to describe decision-making problems with the Markov property. In an MDP, the agent selects the optimal decision based on the current state and possible actions to maximize long-term rewards. The MDP consists of five elements, which can be represented by the tuple $\langle\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma\rangle$ , where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $\mathcal{P}$ is the state transition probability function that represents the probability of transitioning from a past state to a new state after taking the action. $\mathcal{R}$ is the reward function that represents the immediate reward obtained after taking action $a$ in state $s$ , and $\gamma\in(0,1)$ is the discount factor used to discount the value of future rewards.

II-B Short-Time Fourier Transform

FT is pivotal in applications that need to transform signals between the time and frequency domains. An important variant is the STFT [19], which addresses the limitation of the FT in analyzing non-stationary signals where frequency components vary over time. The primary purpose of the STFT is to determine the frequency and phase content of local sections of a signal as it changes over time. This is achieved by multiplying the signal by a window function which is shifted along the duration of the signal. For a signal $x[n]$ where $n$ represents the time step, the STFT transforms it into the frequency domain by applying FT to segments of the signal, defined by the window function. The STFT can be expressed as:

\begin{split}X(m,\omega)=\sum_{n=-\infty}^{\infty}x[n]\cdot w[n-m]\cdot e^{-j% \omega n}\end{split}

(1)

where $w[n-m]$ denotes the window function centered around time $m$ , $m$ is the index of time frame, and $\omega$ corresponds to the frequency.

On the other hand, when the frequency domain signal obtained from STFT needs to be converted to the time domain, the inverse short-time Fourier transform (ISTFT) can be adopted. Given the STFT representation $X(m,\omega)$ of a signal, the corresponding signal $x(n)$ at time domain can be reconstructed using the formula:

x(n)=\sum_{m}\sum_{\omega}X(m,\omega)\cdot w(n-m)\cdot e^{j\omega n}

(2)

II-C Problem Statement

The objective of this research is to develop a framework for effectively navigating AV in various traffic environments. We will focus on four key scenarios: a straight roadway, a four-way intersection, a two-way intersection, and a roundabout.

By modeling this problem as an MDP, our goal is to find an optimal policy $\pi_{\theta}$ , guided by parameters $\theta$ , that maximizes expected cumulative discounted rewards, formulated as:

\begin{array}[]{cl}\max_{\pi_{\theta}}&\mathbb{E}\left[\sum_{t=1}^{T}\gamma^{t% }r_{t}\left(\mathbf{S}_{t},a_{t}\right)\right]\\ \text{ s.t. }&a_{t}\in\mathcal{A},\mathbf{S}_{t}\in\mathcal{S}\end{array}

(3)

Specifically, the state space $\mathcal{S}$ and action space $\mathcal{A}$ in the problem is defined as:

State space $\mathcal{S}$ : We define the state at any given time as a matrix,

\begin{split}\mathbf{S}_{t}=\left[\ \left[\mathbf{s}_{t}^{0}\right]^{T}\ \ % \left[\mathbf{s}_{t}^{1}\right]^{T}\ ...\ \left[\mathbf{s}_{t}^{N}\ \right]^{T% }\right]^{T}\end{split}

(4)

where $N$ is the number of vehicles. The state of the AV is captured in the first row of $\mathbf{S}_{t}$ as $\mathbf{s}_{t}^{0}$ , and other rows of $\mathbf{S}_{t}$ , i.e., $\mathbf{s}_{t}^{i}\ (i=1,2,...,N)$ , represent vectors of the kinematic features of the SVs. Each vehicle’s state is characterized by the following kinematic features:

\begin{split}\mathbf{s}_{t}^{i}={\left[\begin{array}[]{l l l l l l l}{x_{t}^{i% }}&{y_{t}^{i}}&{v_{x,t}^{i}}&{v_{y,t}^{i}}&{\sin\psi_{t}^{i}}&{\cos\psi_{t}^{i% }}\end{array}\right]}\end{split}

(5)

which contains the information about the position, velocity, and orientation of vehicle $i$ at time step $t$ .

Action space $\mathcal{A}$ : The agent has access to a set of five discrete actions,

\begin{split}\mathcal{A}=\left\{A^{0},A^{1},A^{2},A^{3},A^{4}\right\}\end{split}

(6)

including lateral maneuvers such as changing lanes to the left or right ( $A^{0}$ , $A^{2}$ ), maintaining current motion ( $A^{1}$ ), and longitudinal adjustments like decelerating or accelerating ( $A^{3}$ , $A^{4}$ ).

III METHODOLOGY

In this section, we present an inaugural framework for integrated intention prediction and decision-making as shown in Fig. 1, where the AV utilizes the intention prediction obtained from the spectrum attention net to guide its decisions. We first introduce the details of the proposed spectrum attention net, which is used to predict the intention of SVs. Then, we illustrate the advantages of the PPO algorithm in processing complex and non-stationary inputs as a decision-making module in the integrated framework. Finally, we give the details about the joint learning process of the intention prediction module and decision-making module in the integrated framework.

Refer to caption — Figure 1: Overall architecture of the proposed integrated intention prediction and decision-making framework is exemplified through a four-way intersection scenario. For observable SV $i$ , its segmented historical trajectory is transformed into a spectrogram by STFT first, then used to predict its intention with the spectrum attention net. The decision-making module receives the predicted intention of SVs and makes a decision based on predicted intention and direct observation. Note that the trajectory decoder is only executed in the joint training process to improve the real-time capabilities of the integrated framework.

III-A Intention Prediction Module with Spectrum Attention Net

The behavior of SVs is affected by both LTO and STD intention, which can be reflected by the low-frequency component and high-frequency component in the spectrum, respectively. Therefore, we propose a spectrum attention net to predict the intention of SVs from the frequency spectrum, where the spectral information can characterize the intention of SVs.

At time step $t$ , the AV observes current environmental states, which include information about both itself and observable SVs. The state information of these observable SVs is recorded in the trajectory buffer with a horizon of $H$ to maintain a consistent input length for the spectrum attention net. Specifically, for each observable vehicle $i$ in the scenario, its state information $\mathbf{s}^{i}_{t}$ over the time interval $[t-H+1,t]$ is incorporated into the trajectory buffer $T_{ego,i}$ , thereby forming a segmented historical trajectory for vehicle $i$ . By utilizing the segmented historical trajectory as time sequence input, we derive the features of multiple frequency components evolving over time through the transformation of time sequence input into the frequency domain using STFT, expressed as follows:

\begin{split}X_{s}(m,\omega)=\sum_{n=t-H+1}^{t}\mathbf{s}^{i}_{t}\cdot w[t-m]% \cdot e^{-j\omega t}\end{split}

(7)

where $X_{s}(m,\omega)\in\mathbb{R}^{F\times\Omega\times M}$ denotes the trend of the frequency component at $\omega$ across time frame $m$ . Note that $\Omega$ is the number of frequency components, $M$ is the number of time frames, and $F$ denotes the dimension of kinematic features. Indeed, STFT can be regarded as a feature extraction function that derives intent representation from a segmented historical trajectory. The fixed time-frequency resolution of STFT ensures uniform representation of each time frame, which is crucial for the attention mechanism to effectively discern and correlate different temporal segments, thereby facilitating accurate intention predictions.

The behavior of SVs involves weighing different levels of intention, and the intention is interrelated. The frequency components at different time frames within the spectrogram $X_{s}(m,\omega)$ depict the distribution of intention across time. Therefore, to thoroughly research intention patterns across diverse combinations of frequency components, we introduce a multi-head spectrum attention net in this study to forecast the intention of SVs. Initially, the spectrogram $X_{s}(m,\omega)$ is embedded with a linear projection and then sent to the spectral attention module. Through the transformations executed by networks $L_{q}$ , $L_{k}$ , and $L_{v}$ , all embedded vectors are converted into query $Q\in\mathbb{R}^{F\times\Omega\times M}$ , key $K\in\mathbb{R}^{F\times\Omega\times M}$ , and value $V\in\mathbb{R}^{F\times\Omega\times M}$ , respectively. The attention scores are computed by multiplying query matrix $Q$ with the transpose of $K$ , followed by scaling with the inverse square root of dimension $M$ and normalization via a softmax function across the frequency dimension. This process yields a spectral attention matrix, which is utilized to evaluate the correlation among different frequency components, thereby facilitating the prediction about the intention of vehicle $i$ . The resulting prediction vector is subsequently fed into the linear projection network, with each head of spectral attention generating outputs as represented below:

P_{t}^{i}=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V

(8)

Finally, the outputs from all heads are combined and decoded by the intention decoder, ultimately serving as the predicted intention of SV.

Remark 1

The behavior of SVs is affected by multiple levels of intention at the same time, including LTO and STD intention. For a spectrogram $X_{s}(m,\omega)$ that delineates frequency domain variations over time, the attention mechanism can capture the trend of each frequency component over time and their interrelations. Therefore, reasonable predictions about the spectrum distribution in a future time horizon can be obtained by spectrum attention net and reflect the intention of SVs.

III-B Decision-Making Module with PPO

One of the main challenges in the decision-making module of the integrated framework is the non-stationary problem led by joint learning. In the joint learning process, the intention prediction in the observation input of the decision-making algorithm keeps changing, therefore the decision-making policy needs to be adjusted accordingly. As the spectrum attention net in the intention prediction module updates, corresponding adjustments of the policy in the decision-making module become necessary. However, unchecked policy updates can lead to catastrophic consequences in this scenario. Significant updates on the policy of the decision-making module based on current intent prediction module may lead to excellent performance in current environment, but the performance could be suboptimal or even terrible after the intent prediction module updates its network parameters. To address this concern, we adopt the PPO algorithm in the integrated framework.

The PPO algorithm is an online reinforcement learning framework designed to solve sequential decision-making problems. Notably, the PPO algorithm employs a clipping mechanism within its objective function to ensure modest policy updates. This feature becomes especially crucial when handling complex and non-stationary multimodal inputs from the intention prediction module. Specifically, the objective function of PPO designed to mitigate the risk of destabilizing updates is given by:

L^{\text{CLIP}}(\theta)=\hat{\mathbb{E}}_{t}\left[\min\left(r_{t}(\theta)\hat{% A}_{t},\text{clip}\left(r_{t}(\theta),1-\epsilon,1+\epsilon\right)\hat{A}_{t}% \right)\right]

(9)

where $r_{t}(\theta)=\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}$ represents the probability ratio of selecting action $a_{t}$ in state $s_{t}$ under the new policy $\pi_{\theta}$ compared to the old policy $\pi_{\theta_{old}}$ . Additionally, $\hat{A}_{t}$ denotes the estimated advantage function at timestep $t$ , and $\epsilon$ serves as a hyperparameter defining the clipping boundary.

Essentially, the clipped surrogate objective in PPO provides a theoretically constrained network update routine to tackle the non-stationary problem associated with the intention prediction module. This mechanism ensures a stable policy update step within the clipping boundary, which is indispensable for the algorithm’s convergence and stationarity in a non-stationary environment.

III-C Joint Learning Process

The integrated intention prediction and decision-making framework is proposed in this study utilizing the spectrum attention net and PPO algorithm. The joint training of the intention prediction and decision-making modules is employed, as it outperforms the standalone training mechanism [6].

For the joint learning process in the integrated framework, the agent needs to explore the environment and collect experience based on the current network parameters first. In detail, the spectrum attention net predicts SVs’ intention and the PPO algorithm makes decisions based on the observation and predicted intention to explore the environment and collect experiences. Then, the collected experiences serve as training data for the spectrum attention net and the PPO algorithm in the integrated framework. Specifically, the predicted intention recorded in the experiences is sent to the trajectory decoder and transformed to trajectory time sequences in the future time horizon $[t+1,t+H]$ through ISTFT transformation. Subsequently, the trajectory time sequences are compared with the ground truth recorded in the collected experience to calculate the prediction loss. Finally, the spectrum attention net is optimized by minimizing the prediction loss with stochastic gradient descent. Meanwhile, the collected experiences contribute to optimizing the network parameters in the PPO algorithm by maximizing the objective function as in (9). To sum up, the learning process in the integrated framework is outlined in Algorithm 1.

Algorithm 1 Integrated intention prediction and decision-making algorithm

1:Initialize spectrum attention net with parameters

\omega

, policy (actor) network with parameters

\theta

and value (critic) network with parameters

\phi

2:Initialize trajectory buffer

B_{t}

to capacity

H

, training epochs per collect

E

, trajectory length

T

, batch size

B

3:for

k=0,1,2,\dots

4: Collect set of experiences

\mathcal{D}_{k}=\{\tau_{i}\}

by running policy

\pi_{\theta_{k}}

interaction with environment.

5: for time

t=0,1,\ldots,T

6: for observable vehicle

i=1,2,\ldots,N

7: Record state of vehicle

i

B_{t}^{i}

, transform the segmented historical trajectory into the frequency domain with STFT

8: Predict the intention of observable vehicle

P_{t}^{i}

by spectrum attention net based on

B_{t}^{i}

9: end for

10: Update the predicted intention of observable SVs

P_{t}=\{P_{t}^{1},P_{t}^{2},\ldots,P_{t}^{N}\}

11: Select action by running policy

\pi_{\theta_{k}}

12: Execute action

a_{t}

and observe reward

r_{t}

and new state

\mathbf{S}_{t+1}

13: Store transition

\left(\mathbf{S}_{t},P_{t},\mathbf{a}_{t},r_{t},\mathbf{S}_{t+1}\right)

14: end for

15: Compute trajectory target return estimates

\hat{R}_{t}

16: Compute advantage estimates

\hat{A}_{k}

with value

\hat{V}_{\phi}

17: for

e=0,1,\dots,E-1

18: for minibatch

b\in\mathcal{D}_{k}

19: Decode the predicted intention

P_{t}

with the trajectory decoder and transform it into the trajectory time sequence by ISTFT

20: Update the spectrum attention net by minimizing the prediction loss with SGD

21:

L_{pred}=\frac{1}{H\cdot F}\sum_{\tau=t+1}^{t+H}\left\|\hat{\mathbf{s}}_{i}^{% \tau}-\mathbf{s}_{i}^{\tau}\right\|_{2}

22: Update the policy by maximizing the PPO-Clip objective as in (9) with SGD

23: end for

24: Fit the value by regression on mean-squared error with SGD:

25:

L^{\text{VF}}(\phi)=\frac{1}{B\cdot T}\sum_{\tau\in b}\sum_{t=0}^{T-1}(V_{\phi% }(s_{t})-\hat{R}_{t})^{2}

26: end for

27:end for

Remark 2

The prediction and decision-making modules of autonomous driving systems are highly coupled in the conventional setting. Therefore, in this work, we explicitly model the coupling relationship through joint learning of the prediction and the decision-making modules. Specifically, in the proposed integrated framework, the spectrum attention net and the decision-making algorithm use the identical data for training and then update parameters simultaneously. It is worth noting that unlike end-to-end methods which optimize all networks with decision-making objective function, we optimize the spectrum attention net through the prediction loss, thus ensuring the effectiveness of network parameter updates.

IV EXPERIMENTS

In this section, we test the proposed integrated framework in the four common autonomous driving scenarios and compare the results with other classic DRL algorithms. The simulations are conducted in an OpenAI Gym environment $Highway\_Env$ [20], and the behaviors of SVs are characterized by the intelligent driver model (IDM) [21]. and the baseline DRL methods are implemented based on Stable-baselines3 [22].

TABLE I: Hyperparameters

Hyperparameter	DQN	A2C	PPO	Ours
Discount Factor	$0.99$	$0.99$	$0.99$	$0.99$
Learning Rate	$0.0005$	$0.0005$	$0.0005$	$0.0005$
Batch Size	$64$	$64$	$64$	$64$
Exploration Rate	$0.1$	-	-	-
Target Network Update Frequency	$5000$	-	-	-
Entropy Coefficient	-	$0.0$	$0.0$	$0.0$
Value Function Coefficient	-	$0.5$	$0.5$	$0.5$
Number of Steps	-	$1024$	$1024$	$1024$

IV-A Experimental Setup

Straight Road: In this scenario, the AV drives on a four-lane straight road filled with SVs, with the task of safely driving 600 meters without collisions within 1 minute. The reward function is defined as $r=r_{\text{Rel}}+r_{\text{R}}+r_{\text{col}}+r_{\text{LC}}+r_{\text{speed}}$ , where $r_{\text{Rel}}$ and $r_{\text{R}}$ represent the rewards of completing the task and driving on the right side of the road. $r_{\text{col}}$ and $r_{\text{LC}}$ are the penalties of collision and lane change, which can reduce unnecessary lane-changing behavior. $r_{\text{speed}}$ will reward high-speed driving behavior within the speed limitation while penalizing for exceeding the speed limitation.

Intersection-v0: This scenario is a four-way intersection, with the task of safely and quickly driving toward the designated lane. The reward function is settled as $r=r_{\text{Rel}}+r_{\text{col}}\left(v,N\right)+r_{\text{OfR}}+r_{\text{speed}}$ , where $r_{\text{OfR}}$ is the penalty of out of the road and $\left(v,N\right)$ is the speed of the AV and number of SVs in every time interval, which reflects the collision rate.

Intersection-v1: It is similar to Intersection-v0, except that there are two lanes in one direction, and vehicles are allowed to change lanes. Therefore, a lane-changing penalty is added to the reward function, which is written as $r=r_{\text{Rel}}+r_{\text{col}}\left(v,N\right)+r_{\text{OfR}}+r_{\text{LC}}+r% _{\text{speed}}$ .

Roundabout: In this scenario, there is a roundabout with two lanes, with the task of passing through the roundabout as soon as possible without collision. To prevent vehicles from circling, a penalty of timeout $r_{\text{TO}}$ is added in the reward function, which can be expressed as $r=r_{\text{Rel}}+r_{\text{col}}\left(v,N\right)+r_{\text{OfR}}+r_{\text{LC}}+r% _{\text{TO}}+r_{\text{speed}}$

To validate the effectiveness of the proposed method, we evaluate its performance in the above scenarios and compare it with other three baseline DRL methods. The number of SVs is 8. In each simulation, we randomize the states of all agents to prevent the policy network from memorizing actions. We compare the proposed framework with the following baseline methods:

•

A2C performs gradient ascent on the parameter $\theta$ by $\nabla_{\theta}J(\theta)=\mathbb{E}_{\tau}\left[\sum_{t=0}^{T-1}\nabla_{\theta% }\log\pi_{\theta}\left(a_{t}\mid s_{t}\right)G_{t}\right]$ to directly maximize the reward [23].
•

DQN updates the Q function using the SGD algorithm by minimizing the loss $\mathcal{L}=1/N\sum_{j=0}^{N-1}\left(Q\left(s_{j},a_{j}\right)-y_{j}\right)^{2}$ [24].
•

PPO updates indirectly by maximizing a surrogate objective function as in (9), which provides a cautious estimate of the potential change in $J(\pi_{\theta})$ resulting from the update [25]. This method is used as an ablation study.

The hyperparameters for network training of the four algorithms are listed in Table I, and all networks use Adam as the optimizer. The experiments with the aforementioned scenarios are conducted on the Windows 11 system environment with an Intel Core i7-14700KF processor and an NVIDIA RTX 4070 Ti GPU.

IV-B Performance Demonstration

To verify the effectiveness of the proposed algorithm, we implement our framework and other baseline methods in the aforementioned scenarios and obtain the learning curves shown in Fig. 2. It can be observed that the training results of the proposed algorithm are significantly better than those of all baseline methods. This could be attributed to the effectiveness of the intention prediction module within the proposed integrated framework. The decision-making process of SVs takes into account both STD and LTO intention, where LTO intention indicates the willingness to complete tasks of SVs and STD reflects short-term obstacle avoidance intention. The AV predicts the intention of SVs, thereby making decisions accordingly in quick succession, resulting in greater rewards than other baseline methods. Compared with the ablation study of the PPO method without these intention predictions, the method we proposed achieves a reward increase in all scenarios, but just a little improvement in the roundabout scenario. The reason is that our framework has some difficulties in taking spatio-spectrum into account, which significantly affects the performance in the roundabout scenario, due to its complex road structure. When compared with two other baseline methods, the improvement is even greater.

To further evaluate the performance of the integrated intention prediction and decision-making framework, we evaluate the trained policy in the test environment and set three metrics to assess the effectiveness of each algorithm: (1) we define the success rate as the ratio of successful task completions in 100 trials to reflect the ability of different algorithm-trained policies to complete the task; (2) we define the efficiency of the trained policy as $\left({t_{\text{max}}-t}\right)/\left({t_{\text{max }}-t_{\text{min }}}\right)$ , where $t_{\text{max}}$ and $t_{\text{min}}$ are the time thresholds for task completion; (3) we define the safety as the proportion of the AV that does not collide with SVs or the road boundary in 100 trials. The results are shown in Table II. It can be observed that the framework we proposed achieves the best performance in all four scenarios, which can be attributed to predicting the LTO and STD intention of SVs, enabling the agent to strike a balance between safety and efficiency, thus yielding better results.

TABLE II: Comparison of success rate, efficiency, and safety among different methods in various driving scenarios.

Methods	Straight Road			Intersection-v0			Intersection-v1			Roundabout
Methods	succ.(%)	eff.(%)	safety(%)	succ.(%)	eff.(%)	safety(%)	succ.(%)	eff.(%)	safety(%)	succ.(%)	eff.(%)	safety(%)
A2C	65	74	79	70	92	72	42	91	46	67	54	81
PPO	75	70	81	68	88	71	67	81	68	91	82	94
DQN	61	67	77	68	84	73	66	71	73	61	88	65
Ours	93	86	95	88	92	91	87	90	89	93	92	96

Fig. 3 shows the interactive behavior of the AV in the four scenarios. As a representative example, in the scenario of Intersection-v1, the goal of the AV is to safely and efficiently navigate the intersection. In the left turn task, the AV notices the intention of a vehicle making a left turn and moving slowly. So the AV chooses to change lanes before a conflict occurs to avoid a collision. When the AV detects a lane conflict with the planned route of the same vehicle in front, it opts to change lanes to a safer but further lane. This behavior demonstrates the trade-off considered by the AV between efficiency and safety, and the AV selects a safer decision after evaluation. Here, we only visualize the AV’s trajectory in the representative snapshots.

V CONCLUSIONS

In this paper, we propose a novel integrated intention prediction and decision-making framework for autonomous driving systems from the perspective of the frequency domain. It is worth noting that a spectrum attention net is designed to capture the trend of each frequency component over time and their interrelations, so that the intention of SVs can be predicted. Besides, the PPO algorithm tackles the non-stationary problem in the integrated framework by employing a clipping mechanism within its objective function. Experiments in four representative traffic scenarios are conducted to verify the effectiveness of the proposed framework. The results show that the proposed framework outperforms baselines across various metrics in terms of the success rate of completing driving tasks, efficiency, and safety. Future work involves taking advantage of frequency domain representation in dealing with uncertainty in autonomous driving tasks.

References

[1] S. R. Jaladi, Z. Chen, N. R. Malayanur, R. M. Macherla, and B. Li, “End-to-End Training and Testing Gamification Framework to Learn Human Highway Driving,” in IEEE 25th Conf. Intell. Transport. Syst. Proc., pp. 4296–4301, 2022.
[2] K. Shu, H. Yu, X. Chen, S. Li, L. Chen, Q. Wang, L. Li, and D. Cao, “Autonomous Driving at Intersections: A Behavior-Oriented Critical-Turning-Point Approach for Decision Making,” IEEE-ASME Trans. Mechatron., vol. 27, no. 1, pp. 234–244, 2022.
[3] Y. Jiao, M. Miao, Z. Yin, C. Lei, X. Zhu, X. Zhao, L. Nie, and B. Tao, “A Hierarchical Hybrid Learning Framework for Multi-Agent Trajectory Prediction,” IEEE Trans. Intell. Transp. Syst., pp. 1–11, 2024.
[4] M. Kloock, P. Scheffe, S. Marquardt, J. Maczijewski, B. Alrifaee, and S. Kowalewski, “Distributed model predictive intersection control of multiple vehicles,” in IEEE 22nd Conf. Intell. Transport. Syst. Proc., pp. 1735–1740, IEEE, 2019.
[5] X. Tang, K. Yang, H. Wang, J. Wu, Y. Qin, W. Yu, and D. Cao, “Prediction-Uncertainty-Aware Decision-Making for Autonomous Vehicles,” IEEE Trans. Intell. Veh., vol. 7, no. 4, pp. 849–862, 2022.
[6] Z. Huang, H. Liu, J. Wu, and C. Lv, “Differentiable Integrated Motion Prediction and Planning with Learnable Cost Function for Autonomous Driving,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–15, 2023.
[7] J. Ding, L. Li, H. Peng, and Y. Zhang, “A Rule-Based Cooperative Merging Strategy for Connected and Automated Vehicles,” IEEE Trans. Intell. Transp. Syst., vol. 21, no. 8, pp. 3436–3446, 2020.
[8] J. Ma, Z. Cheng, X. Zhang, M. Tomizuka, and T. H. Lee, “Alternating direction method of multipliers for constrained iterative LQR in autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 12, pp. 23031–23042, 2022.
[9] Z. Peng, X. Zhou, L. Zheng, Y. Wang, and J. Ma, “Reward-Driven Automated Curriculum Learning for Interaction-Aware Self-Driving at Unsignalized Intersections,” arXiv preprint arXiv:2403.13674, 2024.
[10] Z. Peng, X. Zhou, Y. Wang, L. Zheng, M. Liu, and J. Ma, “Curriculum Proximal Policy Optimization with Stage-Decaying Clipping for Self-Driving at Unsignalized Intersections,” in Proc. 26th Int. IEEE Conf. Intell. Transp. Syst., 2023.
[11] J. Leu, Y. Wang, M. Tomizuka, and S. Di Cairano, “Autonomous Vehicle Parking in Dynamic Environments: An Integrated System with Prediction and Motion Planning,” in Int. Conf. Robot. Autom., pp. 10890–10897, 2022.
[12] E. Zhang, R. Zhang, and N. Masoud, “Predictive Trajectory Planning for Autonomous Vehicles at Intersections Using Reinforcement Learning,” Transp. Res. Pt. C-Emerg. Technol., vol. 149, 2023.
[13] Y. Chen, B. Ivanovic, and M. Pavone, “ScePT: Scene-Consistent, Policy-Based Trajectory Predictions for Planning,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 17082–17091, 2022.
[14] M. Ye, Y. Kuang, J. Wang, Y. Rui, W. Zhou, H. Li, and F. Wu, “State Sequences Prediction via Fourier Transform for Representation Learning,” in Adv. Neural Inf. Proces. Syst., vol. 36, pp. 67565–67588, 2023.
[15] S. Liu, Y. Zhu, P. Yao, T. Mao, and Z. Wang, “SpectrumNet: Spectrum-Based Trajectory Encode Neural Network for Pedestrian Trajectory Prediction,” in IEEE Int. Conf. Acoust., Speech Signal Process., Proc., pp. 7075–7079, 2024.
[16] C. Wong, B. Xia, Z. Hong, Q. Peng, W. Yuan, Q. Cao, Y. Yang, and X. You, View Vertically: A Hierarchical Network for Trajectory Prediction via Fourier Spectrums, book section Chapter 39, pp. 682–700. Lecture Notes in Computer Science, 2022.
[17] Y. Luo, J. Zhang, S. Wang, F. Lv, J. Zhang, and H. Gao, “A Driving Intention Prediction Method for Mixed Traffic Scenarios,” in IEEE 7th Int. Conf. Intell. Transp. Eng., pp. 296–301, 2022.
[18] J. Karlsson and J. Tumova, “Intention-Aware Motion Planning with Road Rules,” in IEEE 16th Int. Conf. Autom. Sci. Eng., pp. 526–532, 2020.
[19] J. B. Allen and L. R. Rabiner, “A Unified Approach to Short-Time Fourier Analysis and Synthesis,” Proc. IEEE, vol. 65, no. 11, pp. 1558–1564, 1977.
[20] E. Leurent, “An Environment for Autonomous Driving Decision-Making.” https://github.com/eleurent/highway-env, 2018.
[21] M. Treiber, A. Hennecke, and D. Helbing, “Congested Traffic States in Empirical Observations and Microscopic Simulations,” Phys. Rev. E, vol. 62, no. 2, p. 1805, 2000.
[22] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-Baselines3: Reliable Reinforcement Learning Implementations,” J. of Mach. Learn. Res., vol. 22, no. 268, pp. 1–8, 2021.
[23] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning,” in Int. Conf. on Mach. Learn.Proc., pp. 1928–1937, 2016.
[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” arXiv:1312.5602, 2013.
[25] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” arXiv:1707.06347, 2017.