Classic literature reading - RLAD (Reinforcement Learning from Pixels for Autonomous Driving in Urban Environments)

0. Introduction

The application of reinforcement learning in autonomous driving has become increasingly popular. Although due to some ethical issues, not many people actually use this kind of reinforcement learning, there are already many applications of reinforcement learning in autonomous driving. work, but we found that this type of method basically trains the convolutional encoder together with the policy network. However, this paradigm will cause the environment representation to be inconsistent with the downstream task, which may lead to sub-optimal performance. The article "RLAD: Reinforcement Learning from Pixels for Autonomous Driving in Urban Environments" proposes several technologies to enhance the performance of the RLfP algorithm in this field, including: i) Image encoder utilizing image enhancement and adaptive local signal mixing (A-LIX) layers; ii) WayConv1D is a waypoint encoder that uses 1D convolution to exploit the 2D geometric information of waypoints; iii) Auxiliary loss to increase the importance of traffic lights in the hidden layer representation of the environment. Let’s take a closer look below.


1. Main contributions

The main contributions of this paper are summarized as follows:

1) This paper proposesRLAD, which is the first to use reinforcement learning (RL) to simultaneously learn encoders and driving policies in the field of vision-based urban autonomous driving (AD) Network methods. This paper also shows that RLAD significantly outperforms all state-of-the-art RLfP methods in the field;

2) This paper introduces an image encoder that utilizes image enhancement and adaptive local signal mixing (ALIX) layers to minimize severe overfitting of the encoder Together;

3) This paper proposes WayConv1D, which is a waypoint encoder that uses 1D convolution with a 2x2 kernel to utilize the 2D geometric information of the waypoints, which significantly improves Improve driving stability;

4) This paper provides a comprehensive analysis of state-of-the-art RLfP in vision-based urban autonomous driving (AD), where we show that one of the main challenges is traffic light compliance. To address this limitation, an auxiliary loss specifically targeting traffic light information is added to the hidden layer representation of the image, thus enhancing its importance.

2. RLAD Overview

RLAD is the first RLfP method applied in the field of urban autonomous driving. Its main purpose is to derive a feature representation fromsensor data that is fully aligned with the driving task, while simultaneously learning a driving policy. The core of RLAD is built on DrQ [11], but with several modifications. First, based on image enhancement, we also add a regularization layer at the end of each convolutional layer of the image encoder, called Adaptive Local Signal Mixing (ALIX) [12] (see Section 4 for more details ), which significantly improves the stability and efficiency of training. Secondly, we conducted extensive research on optimal hyperparameters and found that some of DrQ's hyperparameters are not suitable for the AD field. Finally, we use an additional traffic light classification loss to guide the latent representation of the image ( i ‾ \overline{i} i) contains information about traffic lights.

3. Learning environment

The learning environment is defined as a Partially Observable Markov Decision Process (POMDP). The environment was built using the CARLA driving simulator (version 0.9.10.1) [27]. a) State space: S S S is defined by CARLA and contains real information about the world. The agent cannot access the environment state.

b) Observation space: at each step, state s t ∈ S s_t ∈ S stSGeneration mutual observation o t ∈ O o_t ∈ O OtO and passed to the agent. An observation is of the last K time steps K = 3 K = 3 K=Stacking of 3 tensor sets. Specifically, o t = { ( I , W , V ) k } k = 0 2 o_t = \{(I, W, V)_k\}^2_{k=0} Ot={(IWV)k}k=02, inside: I I IThis is one 3×256×256 image, W W W corresponds to the 2D coordinates associated with the vehicle for the next N = 10 provided from the global planner provided by CARLA N = 10 N=10direction, V V V corresponds to a two-dimensional vector containing the current speed and vehicle steering.

c) Movement space: A A A consists of three consecutive actions: throttle, ranging from 0 to 1; braking, ranging from 0 to 1; and steering, ranging from -1 to 1.

d) Reward function: We use the reward function defined in [28] as it has been shown to accurately guide AD training.

e) Training: We use CARLA at 10 FPS. Similar to [28], at the beginning of each scene, the start and target locations are randomly generated and the required route is calculated using a global planner. When the target position is reached, a new random target position is calculated. Terminate the scenario if any of the following conditions are met: collision, red light running, obstruction, or a predefined timeout is reached.

4. Agency architecture (key content)

The architecture of RLAD is shown in Figure 1. In general, our system has three main components: the encoder (Section 4.1), the reinforcement learning algorithm (Section 4.2), and the auxiliary loss (Section 4.3). To simplify longitudinal control and ensure smooth control, we reparameterized the throttle and brake commands to use target speeds. Therefore, a PID controller is attached at the end of the actor network, which generates corresponding throttle and braking commands to match the predicted target speed.

Insert image description here

Figure 1. Architecture of RLAD. The system receives as input K consecutive central images, N waypoints calculated using the global planner, and the vehicle's measurements in the last K steps. Each input is processed independently by a different encoder. The potential representations of each input are then concatenated to form the input to the SAC algorithm ( h ~ = i ~ w ~ v ~ ) (\tilde{h} =\tilde{i} \tilde {w} \tilde{v}) (h~=i~In~in~). The actor network and PID of SAC are responsible for outputting command control, while the Q network is responsible for outputting the value function. For guidance i ~ \tilde{i} i~ contains information about traffic lights, and we added an auxiliary branch to perform traffic light classification. All elements of neural networks are represented in terms of proportions. The dashed arrows provide a visual representation of how each loss function affects the system parameters during the backpropagation stage.

4.1 Encoder:

The encoder is responsible for converting sensor data ( o t o_t Ot) is converted into a low-dimensional feature vector ( h ‾ t \overline{h}_t ht) for processing by the reinforcement learning algorithm.

a) Image encoder: As shown in [11], the size of the image encoder is a key element in the RLfP approach. Due to the weak signal of RL loss, encoders commonly used in AD methods, such as ResNet50 [29] (~25M parameters) or Inception V3 [30] (~27M parameters), are not feasible. On the other hand, small encoders designed for scenes of smaller complexity, such as IMPALA [31] (about 0.22M parameters), cannot produce sufficiently accurate environment representations, thus limiting driving Performance of the agent. For urban AD, our results show that the optimal configuration involves a trade-off between large networks that are not suitable for training with RL and small networks that cannot accurately sense the environment. The architecture of the proposed image encoder is shown in Table I and contains about 1M parameters. Similar to DrQ and DrQ-V2, we utilize simple image enhancement techniques to normalize the value function [11], [25]. First, we apply padding to each side of the 256×256 image, repeating the 8 border pixels, and then select a random crop of 256×256. Like [25], we find it useful to apply bilinear interpolation on the cropped image. In addition to image enhancement techniques, we also found that appending an A-LIX layer [12] to the end of each convolutional layer improves the performance of the agent, possibly by preventing a This is achieved by the phenomenon of catastrophic self-overfitting (a spatially inconsistent feature map that leads to discontinuous gradients in backpropagation). A-LIX is applied to the features generated by the convolutional layer a ∈ R C × H × W a ∈ \mathbb{R}^{C×H×W} aRC×H×W,Mixed machine per configuration a c i j a_{cij} acij and its neighbors belonging to the same feature map. Therefore, the output of A-LIX has the same dimensions as the input, but the difference is that the computational graph minimally perturbs each feature a c i j a_{cij} acijinformation of while smoothing the discontinuous components of the gradient signal during backpropagation. Therefore, this technique works by forcing the image encoder to generate spatially consistent feature maps, thus minimizing the impact of the catastrophic self-overfitting phenomenon. This process can be succinctly summarized as i ‾ t = f i ( a u g ( [ { I t − k } k = 0 2 ] ) ) \overline{i}_t = f_i(aug([ \{I_{t−k}\}^2_{k=0}])) it=fi(aug([{ Itk}k=02])), inside f i f_i fi is the image encoder, aug corresponds to the applied data augmentation, and i ‾ t \overline{i}_t it corresponds to three consecutive images ( { I t − k } k = 0 2 ) (\{I_{t−k}\}^2_{k=0} ) ({ Itk}k=02Potential representation of ).

Insert image description here

Architecture of the proposed image encoder. After each convolutional layer, we apply the RELU function [32] and the A-LIX regularization layer [12].

b) Waypoint Encoder: Usually, waypoint encoder is composed of using the current agent pose and the next N N The average direction between N waypoints [9], or flattening the 2D coordinates of waypoints into vectors and applying MLP [33] to form. In our opinion, both approaches have serious limitations. The former approach significantly oversimplifies the problem by encoding all waypoint coordinates into a single value. This method only works for small values ​​of N, because as N N As N increases, the waypoints become more dispersed, so the average direction is no longer a reliable indicator. Although the latter method works for all N N N values, but does not use 2D geometry information by flattening 2D waypoint coordinates into vectors. To overcome these two limitations, we propose WayConv1D, a waypoint encoder that exploits the input 2D geometry by applying a 1D convolution with a 2×2 kernel on the 2D coordinates of the next N waypoints. The output of the 1D convolution is then flattened and processed through an MLP. This process can be summarized as w ~ t = f w ( W t ) \tilde{w}_t = f_w(W_t) In~t=fw(Wt), inside f w f_w fwcorresponds to WayConv1D, w ~ t \tilde{w}_t In~tcorresponds to the current step ( W t W_t INt) potential representation of waypoints. We found that using WayConv1D, the agent learns to follow the trajectory more efficiently without oscillating near the center of the lane. This is a common problem encountered when leveraging RL in the urban AD domain, as documented in previous studies [6], [28].

c) Vehicle measurement encoder: Similar to [33], we apply a multilayer perceptron (MLP) to vehicle measurements: v ~ t = f v ( [ { V t − k } k = 0 2 ] ) \tilde{v}_t = f_v([\{V_{t−k}\}^2_{k=0}]) in~t=fv([{ Vtk}k=02]), inside f v f_v fv是MLP, v ~ t \tilde{v}_t in~tPotential representation corresponding to the concatenation of vehicle measurements in three steps ( [ { V t − k } k = 0 2 ] ) ([\{V_{t− k}\}^2_{k=0}]) ([{ Vtk}k=02])

4.2 Reinforcement learning algorithm

…For details, please refer toGuyueju

Guess you like

Origin blog.csdn.net/lovely_yoshino/article/details/131371576