Trajectory prediction algorithm vectorNet research report

foreword

Traditional behavior prediction methods are regular, generating multiple behavior hypotheses based on road structure constraints. Recently, many learning-based prediction methods have been proposed. They suggest the benefit of probabilistic interpretation of different behavioral hypotheses, but require reconstruction of a new representation to encode map and trajectory information. Interestingly, although high-resolution maps are highly structured, most current prediction methods choose to render high-resolution maps as color-coded attributes, and employ convolutional neural networks with limited receptive fields to encode scene information. This raises a question: Can meaningful scene information representations be learned directly from structured high-resolution maps?
insert image description here

We propose to learn a unified representation of dynamic traffic actors and structured scenes directly from their vector form (shown in the right panel of Figure 1). The geographic extension of a road feature can be a point, polygon or curve. For example, a lane boundary contains multiple control points that form a spline; a crosswalk is a polygon defined by several points; a stop sign is represented by a single point. All geographic entities can be approximated as polylines defined by multiple control points. At the same time, dynamic traffic participants can also be approximated as polylines through their motion trajectories. All these polylines can be represented as a collection of vectors.

insert image description here

Use a graph neural network to combine collections of these vectors. Each vector is considered as a node in the graph, and the characteristics defining the node include the start and end positions of each vector, and other attributes, including polyline ID and semantic label. Through the graph neural network, the environmental information of the high-precision map and the movement trajectories of other traffic participants are integrated into the target traffic participant nodes. The node features output by the target traffic participant can then be decoded to predict its future trajectory.

In particular, to learn competitive representations for graph neural networks, it is found to be important to constrain the connectivity of graphs based on the spatial and semantic proximity of nodes. Therefore, a hierarchical graph network structure is proposed. First, the vectors with the same polyline ID and the same semantic label are integrated into polyline features, and then all the different polyline features are connected to each other to exchange information. The method of realizing local graph through multi-layer perceptron and realizing global graph through self-attention mechanism is shown in Figure 2.
insert image description here

Figure 2. The proposed VectorNet framework. The observed trajectories of traffic participants and map features are represented as vector sequences, and then passed into the local graph network to obtain polyline-level features. These features are then fed into a fully connected graph network to model higher-order interactions. Two types of losses are computed: predicting future trajectories of target traffic participants from their corresponding node features, and predicting masked node features in graph networks.

Finally, inspired by the effectiveness of employing self-supervised learning methods from continuous speech and vision data, an auxiliary image completion objective is proposed in addition to the behavior prediction objective. Specifically, randomly mask node features belonging to static scenes or dynamic trajectories, and then let the model reconstruct the masked features. Intuitively, this encourages graph network structures to better capture the interactions between dynamic traffic actors and static environments. All in all, the contributions of are mainly:

(1) Be the first to demonstrate how to directly integrate vectorized scene information and dynamic traffic participant information to achieve behavior prediction.
(2) A two-layer graph network structure VectorNet and node completion auxiliary tasks are proposed.

(3) The proposed method was evaluated on the internal behavior prediction data set and the Argoverse data set. The results showed that the method achieved the same level as the rendering bird's-eye view while reducing the model parameters by more than 70% and an order of magnitude of computation. The predicted method has comparable or even better performance. At the same time, the method reached the current optimal level on the Argoverse dataset.

Learn an information-rich context (with dynamic ObjList) directly from the structured HD MAP data, find a representation method to express the HD Map structured data and the dynamic ObjList given by perception; then, based on this unified Express trajectory prediction, the road structure (static environment information) and dynamic vehicles are expressed as vectors, and based on the expression again, a GNN network is made to express the interaction between each element. The Encoder based on Conv will lose accuracy. Here Use MAE to do expression training and enhancement.

Note: How to vectorize structured information such as HD Map?
Sampling at equal intervals based on splines, and sampling at equal times based on trajectories (1-to-1 sampling of HD MAP elements).
insert image description here

The coordinates of the starting point of dsi/dei; ai feature information, such as speed limit/lane, etc.; j is the subscript in the polygon P.

HD Map: With the improvement of the level of automatic driving, the requirements for map information are getting higher and higher, so HD Map appears, which can provide almost all road information, such as lane line position, type, color; traffic signal light position and orientation , road maintenance and other information.

method

This section introduces the VectorNet method. We first introduce how to vectorize the trajectories and high-resolution maps of dynamic traffic participants. Next, a hierarchical network is proposed, which first aggregates local features from different polylines separately, and then globally integrates all trajectory and map features. This graph will eventually be used for behavior prediction.

Represent trails and maps

Most high-resolution maps are labeled in the form of splines (such as lane lines), closed shapes (such as intersections), and points (such as traffic lights), with attribute information such as semantic labels and current status (such as traffic lights the color of the road, the speed limit of the road). For dynamic traffic participants, their trajectories are in the form of directed splines with respect to time. All these elements can be approximated as a vector sequence: for map features, select a starting point and direction, uniformly sample key points on the spline with the same spatial distance, and then concatenate adjacent key points into a vector; for trajectory , which can sample keypoints at fixed intervals (0.1 seconds) and concatenate them into a vector. If the given space-time interval is small enough, the resulting polylines are very close to the original maps and trajectories.

The vectorization process is a one-to-one mapping between continuous trajectories, map labels and vector collections, although the latter is unordered. This makes it possible to construct a graph representation structure on a collection of vectors that can be encoded by a graph neural network. More specifically, each vector vi belonging to the polyline Pj is seen as a node in the graph, and the characteristics of the nodes are as follows:
insert image description here

Among them, dis and die are the starting and ending coordinates of the vector, which can be expressed as 2D coordinates (x, y) or 3D coordinates (x, y, z); ai corresponds to attribute characteristics, such as the type of dynamic traffic participants, the trajectory Timestamp, or type of road feature, or speed limit for lane markings. j is the ID of Pj, indicating that vi belongs to Pj.

In order to make the input node features irrelevant to the corresponding traffic participant's position, the coordinate origin of all vectors is determined at the last observed position of the corresponding traffic participant. A future work is to set a coordinate origin for all interacting traffic participants so that their trajectories can be predicted in parallel.
insert image description here

Build a line subgraph

In order to exploit the local spatial and semantic information of nodes, a hierarchical approach is adopted, first constructing subgraphs on the vector level, where all vector nodes belonging to the same polyline are connected to each other. Assuming a polyline P contains nodes {v1,v2,...,vp}, the forward operation to define a layer of subgraph is as follows:
where vi(l) is the node feature of the lth layer of the subgraph network. The function genc(.) encodes independent node features, ψagg(.) aggregates the features of all neighboring nodes, and ψrel(.) is a relational operation between node vi and its neighbors.

In fact, genc(.) is a multi-layer perceptron (MLP) with weights shared among all nodes. Specifically, a multilayer perceptron consists of a fully connected layer, followed by layer normalization [3], and finally a ReLU activation function. ψagg(.) is a max pooling operation and ψrel(.) is a simple concatenation. As shown in Figure 3. Stack multi-layer subgraph networks, where the weight of each layer of genc(.) is different. Finally, to get the features of the polyline, compute:
insert image description here

where ψagg(.) is still max pooling.
insert image description here

The subgraph of can be considered as a generalization of PointNet [22]: when ds=de, and a is empty, the network and PointNet have the same input and calculation process. However, by embedding sorting information into vectors, the connectivity of subgraphs can be restricted based on different polyline IDs, and at the same time, attributes are encoded into node features. This method is especially suitable for encoding structured map annotations and traffic participant trajectories. .

Global graphs for high-level interactions

Now consider modeling higher-order interactions on polyline node features {p1,p2,...,pp} through a global interaction graph:
insert image description here

Among them, {pi(l)} is the set of polyline node features, GNN(.) is a layer graph neural network, and A is the adjacency matrix of the polyline node set

The leadership matrix A can be heuristic, for example using the spatial distance between nodes [2]. For simplicity, assume A is a fully connected graph. The graph network is implemented through the self-attention mechanism insert image description here
:

where P is the feature matrix of the node, and PQ, PK and PV are its linear maps.

The predicted future trajectory is then decoded from the node corresponding to the dynamic traffic participant:
insert image description here

where Lt is the number of layers in the graph neural network and ψtraj(.) is the trajectory decoder. For simplicity, a multi-layer perceptron is used as the trajectory decoder. More advanced decoders, such as the candidate trajectory-based method proposed by MultiPath [6], or variational recurrent neural networks [8, 26] can be used to generate diverse trajectories.

A layer of graph neural network is used in the implementation, so that during testing, only the node features corresponding to the target traffic participants need to be calculated. But it is also possible to stack multiple layers of graph neural networks to model higher-order interactions if desired.

To encourage the global interaction map to better capture the interactions between different trajectories and maps, an auxiliary image completion task is proposed. During the training process, the features of some nodes are randomly covered, and then try to restore the covered node features:
insert image description here

where ψnode(.) is a node feature decoder implemented by a multi-layer perceptron. These node feature decoders are not used during the testing phase.

To recap, pi is a node in a fully connected, unordered graph. In order to be able to identify the corresponding node when its corresponding node features are masked, the minimum value of the starting point coordinates in all vectors belonging to the corresponding node pi is calculated. Then define the characteristics of the input node as:
insert image description here

Image completion tasks are closely related to the hugely successful BERT [11] method in natural language processing, which predicts missing text inputs from contextual cues in text data. Generalize this training objective to deal with undirected graphs. Unlike some recent methods (e.g. [25]), which generalize to unordered image patches of pretrained feature maps, the node features of L are simultaneously optimized in an end-to-end framework.

overall framework

After establishing the hierarchical graph neural network, optimize the multi-task training objectives:
insert image description here

where Ltraj is the negative Gaussian log-likelihood of the future ground-truth trajectory, Lnode is the Huber loss between the predicted node features and the masked ground-truth node features, and a=1.0 is a scalar used to balance the two loss terms.
The predicted trajectory is a coordinate offset for each time step and starts from the last observed position. At the same time, the coordinate system is rotated based on the predicted orientation of the target vehicle at the last observation moment.

experiment

In this section, we first describe the experimental setup, including datasets, metrics and benchmarks based on rasterization + convolutional networks. Second, we conduct comprehensive ablation studies on rasterization benchmarks and VectorNet, respectively. Then, the computational costs, including computation and parameter quantities, are compared and discussed. Finally, the performance is compared with the state-of-the-art methods.

experiment settings

1. Dataset

Experiments are conducted on two datasets for vehicle behavior prediction, the Argoverse dataset [7] and the internal behavior prediction dataset.

Argoverse Behavior Prediction [7] is a dataset for vehicle behavior prediction given historical trajectories. Among them, 333,000 5-second trajectory sequences are divided into 211,000 training samples, 41,000 verification samples and 80,000 test samples. This dataset was created to mine interesting and diverse scenarios such as merging, crossing intersections, etc. The sampling frequency of the trajectory is 10Hz, the first 2 seconds are used for observation, and the last 3 seconds are used for trajectory prediction. Each sequence contains an "interesting" traffic participant as the subject to be predicted. In addition to vehicle trajectories, each sequence is also associated with map information. The future trajectory of the test set in the dataset is hidden. So unless otherwise stated, ablation experiments report performance on the validation set.

The internal dataset is a large-scale dataset for behavior prediction. It contains high-precision map data, detection boxes and tracking information obtained by the sensor system, and manually labeled vehicle trajectories. The total number of vehicle trajectories includes 2.2 million training samples and 0.55 million testing samples. The length of each trajectory is 4 seconds, of which the first 1 second is used as the observed historical trajectory, and the last 3 seconds are used as the predicted future trajectory. The trajectories are sampled from the real-world behavior of the vehicle, including standing still, going straight, turning, changing lanes and reversing, etc., and roughly preserve the natural distribution of the driving scene. In HD maps, lane boundaries, stop signs, pedestrian crossings and speed bumps are included.

Dataset
Two benchmarks for vehicle behavior prediction.

  1. Argoverse dataset: Introduction Each trajectory is 5s, the first 2s are used as observations, and the last 3s are used as labels.
  2. in-house behavior prediction dataset: Each track is 4s, the first 1s is used as an observation, and the last 3s is used as a label.
  3. Argoverse Motion Prediction is a curated collection of 324,557 scenes, 5 seconds each, for training and validation. Each scene contains a 2D bird's-eye centroid of each tracked object sampled at 10 Hz (the 3D point cloud is convertible to and from the 2D bird's-eye view).

The historical trajectories of both datasets come from perceptual models, so there is noise. The label trajectory of the Argoverse dataset is also derived from perception, and the label trajectory of the in-house behavior prediction dataset is manually marked.

2. Evaluation indicators

For the evaluation index, the widely used method is used to calculate the average displacement error of the entire trajectory and the displacement error at time t, where t is 1 second, 2 seconds, and 3 seconds, respectively. Displacement is measured in meters.

  • ADE-Average Displacement Error-average offset error, the offset at the predicted trajectory at time t=1.0, 2.0, 3.0s, the unit is m

3. Raster map benchmark

Render N consecutive frames of historical images, where N is 10 for the internal dataset and 20 for the Argoverse dataset. The size of each image is 400×400×3, which includes map information and a rectangular box for object detection. 400 pixels correspond to 100 meters in the internal dataset and 130 meters in the Argoverse dataset, respectively. Render based on the position of the autonomous vehicle in the last observed frame. The self-driving car is placed at coordinates (200,320) in the internal dataset and (200,200) in the Argoverse dataset. All N frames of pictures are stacked together to form a 400×400×3N image as the model input.

The raster image benchmark uses a convolutional network to encode raster images, and its structure is roughly consistent with IntentNet [5]. ResNet-18 [14] is used as the backbone of the convolutional network. Unlike IntentNet, LiDAR input is not used.

In order to obtain vehicle-centric features, the feature parts around the target vehicle are cropped from the convolutional feature map, and all spatial positions of the cropped feature map are averaged and pooled to obtain a vehicle feature vector. It is empirically observed that using a deeper ResNet model or rotating features according to the vehicle's orientation does not yield better results. The feature vector of the vehicle is then fed into a fully connected layer to predict future trajectory coordinates. The model has been trained and optimized with 8 GPUs synchronously. Adam [17] is used as the optimizer and the learning rate is decayed every 5 training epochs by a factor of 0.3. The model was trained for 25 epochs and the initial learning rate was set to 0.001.

In order to test the impact of convolutional receptive field and feature clipping strategy on performance, ablation studies are performed on network receptive field, feature clipping strategy and input image resolution.

Ablation Study on Convolutional Networks Benchmark

  • baseline-ConvNet
    • Starting from the last observed Vehicle frame, render N consecutive frames forward. For the Argoverse dataset, 400 pixels represent 130 meters, and for the in-house dataset, 400 pixels represent 100 meters. Stack N frames together to form a 400 by 400 image input data.
      The influence of the receptive field of the convolutional network, the feature clipping strategy and the resolution of the raster image is studied by ablation.
      Receptive field effects. Since behavior prediction often needs to capture a large range of road information, the receptive field of the convolution may have a great impact on the prediction quality. Different variants are evaluated separately to observe how two key factors of receptive field (convolution kernel size and feature clipping strategy) affect the prediction performance. The results are shown in Table 1. By comparing kernel sizes of 3, 5, and 7 at 400*400 resolution, it can be found that larger kernel sizes slightly improve performance. However, it also increases the computational cost substantially. Different cropping methods are also compared, either by increasing the cropping size or cropping along the vehicle trajectory. From rows 3 to 6 of Table 1, it can be seen that larger crop size can significantly improve the performance, and cropping along the trajectory also leads to better performance. This observation confirms the importance of receptive fields when rasterized images are given as input. At the same time, it also reflects its limitations, and a well-designed pruning strategy is usually accompanied by an increase in computational cost.

Effect of raster image resolution. The resolution of the raster map is further modified to analyze how it affects prediction performance and computational cost, as shown in the first three rows of Table 1. Three different resolutions were tested, including 400×400 (0.25 meters per pixel), 200×200 (0.5 meters per pixel) and 100×100 (1 meter per pixel). It can be seen that as the resolution increases, the performance generally improves. However, for the Argoverse dataset it can be seen that increasing the resolution from 200×200 to 400×400 leads to a slight decrease in performance, which can be explained by the reduction in the effective receptive field for a fixed 3×3 kernel size. of. The impact of these design choices on computational cost is discussed in Section 4.4.

Table 1. Effect of receptive field (controlled by kernel size and cropping strategy) and rendering resolution on ConvNet benchmarks. The location offset error (DE) and average location offset error (ADE) are reported on the internal dataset and the Argoverse dataset, respectively.
insert image description here

VectorNet Ablation Study

  • VectorNet
    • Principle - Try to keep the same input information as ConvNet. The broken line subgraph adopts a three-layer structure, the global graph has a one-layer structure, and the MLP has 64 nodes. Ablation experiments were done on the layers of context information, subgraphs and global graphs.

Influence of input node type. Investigate whether it makes sense to incorporate map features and dynamic traffic participant trajectories for VectorNet. The first three rows in Table 2 correspond to using only the historical trajectory of the target vehicle, adding only map features and adding trajectory features at the same time. It can be clearly seen that adding map features significantly improves trajectory prediction performance.

Effect of node completion loss. The last four rows of Table 2 compare the impact of adding auxiliary node completion tasks. It can be seen that adding this task helps to improve performance, especially in long-term forecasting.

The influence of graph structure. The effect of graph depth and breadth on trajectory prediction performance is investigated in Table 3. It is observed that for polyline subgraphs, three layers have the best performance, while for global graphs, only one layer is required. Making the multi-layer perceptron wider does not lead to better performance, but has a bad impact on the Argoverse dataset, probably because its training set is smaller. Figure 4 shows an example visualization showing some predicted trajectories.

Compare with Convolutional Networks. Finally, VectorNet and the best ConvNet models are compared in Table 4. For the internal dataset, the model achieves comparable performance to the best residual network models under the premise of greatly reducing model parameters and computation. For the Argoverse dataset, our method significantly outperforms the best convolutional network, reducing position error by 12% when predicting for 3 seconds. The internal dataset is found to contain many stationary vehicles as it is a natural distribution of driving scenarios. These scenarios can be easily solved by convolutional networks, since they are good at capturing local patterns. But only "interesting" scenes are provided in the Argoverse dataset. VectorNet outperforms the best convolutional network baselines, presumably because it is able to capture a wider range of environmental information through a hierarchical graph network.

Table 2. Ablation studies on VectorNet with different node input types and training strategies. Here "map" refers to the input vector from the high-resolution map, and "agent" refers to the input vector of the non-target vehicle trajectory. When "Node Compl" is enabled, the model training task includes graph node feature completion in addition to trajectory prediction.

insert image description here

Table 3. Ablation studies for the depth and width of the polyline submap and the global map. The broken line subgraph has the greatest impact on the position offset error in the 3rd second.
insert image description here

Comparison of model size and calculation volume

Now compare the computational load and model size of ConvNets and VectorNet, and their impact on performance. The results are shown in Table 4. The predicted decoder does not add to the calculation of computational and parameter quantities. It can be seen that as the size of the convolution kernel and the size of the input image increase, the calculation amount of the convolution network increases quadratically, and the parameter amount of the model also increases quadratically with the convolution kernel. For VectorNet, the amount of computation depends on the number of vector nodes and polylines in the scene. For the inner dataset, the average number of polylines in the map is 17, containing 205 vectors. The average number of polylines of dynamic traffic participants is 59, containing 590 vectors. The computation amount is calculated based on these averages. Note that the computation increases linearly with the number of predicted objects due to the need to renormalize the vector coordinate system and recompute the VectorNet features for each object.

Comparing R18-k3-t-r400 (the best model in the convolutional network) and VectorNet, VectorNet is significantly better than the convolutional network. In terms of calculation, for a traffic participant, the convolutional network increases the calculation amount by 200+ times compared with VectorNet. Considering that the average number of vehicles in a scene is about 30, the actual computation of VectorNet is still much smaller than that of convolutional networks. At the same time, the parameter amount of VectorNet is 29% of the parameter amount of the convolutional network. Based on the comparison, it can be found that VectorNet can significantly improve the performance while greatly reducing the computational cost.

Table 4. Comparison of model parameters and computation of ResNet and VectorNet. R18-Km-cN-rS indicates that the convolution kernel size of the ResNet-18 model is M×M, the cropping size is N×N, and the input resolution is S×S.
insert image description here

Table 5. Trajectory prediction performance on the Argoverse test set when the number of sampled trajectories K is set to 1. The results are taken from the Argoverse leaderboard for 2020/03/18.
insert image description here

Simulation and Results Analysis

A vectorized representation of high-resolution maps and dynamic traffic participants is proposed. Design a hierarchical graph neural network, where the first level aggregates information from different vectors in polylines, and the second level models high-order interactions between polylines. Experiments were carried out on large-scale internal datasets and public Argoverse datasets respectively. The results show that the proposed VectorNet method has better performance than the convolutional network method while greatly reducing the amount of calculation. Moreover, VectorNet has reached the current optimal level on the Argoverse dataset. The next step is to integrate the VectorNet encoder and multimodal trajectory decoder to generate diverse future trajectories.

insert image description here
insert image description here
Figure 4. (Left) Visualization of prediction results: lane lines are gray, non-target traffic participants are green, ground truth trajectories of target traffic participants are pink, and predicted trajectories are blue. (Right) Visualization of the degree of attention to the road environment and other traffic participants: Bright red colors correspond to higher attention scores. It can be seen that when traffic participants are faced with multiple choices, the attention mechanism can focus on the correct choice.

Extend the ideation process

code link

Geometrically, a lane line contains multiple control points, an intersection is a polygon (with multiple vertices), a traffic sign is a point, and all of these can be approximated - multiple vertex polygons. Similarly, the trajectory of a dynamic Obj can also be approximated by a polygon. This kind of polygon can be expressed by vector. Here is the underlying logic expressed by the entire vector.
insert image description here

With the vector expression, it is now necessary to construct the context; and the more natural way to express the context is the Graph. A set of vectors is a node in the graph (how is this Node constructed?). How to use the Graph method to dynamically add and delete nodes on the existing graph and then perform reasoning in the scene where Obj dynamically enters/exits?

How the graph is constructed, the author found that the variability of similar geographical location and similar semantics is more important for Node to construct Graph. The vectors that belong to the same polygon and have similar semantics are fully connected, and the attributes are incorporated into the features of the polygon, and the polygons are fully connected. Similar to the MAE method, some Nodes are randomly picked out and the NN is used for estimation; the trained NN can be better. Do the expression: the interaction between Nodes and the description of the context.

How to combine multipath with vectornet? The key is how to express the pre-define anchor on vectornet? Its essence is also point, since it is a point, it can be expressed by vector.

Construct multi-line subgraph

The bottom subgraph processes all vectors of the unified polygon, and the vectors are fully connected.
insert image description here

vi(l) is the feature of the i-th layer; the feature extraction function of a single node of genc, the feature aggregation function of all adjacent points of agg, and the relationship function of a rel node and its adjacent points.

From the perspective of implementation, genc is an MLP, agg is a maxpooling, and rel is a simple full connection; the weight of MLP is one in a polygon.
insert image description here

agg is still a maxpooling.

insert image description here

The feature P of the polygon is obtained through MLP-Pooling-Concat.

Global map of higher-order interactions

The global interaction graph formula is as follows:
insert image description here

A is the adjacency matrix of polygonal nodes, and GNN is used to process the node Pil of the first layer to obtain its interactive features.
The design of A is more elegant, and it can be based on distance or other methods (one comes out of online learning), and here we simply use full connection to deal with it.
insert image description here

GNN is implemented with simple self-attention (in this way, the number of nodes can be dynamic); P is the combined feature matrix of all nodes, and PQ/PK/PV are the feature components of Query/Key/Value respectively.
insert image description here

Decode the eigenvalue of the node into the corresponding vector, which is simply implemented with MLP. And use a single layer of attention to achieve; of course, it can also be very complicated.
insert image description here

Similar to MAE, some nodes are randomly erased, and the characteristics of this layer are tracked through nodes; node is a simple MLP. When suitable features are masked out, use the one with the smallest point coordinates as the vector subscript.

overall framework

insert image description here

Note that there are two objective function forms here, one is Gaussian approximation and the other is HuberLoss; in addition, L2 regularization is performed on the polygonal features before entering GNN.

insert image description here

Observed agent trajectories and map features are represented as sequences of vectors and passed to a local graph network to obtain polyline-level features. These features are then passed to a fully connected graph to model higher-order interactions. Two types of losses are computed: predicting future trajectories from node features corresponding to mobile agents, and predicting node features when their features are masked.

method

  1. Vectorized representation of maps and mobile agents (trajectories, lane line sampling, each point is represented by a feature vector)
  2. Use the local graph net to aggregate the features of each polyline (full connection network, a polyline finally condenses a feature vector [one point])
  3. Use the global graph to aggregate the interaction of each polyline characteristic point (the global graph is a graph composed of full connections of each node. After a layer of state update, the predicted trajectory of the target object-the coordinate displacement of the trajectory is obtained through the decoding network)

1. Ployline Graph

  1. Vectorization
    Map features (lane lines, intersections) - select the starting point and direction, sample at equal space intervals on the spline (spline), connect adjacent points to form a vector, and motion trajectory - sample key points at equal time intervals to form a vector.
    A trajectory Pj is a set of vectors (v1, v2, v3,..., vp).
    The parameters of the vector vi of the curve Pj: dsi, dei represent the coordinates of the start and end points; ai object type, time stamp, road type, speed limit; j is the track number:
    insert image description here
  2. Polyline subgraph-polyline subgraphs
    The nodes on the same polyline form a subgraph, and the node feature update rules:
    insert image description here

insert image description here
3. Polyline representation - all node features on the same polyline go through a maximum pooling operation to aggregate features:
insert image description here
Note:

    1. Coordinates of start and end points - 2D/3D
    1. The time step/location at which the target agent was last observed, as the origin of time or space.
    1. The line subgraph can be seen as a generalization of PointNet - in PointNet, ds = de, a and l are empty.

2. Global Graph

  1. global map

Polyline nodes {p1,p2,...,pP} construct a global graph, A-adjacency matrix-for simplicity, the article uses a fully connected graph.
insert image description here

The specific calculation of the graph uses the self-attention operation:
insert image description here

P is the node feature matrix, and PQ, PK, and PV are the linear projections of P.

Q, K, V: From the self-attention in Transformer. Expand that, as shown in the link

  1. Predicting the future trajectories of moving agents

insert image description here

  1. Additional graph completion task-auxiliary graph completion task

In order to make the graph capture the strong interaction between the trajectory and the lane line, during training, a part of the polyline node feature vector is hidden, and the model is used to predict the feature:
insert image description hereinsert image description here

3. Ablation Experiment

  1. ConvNet network ablation experiment - convolution kernel, crop size, image resolution.
  2. VectorNet network ablation experiment - the number of layers of Context, Node Compl, subgraph and global graph.

Guess you like

Origin blog.csdn.net/weixin_44077556/article/details/128974302