【WSDM 2018】Predicting Multi-step Citywide Passenger Demands Using Atention-based Neural Networks

topic


Predicting Multi-step Citywide Passenger Demands Using Attention -based Neural Networks

Paper background

Conference Papers at 2018WSDM (The 11th International Conference on Web Search and Data Mining), School of Computer Science, Shanghai Jiaotong University

Abstract

Existing research focuses on predicting passenger demand at the next time step in selected locations or hotspots. However, we believe that the multi-step passenger demand of the city 's entire network includes time-varying demand trends and global passenger flow status , so it is more conducive to avoiding supply-demand mismatches and formulating effective vehicle allocation/scheduling strategies. This paper proposes an end-to-end deep neural network model using an encoder-decoder framework based on convolution and ConvLSTM units to identify complex features to capture the spatio-temporal characteristics and impact of alighting and alighting interactions on passenger demand across the city . An attention model is embedded in the model to capture the impact of potential city-wide travel patterns. We evaluate the model using the taxi and bicycle datasets, and the experimental results show that the model works well

The main point of the paper

This paper is to predict the flow of the next few moments (Multi-step). This paper believes that multi-step demand prediction (Multi-step demand prediction) is more meaningful. Multi-step demand prediction can not only reflect the trend of traffic changes, but also express the overall changes, so as to avoid overall prediction errors caused by temporary sudden demand changes. First, the multi-step passenger demand indicates the changing trend of demand, which helps to avoid impulsive vehicle dispatch response when there are temporary demand fluctuations. In contrast, short-term passenger demand forecast results are often short-term, which is more likely to cause unnecessary vehicle scheduling. Second, a large number of vehicles are scattered throughout the city. Projected passenger demand across the city will provide an overview of the overall situation and thus be more informative in enabling better vehicle allocation. The author also believes that pickup demands ( pickup demands ) are closely related to dropoff demands , so they need to be combined to form two channels .

Note: pickup demands refer to the traffic starting from a certain place, dropoff demands refer to the traffic arriving at a certain place

This paper studies the problem of multi-step urban passenger flow demand forecasting (passenger flow demand for pick-up and drop-off in a certain area). The key technical challenge of the problem is how to deal with

  • The impact of complex space-time on passenger demand
  • Interaction of getting on and off

The main idea of ​​the paper

The paper proposes an end-to-end deep neural network to solve the multi-step urban passenger demand forecasting problem. Organize the city-wide pick-up and drop-off demand for a certain time period into a 3D demand tensor, and take a series of demand tensors in the previous time interval as input. The predictive model is an encoder-decoder framework. In the encoding stage, convolutional units are used to extract spatial features from each tensor, which effectively capture the spatial influence and interaction between getting on and off the vehicle. The complex spatiotemporal influences are then revealed using ConvLSTM, resulting in a high-level representation of the input sequence. The decoder behaves inversely to the encoder, outputting future demand tensors. An attention mechanism is embedded in the process.

Main contributions and innovations of the paper

  • The article defines the network-wide multi-step prediction problem for the first time. For this prediction task, an end-to-end deep neural network model is proposed. An encoder-decoder framework based on convolutional and ConvLSTM units is adopted, which can effectively capture complex spatio-temporal influences and on/off interactions

  • Attention model is introduced and integrated into the decoder, improving predictive performance

  • Validation of the model on the New York taxi and bicycle dataset shows that the method predicts the best

The main content of the paper

data preprocessing

The article uses taxi TaxiNYC and bicycle data CitiBikeNYC
taxi TaxiNYC
(using five-year records from 2009 to 2013 as training data, and records in 2014 and 2015 as validation and test sets respectively)

Bicycle data CitiBikeNYC (from July 1, 2013 to June 30, 2016)
(a bicycle trip information includes: trip duration, start and end station id, start and end timestamp. Usage data from July 1, 2015 Day to December 31, 2015 as the validation set, data from January 1, 2016 to June 30, 2016 as the test set, and others as training data)

First of all, the whole city is still divided into several m × nm × nm×The grid of n , and through a hyperparameterλ λλ to control the width of the grid.
As pictured,AAA B B B C C C represents three areas respectively, the red channel representsthe pickup demand, and the blue part representsthe dropoff demand. It can be seen from Table c that the dropoff demandat the previous momentthe pickup demandat the next moment.

Note: pickup demands refer to the traffic starting from a certain place, dropoff demands refer to the traffic arriving at a certain place

insert image description here

essay question definition

The third part of the paper gives three definitions, namely Grid Map , Pickup/Dropff Demand Maps , Multi-step Citywide Demands Precition . In short, these three definitions do the following work respectively:

  • Grid Map : Define the grid, divide the whole city into several grids, and have a parameter λ λ that can control the width of the gridλ , while mapping all raw data to the grid

  • Count the pickup/dropoff demand in each grid in each time interval

  • define the problem to be solved

The data processing is basically the same as the article " Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction "

network model

Models used in the paper: Seq2Seq, Attention and ConvLSTM. If you use the attention mechanism in Seq2Seq, you need to use it thought vector.

thought vectorThat is, a vector calculated based on the output at each moment of encoding and the output at the current decoding moment. Using thought vectorcan extract information from the output of multiple encoders, including the output of the last step of the encoder and the output of other steps, resulting in a richer representation of information.
insert image description here

But in the paper, the author uses representative demand tensors(hereafter referred to as AAA ) Calculated vector.
For traffic forecasting problems in any (personally known) scenario, such as order volume, traffic flow, traffic flow, etc. in a certain area, there are several rules. In the paper, the author's point of view is that for a task, if there isKKK kinds of potential laws, then when a certain law is applied to the input when decoding the prediction, the result should be better. (In the Seq2Seq model of the translation model, the attention is applied to the output of the memory unit (the attention mechanism is usually added to the output of the last step of the encoder), but in this paper, it is applied to the input)
Research It is found that the distribution of passenger flow demand has a certain spatio-temporal regularity, which may be caused by the underlying passenger flow pattern within the city. For example, a subway station is always in high demand during weekday peak hours and low in the middle of the night. To capture this regularity, the authors perform K-means clustering on the historical demand tensor. The resultingKKThe K representative demand tensors are called label tensors. In this paper, it is a new attempt to use the attention model to incorporate these labeled tensors into the demand prediction in the next step.

Model framework

insert image description here
The thesis is to divide an area into grids according to latitude and longitude. Organize the boarding and disembarkation requirements of all areas of the entire network in a certain period of time into a 3D tensor ( M ∈ R ( n ∗ m ∗ 2 ) M∈R(n*m*2)MR(nm2 ) ), multi-step forecasting is to input the demand of multiple time periods in the past and output the demand of multiple time periods in the future.
Use the encoder-decoder framework. The input sequence of passenger demand tensors is first "encoded" into fixed-dimensional representations, and then these representations are "decoded" to generate future required tensors.

Encoding Encoding Previous Passenger Demands

The Encoder part mainly includes two parts: CNN and ConvLSTM. The Encoder input is the 3D tensor of the previous N time periods. The encoder part is actually superimposing two Conv layers and two ConvLSTM layers. The output of the ConvLSTM layer will have hidden state and cell state, which will be used as the initial state of the decoder as input.
For input demand tesors { M 1 , ⋯ MN } \{M_1 , ⋯ M_N\}{ M1,MN} , firstly each demand tensorM t M_tMtBoth pass LLThe convolution processing of the L layer getsI t , L e I^e_{t,L}It,Le, so for NNFor N demand tesors , { I t , L e } t = 1 N \{I^e_{t,L}\}^N_{t=1}{ It,Le}t=1N;Then { I t , L e } t = 1 N \{I^e_{t,L}\}^N_{t=1}{ It,Le}t=1NFeed to a multi-layer ConvLSTM network.

Attention Attention

In order to explain the decoding part more clearly, the author explains the structure diagram of the entire attention calculation:

  • The first step is to use KK firstK -means clustering to determineKKK representative demand tensorsAAA (3-dimensional tensor), that is, the label tensor mentioned above (thisKKK labeled tensors should beKKK 's cluster center). Then pass the vector through a two-layer CNN to extract features to getaaa (also a 3D tensor).
  • The second step is to obtain the weight vector. will aaa and the hidden state output by the encoder in the previous step are flattened and then input into the multi-layer neural network to output a single valueɑ ɑɑ , willKKK ɑɑɑAfter softmaxsoftmaxThe so f t max function gets the normalized weight matrix. KKK weights weighted toKKK aa_a , get the final weightedZ t Z_tZt. (weighted this piece because of KKK aa_a is a three-dimensional tensor, and the weight vector is one-dimensional, that is, byKKK weights, so it isKKThe corresponding positions of the K three-dimensional tensors are weighted and summed with weight vectors to obtain the finalZ t Z_tZt Z t Z_t Ztis still a 3D tensor).
    insert image description here

The first thing to explain is this demand tensors { A k } k = 1 K \{A_{k}\}^K_{k=1}{ Ak}k=1K, which is what the author said in the idea part representative demand tensor, it is also a 3-dimensional tensor, where ( A k ) ij (A_k)_{ij}(Ak)ijReflects the region gij g_{ij}gijPickup and dropoff demands affected by a certain underlying law.

My personal understanding is: since each traffic data may imply KKK kinds of potential changes, and{ A k } k = 1 K \{A_{k}\}^K_{k=1}{ Ak}k=1KThis KKK tensors are the most direct and most reflective of thisKKA tensor of K regularities. According to the introduction of the paper{ A k } k = 1 K \{A_{k}\}^K_{k=1}{ Ak}k=1KIt is the KK after clustering the original data setK cluster centers.

K K How did K come from: the experimental part of the original text mentions that an appropriateKKThe K value should satisfy a lower inter-cluster distance (distortion in the figure below) and a higher silhouette coefficient (Silhouette in the figure below) based on the principle that the author's KK on the two datasetsK is 16 and 32 respectively.
insert image description here
With{ A } k \{A\}_{k}{ A}kAfter that, it is further convolved to get ak a_kak; Then the hidden state H td H^d_t of the last moment of the encoding partHtdRespectively the same as ak a_kakCalculate the similarity through a 2-layer neural network to get α t 1 α_{t1}at 1, α t 2 α_{t2}at2 _a tk a_{tk}atk, and finally get the attention vector zt z_tzt. Concrete formula:
htka = f ( W ha H t − 1 d ‾ + W aak ‾ + bha ) , ∀ k ∈ [ 1 , K ] stka = f ( W sahtka ) , ∀ k ∈ [ 1 , K ] atk = exp ⁡ ( stka ) ∑ k = 1 K exp ⁡ ( stka ) , ∀ k ∈ [ 1 , K ] zt = ∑ k = 1 K atkak \begin{aligned} \mathrm{h}_{\mathrm{tk}}^{\mathrm {a}} & =\mathrm{f}\left(\mathrm{W}_{\mathrm{h}}^{\mathrm{a}} \overline{\mathrm{H}_{\mathrm{t} -1}^{\maths{d}}}+\maths{W}_{\maths{a}} \overline{\maths{a}_{\maths{k}}}+\maths{b}_ {\mathrm{h}}^{\mathrm{a}}\right), \forall \mathrm{k}\in[1, \mathrm{~K}]\\\mathrm{s}_{\mathrm{ tk}}^{\mathrm{a}} & =\mathrm{f}\left(\mathrm{W}_{\mathrm{s}}^{\mathrm{a}} \mathrm{h}_{\ mathrm{tk}}^{\mathrm{a}}\right), \forall\mathrm{k}\in[1,\mathrm{~K}] \\ \mathrm{a}_{\mathrm{tk}} & =\frac{\exp \left(\mathrm{s}_{\mathrm{tk}}^{\mathrm{ a}}\right)}{\sum_{\mathrm{k}=1}^{\mathrm{K}} \exp \left(\mathrm{s}_{\mathrm{tk}}^{\mathrm{ a}}\right)}, \forall \mathrm{k} \in[1, \mathrm{~K}] \\ \mathrm{z}_{\mathrm{t}} & =\sum_{\mathrm{ k}=1}^{\mathrm{K}} \mathrm{a}_{\mathrm{tk}} \mathrm{a}_{\mathrm{k}} \end{aligned}htkastkaatkzt=f(WhaHt1d+Waak+bha),k[1, K]=f(Wsahtka),k[1, K]=k=1Kexp(stka)exp(stka),k[1, K]=k=1Katkak

Decoding Decoding Multi-step Passenger Demands

The input of Decoder is the predicted BBB time steps, each time step is a three-dimensional tensor. The hidden state and cell state finally output by the Encoder are used as the initial state of the decoder. In addition, the output of the attention partZ t Z_tZtIt is also the output of the decoder part. The computing companies for the ConvLSTM part are as follows:
i t d = σ ( W z i d ∗ z t + W h i d ∗ H t − 1 d + b i d ) f t d = σ ( W z f d ∗ z t + W h f d ∗ H t − 1 d + b f d ) o t d = σ ( W z o d ∗ z t + W h o d ∗ H t − 1 d + b o d ) C t d = f t d ∘ C t − 1 d + i t d ∘ tanh ⁡ ( W z c d ∗ z t + W h c d ∗ H t − 1 d + b c d ) H t d = o t d ∘ tanh ⁡ ( C t d ) \begin{aligned} & \mathrm{i}_{\mathrm{t}}^{\mathrm{d}}=\sigma\left(\mathrm{W}_{\mathrm{zi}}^{\mathrm{d}} * \mathrm{z}_{\mathrm{t}}+\mathrm{W}_{\mathrm{hi}}^{\mathrm{d}} * \mathrm{H}_{\mathrm{t}-1}^{\mathrm{d}}+\mathrm{b}_{\mathrm{i}}^{\mathrm{d}}\right) \\ & \mathrm{f}_{\mathrm{t}}^{\mathrm{d}}=\sigma\left(\mathrm{W}_{\mathrm{zf}}^{\mathrm{d}} * \mathrm{z}_{\mathrm{t}}+\mathrm{W}_{\mathrm{hf}}^{\mathrm{d}} * \mathrm{H}_{\mathrm{t}-1}^{\mathrm{d}}+\mathrm{b}_{\mathrm{f}}^{\mathrm{d}}\right) \\ &\mathrm{o}_{\mathrm{t}}^{\mathrm{d}}=\sigma\left(\mathrm{W}_{\mathrm{zo}}^{\mathrm{d}} * \ mathrm{z}_{\mathrm{t}}+\mathrm{W}_{\mathrm{ho}}^{\mathrm{d}} * \mathrm{H}_{\mathrm{t}-1} ^{\mathrm{d}}+\mathrm{b}_{\mathrm{o}}^{\mathrm{d}}\right) \\ & \mathrm{C}_{\mathrm{t}}^ {\mathrm{d}}=\mathrm{f}_{\mathrm{t}}^{\mathrm{d}} \circ \mathrm{C}_{\mathrm{t}-1}^{\mathrm {d}}+\mathrm{i}_{\mathrm{t}}^{\mathrm{d}} \circ \tanh \left(\mathrm{W}_{\mathrm{zc}}^{\mathrm {d}} * \mathrm{z}_{\mathrm{t}}+\mathrm{W}_{\mathrm{hc}}^{\mathrm{d}} * \mathrm{H}_{\mathrm{d}} {t}-1}^{\mathrm{d}}+\mathrm{b}_{\mathrm{c}}^{\mathrm{d}}\right) \\ & \mathrm{H}_{\ mathrm{t}}^{\mathrm{d}}=\mathrm{o}_{\mathrm{t}}^{\mathrm{d}} \circ \tanh \left(\mathrm{C}_{\ mathrm{t}}^{\mathrm{d}}\right) \end{aligned}\mathrm{z}_{\mathrm{t}}+\mathrm{W}_{\mathrm{ho}}^{\mathrm{d}} * \mathrm{H}_{\mathrm{t}} }^{\mathrm{d}}+\mathrm{b}_{\mathrm{o}}^{\mathrm{d}}\right) \\ & \mathrm{C}_{\mathrm{t}} ^{\mathrm{d}}=\mathrm{f}_{\mathrm{t}}^{\mathrm{d}} \circ \mathrm{C}_{\mathrm{t}-1}^{\ mathrm{d}}+\mathrm{i}_{\mathrm{t}}^{\mathrm{d}} \circ \tanh \left(\mathrm{W}_{\mathrm{zc}}^{\ mathrm{d}} * \mathrm{z}_{\mathrm{t}}+\mathrm{W}_{\mathrm{hc}}^{\mathrm{d}} * \mathrm{H}_{\ mathrm{t}-1}^{\mathrm{d}}+\mathrm{b}_{\mathrm{c}}^{\mathrm{d}}\right) \\ & \mathrm{H}_{ \mathrm{t}}^{\mathrm{d}}=\mathrm{o}_{\mathrm{t}}^{\mathrm{d}} \circ \tanh \left(\mathrm{C}_{ \mathrm{t}}^{\mathrm{d}}\right) \end{aligned}\mathrm{z}_{\mathrm{t}}+\mathrm{W}_{\mathrm{ho}}^{\mathrm{d}} * \mathrm{H}_{\mathrm{t}} }^{\mathrm{d}}+\mathrm{b}_{\mathrm{o}}^{\mathrm{d}}\right) \\ & \mathrm{C}_{\mathrm{t}} ^{\mathrm{d}}=\mathrm{f}_{\mathrm{t}}^{\mathrm{d}} \circ \mathrm{C}_{\mathrm{t}-1}^{\ mathrm{d}}+\mathrm{i}_{\mathrm{t}}^{\mathrm{d}} \circ \tanh \left(\mathrm{W}_{\mathrm{zc}}^{\ mathrm{d}} * \mathrm{z}_{\mathrm{t}}+\mathrm{W}_{\mathrm{hc}}^{\mathrm{d}} * \mathrm{H}_{\ mathrm{t}-1}^{\mathrm{d}}+\mathrm{b}_{\mathrm{c}}^{\mathrm{d}}\right) \\ & \mathrm{H}_{ \mathrm{t}}^{\mathrm{d}}=\mathrm{o}_{\mathrm{t}}^{\mathrm{d}} \circ \tanh \left(\mathrm{C}_{ \mathrm{t}}^{\mathrm{d}}\right) \end{aligned}\mathrm{C}_{\mathrm{t}-1}^{\mathrm{d}}+\mathrm{i}_{\mathrm{t}}^{\mathrm{d}} \circ \tanh \ left(\mathrm{W}_{\mathrm{zc}}^{\mathrm{d}} * \mathrm{z}_{\mathrm{t}}+\mathrm{W}_{\mathrm{hc} }^{\mathrm{d}} * \mathrm{H}_{\mathrm{t}-1}^{\mathrm{d}}+\mathrm{b}_{\mathrm{c}}^{\ mathrm{d}}\right) \\ & \mathrm{H}_{\mathrm{t}}^{\mathrm{d}}=\mathrm{o}_{\mathrm{t}}^{\mathrm {d}} \circ \tanh \left(\mathrm{C}_{\mathrm{t}}^{\mathrm{d}}\right) \end{aligned}\mathrm{C}_{\mathrm{t}-1}^{\mathrm{d}}+\mathrm{i}_{\mathrm{t}}^{\mathrm{d}} \circ \tanh \ left(\mathrm{W}_{\mathrm{zc}}^{\mathrm{d}} * \mathrm{z}_{\mathrm{t}}+\mathrm{W}_{\mathrm{hc} }^{\mathrm{d}} * \mathrm{H}_{\mathrm{t}-1}^{\mathrm{d}}+\mathrm{b}_{\mathrm{c}}^{\ mathrm{d}}\right) \\ & \mathrm{H}_{\mathrm{t}}^{\mathrm{d}}=\mathrm{o}_{\mathrm{t}}^{\mathrm {d}} \circ \tanh \left(\mathrm{C}_{\mathrm{t}}^{\mathrm{d}}\right) \end{aligned}\left(\mathrm{C}_{\mathrm{t}}^{\mathrm{d}}\right) \end{aligned}\left(\mathrm{C}_{\mathrm{t}}^{\mathrm{d}}\right) \end{aligned}itd=p(Wdaydzt+WhidHt1d+bid)ftd=p(Wzfdzt+WhfdHt1d+bfd)otd=p(WLike thisdzt+WtodHt1d+bod)Ctd=ftdCt1d+itdfishy(Wzcdzt+WhcdHt1d+bcd)Htd=otdfishy(Ctd)
It can be seen from the formula that the input of ConvLSTM is in addition to the H t − 1 d H^d_{t-1} at the previous momentHt1d C t − 1 d C^d_{t-1} Ct1dIn addition, there is the attention vector at the current moment. After getting the output H td H^d_{t} at multiple timesHtdAfter that, the convolution of multiple layers (the same as the CNN in the encoding part) is used as the output of the entire model (its output HHH then passes through two Conv layers to get the final output)
insert image description here
The part marked in red by Decoder is the process of the two states inside the cyclic neural network passing through the cycle with time steps.

Summarize

For the demand forecasting problem in the paper, firstly, the author believes that pickup demands should be affected by dropoff demands, and the two should be combined. Secondly, the author believes
that the traffic forecasting problem should be considered from the perspective of global optimization, so he proposed multi-step.
Finally, the author also believes that each type of traffic data may have multiple latent mobility patterns, and this latent mobility pattern can be applied to the input during decoding to improve the prediction results.

Dataset description

Taxi TaxiNYC : New York taxi data set, the data set is a structured text, which records the traffic information of New York taxis, including the longitude and latitude of the starting and ending points, time, number of passengers, cost, etc. There are four types of cars in the data set: Green car, Yellow car, For-Hire Vehicle (FHV), High Volume For-Hire Vehicle (HVFHS)

Note: The official has changed the format of all data files from CSV to Parquet format

New York taxi data set collation and data description

Bicycle data CitiBikeNYC : journey time, start time, stop time, start station ID, start station name, start station latitude, start station longitude, end station ID, end station name, end station latitude, end station longitude, bicycle Number, Year of Birth, User Type – (Customer = 24 Hour Pass or 3 Day Pass User; Subscriber = Annual Membership), Gender – (Zero = Unknown; 1 = Male; 2 = Female)

CitiBike System Data official website
Kaggle CitiBike System Data dataset

Other Questions (Reference)

Explain the reason for using attention? The general feeling is to look for distance from hhh the nearest cluster centerai a_iai, for ai a_iaiinstead of hhThe value of h (although they are all soft), but it is not clear why this is done. General attention coefficientα αShouldn't α be a weighted average with the output of the MLP? Why in this modelα αWhat about the weighted average of α and the cluster center?

Answer : The first point, why use attention? We all know that there are several rules for traffic forecasting problems in any (personally known) scenario, such as order volume, traffic flow, traffic flow, etc. in a certain area. In the paper, the author's point of view is that for a certain task, if the data set has KKK kinds of potential laws, then when a certain law is applied to the input when decoding the prediction, the result should be better.
What does that mean? Since this type of data usually has regular characteristics (and there are many kinds, setKKK species), and if we can find theKKK kinds of laws, and then apply the laws that conform to the output of the current moment to the current moment, which will make the predicted results more accurate. The role of clustering is to find theKKK rules, Attention is used to learn fromKKAmong the K laws, find the law characteristic that best matches the output at the current moment.

The second point, the general attention coefficient α αShouldn't α be a weighted average with the output of the MLP? Why in this modelα αWhat about the weighted average of α and the cluster center?
Answer: In general attention,α αα is obtained by MLP, and thenalpha alphaa lp ha acts on the input of the MLP ( a 1 a_1in the texta1 , a 2 a_2 a2,… a k a_k ak, that is, the input of MLP is the result of extracting a feature with CNN, and the form is similar), you can take a closer look at the network structure in the paper.

Guess you like

Origin blog.csdn.net/qq_44033208/article/details/130081630