2. Methods

Traffic information with time and space dimensions should be jointly considered to predict network-wide traffic congestion. Let $x$ - and $y$ -axis represent time and space of a matrix, respectively. The elements within the matrix are values of traffic variables associated with time and space. The generated matrix can be viewed as a channel of an image in the way that every pixel in the image shares the corresponding value in the matrix. As a result, the image is of $M$ pixels width and $N$ pixels height, where $M$ and $N$ are the two dimensions of the matrix. A two-step methodology, converting network traffic to images and the CNN for network traffic prediction, respectively, is designed to learn from the matrix and make predictions.
在预测全网交通拥堵时，需要综合考虑时间和空间两个维度的交通信息。设 $x$ 轴和 $y$ 轴分别表示一个矩阵的时间和空间。矩阵中的元素是与时间和空间相关的交通变量的值。生成的矩阵可以看作是图像的一个通道，图像中的每个像素都共享矩阵中相应的值。因此，图像的宽度为 $M$ 个像素，高度为 $N$ 个像素，其中 $M$ 和 $N$ 是矩阵的两个维度。设计了两步方法，分别将网络流量转换为图像和CNN进行网络流量预测，从矩阵中学习并进行预测。

2.1. Converting Network Traffic to Images（网络流量转化为图像）

A vehicle trajectory recorded by a floating car with a dedicated GPS device provides specific information on vehicle speed and position at a certain time. From the trajectory, the spatiotemporal traffic information on each road segment can be estimated and integrated further into a time-space matrix that serves as a time-space image.
一辆正在行驶的汽车用专用的GPS设备记录下车辆的轨迹，可以提供特定时间内车辆速度和位置的具体信息。从轨迹上可以估计出各路段的时空交通信息，并进一步整合成时空矩阵，即时空图像。

In the time dimension, time usually ranges from the beginning to the end of a day, and time intervals, which are usually 10 s to 5 min, depend on the sampling resolution of the GPS devices. Generally, narrow intervals, for example 10 s, are meaningless for traffic prediction. Thus, if the sampling resolution is high, these data may be aggregated to obtain wider intervals, such as several minutes.
在时间维度上，时间通常为一天的开始到结束，时间间隔通常为10秒到5分钟，这取决于GPS设备的采样分辨率。一般来说，狭窄的间隔，例如10秒，对于流量预测是没有意义的。因此，如果采样区间很高，这些数据可以被聚合以获得更宽的间隔，比如几分钟。

In the space dimension, the selected trajectory is viewed as a sequence of dots with inner states, including vehicle position, average speed, etc. This sequence of dots can be ordered simply and linearly fitted into the y-axis, but may result in a high dimension and uninformative issues, because the sequences of dots are redundant and a large number of regions in this sequence are stable and lack variety. Therefore, to make the y-axis both compact and informative, the dots are grouped into sections, each representing a similar traffic state. The sections are then ordered spatially with reference to a predefined start point of a road, and then fitted into the y-axis.
在空间维度上，所选择的轨迹被看作是包含车辆位置、平均速度等内部状态的点序列。这种点序列可以简单有序地线性拟合到y轴上，但由于点序列是冗余的，并且该序列中有大量区域是稳定的，缺乏多样性，因此可能会导致维度过高和信息过少问题。因此，为了使y轴既紧凑又信息丰富，这些点被分成了几个部分，每个部分代表着相似的交通状态。然后，根据预先设定的道路起点，将这些路段在空间上排序，然后将其贴合到y轴上。

Finally, a time-space matrix can be constructed using time and space dimension information.Mathematically, we denote the time-space matrix by:
最后，利用时空维信息构造时空矩阵。在数学上，我们表示时空矩阵为:
$M=\begin{bmatrix}m_{11}&m_{12}&\dotsb&m_{1N}\\m_{21}&m_{22}&\dotsb&m_{2N}\\⋮&⋮&\dotsb&⋮\\m_{Q1}&m_{Q2}&…&m_{QN}\end{bmatrix}$

where $N$ is the length of time intervals, $Q$ is the length of road sections; the $i$ th column vector of $M$ is the traffic speed of the transportation network at time $I$ ; and pixel $m_{ij}$ is the average traffic speed on section $I$ at time $j$ . Matrix $M$ forms a channel of the image. Figure 1 illustrates the relations among raw averaged floating car speeds, time-space matrix, and the final image.
上式中， $N$ 为时间间隔长度， $Q$ 为路段长度; $M$ 的第 $i$ 列向量为第 $i$ 时刻交通网络的交通速度;像素 $m_{ij}$ 为第 $i$ 段在第 $j$ 时刻的平均交通速度。矩阵 $M$ 构成图像的一个通道。图1展示了原始汽车平均行驶速度、时空矩阵和最终图像之间的关系。

Figure 1. An illustration of the traffic-to-image conversion on a network.
图1. 交通网络上的流量转换为图像的说明

2.2. CNN for Network Traffic Prediction（CNN网络流量预测）

2.2.1. CNN Characteristics（CNN的特点）

The CNN has exhibited a significant learning ability in image understanding because of its unique method of extracting critical features from images. Compared to other deep learning architectures, two salient characteristics contribute to the uniqueness of CNN, namely, (a) locally-connected layers, which means output neurons in the layers are connected only to their local nearby input neurons, rather than the entire input neurons in fully-connected layers. These layers can extract features from an image effectively, because every layer attempts to retrieve a different feature regarding the prediction problem [31] ; and (b) a pooling mechanism, which largely reduces the number of parameters required to train the CNN while guaranteeing that the most important features are preserved.
CNN因其独特的提取图像关键特征的方法，在图像理解方面表现出了显著的学习能力。与其他深度学习架构相比,两个突出特征有助于CNN的独特性,即 (a) 局部连接层,这意味着输出层神经元只连接到局部附近的输入神经元,而不是整个输入神经元全层。这些层可以有效地从图像中提取特征，因为每一层都试图检索关于预测问题 [31] 的不同特征; (b) 池化机制，在保证保留最重要特征的同时，大大减少了训练CNN所需的参数数量。

Sharing the two salient characteristics, the CNN is modified in the following aspects to adapt to the context of transportation: First, the model inputs are different, i.e., the input images have only one channel valued by traffic speeds of all roads in a transportation network, and the pixel values in the images range from zero to the maximum traffic speed or speed limits of the network. In contrast, in the image classification problem, the input images commonly have three channels, i.e., RGB, and pixel values range from 0 to 255. Although differences exist, the model inputs are normalized to prevent model weights from increasing the model training difficulty. Second, the model outputs are different. In the context of transportation, the model outputs are predicted traffic speeds on all road sections of a transportation network, whereas, in the image classification problem, model outputs are image class labels. Third, abstract features have different meanings. In the context of transportation, abstract features extracted by the convolutional and pooling layers are relations among road sections regarding traffic speeds. In the image classification problem, the abstract features can be shallow image edges and deep shapes of some objects in terms of its training objective. All of these abstract features are significant for a prediction problem [36]. Fourth, the training objectives differ because of distinct model outputs. In the context of transportation, because the outputs are continuous traffic speeds, continuous cost functions should be adopted accordingly. In the image classification problem, cross-entropy cost functions are usually used.
具有这两个显著特点的CNN在以下几个方面进行了修改，以适应交通的语境:首先，模型的输入不同，即输入图像只有一个通道值，即交通网络中所有道路的交通速度，图像中的像素值在0到网络的最大交通速度或限速范围内。而在图像分类问题中，输入图像通常有三个通道，即RGB通道，像素值范围为0 ~ 255。尽管存在差异，但对模型输入进行归一化处理，以防止模型权值增加模型训练难度。第二，模型输出不同。在交通的背景下，模型输出是预测交通网络中所有路段的交通速度，而在图像分类问题中，模型输出是图像分类标签。第三，抽象特征有不同的含义。在交通环境下，卷积层和池化层提取的抽象特征是路段之间关于交通速度的关系。在图像分类问题中，根据训练目标，抽象特征可以是图像的浅边缘和某些对象的深形状。所有这些抽象特征对于预测问题 [36] 都具有重要意义。第四，培养目标因模式产出不同而不同。在运输的背景下，由于输出是连续的交通速度，因此应该采用连续的代价函数。在图像分类问题中，通常使用交叉熵损失函数。

2.2.2. CNN Characteristics（CNN的特点）

Figure 2 shows the structure of CNN in the context of transportation with four main parts, that is, model input, traffic feature extraction, prediction, and model output. Each of the parts is explained below.
图2展示了交通环境下CNN的结构，主要有四个部分，即模型输入、交通特征提取、预测和模型输出。下面将对每个部分进行解释。

First, model input is the image generated from a transportation network with spatiotemporal characteristics. Let the lengths of input and output time intervals be $F$ and $P$ , respectively. The model input can be written as:
第一，模型输入是由具有时空特征的交通网络生成的图像。设输入和输出时间间隔的长度分别为 $F$ 和 $P$ 。模型输入可以写成:
$x^i=[m_i,m_{i+1},…,m_{i+P-1}],i∈[1,N-P-F+1]$
where $I$ is the sample index, $N$ is the length of time intervals, and $m_i$ is a column vector representing traffic speeds of all roads in a transportation network within one time unit.
上式中， $i$ 为样本指标， $N$ 为时间间隔长度， $m_i$ 为表示一个时间单位内路网中所有道路交通速度的列向量。

Second, the extraction of traffic features is the combination of convolutional and pooling layers, and is the core part of the CNN model. The pooling procedure is indicated by using $p o o l$ , and $L$ is denoted by the depth of CNN. Denote the input, output, and parameters of $l$ th layer by $x_l^j$ , $o_l^j$ and $W_l^j,b_l^j )$ , respectively, where $j$ is the channel index considering the multiple convolutional filters in the convolutional layer. The number of convolutional filters in lth layer is denoted by $c_l$ . The output in the first convolutional and pooling layers can be written as:
第二，流量特征的提取是卷积层和池化层的结合，是CNN模型的核心部分。池化过程用 $p o o l$ 表示， $L$ 表示CNN的深度（层数）。分别用 $x_l^j$ , $o_l^j$ , $W_l^j,b_l^j )$ 表示第 $l$ 层的输入，输出和参数，其中 $j$ 为考虑卷积层中多个卷积滤波器的信道指数。第 $l$ 层卷积滤波器的个数用 $c_l$ 表示。第一个卷积层和池化层的输出可以写成:
$o_1^j=pool(σ(W_1^j x_1^j+b_1^j )),j∈[1,c_1 ]$
where $σ$ is the activation function, which will be discussed in next section. The output in the $l$ th $(l \neq = 1, l = 1 L)$ convolutional and pooling layers can be written as:
其中 $σ$ 是激活函数，将在下一节讨论。第 $l$ 层 $(l \neq = 1, l = 1 L)$ 卷积和池化层的输出可以写成:
$o_l^j=pool(σ(∑_{k=1}^{c_{l-1}}(W_l^j x_l^k+b_l^j ) )),j∈[1,c_l ]$

The extraction of traffic features has the following characteristics: (a) Convolution and pooling are processed in two dimensions. This part can learn the spatiotemporal relations of the road sections in terms of the prediction task in model training; (b) Different from layers with only four convolutions or pooling filters in Figure 2, in reality, the number of the layers in applications are set to be hundreds, which means hundreds of features can be learned by a CNN; and © a CNN transforms the model input into deep features through these layers.
交通特征的提取具有以下特点:(a)卷积和池化在两个维度上进行。该部分可以根据模型训练中的预测任务来学习路段的时空关系; (b)不同于图2中只有4个卷积或池化过滤器的层，实际应用中设置的层数是数百个，这意味着一个CNN可以学习到数百个特征; © CNN通过这些层将模型输入转换为深层特征。

In the model prediction, the features learnt and outputted by traffic feature extraction are concatenated into a dense vector that contains the final and most high-level features of the input transportation network. The dense vector can be written as:
在模型预测中，将交通特征提取学到的特征和输出的特征串联成一个包含输入交通网络最终和最高层特征的稠密向量。稠密向量可以写成:
$o_L^{flatten}=flatten([o_L^1,o_L^2,…,o_L^j ]),j=c_L$
where $L$ is the depth of CNN and $f l a t t e n$ is the concatenating procedure discussed above.
其中 $L$ 是CNN的深度， $f l a t t e n$ 是上面讨论的连接过程。

Finally, the vector is transformed into model outputs through a fully connected layer. The model output can, thus, be written as:
最后，通过全连通层将向量转化为模型输出。因此，模型输出可以写成:
$\begin{aligned}\hat{y}&=W_f o_L^{flatten}+b_f\\ &=W_f (flatten(pool(σ(∑_{k=1}^{c_{l-1}}(W_L^j x_L^k+b_L^j ) ))))+b_f\end{aligned}$
where $W_f$ and $b_f$ are parameters of the fully connected layer. $\hat{y}$ are the predicted network-wide traffic speeds.
其中 $W_f$ 、 $b_f$ 为全连接层参数。 $\hat{y}$ 是预测的网络范围内的流量速度。

Figure 2. Deep learning architecture of CNN in the context of transportation.
图2. 交通背景下CNN的深度学习架构

2.2.3. Convolutional Layers and Pooling Layers of the CNN（CNN的卷积层和池化层）

Before discussing the explicit layers, it should be noted that each layer is activated by an activation function. The benefits of employing the activation function are as follows: (a) the activation function transforms the output to a manageable and scaled data range, which is beneficial to model training; and (b) the combination of the activation function through layers can mimic very complex nonlinear functions making the CNN powerful enough to handle the complexity of a transportation network. In this study, the Relu function is applied and defined as follows:
在讨论卷积层之前，应该注意到每一层都是由一个激活函数激活的。使用激活函数的好处如下:(a)激活函数将输出转换为一个可管理和缩放的数据范围，这有利于模型训练;(b)分层组合激活函数可以模拟非常复杂的非线性函数，使CNN足够强大，能够处理复杂的交通网络。在本研究中，应用Relu函数，定义如下:
$g_1 (x)=\begin{cases}x,&\text{if}\ x>0\\0,&\text{otherwise}\end{cases}$
Convolutional layers differ from traditional feedforward neural network where each input neuron is connected to each output neuron and the network is fully connected (fully-connected layer). The CNN uses convolutional filters over its input layer and obtains local connections where only local input neurons are connected to the output neuron (convolutional layer). Hundreds of filters are sometimes applied to the input and results are merged in each layer. One filter can extract one traffic feature from the input layer and, thus, hundreds of filters can extract hundreds of traffic features. Those extracted traffic features are combined further to extract a higher level and more abstract traffic features. The process confirms the compositionality of the CNN, meaning each filter composes a local path from lower-level into higher-level features. When one convolutional filter $W_l^r$ is applied to the input, the output can be formulated as:
卷积层与传统的前馈神经网络不同，传统的前馈神经网络是每个输入神经元与每个输出神经元相连接，网络是全连接的(全连接层)。CNN在其输入层上使用卷积滤波器，并获得只有局部输入神经元连接到输出神经元的局部连接(卷积层)。有时会有数百个过滤器应用于输入，每一层的结果都会被合并。一个过滤器可以从输入层提取一个流量特征，因此，数百个过滤器可以提取数百个流量特征。将提取的交通特征进一步组合，提取出更高层次、更抽象的交通特征。这个过程确认了CNN的组成性，这意味着每个过滤器组成了从低级到高级特征的局部路径。当对输入应用一个卷积滤波器W_l^r时，输出可以表示为:
$y_{conv}=∑_{e=1}^m∑_{f=1}^n((W_l^r )_{ef} d_{ef})$
where $m$ and $n$ are two dimensions of the filter, $d_{ef}$ is the data value of the input matrix at positions $e$ and $f$ , and $W_l^r )_{ef}$ is the coefficient of the convolutional filter at positions e and f and $y_{conv}$ is the output.
其中 $m$ 和 $n$ 为滤波器的两个维度， $d_{ef}$ 为输入矩阵在 $e$ 和 $f$ 位置的数据值， $W_l^r )_{ef}$ 为卷积滤波器在 $e$ 和 $f$ 位置的系数， $y_{conv}$ 为输出。

Pooling layers are designed to downsample and aggregate data because they only extract salient numbers from the specific region. The pooling layers guarantee that CNN is locally invariant, which means that the CNN can always extract the same feature from the input, regardless of feature shifts, rotations, or scales [36] . Based on the above facts, the pooling layers can not only reduce the network scale of the CNN, but also identify the most prominent features of input layers. Taking the maximum operation as an example, the pooling layer can be formulated as:
池化层被设计用来下采样和聚合数据，因为它们只从特定区域提取显著数字。池化层保证了CNN是局部不变的，这意味着CNN总是可以从输入中提取相同的特征，而不管特征是平移、旋转还是缩放 [36] 。基于以上事实，池化层不仅可以减小CNN的网络规模，还可以识别出输入层最显著的特征。以最大值运算为例，池化层可以表示为:
$y_{pool}=\text{max}⁡(d_{ef} ),e∈[1,…,p],f∈[1,…,q]$
where $p$ and $q$ are two dimensions of pooling window size, $d_{ef}$ is the data value of the input matrix at positions $e$ and $f$ , and $y_{pool}$ is the pooling output.
其中 $p$ 和 $q$ 是池化窗口大小的两个维度， $d_{ef}$ 是输入矩阵在位置 $e$ 和 $f$ 处的数据值， $y_{pool}$ 是池化输出。

2.2.4. CNN Optimization（CNN优化）

The predictions of the CNN are traffic speeds on different road sections, and the mean squared errors (MSEs) are employed to measure the distance between predictions and ground-truth traffic speeds. Thus, minimizing MSEs is taken as the training goal of the CNN. MSE can be written as:
CNN的预测是不同路段的交通速度，使用平均平方误差(MSEs)来测量预测与地面真实交通速度之间的距离。因此，最小化MSEs作为CNN的训练目标。MSE可以写成:
$MSE=\frac{1}{n} ∑_{i=1}^N(\hat{y}_i-y_i)^2$
Let the model parameters be set $Θ=(W_l^i,b_l^i,W_f,c_f)$ , the optimal values of $Θ$ can be determined according to the standard backpropagation algorithm similar to other studies on CNN [31] , [36] :
设模型参数 $Θ=(W_l^i,b_l^i,W_f,c_f)$ ，根据CNN上类似其他研究的标准backpropagation算法确定 $Θ$ 的最优值 [31] , [36] :
$\begin{aligned}Θ&=\text{arg}\min\limits_Θ⁡\frac{1}{n} ∑_{i=1}^N(\hat{y}_i-y_i )^2\\ &=\text{arg}\min\limits_Θ⁡\frac{1}{n} \Vert W_f o_L^{flatten}+b_f-y\Vert^2\\ &=\text{arg}\min\limits_Θ⁡\frac{1}{n} \Vert W_f (flatten(pool(σ(∑_{k=1}^{c_{l-1}}(W_L^j x_L^k+b_L^j ) ))))+b_f-y\Vert^2\end{aligned}$

参考文献

31. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012.

36. LeCun, Y.; Bengio, Y. Convolutional networks for images, speech, and time series. In The Handbook of Brain Theory and Neural Networks; MIT Press: Cambridge, MA, USA, 1998; Volume 3361, pp. 255–258.

【论文阅读】Learning Traffic as Images: A Deep Convolutional ... [将交通作为图像学习: 用于大规模交通网络速度预测的深度卷积神经网络]（2）

【论文阅读】Learning Traffic as Images: A Deep Convolutional Neural Network for Large-Scale Transportation Network Speed Prediction [将交通作为图像学习: 用于大规模交通网络速度预测的深度卷积神经网络]（2）