From one-dimensional convolution, causal convolution (Causal CNN), expanded convolution (Dilation CNN) to temporal convolutional network (TCN)

Source: AINLPer WeChat public account (daily dry goods sharing!!)
Editor: ShuYini
Proofreading: ShuYini
Time: 2022-09-30

introduction

Convolutional neural networks (CNNs), although often associated with image classification tasks, can also be adapted for sequence modeling predictions. In this article, we will explore in detail the fundamental building blocks that Temporal Convolutional Networks (TCNs) consist of, and how they fit together to become powerful predictive models . This paper's description of Temporal Convolutional Networks (TCN) is based on the following paper: https://arxiv.org/pdf/1803.01271.pdf

Follow the AINLPer public account , the latest dry goods will be delivered as soon as possible

background introduction

Until recently, the topic of sequence modeling in the context of deep learning was largely related to recurrent neural network architectures, such as LSTMs and GRUs. S. Bai et al. show that this way of thinking is outdated and that convolutional networks should be one of the leading candidates when modeling sequence data. They demonstrate that convolutional networks can achieve better performance than RNNs in many tasks, while avoiding common shortcomings of recurrent models, such as the gradient explosion/vanishing problem or lack of memory retention. Also, using a convolutional network can improve performance as it allows the output to be computed in parallel . Their proposed architecture is called Temporal Convolutional Network (TCN) and will be explained in the following sections.

Convolution model

 TCN is an acronym for Time Domain Convolutional Network, which consists of dilated/causal 1D convolutional layers with the same input and output lengths. What these terms actually mean is detailed below.

1D Convolutional Network

 A 1D convolutional network takes a 3D tensor as input and outputs a 3D tensor. Our TCN implementation takes input tensors of shape (batch_size, input_length, input_size) and output tensors of shape (batch_size, input_length, output_size). Since each layer in a TCN has the same input and output lengths, only the third dimension of the input and output tensors differs. In the univariate case, both input_size and output_size will be equal to 1. In the more general multivariate case, input_size and output_size may differ, since we may not want to predict every component of the input sequence.

 A single 1D convolutional layer takes an input tensor of shape (batch_size, input_length, nr_input_channels) and outputs a tensor of shape (batch_size, input_length, nr_output_channels). To understand how a single layer transforms its input into its output, let's look at one element in the batch (the same process happens for each element in the batch). Let's start with the simplest case, where both nr_input_channels and nr_output_channels are equal to 1. In this case we are looking at 1D input and output tensors. The figure below shows how one element of the output tensor is computed.

 You can see that to compute an element of the output, we look at a sequence of consecutive elements of length kernel_size in the input. In the above example, we chose a kernel_size of 3. To obtain the output, we dot-product a subsequence of the input with a kernel vector of the same length for the learned weights. To get the next element of the output, the same procedure is applied, but a window of the size kernel_size of the input sequence is shifted one element to the right (the stride is always set to 1 for this prediction model). Note here that the same set of kernel weights will be used to compute each output of a convolutional layer. The figure below shows two consecutive output elements and their respective input subsequences.

 To make the visualization easier, the dot product with kernel vectors is no longer shown, but is done for each output element with the same kernel weight.

 To ensure that the output sequence has the same length as the input sequence, some zero padding is applied. This means adding an extra zero-valued entry to the beginning or end of the input tensor to ensure the output has the desired length. How to do this will be explained later.

 Now let's look at the case where we have multiple input channels, i.e. nr_input_channels is greater than 1. In this case, the above process is repeated for each input channel, but each time with a different kernel. This will generate nr_input_channels intermediate output vectors and some kernel weights of kernel_size * nr_input_channels. Then, all the intermediate output vectors are summed to get the final output vector. In a sense, this is equivalent to a two-dimensional convolution with an input tensor of (input_size, nr_input_channels) and a kernel of (kernel_size, nr_input_channels), as shown in the figure below. It's still 1D because the window only moves along a single axis, but we have a 2D convolution at each step because we're using a 2D kernel matrix.

 For this example, we choose nr_input_channels equal to 2. Instead of sliding over a 1D input sequence, we now have nr_input_channels sliding along nr_input_channels wide series length input_length by kernel_size kernel matrix.

 If both nr_input_channels and nr_output_channels are greater than 1, just repeat the above process for each output channel with a different kernel matrix. The output vectors are then stacked together to form an output tensor of shape (input_length, nr_output_channels). In this case, the number of kernel weights is equal to kernel_size nr_input_channels nr_output_channels.

 The two variables nr_input_channels and nr_output_channels depend on the position of the layer in the network. nr_input_channels = input_size of the first layer, nr_output_channels = output_size of the last layer. All other layers will use the number of intermediate channels given by num_filters.

Causal convolution

 For a causal convolutional layer, for { 0 , . . . , input _ length — 1 } \{0, ..., input\_length — 1\}{ 0,...,For each i in p u t _ l e n g t h —1 }, the i-th element of the output sequence may depend only on elements of the input sequence with indices { 0 , , i } . In other words, an element in the output sequence can only depend on the element before it in the input sequence. As mentioned before, to ensure that the output tensor has the same length as the input tensor, we need to apply zero padding. If we only apply zero padding on the left side of the input tensor, then causal convolution will be ensured. To understand this, consider the rightmost output element. Given that there is no padding on the right side of the input sequence, it relies on the last element being the last element of the input. Now consider the second-to-last output element of the output sequence. Its kernel window is shifted left by one compared to the last output element, which means its rightmost dependency in the input sequence is the second-to-last element of the input sequence. By induction, for each element in the output sequence, its latest dependency in the input sequence has the same index as itself. The figure below shows an example with an input_length of 4 and a kernel_size of 3.

 We can see that with left zero padding of 2 entries we can achieve the same output length while respecting the causality rules. In fact, in the absence of inflation, the number of zero-padding entries required to maintain the input length is always equal to kernel_size – 1.

Dilation convolution

 A desirable property of a predictive model is that the value of a particular entry in the output depends on all previous entries in the input, i.e. all entries with an index less than or equal to itself. This is achieved when the receiving field (i.e. the original set of input entries affecting a particular entry for the output) is of size input_length. We also call it "Full History Reporting". As we saw earlier, a traditional convolutional layer makes entries in the output dependent on kernel_size entries of the input whose index is less than or equal to itself. For example, if kernel_size is 3, the 5th element in the output will depend on elements 3, 4, and 5 of the input. This range expands when we layer multiple layers together. In the image below, we can see that by stacking two layers with kernel_size 3, we get a receptive field of size 5 .

 More generally, a 1D convolutional network with n layers and kernel_sizek has a receptive area of ​​size r. r = 1 + n ∗ ( k − 1 ) r=1+n*(k-1)r=1+n(k1 )
 To know how many layers are needed for full coverage, we can set the receptive field size to input_length l and solve for the number of layers n (if it is a non-integer value, we need to round up): n = [ ( l − 1 ) / ( k − 1) ] n=[(l-1)/(k-1)]n=[(l1)/(k1 )]
 This means that, given a fixed kernel_size, the number of layers required for complete coverage is linear in the length of the input tensor, which leads to networks that get very deep and very fast, resulting in models with very many parameters Takes longer to train. In addition, as the number of model layers increases, the problem of gradient disappearance is easy to appear. One way to increase the size of the receptive region while keeping the number of layers relatively small is to introduce dilation into the convolutional network.

 Dilation in the context of a convolutional layer refers to the distance between elements of the input sequence that are used to compute an entry of the output sequence. Therefore, a traditional convolutional layer can be seen as a 1-dilated layer, since the input elements of 1 output value are adjacent. The figure below shows an example of a 2-dilated layer with an input_length of 4 and a kernel_size of 3.

 Compared to the 1-dilated case, the receptive areas of this layer are distributed over a length of 5 instead of 3. More generally, the receptive field distribution of a d-dilated layer with kernel size k is 1 + d ∗ ( k − 1 ) 1+d*(k-1)1+d(k1 ) on the length. If d is fixed, this still requires a number linear in the length of the input tensor to achieve full receptive field coverage.

 This problem can be solved by increasing the value of d exponentially as we move up the layer. To do this, we choose a constant dilation_base integer b that will let us compute the dilation d of a particular layer as a function of the number of layers i below it, as d = bid=b^id=bi . The figure below shows a network with input_length 10, kernel_size 3 and dilation_base 2, which produces 3 dilated convolutional layers for full coverage.

 Here, we only show the effect of the input on the last output value. Likewise, only zero-padded entries required for the last output value are displayed. Obviously, the last output value depends on the entire input coverage. In fact, given the hyperparameters, input_lengths as high as 15 can be used while maintaining full receptive field coverage. In general, each additional layer increases the current receptive field width byd ∗ ( k − 1 ) d*(k-1)d(k1 ) where d is calculated asd = bid=b^id=bi , where i represents the number of layers below the new layer. Therefore, the width of the receptive region w of a TCN with base-b exponential expansion, kernel size k, and number of layers n can be calculated by:w = 1 + ∑ i = 0 n − 1 ( k − 1 ) ⋅ bi = 1 + ( k − 1 ) ⋅ bn − 1 b − 1 w=1+\sum_{i=0}^{n-1}(k-1)\cdot b^{i}=1+(k-1 )\cdot\frac{b^{n}-1}{b-1}w=1+i=0n1(k1)bi=1+(k1)b1bn1  However, depending on the values ​​of b and k, there may be "holes" in this receptive field. Consider the following network with a dilation_base of 3 and a kernel size of 2:

 the receptive region domain covers a range that is indeed larger than the input size (ie 15). However, the brain has holes in its receptive field; that is, there are entries in the input sequence for which the output value does not depend (shown in red above). To fix the 'hole' we need to increase the kernel size to 3, or reduce the inflation base to 2 . In general, for a receptive region without pores, the size k of the core must be at least as large as the expanded base b.

 Taking these observations into account, it is possible to calculate how many layers our network needs to cover the complete history. Assuming that the kernel size is k, the expansion base is b, where k≥b, and the input length is l, the following inequality must be satisfied for full history coverage: 1 + ( k − 1 ) ⋅ bn − 1 b − 1 ⩾ l 1+(k -1)\cdot\frac{b^{n}-1}{b-1}\geqslant l1+(k1)b1bn1l  We can solve for n and obtain the minimum number of layers required as:n = [ log ⁡ b ( ( l − 1 ) ⋅ ( b − 1 ) ( k − 1 ) + 1 ) ] n=\left [ \log_{ b}(\frac{(l-1)\cdot (b-1)}{(k-1)}+1) \right ]n=[logb((k1)(l1)(b1)+1 ) ]  We can see that the number of layers is now logarithmic in the input length instead of linear. This is a significant improvement that can be achieved without sacrificing receptive field coverage.

 The only thing that needs to be specified now is the number of zero-padding entries required for each layer. Given an dilation base b, a kernel size k, and the number of layers i below the current layer, the number p of zero-padding entries required for the current layer is calculated as: p = bi ⋅ ( k − 1 ) p=b^i\cdot (k-1)p=bi(k1)

Basic TCN model

 Given input_length, kernel_size, dilation_base, and the minimum number of layers required for full history coverage, a basic TCN network looks like this:

Forecasting

 So far, we have only discussed "input sequence" and "output sequence", but not the relationship between them. In contextual forecasting, we wish to predict the next entry of a time series in the future. To train our TCN network to make predictions, the training set will consist of (input sequence, target sequence) – pairs of equal-sized subsequences of the given time series. The target sequences will be sequences shifted forward by output_length strides relative to their respective input sequences. This means that a target sequence of length input_length contains the last (input_length - output_length) element of its respective input sequence as the first element, and the output_length element after the last entry of the input sequence as the last element. In context predictions, this means that the maximum prediction range that can be predicted with this model is equal to output_length. Using sliding window methods, many overlapping pairs of input and target series can be created from one time series.

TCN model upgrade

 S. Bai proposed some additions (ie residual connections, regularization and activation functions) to the basic TCN architecture to improve its performance. The following description of temporal convolutional networks is based on the following paper: https://arxiv.org/pdf/1803.01271.pdf

residual block

 The biggest modification made to the basic TCN model introduced earlier is to change the basic building block of the model from a simple 1D causal convolutional layer to a 2-layer residual block with the same dilation factor and residual connection.

 Let's consider a layer from the base model with a dilation factor d of 2 and a kernel size k of 3 and see how this translates into the residual block of the improved model. First the following figure:

Then it becomes the following figure:

 the output of the two convolutional layers will be added to the input of the residual block to produce the input of the next block. For all inner blocks of the network, i.e. all but the first and last, the input and output channel widths are the same, ie num_filters. Since the first convolutional layer of the first residual block and the second convolutional layer of the last residual block may have different input and output channel widths, it may be necessary to adjust the width of the residual tensor, which is Done using 1×1 convolution.

 This change affects the calculation of the minimum number of layers required for complete coverage. Now we have to consider how many residual blocks are needed to achieve full receptive field coverage. Adding a residual block to a TCN increases the receptive field width twice as much as adding a base causal layer, since it includes 2 such layers. Therefore, the total size of the receptive region r of a TCN with an expanded basis b, the kernel size k for k ≥ b, and the number of residual blocks n can be calculated as: r = 1 + ∑ i = 1 n − 1 2 ⋅ ( k − 1 ) ⋅ bi = 1 + 2 ⋅ ( k − 1 ) ⋅ bn − 1 b − 1 r=1+\sum_{i=1}^{n-1}2\cdot(k-1)\cdot b^ {i}=1+2\cdot(k-1)\cdot\frac{b^{n}-1}{b-1}r=1+i=1n12(k1)bi=1+2(k1)b1bn1This results in the minimum number of residual blocks n for full history coverage of input_length to be: n = [ log ⁡ b ( ( l − 1 ) ⋅ ( b − 1 ) ( k − 1 ) ⋅ 2 + 1 ) ] n=\left [ \log_{b}(\frac{(l-1)\cdot (b-1)}{(k-1)\cdot2}+1) \right ]n=[logb((k1)2(l1)(b1)+1)]

Activation, Normalization, Regularization

 In order to make TCN more than just an overly complex linear regression model, activation functions need to be added on top of the convolutional layers to introduce nonlinearities. ReLU activations are added to the residual block after the two convolutional layers.
 In order to normalize the input to the hidden layer (which counteracts problems like exploding gradients), weight normalization is applied to each convolutional layer.
 To prevent overfitting, regularization is introduced via dropout after each convolutional layer of each residual block. The figure below shows the final residual block.

 The asterisk in the second ReLU unit in the diagram above indicates that it exists in every layer except the last, since we want the final output to be able to take negative values ​​as well (this is different from the architecture outlined in the paper).

final model

 The figure below shows our final TCN model, where l is equal to input_length, k is equal to kernel_size, b is equal to dilation_base, k ≥ b and has the minimum number of residual blocks with full history coverage n, where n can be calculated according to the above parameters.

recommended reading

[1] One article to understand linear regression [more detailed] (including source code)
[2] One article to understand logistic regression [more detailed] (including source code)
[3] One article to understand EMNLP International Top Conference && download of EMNLP papers over the years && including EMNLP2022
[4] [NeurIPS paper download over the years] This article will take you to understand the NeurIPS International Conference (including NeurIPS2022)

Guess you like

Origin blog.csdn.net/yinizhilianlove/article/details/127129520