Using LSTM and CNN to implement time series classification tasks based on TensorFlow

Github project address : https://github.com/healthDataScience/deep-learning-HAR

Manual feature engineering is also used in traditional image classification. However, with the advent of deep learning, convolutional neural networks can handle computer vision tasks perfectly. Using CNN to process images does not require any manual feature engineering, the network automatically combines the most basic features layer by layer into more advanced and abstract features to complete computer vision tasks.

In this article, we will discuss how to use deep learning methods to classify time series data. The case we use is the Human Activity Recognition (HAR) dataset from the UCI project. The dataset contains raw time series data and preprocessed data (containing 561 features). This article will compare the machine learning algorithm using feature engineering with two deep learning methods (convolutional neural network and recurrent neural network), and the experiment finally shows that the deep learning method surpasses the traditional method using feature engineering.

The author uses TensorFlow and implements and trains the model. Only part of the code is shown in the article. For more detailed code, please check Github.

Convolutional Neural Network (CNN)

The first step is to feed the data to an array in Numpy, and the dimension of the array is (batch_size, seq_len, n_channels), where batch_size is the amount of data required by the model for each iteration when performing SGD, and seq_len is the length of the time series sequence ( 128 in this paper), n_channels is the number of channels to perform the measurement. In this case, the number of channels is 9, that is, each of the 3 axes has 3 different acceleration measurements. We have six active labels, ie each sample belongs to LAYING, STANDING, SITTING, WALKING_DOWNSTAIRS, WALKING_UPSTAIRS or WALKING.

Below, we first build the computation graph, where we prepare the input data using placeholders:

graph = tf.Graph()
with graph.as_default():
inputs_ = tf.placeholder(tf.float32, [None, seq_len, n_channels],
name = 'inputs')
labels_ = tf.placeholder(tf.float32, [None, n_classes], name = 'labels')
keep_prob_ = tf.placeholder(tf.float32, name = 'keep')
learning_rate_ = tf.placeholder(tf.float32, name = 'learning_rate')

where inputs_ is the input tensors fed into the computation graph, and setting the first parameter to "None" ensures that the first dimension of the placeholder can be properly adjusted for different batch sizes. labels_ is the one-hot encoded label to predict, keep_prob_ is the hold probability for dropout regularization, and learning_rate_ is the learning rate for the Adam optimizer.

We build convolutional layers with 1D convolution kernels that move over the sequence, and images typically use 2D convolution kernels. Convolution kernels in sequence tasks can act as filters in training. In many CNN architectures, the greater the depth of the layers, the greater the number of filters. Each convolution operation is followed by a pooling layer to reduce the length of the sequence. Below is a simple CNN architecture we can use.

The convolutional layer described in the figure above can be implemented with the following code:

with graph.as_default():
# (batch, 128, 9) -> (batch, 32, 18)
conv1 = tf.layers.conv1d(inputs=inputs_, filters=18, kernel_size=2, strides=1,
padding='same', activation = tf.nn.relu)
max_pool_1 = tf.layers.max_pooling1d(inputs=conv1, pool_size=4, strides=4, padding='same')

# (batch, 32, 18) -> (batch, 8, 36)
conv2 = tf.layers.conv1d(inputs=max_pool_1, filters=36, kernel_size=2, strides=1,
padding='same', activation = tf.nn.relu)
max_pool_2 = tf.layers.max_pooling1d(inputs=conv2, pool_size=4, strides=4, padding='same')

# (batch, 8, 36) -> (batch, 2, 72)
conv3 = tf.layers.conv1d(inputs=max_pool_2, filters=72, kernel_size=2, strides=1,
padding='same', activation = tf.nn.relu)
max_pool_3 = tf.layers.max_pooling1d(inputs=conv3, pool_size=4, strides=4, padding='same')

一旦到达了最后一层，我们需要 flatten 张量并投入到有适当神经元数的分类器中，在上图中为 144 个神经元。随后分类器输出 logits，并用于以下两种案例：

计算 softmax 交叉熵函数，该损失函数在多类别问题中是标准的损失度量。
在最大化概率和准确度的情况下预测类别标签。

下面是上述过程的实现：

with graph.as_default():
# Flatten and add dropout
flat = tf.reshape(max_pool_3, (-1, 2*72))
flat = tf.nn.dropout(flat, keep_prob=keep_prob_)

# Predictions
logits = tf.layers.dense(flat, n_classes)

# Cost function and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits,
labels=labels_)) optimizer = tf.train.AdamOptimizer(learning_rate_).minimize(cost)

# Accuracy
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(labels_, 1)) accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy')
剩下的实现部分就比较典型了，读者可查看 GitHub 中的完整代码和过程。前面我们已经构建了计算图，后面就需要将批量训练数据馈送到计算图进行训练，同时我们还要使用验证集来评估训练结果。最后，完成训练的模型将在测试集上进行评估。我们在该实验中 batch_siza 使用的是 600、learning_rate 使用的是 0.001、keep_prob 为 0.5。在 500 个 epoch 后，我们得到的测试精度为 98%。下图显示了训练准确度和验证准确度随 epoch 的增加而显示的变化：

长短期记忆网络（LSTM）

LSTM 在处理文本数据上十分流行，它在情感分析、机器翻译、和文本生成等方面取得了十分显著的成果。因为本问题涉及相似分类的序列，所以 LSTM 是比较优秀的方法。

下面是能用于该问题的神经网络架构：

为了将数据馈送到网络中，我们需要将数组分割为 128 块（序列中的每一块都会进入一个 LSTM 单元），每一块的维度为（batch_size, n_channels）。随后单层神经元将转换这些输入并馈送到 LSTM 单元中，每一个 LSTM 单元的维度为 lstm_size，一般该参数需要选定为大于通道数量。这种方式很像文本应用中的嵌入层，其中词汇从给定的词汇表中嵌入为一个向量。后面我们需要选择 LSTM 层的数量（lstm_layers），我们可以设定为 2。

对于这一个实现，占位符的设定可以和上面一样。下面的代码段实现了 LSTM 层级：

with graph.as_default():
# Construct the LSTM inputs and LSTM cells
lstm_in = tf.transpose(inputs_, [1,0,2]) # reshape into (seq_len, N, channels)
lstm_in = tf.reshape(lstm_in, [-1, n_channels]) # Now (seq_len*N, n_channels)

# To cells
lstm_in = tf.layers.dense(lstm_in, lstm_size, activation=None)

# Open up the tensor into a list of seq_len pieces
lstm_in = tf.split(lstm_in, seq_len, 0)

# Add LSTM layers
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob_)
cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)
initial_state = cell.zero_state(batch_size, tf.float32)

上面的代码段是十分重要的技术细节。我们首先需要将数组从 (batch_size, seq_len, n_channels) 重建维度为 (seq_len, batch_size, n_channels)，因此 tf.split 将在每一步适当地分割数据（根据第 0 个索引）为一系列 (batch_size, lstm_size) 数组。剩下的部分就是标准的 LSTM 实现了，包括构建层级和初始状态。

下一步就是实现网络的前向传播和成本函数。比较重要的技术点是我们引入了梯度截断，因为梯度截断可以在反向传播中防止梯度爆炸而提升训练效果。

下面是我们定义前向传播和成本函数的代码：

with graph.as_default():
outputs, final_state = tf.contrib.rnn.static_rnn(cell, lstm_in, dtype=tf.float32,
initial_state = initial_state)

# We only need the last output tensor to pass into a classifier
logits = tf.layers.dense(outputs[-1], n_classes, name='logits')

# Cost function and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels_))

# Grad clipping
train_op = tf.train.AdamOptimizer(learning_rate_)
gradients = train_op.compute_gradients(cost) capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients] optimizer = train_op.apply_gradients(capped_gradients)

# Accuracy
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(labels_, 1)) accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy')
注意我们只使用了 LSTM 顶层输出序列的最后一个元素，因为我们每个序列只是尝试预测一个分类概率。剩下的部分和前面我们训练 CNN 的过程相似，我们只需要将数据馈送到计算图中进行训练。其中超参数可选择为 lstm_size=27、lstm_layers=2、batch_size=600、learning_rate=0.0005 和 keep_prob=0.5，我们在测试集中可获得大约 95% 的准确度。这一结果要比 CNN 还差一些，但仍然十分优秀。可能选择其它超参数能产生更好的结果，读者朋友也可以在 Github 中获取源代码并进一步调试。

对比传统方法

前面作者已经使用带 561 个特征的数据集测试了一些机器学习方法，性能最好的方法是梯度提升树，如下梯度提升树的准确度能到达 96%。虽然 CNN、LSTM 架构与经过特征工程的梯度提升树的精度差不多，但 CNN 和 LSTM 的人工工作量要少得多。

HAR 任务经典机器学习方法：https://github.com/bhimmetoglu/talks-and-lectures/tree/master/MachineLearning/HAR

梯度提升树：https://rpubs.com/burakh/har_xgb

结语

在本文中，我们试验了使用 CNN 和 LSTM 进行时序数据的分类，这两种方法在性能上都有十分优秀的表现，并且最重要的是它们在训练中会一层层学习独特的特征，它们不需要成本昂贵的特征工程。

本文所使用的序列还是比较小的，只有 128 步。可能会有读者怀疑如果序列变得更长（甚至大于 1000），是不是训练就会变得十分困难。其实我们可以结合 LSTM 和 CNN 在这种长序列任务中表现得更好。总的来说，深度学习方法相对于传统方法有非常明显的优势。

Using LSTM and CNN to implement time series classification tasks based on TensorFlow

长短期记忆网络（LSTM）

结语

Guess you like