Keras中return_sequences、return_state、TimeDistributed

原文：Keras中return_sequences和return_state有什么用？ - 异尘的文章 - 知乎

前言

CNN和RNN，作为深度学习的两大护法，促进了深度学习近几年在Computer Vision、NLP等领域席卷全世界。相比CNN，RNN其实更为“骨骼精奇”，它开创性的递归网络结构，让模型具有了“记忆”，使得我们向着“AI”更近了一步。虽然最近各种Transformer结构有了超越RNN之势，但是我依然觉得RNN是非常值得学习和发展的。

今天，我们以LSTM为例，来谈一个RNN中的一个具体的问题。我们知道，在Keras的LSTM实现中，有两个参数return_sequences和return_state。这两个参数的实际意义是什么？在什么场景下会用到呢？

PS：Keras是我最喜爱的深度学习框架了，其API的设计非常精妙和优雅，François Chollet是不愧是大师中的大师。相比传统的Tensorflow和PyTorch，Keras的API才是真正的“Deep Learning for Human”。另外，看到Tensorflow 2.0也开始以tf.keras作为第一公民，我非常欣慰。关于我对这几个框架的理解，后面再以专题文章和大家分享。

LSTM介绍

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997) , and were refined and popularized by many people in following work.They work tremendously well on a large variety of problems, and are now widely used.

LSTM是为了解决普通RNN网络在实际实践中出现的“梯度消失”等问题而出现的。这里我们略过里面的细节，重点看看单个LSTM cell的输入输出情况。

从上图可以看出，单个LSTM cell其实有2个输出的，一个是h(t)，一个是c(t)。这里的h(t)称为hidden state，c(t)称为cell state。这个命名其实我认为是不太好的。熟悉全连接神经网络的同学，一定会把h(t)跟hidden layer相混淆。其实，这个h(t)才是LSTM的真正output，c(t)才是LSTM的内部”隐藏”状态。

我们进一步把LSTM网络展开来看。每一个时间节点timestep，输入一个x(t)，cell里面的c(t)做一次更新，输出h(t)。紧接着下一个timestep，x(t+1)、h(t)和c(t)继续输入到cell，输出为h(t+1)和c(t+1)，如下图。

因此，Keras中的return_sequences和return_state，就是跟h(t)和c(t)相关。

Return Sequences

接下来我们来点hands-on的代码，来具体看看这两个参数的作用。

实验一

试验代码中，return_sequences和return_state默认都是false，输入shape为(1,3,1)，表示1个batch，3个timestep，1个feature

扫描二维码关注公众号，回复： 12996142 查看本文章

from keras.models import Model
from keras.layers import Input
from keras.layers import LSTM
from numpy import array
# define model
inputs1 = Input(shape=(3, 1))
lstm1 = LSTM(1)(inputs1)
model = Model(inputs=inputs1, outputs=lstm1)
# define input data
data = array([0.1, 0.2, 0.3]).reshape((1,3,1))
# make and show prediction
print(model.predict(data))

输出结果为

[[-0.0953151]]

表示在经历了3个time step的输入后，LSTM返回的hidden state，也就是上文中的h(t)。由于输出的是网络最后一个timestep的值，因此结果是一个标量。

实验二

我们加上参数return_sequences=True

lstm1 = LSTM(1, return_sequences=True)(inputs1)

输出结果为

[[[-0.02243521]
[-0.06210149]
[-0.11457888]]]

我们看到，输出了一个array，长度等于timestep，表示网络输出了每个timestep的h(t)。

总结一下，return_sequences即表示，LSTM的输出h(t)，是输出最后一个timestep的h(t)，还是把所有timestep的h(t)都输出出来。

当return_sequence=True时返回的是(samples, time_steps, output_dim)的3D张量，当return_sequence=Flase时返回的是(samples, output_dim)的2D张量。

比如输入shape=(N, 2, 8)，同时output_dim=32，当return_sequence=True时返回(N, 2, 32)；当return_sequence=False时返回(N, 32)，这里表示的时输出序列的最后一个输出（参考）。

在实际应用中，关系到网络的应用场景是many-to-one还是many-to-many，非常重要。

Return State

实验三

接下来我们继续实验return_state

lstm1, state_h, state_c = LSTM(1, return_state=True)(inputs1)

输出结果为

[array([[ 0.10951342]], dtype=float32),
 array([[ 0.10951342]], dtype=float32),
 array([[ 0.24143776]], dtype=float32)]

注意，输出是一个列表list，分别表示 - 最后一个time step的hidden state - 最后一个time step的hidden state（跟上面一样) - 最后一个time step的cell state（注意就是上文中的c(t)）

可以看出，return_state就是控制LSTM中的c(t)输出与否。

实验四

我们最后看看return_sequences和return_state全开的情况。

lstm1, state_h, state_c = LSTM(1, return_sequences=True, return_state=True)

输出结果为

[array([[[-0.02145359],
         [-0.0540871 ],
         [-0.09228823]]], dtype=float32),
 array([[-0.09228823]], dtype=float32),
 array([[-0.19803026]], dtype=float32)]

输出列表的意义其实跟上面实验三一致，只是第一个hidden state h(t)变成了所有timestep的，因此也是长度等于timestep的array。

Time Distributed

最后再讲一讲Keras中的TimeDistributed。这个也是在RNN中非常常用但比较难理解的概念，原作者解释说

TimeDistributedDense applies a same Dense (fully-connected) operation to every timestep of a 3D tensor.

其实它的主要用途在于Many-to-Many：比如输入shape为(1, 5, 1)，输出shape为(1, 5, 1)

model = Sequential()
model.add(LSTM(3, input_shape=(length, 1), return_sequences=True))
model.add(TimeDistributed(Dense(1)))

根据上面解读，return_sequences=True，使得LSTM的输出为每个timestep的hidden state，shape为(1, 5, 3)

现在需要将这个(1 ,5, 3)的3D tensor变换为(1, 5, 1)的结果，需要3个Dense layer，分别作用于每个time step的输出。而使用了TimeDistributed后，则把一个相同的Dense layer去分别作用，可以使得网络更为紧凑，参数更少的作用。

如果是在many-to-one的情况，return_sequence=False，则LSTM的输出为最后一个time step的hidden state，shape为(1, 3)。此时加上一个Dense layer, 不用使用TimeDistributed，就可以将(1, 3)变换为(1, 1)。

总结

本文主要通过一些实际的代码案例，解释了Keras的LSTM API中常见的两个参数return_sequence和return_state的原理及作用，在Tensorflow及PyTorch，也有相通的，希望能够帮助大家加深对RNN的理解。