这次是利用TensorFlow进行文本分类,判断电影评价是正面还是负面的.IMDB数据集包含5万个评论,其中2.5万作为训练集,2.5万作为测试集.训练集和数据集相当意味着正负样本数一样.

一.下载IMDB数据集

IMDB数据集经过处理,将单词序列转成数字序列,每一个数字在字典中代表中一个特定的单词.下载的代码如下,下载在文件夹/root/.keras/datasets下面,文件名是imdb.npz.

代码如下:

import tensorflow as tf
from tensorflow import keras

import numpy as np
import matplotlib.pyplot as plt

imdb=keras.datasets.imdb
(train_data,train_labels),(test_data,test_labels)=imdb.load_data(num_words=10000)

这里的参数num_words=10000使得训练样本中的前1万个单词是频率最高的.

二.探索数据

下面将展示一个样本,是一组整数的数组,代表着电影评论的单词.标签是0或者1,0表示的是负面的评论,1代表的是正面的评论.

代码:

print("Training entries:{},labels:{}".format(len(train_data),len(train_labels)))
print(train_data.shape)
print(test_labels.shape)
print(train_data[0])

结果:

Training entries:25000,labels:25000
(25000,)
(25000,)
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

Process finished with exit code 0

从上可知,数组包含25000的样本,每一个样本是一个list,代表的是这个评论的单词在字典中的下标.

代码:对于不同的样本,也就是不同的评论的长度是不一样大小的.

print(len(train_data[0]),len(train_data[1]))

结果:

218 189

上面的第一个训练样本长度是218,第二个样本的长度是189.

2.1 将整数转回单词

代码如下:

# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

print(decode_review(train_data[0]))

结果:

"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

三.准备数据

这些评论必须转成tensor在传入卷积神经网络中,下面几种方式可以:

第一种:因为我们之前取得是1万个单词数,所以如果评论是[3,5],我们就需要有一个1万个元素的向量,只有下标3和5的地方为1,其他都为0.这样,我们需要大小为1万*评论数的矩阵.

第二种:我们可以将每个评论的都设置成相同的长度,创造一个最大长度*评论数的tensor.

我们使用方法二.

我们将评论的数组不满256的补充0,扩充成256个单词数目.

代码:

train_data=keras.preprocessing.sequence.pad_sequences(train_data,
                                                      value=word_index['<PAD>'],
                                                      padding='post',
                                                      maxlen=256)
test_data=keras.preprocessing.sequence.pad_sequences(test_data,
                                                     value=word_index['<PAD>'],
                                                     padding='post',
                                                     maxlen=256)
print(len(train_data[0]),len(train_data[1]),len(test_data[1]))
print(train_data[0])

结果:

256 256 256
[   1   14   22   16   43  530  973 1622 1385   65  458 4468   66 3941
    4  173   36  256    5   25  100   43  838  112   50  670    2    9
   35  480  284    5  150    4  172  112  167    2  336  385   39    4
  172 4536 1111   17  546   38   13  447    4  192   50   16    6  147
 2025   19   14   22    4 1920 4613  469    4   22   71   87   12   16
   43  530   38   76   15   13 1247    4   22   17  515   17   12   16
  626   18    2    5   62  386   12    8  316    8  106    5    4 2223
 5244   16  480   66 3785   33    4  130   12   16   38  619    5   25
  124   51   36  135   48   25 1415   33    6   22   12  215   28   77
   52    5   14  407   16   82    2    8    4  107  117 5952   15  256
    4    2    7 3766    5  723   36   71   43  530  476   26  400  317
   46    7    4    2 1029   13  104   88    4  381   15  297   98   32
 2071   56   26  141    6  194 7486   18    4  226   22   21  134  476
   26  480    5  144   30 5535   18   51   36   28  224   92   25  104
    4  226   65   16   38 1334   88   12   16  283    5   16 4472  113
  103   32   15   16 5345   19  178   32    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0]

四.建立模型

代码如下:

vocab_size=10000
model=keras.Sequential()
model.add(keras.layers.Embedding(vocab_size,16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16,activation=tf.nn.relu))
model.add(keras.layers.Dense(1,activation=tf.nn.sigmoid))

model.summary()

结果展示:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, None, 16)          160000    
_________________________________________________________________
global_average_pooling1d_1 ( (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0

4.1 隐藏单元

~~其实这个模型的结构我也不是很理解,特别是第一层看了几个博客都没搞清楚....~~

4.2 损失函数和优化器

我们这里的损失函数设置成binary_crossentropy,这个并不是唯一的损失函数,还可以使用mean_squared_error,但是因为binary_crossentrpy这个能够更好的处理我们的概率问题,所以这里采用这个.为什么采用这个交叉熵函数,而不是普通的MSE呢?

下面简要的解释一下:(因为结果比较的复杂,这里我们写在草稿纸上进行展示)

一般来说sigmoid是用在二分类的问题,softmax是用于多分类的问题,同样的对应着这样的损失函数binary_crossentopy和categorical_crossentropy.但是注意一般我们在构建模型的时候,sigmoid激活函数可以放在最后一层也可以放在中间层,softmax只放在最后一层(一般认为是单独的一层,而不是单单的一个激活函数).且很特殊的是softmax是多输入多输出的,但是其他的激活函数仅仅是单输入和单输出.

代码如下:

model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

五.建立交叉验证集合

如果你看过机器学习的课程,你应该知道交叉验证集的作用,这里不讨论,可以查看我之前的博客.

这里的代码如下:

x_val=train_data[:10000]
partial_x_train=train_data[10000:]

y_val=train_labels[:10000]
partial_y_train=train_labels[10000:]

六.训练模型

代码如下:(结果不展示了就是迭代了40次,很长)每次迭代完成输出的是训练的误差和准确率还有交叉验证集的误差和准确率.

history=model.fit(partial_x_train,
                  partial_y_train,
                  epochs=40,
                  batch_size=512,
                  validation_data=(x_val,y_val),
                  verbose=1
                  )

七.评估模型

代码如下:(在测试集合上测试损失和准确率)

results=model.evaluate(test_data,test_labels)
print(results)

结果:

[0.30745334438323974, 0.8748]

八.在时间上建立一个关于准确率和损失的图

我们在这里建立了两个图:第一个是关于迭代步数的损失图,第二个是关于迭代步数的准确率图.

代码如下:

可以看到,这里的history是由fit返回的,是一个包含了训练过程中的结果的字典.返回的是训练和交叉验证的过程中的损失和准确率.对于我们的训练集合展示的是蓝色的点,对于我们的交叉验证集我们采用的是蓝色的实线.

history_dict=history.history
history_dict.keys()

acc=history.history['acc']
val_acc=history.history['val_acc']
loss=history.history['loss']
val_loss=history.history['val_loss']

epochs=range(1,len(acc)+1)

plt.plot(epochs,loss,'bo',label='Training loss')

plt.plot(epochs,val_loss,'b',label='Validation loss')
plt.title('Trainig and validation loss')

plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

plt.clf()
acc_values=history_dict['acc']
val_acc_values=history_dict['val_acc']

plt.plot(epochs,acc,'bo',label='Training acc')
plt.plot(epochs,val_acc,'b',label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

结果展示:

下面的两个图你可以看到对于训练集合损失一直在下降,准确率一直在上升,但是对于交叉验证集他到差不多20次迭代的时候就基本平了,这就是所谓的过拟合的现象,即我们的模型对于训练集合能够达到很高的准确率,但是在训练反而对一般的未见过的数据很难准确的预测结果.一般的步骤就是只训练大约20次这样就行了.

图一:

图2

2.利用TensorFlow进行电影评论的正负判断(文本分类)

一.下载IMDB数据集

二.探索数据

2.1 将整数转回单词

三.准备数据

四.建立模型

4.1 隐藏单元

4.2 损失函数和优化器

五.建立交叉验证集合

六.训练模型

七.评估模型

八.在时间上建立一个关于准确率和损失的图

猜你喜欢