Keras深度学习记录2——电影评论分类(二分类)


二分类问题是机器学习中应用最广泛的问题,下面将学习根据电影评论的文字内容将其划分为正面或负面的例子。

一、加载IMDB数据集

下面代码将会下载imdb数据集(第一次运行将下载大约80M的数据)

from keras.datasets import imdb
(train_data,train_labels), (test_data,test_labels) = imdb.load_data(num_words=10000)
train_data[0] //单词索引组成的列表
 >>[1,
...,
 973,
 1622]
train_labels[0] //标签由10组成,其中1代表正面(positive),0代表负面(negative)
 >>1
max([max(sequence) for sequence in train_data])
 >>9999

num_words=10000的意思是保留训练集中常出现的10000个单词,低频单词将被舍弃,这种方式得到的向量数据不会太大,便于处理。

将评论解码为英文单词,下面的代码会将第一条评论解码出来

word_index = imdb.get_word_index()
reverse_word_index = dict([(value,key) for (key,value)in word_index.items()])
decoded_review= ' '.join(
[reverse_word_index.get(i - 3, '?') for i in train_data[0]])
print(decoded_review)
>>? this film was just brilliant casting location scenery story direction everyone's 
>really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's 
>that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't 
>you think the whole story was so lovely because it was true 
>and was someone's life after all that was shared with us all

二、准备数据

将整数序列编码为二进制矩阵

import numpy as np

def vectorize_sequence(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

x_train = vectorize_sequence(train_data)
x_test = vectorize_sequence(test_data)

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

三、 网络搭建

1.模型定义

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(16, activation='relu',input_shape=(10000,)))
model.add(layers.Dense(16,activation='relu'))
model.add(layers.Dense(1,activation='sigmoid'))

2.编译模型

model.compile(optimizer='rmsprop',
             loss='binary_crossentropy',
             metrics=['accuracy'])

重点:model.compile()
optimeizer代表优化器:本文使用rmsprop优化器;
loss为损失函数:本文使用常用的交叉熵损失函数;
metric为监控指标:例如本文只关心精度(accuracy)。
上述代码将优化器、损失函数和指标以字符串传入,因为rmsprop,binary_crossentropy,accuracy为Keras内置的一部分。我们也可以配置自定义优化器的参数,或者自定义损失函数和监控指标。

2.1 配置优化器

预先写好优化器函数,使用import导入,这里使用keras自带的优化器

from keras import optimizers

model.compile(optimizer=optimizers.RMSprop(lr=0.001),
             loss='binary_crossentropy',
             metrics=['accuracy'])

2.2 使用自定义损失函数和指标

同上

from keras import losses
from keras import metrics

model.compile(optimizer=optimizers.RMSprop(lr=0.001),
             loss=losses.binary_crossentropy,
             metrics=[metrics.binary_accuracy])

四、模型训练与验证

1.留出验证集

x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

2.训练模型

model.compile(optimizer='rmsprop',
             loss='binary_crossentropy',
             metrics=['acc'])
history = model.fit(partial_x_train,
         partial_y_train,
         epochs=20,
         batch_size=512,
         validation_data=(x_val, y_val))

注意!!!
model.fit()返回一个History对象。这个对象有一个成员history,他是一个字典,包含训练中的所有数据。可以通过这个字典绘制训练图像。

history_dict = history.history
history_dict.keys()
>>dict_keys(['val_loss', 'val_acc', 'loss', 'acc'])

3.使用matplotlib绘制训练损失和验证损失

import matplotlib.pyplot as plt

history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

epochs = range(1, len(loss_values) + 1)

plt.plot(epochs, loss_values, 'bo', label='Training loss')#bo表示蓝色圆点
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')#b表示蓝色实线
plt.title('Training and Validation loss')
plt.xlabel = ('Epoch')
plt.ylabel = ('Loss')
plt.legend()

plt.show()

4.查看测试集模型性能

results = model.evaluate(x_test, y_test)
print(results)

5.使用训练好的网络在新数据上生成预测结果

model.predict(x_test)
>>array([[4.2676926e-05],
       [1.0000000e+00],
       [9.9798399e-01],
       ...,
       [4.2319298e-06],
       [6.9633126e-04],
       [9.3176681e-01]], dtype=float32)

可以看到网络对某些样本的结过非常确定,可达0.99和1,也有不确定0.4或0.6

扫描二维码关注公众号,回复: 12081739 查看本文章

五、整体代码清单

from keras.datasets import imdb
import numpy as np
from keras import models
from keras import layers
import matplotlib.pyplot as plt

#一、加载IMDB数据集
(train_data,train_labels), (test_data,test_labels) = imdb.load_data(num_words=10000)

#二、准备数据,对训练和测试数据及标签向量化
def vectorize_sequence(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

x_train = vectorize_sequence(train_data)
x_test = vectorize_sequence(test_data)

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

#三、网络搭建
#1.定义模型
model = models.Sequential()
model.add(layers.Dense(16, activation='relu',input_shape=(10000,)))
model.add(layers.Dense(16,activation='relu'))
model.add(layers.Dense(1,activation='sigmoid'))
#2.编译模型
model.compile(optimizer='rmsprop',
             loss='binary_crossentropy',
             metrics=['accuracy'])


#四、模型训练与验证
#1.训练模型
history = model.fit(partial_x_train,
         partial_y_train,
         epochs=20,
         batch_size=512,
         validation_data=(x_val, y_val))
#2.绘制图像
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

epochs = range(1, len(loss_values) + 1)

plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and Validation loss')
plt.xlabel = ('Epoch')
plt.ylabel = ('Loss')
plt.legend()

plt.show()

#3.查看测试集性能
results = model.evaluate(x_test, y_test)
print(results)

#4.查看预测结果
print(model.predict(x_test))

六、运行结果——过拟合

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

可以看到,训练损失和验证损失在不断降低,但是验证损失在第4轮达到最佳值,且后面开始不断增长。通俗来讲就是模型在训练数据上越带越好,但是在为见到过的数据上表现却不一定越来越好。用深度学习术语来讲发生了过拟合(overfit),即在第4轮后训练的数据过度拟化,最终学到的数据只针对训练集,无法泛化到训练集外的数据。
为了防止这种情况,可以在3轮之后停止训练,或者采用其它方法降低过拟合!

猜你喜欢

转载自blog.csdn.net/qq_40076022/article/details/109224933