IMDB数据集含有50000条国际影评,被均分为训练集和测试集。在IMDB中,影评所用的词汇被映射为大于0的整数,表示该单词出现在数据库中的词频的排名。本实验只是运用keras进行简单神经网络实验,没有考虑过拟合问题处理。代码简单,直接贴出来。
from keras.datasets import imdb import numpy as np from keras import models from keras import layers import matplotlib.pyplot as plt #获取数据 (train_data,train_labels),(test_data,test_labels)=imdb.load_data(num_words=10000) #vectorized sequences def vectorize_sequences(sequences,dimension=10000): results = np.zeros((len(sequences),dimension)) for i ,sequence in enumerate(sequences): results[i,sequence] = 1 return results x_train = vectorize_sequences(train_data) x_test = vectorize_sequences(test_data) #word_index为字典结构,key为单词,value为编码 word_index= imdb.get_word_index() #将word_index的键和值颠倒 reverse_word_index = dict([(value,key) for (key,value) in word_index.items()]) #0,1,2分别为padding,start of sequence,unknown,所以偏移量减去3 decoded_review = ' '.join([reverse_word_index.get(i-3,'?') for i in train_data[0]]) #vectorized labels y_train = np.asarray(train_labels).astype('float32') y_test = np.asarray(test_labels).astype('float32') #准备训练集和验证集 x_val = x_train[:10000] partial_x_train = x_train[10000:] y_val = y_train[:10000] partial_y_train = y_train[10000:] #模型配置 model = models.Sequential() model.add(layers.Dense(16,activation='relu',input_shape=(10000,))) model.add(layers.Dense(16,activation='relu')) model.add(layers.Dense(1,activation='sigmoid')) model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy']) #模型训练,返回的history为字典结构,记录了训练过程的所有数据,键包含['acc','val_acc','val_loss'],history.history.keys() history = model.fit(partial_x_train,partial_y_train,epochs=20,batch_size=512,validation_data=(x_val,y_val)) #显示training loss 和validating loss acc_values = history.history['acc'] val_acc_values = history.history['val_acc'] loss = history.history['loss'] val_loss = history.history['val_loss'] epochs = range(1,len(acc_values)+1) #bo for blue dot plt.plot(epochs,loss,'bo',label='trainning loss') plt.plot(epochs,val_loss,'b',label='validating loss') plt.title('Training and validation loss') plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend() plt.show() #显示training和validating accuracy plt.clf() plt.plot(epochs,acc_values,'ro',label='Training accuracy') plt.plot(epochs,val_acc_values,'r',label='validating accuracy') plt.title('Training and validating accuracy') plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend() plt.show() #测试集上看效果 test_result = model.evaluate(x_test,y_test) print(test_result)
运行效果如下:
最后测试集上的accuracy:0.84388,从图中可以看出简单的三层全连接神经网络过拟合严重。
从本实验可以得出以下几点:
1.全连接层堆叠(Dense Layer)可以解决很多问题,在使用神经网络训练前,数据预处理部分占据大部分工作量。
2.在二分类问题中,如果输出是标量,损失函数通常使用binary_crossentropy