Deep learning starts from scratch-neural network (5), single-label multi-classification problem, Reuters data set usage, classification coding and integer coding

Reuters data is divided into 46 topics, and each topic has a corresponding short news.
The preprocessing process is basically the same as IMDB.

Download Data

The first execution will download the data and print the number of training samples and test samples.

(train_data,train_labels),(test_data,test_labels) = reuters.load_data(num_words=10000)

print(len(train_data))
print(len(test_data))

Insert picture description here

Convert data to vector

Each sample is also the index value of the word. Therefore, each comment is also transformed into a 10,000-dimensional vector, and 1 is used to indicate the word.

Tag to quantization using the classification code vector full 0 (categorcial encoding), the one-hot form, and the quantized data to the idea is the same, the corresponding dimension types. Only the element corresponding to the label index is 1. The code uses built-in methods provided by keras.

import numpy as np
from keras.utils.np_utils import to_categorical

#转换为10000维的向量,索引的位置是1,其他位置是0
def vectorize_sequences(sequences,dimension=10000):
    results = np.zeros((len(sequences),dimension))
    for i, sequence in enumerate(sequences):
        results[i,sequence] = 1.
    return results
#数据向量化
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

#标签向量化 使用 分类编码 ont-hot
ont_hot_train_labrls = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

Build the network

Previously, there were only two final output categories of movie reviews, but now there are 46. The dimensionality of the output space is much larger.
Therefore, the 16 units used in the previous network are not enough here. There is no way to distinguish 46 different categories. The layer with small dimensions may become an information bottleneck.
Therefore 64 units are used here. But there are still two layers of hidden, one output.

from keras import models
from keras import layers

#模型定义
model = models.Sequential()
model.add(layers.Dense(64,activation='relu',input_shape=(10000,)))
model.add(layers.Dense(64,activation='relu'))
model.add(layers.Dense(46,activation='softmax'))

The last layer of 46-dimensional vector output, using the softmax activation function, the network outputs the probability distribution in 46 categories. The sum of 46 probabilities is 1.
Then use the loss function categorical_crossentropy (categorical crossentropy). It is used to measure the distance between two probability distributions. The two to be compared are the probability distribution of the network output and the true distribution of the label . Train the network by minimizing the distance between these two distributions.

#定义优化器,损失函数,指标
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

Validation set

1000 samples are taken from the data to be used as a validation set.

#取1000用于验证集
x_val = x_train[:1000] #验证集
partial_x_train = x_train[1000:]

y_val = ont_hot_train_labels[:1000] #验证集
partial_y_train = ont_hot_train_labels[1000:]

Training model

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val,y_val))

We input the training data set partial_x_train and the training tag set partial_y_trail.
Training times of all data*20.
Fetch 512 pieces of data at a time.
The history contains all the data in the training process.
validation_data is the validation set taken out in the previous step.

Use matplot to plot the trend of these indicators:
loss change chart:

import matplotlib.pyplot as plt

loss_values = history.history['loss']
val_loss_values = history.history['val_loss']

epochs = range(1,len(loss_values)+1)

plt.plot(epochs,loss_values,'bo',label='Training loss') #bo是蓝色圆点
plt.plot(epochs,val_loss_values,'b',label='Validation loss') #b是蓝色实线
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

Accuracy change chart:

plt.clf() #清空图表

acc_values = history.history['accuracy']
val_acc_values  = history.history['val_accuracy']

plt.plot(epochs,acc_values,'bo',label='Training acc') #bo是蓝色圆点
plt.plot(epochs,val_acc_values,'b',label='Validation acc') #b是蓝色实线
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

Insert picture description here
Insert picture description here
It can be seen that there is overfitting in the aftermath of the eighth training round.

Modify epochs to 9 rounds of training

Then test the model on the test set.

#训练
history = model.fit(partial_x_train,
                    partial_y_trail,
                    epochs=9,
                    batch_size=512,
                    validation_data=(x_val,y_val))
                    
results = model.evaluate(x_test,one_hot_test_labels)
print(results)

Insert picture description here
Can reach 80% accuracy.

Predict results on new data

prediction = model.predict(x_test)

print(prediction[0].shape) #其中每个向量的维度
print(np.sum(prediction[0])) #第一个向量中,所有元素的和
print(np.argmax(prediction[0])) #输出第一个向量中最大概率的类别

Insert picture description here
It can be seen that none of the vectors in the result are 46-dimensional, representing the probability that the news belongs to 46 categories, and the sum is 1. The probability of the first news being in the third category is the greatest.

Change label coding method

Change the label encoding method to an integer tensor .
For the classification coding used in the previous label , the loss function we used is categorical_crossentropy .
Now after changing to integer tags , you should choose to use sparse_categorical_crossentropy

And modify other related codes.

#整数张量 编码标签
y_train = np.array(train_labels)
y_test = np.array(test_labels)

#采用整数标签后,对应的损失函数
model.compile(optimizer='rmsprop',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

y_val = y_train[:1000] #验证集
partial_y_train = y_train[1000:]

results = model.evaluate(x_test,y_test)

Insert picture description here

Complete code

from keras.datasets import reuters
import numpy as np
from keras.utils.np_utils import to_categorical
from keras import models
from keras import layers
import matplotlib.pyplot as plt

(train_data,train_labels),(test_data,test_labels) = reuters.load_data(num_words=10000)
print(len(train_data))
print(len(test_data))

#转换为10000维的向量,索引的位置是1,其他位置是0
def vectorize_sequences(sequences,dimension=10000):
    results = np.zeros((len(sequences),dimension))
    for i, sequence in enumerate(sequences):
        results[i,sequence] = 1.
    return results

#数据向量化
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

#标签向量化 使用 分类编码 ont-hot
# ont_hot_train_labels = to_categorical(train_labels)
# one_hot_test_labels = to_categorical(test_labels)

#整数张量 编码标签
y_train = np.array(train_labels)
y_test = np.array(test_labels)

#模型定义
model = models.Sequential()
model.add(layers.Dense(64,activation='relu',input_shape=(10000,)))
model.add(layers.Dense(64,activation='relu'))
model.add(layers.Dense(46,activation='softmax'))

#定义优化器,损失函数,指标

#采用分类编码后,对应的损失函数
# model.compile(optimizer='rmsprop',
#               loss='categorical_crossentropy',
#               metrics=['accuracy'])

#采用整数标签后,对应的损失函数
model.compile(optimizer='rmsprop',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

#取1000用于验证集
x_val = x_train[:1000] #验证集
partial_x_train = x_train[1000:]

# y_val = ont_hot_train_labels[:1000] #验证集
# partial_y_train = ont_hot_train_labels[1000:]

y_val = y_train[:1000] #验证集
partial_y_train = y_train[1000:]

#训练
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=9,
                    batch_size=512,
                    validation_data=(x_val,y_val))


# loss_values = history.history['loss']
# val_loss_values = history.history['val_loss']
#
# epochs = range(1,len(loss_values)+1)
#
# plt.plot(epochs,loss_values,'bo',label='Training loss') #bo是蓝色圆点
# plt.plot(epochs,val_loss_values,'b',label='Validation loss') #b是蓝色实线
# plt.title('Training and validation loss')
# plt.xlabel('Epochs')
# plt.ylabel('Loss')
# plt.legend()
# plt.show()
#
# plt.clf() #清空图表
#
# acc_values = history.history['accuracy']
# val_acc_values  = history.history['val_accuracy']
#
# plt.plot(epochs,acc_values,'bo',label='Training acc') #bo是蓝色圆点
# plt.plot(epochs,val_acc_values,'b',label='Validation acc') #b是蓝色实线
# plt.title('Training and validation accuracy')
# plt.xlabel('Epochs')
# plt.ylabel('Accuracy')
# plt.legend()
# plt.show()

# results = model.evaluate(x_test,one_hot_test_labels)
results = model.evaluate(x_test,y_test)
print(results)

prediction = model.predict(x_test)

print(prediction[0].shape) #其中每个向量的维度
print(np.sum(prediction[0])) #第一个向量中,所有元素的和
print(np.argmax(prediction[0])) #输出第一个向量中最大概率的类别

Guess you like

Origin blog.csdn.net/wwb1990/article/details/104894838