NLP entry learning 2 - text classification (based on keras to build LSTM)

0. Introduction

This article will introduce in detail the use of keras to build a single-layer, multi-layer, one-way, two-way LSTM model in a practical way to complete the classification of news data. It is mainly for NLP beginners, so it is written in as much detail as possible. Of course, valuable opinions from all the big guys are welcome.
The data used is a subset of Tsinghua University’s news data THUCNews,
download link: https://pan.baidu.com/s/1U_Ypqiu8Cq4IAWdqiDLR_g Extraction code: 9766 The train, val, and test sets of
the attached THUCNews website http://thuctc.thunlp.org/
data contain 50,000, 5,000, and 10,000 pieces of news respectively, and indicate which news each piece belongs to One category, 10 categories in total.
First post the reference link:
Reference Link 1
Reference Link 2
Reference Link 3
Reference Link 4
Reference Link 5
The main reference of this article https://zhuanlan.zhihu.com/p/39884984 describes the model construction and training process in more detail. On this basis, I added my own understanding and more detailed notes, as well as more model structures.

1. Environmental dependence

This article mainly uses the following modules:

module Version
hard 2.4.3
scikit-learn 0.19.1
scipy 1.0.0
seaborn 0.8.1
numpy 1.14.0

Note that if the version of numpy is too high, the problem of sklearn import failure may occur.

2. Word segmentation processing

After getting the data, we first need to do simple preprocessing, which is to segment the news. There are several tools available. You might as well use the commonly used Chinese word segmentation tool jieba.

train_df['cutword'] = ''         # 给dataframe新建一个名为cutword的列
for row in train_df.iterrows():           # 对每一行进行操作
    cut_res = ' '.join(jieba.cut(row[1]['text']))           # jieba分词
    #print(cut_res)
    train_df['cutword'][row[0]] = cut_res          # 将分词的结果写入dataframe

The dataframe after processing should look like this.
word segmentation results
The same val and test also need to be processed in the same way.
Or if you don't want to do word segmentation yourself if you find it troublesome, you can directly use the csv file divided in reference link 1, and there is a download link in the original text.

3. Model Construction

3.1 One-way single-layer LSTM

This part is actually a reproduction of the content in https://zhuanlan.zhihu.com/p/39884984 . If the original blogger minds, please remind me to modify or delete it.

The first is to import the required packages:

import pandas as pd
import numpy as np
from sklearn import metrics                                      # 模型评价指标
from sklearn.preprocessing import LabelEncoder,OneHotEncoder    # 用于对数据集的标签进行编码
from keras.models import Model                                  # 通用模型定义方法
from keras import Sequential                                    # 序列模型定义方法
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding    # keras中添加的各层
from keras.optimizers import RMSprop                             # 优化器
from keras.preprocessing.text import Tokenizer                  # 词典生成
from keras.preprocessing import sequence                        # 主要用于序列的padding
from keras.callbacks import EarlyStopping                         # 训练过程中的早停
import seaborn as sns

Here, because of my laziness, I didn't set the font, and I directly replaced it with pinyin when drawing later.
Import Data:

train_df = pd.read_csv('/root/news/cnews_train.csv')
val_df = pd.read_csv('/root/news/cnews_val.csv')
test_df = pd.read_csv('/root/news/cnews_test.csv')
test_df.head()

Encode the label of the data

# 对标签进行编码
# labelencoder的效果是将标签转变为数字编号,本例中就是0-9的数字
train_y = train_df.label
val_y = val_df.label
test_y = test_df.label
LabelE = LabelEncoder()
train_y = LabelE.fit_transform(train_y).reshape(-1,1)
val_y = LabelE.transform(val_y).reshape(-1,1)
test_y = LabelE.transform(test_y).reshape(-1,1)

# 对标签进行one-hot编码
# 再将刚才的编号转为one-hot
OneHotE = OneHotEncoder()
train_y = OneHotE.fit_transform(train_y).toarray()
val_y = OneHotE.transform(val_y).toarray()
test_y = OneHotE.transform(test_y).toarray()

According to the word frequency, the words in the text are numbered, and the words with higher word frequency have smaller numbers.

max_words = 5000                    # 词表中的最大词语数量
max_len = 600                       # 新闻向量的最大长度
tok = Tokenizer(num_words=max_words)    
tok.fit_on_texts(train_df.cutword)

Use tok.word_index.items() to view the numbers corresponding to the 10 words with the largest word frequency:

for ii,iterm in enumerate(tok.word_index.items()):
    if ii < 10:
        print(iterm)
    else:
        break

输出:
('我们', 1)
('一个', 2)
('中国', 3)
('可以', 4)
('基金', 5)
('没有', 6)
('自己', 7)
('他们', 8)
('市场', 9)
('这个', 10)

Use tok.word_counts.items() to view the frequency of the first 10 words in the thesaurus in the thesaurus:

for ii,iterm in enumerate(tok.word_counts.items()):
    if ii < 10:
        print(iterm)
    else:
        break

输出:
('马晓旭', 2)
('意外', 1641)
('受伤', 1948)
('国奥', 148)
('警惕', 385)
('无奈', 1161)
('大雨', 77)
('格外', 529)
('青睐', 1092)
('殷家', 1)

So far each word has been represented by a number, then each news can be converted into a vector. And in order to ensure that in the input of the following model, all vectors maintain the same dimension, padding operation is required, that is, all news are filled to the same length, max_len=600

train_seq = tok.texts_to_sequences(train_df.cutword)
val_seq = tok.texts_to_sequences(val_df.cutword)
test_seq = tok.texts_to_sequences(test_df.cutword)

# 将每个序列调整为相同的长度
train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)

After the preparations are done, you can use keras to build the LSTM model. Model building in keras mainly includes two methods, general model and sequence model. The referenced original article uses the general model Model to build. Since this article is aimed at beginners, it also gives another method to build the sequence model Sequential. As for the difference between the two models, I won’t give a detailed introduction here, you can Baidu by yourself.
First the general model:

inputs = Input(name='inputs',shape=[max_len])
## Embedding(词汇表大小,batch大小,每个新闻的词长)
layer = Embedding(max_words+1,128,input_length=max_len)(inputs)     # 定义Embedding层,128是embedding之后的维度
layer = LSTM(128)(layer)                                     # 定义LSTM层,上一层的输出维度128
layer = Dense(128,activation="relu",name="FC1")(layer)       # 定义全连接层
layer = Dropout(0.5)(layer)
layer = Dense(10,activation="softmax",name="FC2")(layer)
model = Model(inputs=inputs,outputs=layer)                # 建立模型
model.summary()
model.compile(loss="categorical_crossentropy",optimizer=RMSprop(),metrics=["accuracy"])    # 损失函数、优化器、评价标准

Model summary:
model summary
At this point, the data preparation and model structure construction have been completed, and then the training can begin. In order to save time, an early stop mechanism is adopted here.

model_fit = model.fit(train_seq_mat,train_y,batch_size=128,epochs=10,
                      validation_data=(val_seq_mat,val_y),
                      callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)]     # 当val-loss不再提升时停止训练
                     )

After only two epochs in my training process, the effect on the test set is no longer improved, so early stopping is triggered. The loss and accuracy are as follows:
model training
Then predict the test set and look at the accuracy and recall of the model.

# 对测试集进行预测
test_pre = model.predict(test_seq_mat)
# 计算混淆矩阵
confm = metrics.confusion_matrix(np.argmax(test_pre,axis=1),np.argmax(test_y,axis=1))
print(metrics.classification_report(np.argmax(test_pre,axis=1),np.argmax(test_y,axis=1)))

model evaluation
Visualize the confusion matrix with a heatmap:

Labname = ["tiyu","yule","jiaju","fangchan","jiaoyu","shishang","shizheng","youxi","keji","caijing"]  # 没有设置字体,就直接用拼音代替了
plt.figure(figsize=(8,8))
sns.heatmap(confm.T, square=True, annot=True,
            fmt='d', cbar=False,linewidths=.8,
            cmap="YlGnBu")
plt.xlabel('True label',size = 14)
plt.ylabel('Predicted label',size = 14)
plt.xticks(np.arange(10)+0.5,Labname,size = 12)
plt.yticks(np.arange(10)+0.3,Labname,size = 12)
plt.show()

Confusion Matrix Heatmap
Generally speaking, the effect is not bad, and the results of the original blogger are basically consistent.
Next, use the sequence model method to build LSTM. Paste the code directly:

from keras import Sequential
model = Sequential()          # 首先将模型定义为序列模型
model.add(Embedding(max_words+1, 128, input_length=max_len))    # 添加一个embedding层
model.add(LSTM(128))                                             # 添加一个LSTM层
model.add(Dense(128,activation='relu',name='FC1'))               # 添加一个全连接层
model.add(Dropout(0.5))
model.add(Dense(10,activation='softmax',name='FC2'))
model.compile(loss='categorical_crossentropy',optimizer=RMSprop(),metrics=['accuracy'])

model.summary()

Comparing the results of summary, it is found that it is completely consistent with the previous model.
summary
Or train in the same way. Here simply let epoch be fixed at 2.

model_fit = model.fit(train_seq_mat,train_y,batch_size=128,epochs=2,
                      validation_data=(val_seq_mat,val_y),
                      callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)]
                     )

It turns out that the effect after training for two epochs is not as good as the effect of training the model for two epochs just now. Maybe the model should not stop training at this time. But as the epoch increases, the final effects of the two models should be consistent.

3.2 Unidirectional multi-layer LSTM

The definition of multi-layer LSTM is similar to that of single-layer LSTM, just pay attention to the output of the previous layer.
Here, taking the sequence model as an example, a two-layer LSTM model is constructed.

from keras import Sequential
model = Sequential()
model.add(Embedding(max_words+1, 128, input_length=max_len))
model.add(LSTM(128,return_sequences=True,name='LSTM1'))         # 添加第一个LSTM层
model.add(LSTM(128,name='LSTM2'))                              # 添加第二个LSTM层
model.add(Dense(128,activation='relu',name='FC1'))
model.add(Dropout(0.5))
model.add(Dense(10,activation='softmax',name='FC2'))
model.compile(loss='categorical_crossentropy',optimizer=RMSprop(),metrics=['accuracy'])

model.summary()

It is necessary to pay attention to the difference between the addition of the two layers of LSTM. There is no difference between the addition of the second layer of LSTM and the previous LSTM layer, but if the upper layer of LSTM does not write return_sequences=True, it will be dimensionally wrong when executing the second layer of LSTM addition statement. This is about the return_sequences parameter of LSTM.
This parameter is used to control whether the LSTM layer returns the hidden state of each node. In keras, the default return_sequences is False, that is, only the hidden state of the last time is returned.
In this way, if you do not record the hidden state at all times, then in the input process of the second layer LSTM, the input dimension will no longer be 128, so an error will be reported.
Then look at the summary of the model:
summary
Then try to train two epochs, the time relationship has not been trained a lot.
train
The model training has slowed down, but judging from the first two epochs, adding a layer of LSTM did not improve the accuracy of the model. I don't know what the effect will be if we continue to train.

3.3 Bidirectional LSTM

Using keras to build a bidirectional model, you only need to import Bidirectional, and then add Bidirectional in front of the LSTM layer to make it bidirectional.

from keras import Sequential
from keras.layers import Bidirectional
model = Sequential()
model.add(Embedding(max_words+1, 128, input_length=max_len))
model.add(Bidirectional(LSTM(128)))
model.add(Dense(128,activation='relu',name='FC1'))
model.add(Dropout(0.5))
model.add(Dense(10,activation='softmax',name='FC2'))
model.compile(loss='categorical_crossentropy',optimizer=RMSprop(),metrics=['accuracy'])

model.summary()

Then, train for 2 epochs in the same way.
train
The effect does not appear to have changed much. It is longer than the time required for training in the case of one-way single-layer.

4. Save and apply the model

4.1 Save the model

In addition to saving the model, you also need to save the trained tokenizer. The code below is saving and loading respectively.

import pickle
# saving
with open('tok.pickle', 'wb') as handle:
    pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)

# loading
with open('tok.pickle', 'rb') as handle:
    tok = pickle.load(handle)

Then save the model as an h5 file, and load the model:

from keras.models import load_model
# 保存模型
model.save('LSTM.h5')  

del model  
# 加载模型
model = load_model(LSTM.h5')
# 如果工作空间中已有模型model,需要先把它删了再加载。
# del model

4.2 Application of the model

After the model is trained, it can be used to classify news. If you want to directly see the classification effect on the validation set:

val_seq = tok.texts_to_sequences(val_df.cutword)
# 将每个序列调整为相同的长度
val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)
# 对验证集进行预测
pre = model.predict(val_seq_mat)

pre is an array, each item contains ten elements, and the position of the largest element is its corresponding category.
But with such use, it seems. . . It’s not very convenient. If you want to see the category corresponding to the news of a certain id on the verification set, you can use such a simple function to achieve it:

def pre_res(id):
    '''
    查看指定id的新闻的分类结果
    '''
    loc = np.argmax(val_pre[id])
    if loc == 0:
        res = '体育'
    elif loc == 1:
        res = '娱乐'
    elif loc == 2:
        res = '家居'
    elif loc == 3:
        res = '房产'
    elif loc == 4:
        res = '教育'
    elif loc == 5:
        res = '时尚'
    elif loc == 6:
        res = '时政'
    elif loc == 7:
        res = '游戏'
    elif loc == 8:
        res = '科技'
    elif loc == 9:
        res = '财经'
    
    return res

have a test:

pre_res(4998)

The result is finance. The classification is accurate.

5. end

At this point, the introduction of using keras to build LSTM is complete. Next, it depends on the time and mood to change the TensorFlow or torch version, so we will see you next time.

Guess you like

Origin blog.csdn.net/weixin_44826203/article/details/107536669