[DataWhale Learning Record 15-04] Zero-based Introductory NLP-News Text Classification Contest Questions-04 Text Classification Based on Deep Learning

4 Text classification based on deep learning

4.1 Learning objectives

  1. Learn the use and basic principles of FastText
  2. Learn to use the validation set for parameter tuning

4.2 Text Representation Part2

4.2.1 Defects of the existing text representation method

In the previous section, we introduced several text representation methods:

  1. One-hot
  2. Bag of Words
  3. N-gram
  4. TF-IDF
    also carried out corresponding practice through sklean. However, the above methods have more or less certain problems: the converted vector has a high dimensionality and requires a long training time; the relationship between words is not considered, but statistics are performed.

Unlike these representation methods, deep learning can also be used for text representation, and it can also be mapped to a low-dimensional space. Among the more typical examples are: FastText, Word2Vec and Bert. In this chapter we introduce FastText, and we will introduce Word2Vec and Bert later.

4.2.2 FastText

FastText is a typical deep learning word vector representation method. It is very simple to map the words to the dense space through the Embedding layer, and then average all the words in the sentence in the Embedding space to complete the classification operation.
So FastText is a three-layer neural network: input layer, hidden layer and output layer.
Insert picture description here
The following figure shows the FastText network structure implemented using Keras: Insert picture description here
FastText is superior to TF-IDF in text classification tasks:

  1. FastText uses the document vector obtained by embedding the word embedding to classify similar sentences into one category.
  2. The Embedding space dimensionality learned by FastText is relatively low and can be trained quickly.
    If you want deep learning, please refer to the paper:
    Bag of Tricks for Efficient Text Classification

4.3 Text classification based on FastText

FastText can quickly train on the CPU. The best practice method is the official open source version: https://github.com/facebookresearch/fastText/tree/master/python

  • pip install
    pip install fasttext
  • Source installation
    git clone https://github.com/facebookresearch/fastText.git
    cd fastText
    sudo pip install.

Classification model

import pandas as pd
from sklearn.metrics import f1_score

# 转换为FastText需要的格式
train_df = pd.read_csv('../input/train_set.csv', sep='\t', nrows=15000)
train_df['label_ft'] = '__label__' + train_df['label'].astype(str)
train_df[['text','label_ft']].iloc[:-5000].to_csv('train.csv', index=None, header=None, sep='\t')

import fasttext
model = fasttext.train_supervised('train.csv', lr=1.0, wordNgrams=2,
verbose=2, minCount=1, epoch=25, loss="hs")

val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-5000:]['text']]
print(f1_score(train_df['label'].values[-5000:].astype(str), val_pred, average='macro'))
# 0.82

At this time, the data volume is relatively small and the score is 0.82. When the number of training sets is continuously increased, the accuracy of FastText will continue to increase by 5w training samples, and the verification set score can reach about 0.89-0.90.

4.4 How to use the validation set to adjust parameters

When using TF-IDF and FastText, there are some model parameters that need to be selected. These parameters will affect the accuracy of the model to a certain extent, so how to choose these parameters?

  • By reading the document, we must figure out the general meaning of these parameters, which will increase the complexity of the model
  • By verifying the accuracy of the model on the validation set, find out whether the model is over-fitting or under-fitting.
    Insert picture description here
    Here we use 10-fold cross-validation, and each fold uses 9/10 of the data for training, and the remaining 1/10 is used as the validation set to test the model. effect. It should be noted that the division of each fold must ensure that the distribution of the label is consistent with the distribution of the entire data set.
label2id = {
    
    }
for i in range(total):
label = str(all_labels[i])
if label not in label2id:
label2id[label] = [i]
else:
label2id[label].append(i)

Through the 10-fold division, we have obtained a total of 10 data with the same distribution, with indexes from 0 to 9, each time by using one data as the validation set and the remaining data as the training set, we obtained 10 kinds of divisions of all the data. Without loss of generality, we choose the last one to complete the remaining experiments, that is, the one with index 9 as the validation set and the one with indexes 1-8 as the training set, and then adjust the hyperparameters based on the results of the validation set to make the model performance better.

4.5 Summary of this chapter

This chapter introduces the principle and basic use of FastText, and carries out corresponding practice. Then it introduces the division of the data set through 10-fold cross-validation.

4.6 Homework

1. Read the FastText document and try to modify the parameters to get a better score.


-Load word vector model model = fasttext.load_model("model.bin",encoding = "utf-8")
-List of model methods:
print model.model_name''' model name'''
print model.words''' vocabulary List ``'
print model.dim''' word vector dimension'''
print model.ws''' context window size'''
print model.epoch''' number of epochs'''
print model.min_count''' minimum Word frequency'''
print model.word_ngrams''' n-gram settings'''
print model.loss_name''' loss function name'''
print model.minn''' model minimum character length'''
print model.maxn ' '' Model maximum character length'''
print model.lr_update_rate''' Learning rate update rate'''
print model.t''' Sampling threshold'''
print model.encoding''' Model encoding'''
print model[ word)''' word vector of word'''

When the number of validation sets reaches 20w, the score is 0.909.

2. Adjust the hyperparameters based on the results of the validation set to make the model performance better.

We can also change the number of training epochs , learning rate lr (learning rate) and Gram-use of the n-Word ( -wordNgrams ) to improve the accuracy

After simply adjusting the parameters, the result is 0.914.
Insert picture description here

In fact, hierarchical softmax is slightly worse than full softmax in effect. This is because hierarchical softmax is an approximation of complete softmax. Hierarchical softmax allows us to efficiently build models on large data sets, but usually at the cost of a few percentage points of loss of accuracy. Due to equipment and time issues, the demonstration will not be done here.

Guess you like

Origin blog.csdn.net/qq_40463117/article/details/107611117