Tianchi NLP Competition-News Text Classification (4)-Text Classification Based on Deep Learning 1-FastText


Series of articles
Tianchi NLP Competition-News Text Classification (1)
-Comprehension of Competition Questions Tianchi NLP Competition-News Text Classification (2)
-Data Reading and Data Analysis Tianchi NLP Competition-News Text Classification (3)-Based on Machine Learning Text classification
Tianchi NLP competition-news text classification (4)-text classification based on deep learning 1-FastText


Four, text classification based on deep learning 1-FastText

4.1 Text representation method-FastText

In the previous chapter, we introduced several text representation methods: One-hot, Bag of Words, N-gram, TF-IDF

The shortcomings of the above-mentioned representation method: the converted vector has a very high dimension and requires a long training time; the relationship between words is not considered, but statistics is performed.

Unlike these representation methods, deep learning can also be used for text representation, and it can also be mapped to a low-dimensional space. Among the more typical examples are: FastText, Word2Vec and Bert. In this chapter we will introduce FastText, and Word2Vec and Bert will be introduced later.

FastText

You can refer to: https://blog.csdn.net/feilong_csdn/article/details/88655927

FastText is a typical deep learning word vector representation method. It is very simple to map the words to the dense space through the Embedding layer, and then average all the words in the sentence in the Embedding space to complete the classification operation.

So FastText is a three-layer neural network, input layer, hidden layer and output layer.

Insert picture description here

The following figure is the FastText network structure implemented using keras:

Insert picture description here

FastText is superior to TF-IDF in text classification tasks:

  • FastText uses the document vector obtained by embedding the word embedding to classify similar sentences into one category
  • The Embedding space dimensionality learned by FastText is relatively low and can be trained quickly

4.2 Text classification based on FastText

pip install

pip install fasttext

Note: When installing here, this small error didn't work and there was no problem.
Insert picture description here

import pandas as pd
from sklearn.metrics import f1_score

# 转换为FastText需要的格式
train_df = pd.read_csv('../input/train_set.csv', sep='\t', nrows=15000)
train_df['label_ft'] = '__label__' + train_df['label'].astype(str)
train_df[['text','label_ft']].iloc[:-5000].to_csv('train.csv', index=None, header=None, sep='\t')

import fasttext
model = fasttext.train_supervised('train.csv', lr=1.0, wordNgrams=2, 
                                  verbose=2, minCount=1, epoch=25, loss="hs")

val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-5000:]['text']]
print(f1_score(train_df['label'].values[-5000:].astype(str), val_pred, average='macro'))
# 0.82

4.3 How to use validation set tuning

When using TF-IDF and FastText, there are some model parameters that need to be selected. These parameters will affect the accuracy of the model to a certain extent, so how to choose these parameters?

  • By reading the document, we must figure out the general meaning of these parameters, which will increase the complexity of the model
  • By verifying the accuracy of the model on the validation set, find out whether the model is over-fitting or under-fitting

Insert picture description here

Here we use 10-fold cross-validation, each fold uses 9/10 of the data for training, and the remaining 1/10 is used as the validation set to test the effect of the model. It should be noted here that the division of each fold must ensure that the distribution of the label is consistent with the distribution of the entire data set

label2id = {
    
    }
for i in range(total):
    label = str(all_labels[i])
    if label not in label2id:
        label2id[label] = [i]
    else:
        label2id[label].append(i)

Through the 10-fold division, we have obtained a total of 10 data with the same distribution, with indexes from 0 to 9, each time by using one data as the validation set and the remaining data as the training set, we obtained 10 kinds of divisions of all the data. Without loss of generality, we choose the last one to complete the remaining experiments, that is, the one with index 9 as the validation set and the one with indexes 1-8 as the training set, and then adjust the hyperparameters based on the results of the validation set to make the model performance better.

Online 0.91

import pandas as pd
from sklearn.metrics import f1_score

%%time
# 转换为FastText需要的格式
train_df = pd.read_csv('./train_set.csv', sep='\t', nrows=200000)
train_df['label_ft'] = '__label__' + train_df['label'].astype(str)
train_df[['text','label_ft']].iloc[:-50000].to_csv('train.csv', index=None, header=None, sep='\t')

train_df.head()

train_df.info()

%%time
import fasttext
model = fasttext.train_supervised('train.csv', lr=1.0, wordNgrams=2, 
                                  verbose=2, minCount=1, epoch=25, loss="hs")


%%time
val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-50000:]['text']]
print(f1_score(train_df['label'].values[-50000:].astype(str), val_pred, average='macro'))
# 0.912

test_df = pd.read_csv('./test_a.csv', sep='\t')
test_df

%%time
test_pred = [model.predict(x)[0][0].split('__')[-1] for x in test_df.iloc[:]['text']]

name = ['label']
list = test_pred

test = pd.DataFrame(columns=name, data=list)

test.to_csv('fasttext_submit.csv', index=None)
#线上0.91

test:
Insert picture description here

Guess you like

Origin blog.csdn.net/bosszhao20190517/article/details/107614327