NLP news text classification-Task4

Task4-fasttext for text classification based on deep learning

1. Text representation method

Defects of existing text representation methods:
In the previous chapter, we introduced several text representation methods:
One-hot, Bag of Words, N-gram, TF-IDF,
but the above methods have more or less problems: The converted vector has a high dimensionality and requires a long training practice; the relationship between words is not considered, only statistics are performed.

Unlike these representation methods, deep learning can also be used for text representation, and it can also be mapped to a low-latitude space. Among the more typical examples are: FastText, Word2Vec and Bert.

This time, fasttext is mainly used for text classification. There are several articles about fasttext and other deep learning text representation models, which are quite good:
fasttext principle and practice
fasttext source code analysis

2. Use fasttext for classification & parameter explanation

import pandas as pd
from sklearn.metrics import f1_score
import fasttext

# 将数据转化为FastText需要的格式
train_df = pd.read_csv('train_set.csv', sep='\t', nrows=15000)
train_df['label_ft'] = '__label__' + train_df['label'].astype(str)
train_df[['text','label_ft']].iloc[:-5000].to_csv('train.csv', index=None, header=None, sep='\t')

# supervised参数解析
model = fasttext.train_supervised('train.csv', lr=1.0, wordNgrams=2,
                                  verbose=2, minCount=1, epoch=25, loss="hs")
# lr:学习率
# wordNgrams:n-gram设置
# minCount:最低出现的词频
# loss:损失函数(ns, hs, softmax)

# 利用模型进行预测计算
val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-5000:]['text']]
# 计算指标f1_score值
print(f1_score(train_df['label'].values[-5000:].astype(str), val_pred, average='macro'))

Training result display:
Insert picture description here
here is a more detailed explanation of each method parameter in fasttext, which should be sufficient for calling:
fasttext parameter description

3. Example training.
Here is a little chestnut I found that is more suitable for practice~~
Python3 uses fastText for text classification and news classification

Guess you like

Origin blog.csdn.net/DZZ18803835618/article/details/107597911