[Datawhale] AI Summer Camp Phase III - Text Classification Notes Based on Paper Abstracts (Part 1)

During the summer, I participated in the third AI summer camp of Datawhale, where I studied NLP. During this period, we learned NLP through competitions . Today, I mainly share and record the learning process and notes of this summer camp.

Background

The NLP summer camp mainly learns through the form of competition rankings, and the topic of our NLP competition this time is 基于论文摘要的文本分类. The literature database in the medical field contains a wealth of disease diagnosis and treatment information. How to efficiently extract key information from massive literature and make disease diagnosis and treatment recommendations is of great significance to clinicians and researchers. Therefore, our main task in this issue is to build a high-precision model to realize the text classification of paper abstracts. The specific point is to divide the papers into two categories by analyzing the paper abstracts, one is medical papers, and the other is non-Chinese papers. Medical papers.

Competition task

The machine judges whether the paper belongs to the literature in the medical field by understanding the abstract and other information of the paper.
Task example:
input:
paper information in the following format:
Inflammatory Breast Cancer: What to Know About This Unique, Aggressive Breast Cancer.
[Arjun Menta, Tamer M Fouad, Anthony Lucci, Huong Le-Petross, Michael C Stauder, Wendy A Woodward , Naoto T Ueno, Bora Lim],
Inflammatory breast cancer (IBC) is a rare form of breast cancer that accounts for only 2% to 4% of all breast cancer cases. Despite its low incidence, IBC contributes to 7% to 10% of breast cancer caused mortality. Despite ongoing international efforts to formulate better diagnosis, treatment, and research, the survival of patients with IBC has not been significantly improved, and there are no therapeutic agents that specifically target IBC to date. The authors present a comprehensive overview that aims to assess the present and new management strategies of IBC.,
Breast changes; Clinical trials; Inflammatory breast cancer; Trimodality care.
输出:
是(1)

Competition data set

The training set and test set data are CSV format files, and the fields are title, author, abstract, and keywords.
insert image description here

Dataset download link: https://aistudio.baidu.com/datasetdetail/231041

Evaluation index

The evaluation standard of this competition adopts F1_score, the higher the score, the better the effect.

problem solving ideas

Literature domain classification
can provide two practical ideas for text classification tasks, one is to use traditional feature extraction methods (such as TF-IDF/BOW) combined with machine learning models, and the other is to use pre-trained BERT models for modeling .
The idea of ​​using feature extraction + machine learning is as follows:

  1. Data preprocessing: First, preprocess the text data, including text cleaning (such as removing special characters and punctuation marks), word segmentation and other operations. Common NLP toolkits such as NLTK or spaCy can be used to assist in preprocessing.
  2. Feature extraction: Convert text to vector representations using TF-IDF (Term Frequency-Inverse Document Frequency) or BOW (Bag of Words) methods. TF-IDF can calculate the importance of words in the text, while BOW simply counts the number of occurrences of each word in the text. Feature extraction can be achieved using TfidfVectorizer or CountVectorizer of the scikit-learn library.
  3. Build a training set and a test set: split the preprocessed text data into a training set and a test set to ensure that the samples in the data set are evenly distributed.
  4. Choose a machine learning model: Choose a suitable machine learning model according to the actual situation, such as naive Bayesian, support vector machine (SVM), random forest, etc. These models perform well on text classification tasks. The corresponding classifiers in the scikit-learn library can be used for model training and evaluation.
  5. Model training and evaluation: The selected machine learning model is trained using the training set and then evaluated using the test set. Evaluation indicators can choose accuracy rate, precision rate, recall rate, F1 value, etc.
  6. Parameter tuning optimization: If the effect of the model is not satisfactory, you can try to adjust the parameters of feature extraction (such as word frequency threshold, word bag size, etc.) or machine learning model parameters to obtain better performance.
    Our Baseline chooses to use machine learning methods. When solving machine learning problems, we generally follow the following process:
    insert image description here

Task 1: Machine Learning Method Baseline

In this Baseline, we use the LogisticRegression model of machine learning, which is the logistic regression model.

1. Import module

Import the modules we need for this Baseline code

#import 相关库
# 导入pandas用于读取表格数据
import pandas as pd

# 导入BOW(词袋模型),可以选择将CountVectorizer替换为TfidfVectorizer(TF-IDF(词频-逆文档频率)),注意上下文要同时修改,亲测后者效果更佳
from sklearn.feature_extraction.text import CountVectorizer

# 导入LogisticRegression回归模型
from sklearn.linear_model import LogisticRegression

# 过滤警告消息
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

2. Feature extraction

特征提取It is an important step in machine learning tasks. We call each dimension of the training data a feature. For example, if we want to predict the price of a used car based on the three variables of brand, price, and mileage of a used car, the brand, price, and mileage are the Three characteristics of the task. The so-called feature extraction is the process of creating a new feature subset from the feature set of the training data. The feature number of the extracted feature subset is generally less than or equal to the original feature number, but it can better represent the situation of the training data, and the extracted feature subset can achieve better prediction results. For NLP and CV tasks, we usually need to extract text and image features into numerical vector features that can be processed by computers. We can generally use the feature_extraction package in the sklearn library to extract features from text and images.

In NLP tasks, feature extraction generally needs to convert natural language text into a numerical vector representation. Common methods include based on TF-IDF(词频-逆文档频率)提取or 基于 BOW(词袋模型)提取and so on. Both methods are sklearn.feature_extractionimplemented in the package.

2.1 Extraction based on TF-IDF

TF-IDF(term frequency–inverse document frequency)is a used 信息检索与数据挖掘, 常用加权技术where, TF 指 term frequence,即词频, refers to the ratio of the number of times a word appears in the article to the total number of words in the article; IDF 指 inverse document frequence,即逆文档频率, refers to the ratio of the number of documents containing a word to the total number of documents in the corpus. For example, assuming the corpus is {"the weather is good today", "the mood is bad today", "the weather is bad tomorrow"}, and each sentence is a document, then the TF and IDF of "today" are: TF (
today ∣ document 1 ) = frequency of occurrence of word in document 1 total number of words in document 1 = 1 3 TF(today|document 1) = \frac{frequency of occurrence of word in document 1}{total number of words in document 1} = \ frac{1}{3}TF ( Today∣Document 1 ) _ _=The total number of words in document 1Frequency of occurrence of word in document 1=31
TF (today∣ document 2) = frequency of occurrence of word in document 2 total number of words in document 2 = 1 4 TF(today|document 2) = \frac{frequency of occurrence of word in document 2}{total number of words in document 2 } = \frac{1}{4}TF ( Today∣Document 2 ) _ _=The total number of words in document 2Frequency of occurrence of word in document 2=41
TF(Today∣Document3) = 0 TF(Today|Document3) = 0TF ( Today∣Document 3 ) _ _=0
IDF (today) = total number of log corpus documents number of documents where the word appears = log 3 2 IDF (today) = log\frac{total number of corpus documents}{number of documents where the word appears} = log\frac{3}{2 }I D F ( today)=logThe number of documents in which the word appearsTotal number of corpus documents=log23
The final IF-IDF of each word is the TF value multiplied by the IDF value. After calculating the TF-IDF value of each word, use the numerical vector calculated by TF-IDF to replace the original text to realize text feature extraction based on TF-IDF.

We can use the TfidfVectorizer class in sklearn.feature_extraction.text to simply implement document feature extraction based on TF-IDF:

# 首先导入该类
from sklearn.feature_extraction.text import TfidfVectorizer

# 假设我们已从本地读取数据为 DataFrame 类型,并已经过基本预处理,data 为已处理的 DataFrame 数据
# 实例化一个 TfidfVectorizer 对象,并使用 fit 方法来拟合数据
vector = TfidfVectorizer().fit(data["text"])

# 拟合之后,调用 transform 方法即可得到提取后的特征数据
train_vector = vector.transform()
2.2 Based on BOW

BOW(Bag of Words)It is a commonly used text representation method. Its basic idea is to assume that for a text, ignore its word order, grammar, and syntax, and only regard it as a collection of some words, and each word in the text is independent. To put it simply, each document is regarded as a bag (because it contains words, so it is called a bag of words, and Bag of words comes from this), and then see what words are contained in this bag, and then its classification. Specifically, the bag-of-words model represents a text, and first maintains a thesaurus, which maintains the mapping relationship between each word and a numerical vector. For example, the simplest mapping relationship is one-hot encoding. Assuming that there are four words in the lexicon, today, weather, very, bad, then one-hot encoding will encode the four words as: today——( 1
, 0, 0, 0) weather - (0, 1, 0, 0) very - (0, 0, 1, 0) bad - (0, 0, 0, 1) today - (1,0 ,0,0) \\ Weather - (0,1,0,0) \\ Very - (0,0,1,0) \\ Bad - (0,0,0,1)Today - ( 1 ,0,0,0Weather - ( 0 ,1,0,0Very—— ( 0 , _0,1,0Bad - ( 0 ,0,0,1 )
Using the bag of words model, the above sentence will be encoded as:
BOW (S entence) = E mbedding (today) + E mbedding (weather) + E mbedding (very) + E mbedding (not good) = ( 1 , 1 , 1 , 1 ) BOW(Sentence) = Embedding(today) + Embedding(weather) + Embedding(very) + Embedding(bad) = (1,1,1,1)BOW(Sentence=E mb e dd in g ( today )+E mb e dd in g ( weather )+E mb e dd in g ( very )+E mb e dd in g ( bad )=(1,1,1,1 )
We generally usesklearn.feature_extraction.textinCountVectorizerto simply implement the BOW feature extraction of documents based on frequency statistics. Its main method is the same as thatTfidfVectorizerof一致:

# 首先导入该类
from sklearn.feature_extraction.text import CountVectorizer

# 假设我们已从本地读取数据为 DataFrame 类型,并已经过基本预处理,data 为已处理的 DataFrame 数据
# 实例化一个 CountVectorizer 对象,并使用 fit 方法来拟合数据
vector = CountVectorizer().fit(data["text"])

# 拟合之后,调用 transform 方法即可得到提取后的特征数据
train_vector = vector.transform()
2.3 Stop words

停用词(Stop Words)It is one of the fields of natural language processing 重要工具and is usually used to improve the quality of text features or reduce the dimensionality of text features.

When we use TF-IDF or BOW models to represent text, we always encounter some problems.

In specific NLP tasks, some words cannot provide valuable information and can be ignored. This situation is also very common in life. Taking this learning task as an example, we want medical words to be highlighted during feature extraction. For data that is not medical words, it should be considered to be downplayed during feature extraction. Commonly occurring words, we should also choose to ignore these words to prevent interference with our feature extraction.

For example, we still introduce this example when explaining the BOW model:
BOW ( S entence ) = E mbedding (today) + E mbedding (weather) + E mbedding (very) + E mbedding (not good) = (1 , 1 , 1 , 1 ) BOW(Sentence) = Embedding(today) + Embedding(weather) + Embedding(very) + Embedding(bad) = (1,1,1,1)BOW(Sentence=E mb e dd in g ( today )+E mb e dd in g ( weather )+E mb e dd in g ( very )+E mb e dd in g ( bad )=(1,1,1,1 )
When we need to classify the sentiment of this sentence, we need to highlight its emotional characteristics, that is, we hope that the value of the word 'bad' after being encoded by the BOW model can be larger

But if we don't use stop words, then the value of the sentence "today is good or bad" encoded by the BOW model will be highly similar to the encoding of the above sentence, which will seriously affect the result of the model judgment.

So how do we use stop words to solve this problem? Ideally, we will stop all words except emotional elements, that is, we will not consider them when coding, and only keep emotional words, that is, to judge whether the word "good" appears more or less in the sentence. Less, it is obvious that the word good appears more and the emotion is obviously positive.

For this task, the words that appear in daily life may not be very helpful for model classification. For example (or, again, and), I will post some word files that I collected that may not be so helpful to the task. .

stop.txt file link: Link: https://pan.baidu.com/s/1mQ50_gsKZHWERHzfiDnheg?pwd=qzuc Extraction code: qzuc

How to use this file, use the method shown below to read the file:

stops =[i.strip() for i in open(r'stop.txt',encoding='utf-8').readlines()] 

After reading this file, specify the parameter as stops CountVectorizer()when using the method .stop_words

vector = CountVectorizer(stop_words=stops).fit(train['text'])

3. Divide the dataset

In machine learning tasks, we generally have three data sets: training set, validation set, and prediction set. The training set is the fitting data of our training model, which is the input we provided to the model in the early stage; the verification set is generally a data set that we divide to verify the effect of the model to select the optimal combination of hyperparameters; the test set is the final test of the effect of the model data set. For example, in this competition task, the test.csv provided by the competition is the prediction set. Our final task is to build a model to achieve more accurate predictions on the prediction set. However, the prediction set generally limits the number of predictions. For example, in this competition, each person can only submit three times a day, but we know that machine learning models generally have many hyperparameters. In order to select the optimal combination of hyperparameters, we generally need more Validate the model for the first time, that is, provide a part of the data for the trained model to make predictions to find the model with the highest prediction accuracy.

Therefore, we generally divide the training set provided by the competition party, that is, train.csv, into a training set and a verification set. We will use the divided training set for model fitting and training, and use the divided verification set to verify the effects of different parameters and different models to find the optimal model and parameters and then predict on the prediction set provided by the competition Final Results.

There are many ways to divide the data set, the basic principle is the same distribution sampling. That is, the validation set and training set we have divided should have the same distribution to avoid inaccurate verification (in fact, the final prediction set should also be the same distribution as the training set and validation set). Here we introduce cross-validation, that is, for a data set with a total sample size of T, we generally randomly sample 10%~20% (that is, the number of samples from 0.1T~0.2T) as the verification set, and take other data as Training set. To learn more about the division method, you can refer to this blog: https://blog.csdn.net/hcxddd/article/details/119698879. We can use the train_test_split function in sklearn.model_selection to conveniently divide the data set:

The validation set is not divided in the baseline, you can also divide the validation set by yourself to observe the accuracy in training:

from sklearn.model_selection import train_test_split

# 该函数将会根据给定比例将数据集划分为训练集与验证集
trian_data, eval_data = train_test_split(data, test_size = 0.2)
# 参数 data 为总数据集,可以是 DataFrame 类型
# 参数 test_size 为划分验证集的占比,此处选择0.2,即划分20%样本作为验证集 

4. Choose a machine learning model

We can choose a variety of machine learning models to fit the training data. Different business scenarios and different training data often have different optimal models. Common models include linear models, logistic regression, decision trees, support vector machines, ensemble models, neural networks, and more. Students who want to learn more about various machine learning models are recommended to study "Watermelon Book" or "Statistical Learning Methods".
Sklearn encapsulates a variety of machine learning models. Common models can be found in sklearn. sklearn is organized in different packages according to the category of the model. Here are a few commonly used packages:

  • sklearn.linear_model: linear model, such as linear regression, logistic regression, ridge regression, etc.
  • sklearn.tree: tree model, generally a decision tree
  • sklearn.neighbors: nearest neighbor model, common such as K nearest neighbor algorithm
  • sklearn.svm: Support Vector Machines
  • sklearn.ensemble: ensemble models, such as AdaBoost, GBDT, etc.
    In this case, we use a simple but well-fitting logistic regression model as the Baseline model. Here is a brief introduction to its principle.

The logistic regression model, that is, Logistic Regression, is actually a linear classifier. Through the Logistic function (or Sigmoid function), the data feature is mapped to a probability value in the range of 0 to 1 (the possibility that the sample belongs to a positive example). The comparison to get the category to which the data belongs. The mathematical expression of logistic regression is:

f ( z ) = 1 1 + e − z f(z) = \frac{1}{1 + e^{-z}} f(z)=1+ez1
z = w T x + w 0 z = w^Tx + w_0 z=wTx+w0

The logistic regression model is simple, parallelizable, and interpretable, and it can often achieve good results. It is a more general model.
We can call sklearn.linear_model.LogisticRegressionthe implemented logistic regression model with:

# 引入模型
model = LogisticRegression()
# 可以在初始化时控制超参的取值,此处使用默认值,具体参数可以查阅官方文档

# 开始训练,这里可以考虑修改默认的batch_size与epoch来取得更好的效果
# 此处的 train_vector 是已经经过特征提取的训练数据
model.fit(train_vector, train['label'])

# 利用模型对测试集label标签进行预测,此处的 test_vector 同样是已经经过特征提取的测试数据
test['label'] = model.predict(test_vector)

In fact, a variety of machine learning models provided by sklearn are encapsulated into similar classes, and most of the usage methods are consistent with the above, that is, first instantiate a model object, then use the fit function to fit the training data, and finally use the predict function Predict the test data.

5. Data Exploration

Data exploratory analysis is to understand the data set, understand the relationship between variables and the relationship between variables and predicted values, and use graphing, tabulation, equation fitting, It is a very important step in machine learning, which is a data analysis method that explores the structure and laws of data by means of calculating feature quantities, so as to help us better perform feature engineering and model building in the later stage.
In this baseline practice, we use pandas to read data and explore data.

5.1 Read data using pandas

In this part of the content, we use pd.read_csv()the method to read the data of the competition. pd.read_csv()参The number is the address of the data to be read, and a DataFrame data is returned after reading:

import pandas as pd
train = pd.read_csv('./基于论文摘要的文本分类与关键词抽取挑战赛公开数据/train.csv')
train['title'] = train['title'].fillna('')
train['abstract'] = train['abstract'].fillna('')

test = pd.read_csv('./基于论文摘要的文本分类与关键词抽取挑战赛公开数据/testB.csv')
test['title'] = test['title'].fillna('')
test['abstract'] = test['abstract'].fillna('')

Through some methods provided by pandas, we can quickly view some characteristics of the data locally. View the length of the data
through the method:DataFrame.apply(len).describe()

print(train['text'].apply(len).describe())
count     6000.000000
mean      1620.251500
std        496.956005
min        286.000000
25%       1351.750000
50%       1598.500000
75%       1885.000000
max      10967.000000
Name: text, dtype: float64

Observe the output and find that the average data length is around 1620. Check the data quantity
by method:DataFrame.value_counts()

print(train["label"].value_counts())
label
0    3079
1    2921
Name: count, dtype: int64

Observing the output shows that the distribution of 0 and 1 labels is relatively uniform, that is to say, we don't have to worry about overfitting due to uneven data distribution, so as to ensure the generalization ability of the model.

6. Data cleaning

Data and features determine the upper limit of machine learning, while models and algorithms only approach this upper limit. As the saying goes: garbage in, garbage out. After analyzing the data, before feature engineering, an essential step is to clean the data.

The role of data cleaning is to use relevant techniques such as mathematical statistics, data mining or predefined cleaning rules to transform dirty data into data that meets data quality requirements. It mainly includes missing value processing, outlier value processing, data bucketing, feature normalization/standardization and other processes.

At the same time, since there are many columns in the table, we combine the important contents of these columns to generate a new column for training.

# 提取文本特征,生成训练集与测试集
train['text'] = train['title'].fillna('') + ' ' +  train['author'].fillna('') + ' ' + train['abstract'].fillna('')+ ' ' + train['Keywords'].fillna('')
test['text'] = test['title'].fillna('') + ' ' +  test['author'].fillna('') + ' ' + test['abstract'].fillna('')

In the practical learning, some students reported that fillna('')they had doubts about the method. fillna()The method in pandas can use the specified method 填充NA/NaN值. If a row in the data set is missing title author abstractthe content in , we need to use it fillna()to ensure that no error will be reported.

7. Feature engineering

Feature engineering refers to the process of converting raw data into model training data in order to obtain better training data features. Feature engineering can improve the performance of the model, and sometimes even achieve good results on simple models.
insert image description here
Here we choose to use BOW to convert text to vector representation:

#特征工程
vector = CountVectorizer().fit(train['text'])
train_vector = vector.transform(train['text'])
test_vector = vector.transform(test['text'])

8. Model Training and Validation

Be it feature engineering or data cleaning, they all serve the final model, and the establishment and parameter adjustment of the model determine the final result. The choice of the model determines the upper limit of the result, and how to better reach the upper limit of the model depends on the tuning of the model.

The modeling process requires us to have a basic understanding of common linear models and nonlinear models. After the model is built, it is necessary to master certain methods and skills of model performance verification.

# 模型训练
model = LogisticRegression()

# 开始训练,这里可以考虑修改默认的batch_size与epoch来取得更好的效果
model.fit(train_vector, train['label'])

9. Result output

Submission results need to match the submission sample results

# 利用模型对测试集label标签进行预测
test['label'] = model.predict(test_vector)
test['Keywords'] = test['title'].fillna('')
# 生成任务一推测结果
test[['uuid', 'Keywords', 'label']].to_csv('submit_task1.csv', index=None)

The complete code is as follows:

# 导入pandas用于读取表格数据
import pandas as pd

# 导入BOW(词袋模型),可以选择将CountVectorizer替换为TfidfVectorizer(TF-IDF(词频-逆文档频率)),注意上下文要同时修改,亲测后者效果更佳
from sklearn.feature_extraction.text import CountVectorizer

# 导入LogisticRegression回归模型
from sklearn.linear_model import LogisticRegression

# 过滤警告消息
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)


# 读取数据集
train = pd.read_csv('./基于论文摘要的文本分类与关键词抽取挑战赛公开数据/train.csv')
train['title'] = train['title'].fillna('')
train['abstract'] = train['abstract'].fillna('')

test = pd.read_csv('./基于论文摘要的文本分类与关键词抽取挑战赛公开数据/testB.csv')
test['title'] = test['title'].fillna('')
test['abstract'] = test['abstract'].fillna('')


# 提取文本特征,生成训练集与测试集
train['text'] = train['title'].fillna('') + ' ' +  train['author'].fillna('') + ' ' + train['abstract'].fillna('')+ ' ' + train['Keywords'].fillna('')
test['text'] = test['title'].fillna('') + ' ' +  test['author'].fillna('') + ' ' + test['abstract'].fillna('')

vector = CountVectorizer().fit(train['text'])
train_vector = vector.transform(train['text'])
test_vector = vector.transform(test['text'])


# 引入模型
model = LogisticRegression()

# 开始训练,这里可以考虑修改默认的batch_size与epoch来取得更好的效果
model.fit(train_vector, train['label'])

# 利用模型对测试集label标签进行预测
test['label'] = model.predict(test_vector)
# 因为任务一并不涉及关键词提取,而提交中需要这一行所以我们用title列填充Keywords列
test['Keywords'] = test['title'].fillna('')
# 生成任务一推测结果
test[['uuid', 'Keywords', 'label']].to_csv('submit_task1.csv', index=None)

2. Task-Practice

2.1 Replacement model

At the beginning, only classic models such as SVM, KNN, decision tree, random forest, and naive Bayesian were replaced, and no modification was made in other places. First, find a basic model with the best classification performance, and then try to tune it.

  • SVM
# SVM
# 导入pandas用于读取表格数据
import pandas as pd

# 导入BOW(词袋模型),可以选择将CountVectorizer替换为TfidfVectorizer(TF-IDF(词频-逆文档频率)),注意上下文要同时修改,亲测后者效果更佳
from sklearn.feature_extraction.text import CountVectorizer

# 导入LogisticRegression回归模型
from sklearn.svm import SVC
# 过滤警告消息
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

# 设置惩罚系数
svc = SVC(kernel='linear',C=1)

# 读取数据集
train = pd.read_csv('/home/aistudio/data/data231041/train.csv')
train['title'] = train['title'].fillna('')
train['abstract'] = train['abstract'].fillna('')

test = pd.read_csv('/home/aistudio/data/data231041/testB.csv')
test['title'] = test['title'].fillna('')
test['abstract'] = test['abstract'].fillna('')


# 提取文本特征,生成训练集与测试集
train['text'] = train['title'].fillna('')  + ' ' + train['abstract'].fillna('')+ ' ' + train['Keywords'].fillna('')
test['text'] = test['title'].fillna('')  + ' ' + test['abstract'].fillna('')

# 停止词,剔除噪音数据
stops =[i.strip() for i in open(r'/home/aistudio/stop.txt',encoding='utf-8').readlines()] 
vector = CountVectorizer(stop_words=stops).fit(train['text'])
train_vector = vector.transform(train['text'])
test_vector = vector.transform(test['text'])


# 引入模型
model = svc

# 开始训练,这里可以考虑修改默认的batch_size与epoch来取得更好的效果
model.fit(train_vector, train['label'])

# 利用模型对测试集label标签进行预测
test['label'] = model.predict(test_vector)
test['Keywords'] = test['title'].fillna('')
print(test)
test[['uuid','Keywords','label']].to_csv('submit_task.csv_svm', index=None)
  • KNN
# KNN
# 导入pandas用于读取表格数据
import pandas as pd

# 导入BOW(词袋模型),可以选择将CountVectorizer替换为TfidfVectorizer(TF-IDF(词频-逆文档频率)),注意上下文要同时修改,亲测后者效果更佳
from sklearn.feature_extraction.text import CountVectorizer

# 导入KNN
from sklearn.neighbors import KNeighborsClassifier
# 过滤警告消息
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)



# 读取数据集
train = pd.read_csv('/home/aistudio/data/data231041/train.csv')
train['title'] = train['title'].fillna('')
train['abstract'] = train['abstract'].fillna('')

test = pd.read_csv('/home/aistudio/data/data231041/testB.csv')
test['title'] = test['title'].fillna('')
test['abstract'] = test['abstract'].fillna('')


# 提取文本特征,生成训练集与测试集
train['text'] = train['title'].fillna('')  + ' ' + train['abstract'].fillna('')+ ' ' + train['Keywords'].fillna('')
test['text'] = test['title'].fillna('')  + ' ' + test['abstract'].fillna('')

# 停止词,剔除噪音数据
stops =[i.strip() for i in open(r'/home/aistudio/stop.txt',encoding='utf-8').readlines()] 
vector = CountVectorizer(stop_words=stops).fit(train['text'])
train_vector = vector.transform(train['text'])
test_vector = vector.transform(test['text'])


# 引入模型
# model = KNeighborsClassifier(n_neighbors=3) # k=3
model = KNeighborsClassifier(n_neighbors=5) # k=5

# 开始训练,这里可以考虑修改默认的batch_size与epoch来取得更好的效果
model.fit(train_vector, train['label'])

# 利用模型对测试集label标签进行预测
test['label'] = model.predict(test_vector)
test['Keywords'] = test['title'].fillna('')
test[['uuid','Keywords','label']].to_csv('submit_task_knn5.csv', index=None)
  • decision tree
# 决策树
# 导入pandas用于读取表格数据
import pandas as pd

# 导入BOW(词袋模型),可以选择将CountVectorizer替换为TfidfVectorizer(TF-IDF(词频-逆文档频率)),注意上下文要同时修改,亲测后者效果更佳
from sklearn.feature_extraction.text import CountVectorizer

# 导入决策树
from sklearn.tree import DecisionTreeClassifier
# 过滤警告消息
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)



# 读取数据集
train = pd.read_csv('/home/aistudio/data/data231041/train.csv')
train['title'] = train['title'].fillna('')
train['abstract'] = train['abstract'].fillna('')

test = pd.read_csv('/home/aistudio/data/data231041/testB.csv')
test['title'] = test['title'].fillna('')
test['abstract'] = test['abstract'].fillna('')


# 提取文本特征,生成训练集与测试集
train['text'] = train['title'].fillna('')  + ' ' + train['abstract'].fillna('')+ ' ' + train['Keywords'].fillna('')
test['text'] = test['title'].fillna('')  + ' ' + test['abstract'].fillna('')

# 停止词,剔除噪音数据
stops =[i.strip() for i in open(r'/home/aistudio/stop.txt',encoding='utf-8').readlines()] 
vector = CountVectorizer(stop_words=stops).fit(train['text'])
train_vector = vector.transform(train['text'])
test_vector = vector.transform(test['text'])


# 引入模型
model = DecisionTreeClassifier(max_depth=5)


# 开始训练,这里可以考虑修改默认的batch_size与epoch来取得更好的效果
model.fit(train_vector, train['label'])

# 利用模型对测试集label标签进行预测
test['label'] = model.predict(test_vector)
test['Keywords'] = test['title'].fillna('')
test[['uuid','Keywords','label']].to_csv('submit_task_tree5.csv', index=None)
  • random forest
# 随机森林
# 导入pandas用于读取表格数据
import pandas as pd
# 导入BOW(词袋模型),可以选择将CountVectorizer替换为TfidfVectorizer(TF-IDF(词频-逆文档频率)),注意上下文要同时修改,亲测后者效果更佳
# from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# 导入随机森林
from sklearn.ensemble import RandomForestClassifier
# 过滤警告消息
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

# 读取数据集
train = pd.read_csv('/home/aistudio/data/data231041/train.csv')
train['title'] = train['title'].fillna('')
train['abstract'] = train['abstract'].fillna('')

test = pd.read_csv('/home/aistudio/data/data231041/testB.csv')
test['title'] = test['title'].fillna('')
test['abstract'] = test['abstract'].fillna('')

# 提取文本特征,生成训练集与测试集
train['text'] = train['title'].fillna('')  + ' ' + train['abstract'].fillna('')+ ' ' + train['Keywords'].fillna('')
test['text'] = test['title'].fillna('')  + ' ' + test['abstract'].fillna('')

# 停止词,剔除噪音数据
stops =[i.strip() for i in open(r'/home/aistudio/stop.txt',encoding='utf-8').readlines()] 
vector = TfidfVectorizer(stop_words=stops, ngram_range=(1, 2), max_features=1000).fit(train['text'])
train_vector = vector.transform(train['text'])
test_vector = vector.transform(test['text'])

# 引入模型
model = RandomForestClassifier(n_estimators=100)  # 设置随机森林的树的数量为100

# 开始训练,这里可以考虑修改默认的batch_size与epoch来取得更好的效果
model.fit(train_vector, train['label'])

# 利用模型对测试集label标签进行预测
test['label'] = model.predict(test_vector)
test['Keywords'] = test['title'].fillna('')
test[['uuid','Keywords','label']].to_csv('submit_task_random_forest.csv', index=None)
  • Naive Bayes
#朴素贝叶斯

# 导入pandas用于读取表格数据
import pandas as pd

# 导入BOW(词袋模型),可以选择将CountVectorizer替换为TfidfVectorizer(TF-IDF(词频-逆文档频率)),注意上下文要同时修改,亲测后者效果更佳
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB # 朴素贝叶斯分类器

# 过滤警告消息
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)


# 读取数据集
train = pd.read_csv('/home/aistudio/data/data231041/train.csv')
train['title'] = train['title'].fillna('')
train['abstract'] = train['abstract'].fillna('')

test = pd.read_csv('/home/aistudio/data/data231041/testB.csv')
test['title'] = test['title'].fillna('')
test['abstract'] = test['abstract'].fillna('')


# 提取文本特征,生成训练集与测试集
train['text'] = train['title'].fillna('')  + ' ' + train['abstract'].fillna('')+ ' ' + train['Keywords'].fillna('')
test['text'] = test['title'].fillna('')  + ' ' + test['abstract'].fillna('')

# 停止词,剔除噪音数据
stops =[i.strip() for i in open(r'/home/aistudio/stop.txt',encoding='utf-8').readlines()] 
vector = TfidfVectorizer(stop_words=stops).fit(train['text'])
train_vector = vector.transform(train['text'])
test_vector = vector.transform(test['text'])


# 引入模型

model = MultinomialNB() # 改用朴素贝叶斯分类器

# 开始训练,这里可以考虑修改默认的batch_size与epoch来取得更好的效果
model.fit(train_vector, train['label'])

# 利用模型对测试集label标签进行预测
test['label'] = model.predict(test_vector)
test['Keywords'] = test['title'].fillna('')
test[['uuid','Keywords','label']].to_csv('submit_task_MultinomialNB.csv', index=None)

2.2 Replacement model for feature extraction

# KNN
# 导入pandas用于读取表格数据
import pandas as pd
# 导入BOW(词袋模型),可以选择将CountVectorizer替换为TfidfVectorizer(TF-IDF(词频-逆文档频率)),注意上下文要同时修改,亲测后者效果更佳
# from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# 导入KNN
from sklearn.neighbors import KNeighborsClassifier
# 过滤警告消息
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)



# 读取数据集
train = pd.read_csv('/home/aistudio/data/data231041/train.csv')
train['title'] = train['title'].fillna('')
train['abstract'] = train['abstract'].fillna('')

test = pd.read_csv('/home/aistudio/data/data231041/testB.csv')
test['title'] = test['title'].fillna('')
test['abstract'] = test['abstract'].fillna('')


# 提取文本特征,生成训练集与测试集
train['text'] = train['title'].fillna('')  + ' ' + train['abstract'].fillna('')+ ' ' + train['Keywords'].fillna('')
test['text'] = test['title'].fillna('')  + ' ' + test['abstract'].fillna('')

# 停止词,剔除噪音数据
stops =[i.strip() for i in open(r'/home/aistudio/stop.txt',encoding='utf-8').readlines()] 
vector = TfidfVectorizer(stop_words=stops).fit(train['text'])
train_vector = vector.transform(train['text'])
test_vector = vector.transform(test['text'])


# 引入模型
# model = KNeighborsClassifier(n_neighbors=3) # k=3
model = KNeighborsClassifier(n_neighbors=5) # k=5

# 开始训练,这里可以考虑修改默认的batch_size与epoch来取得更好的效果
model.fit(train_vector, train['label'])

# 利用模型对测试集label标签进行预测
test['label'] = model.predict(test_vector)
test['Keywords'] = test['title'].fillna('')
test[['uuid','Keywords','label']].to_csv('submit_task_knn5-1.csv', index=None)

After replacing the bag-of-words model BOW with TF-IDF (word frequency-inverse document frequency), the accuracy has indeed improved.

2.3 F1_score evaluation

The number of submissions 3 times a day is really not enough for model tuning. I even opened a trumpet registration competition, but only 6 times a day. Therefore, I said that the training set will be re-divided according to the ratio of 8:2, and then I will test the accuracy and f1-score by myself, and submit it after the effect is good. Print the accuracy and F1-Score to see~

# KNN
# 导入pandas用于读取表格数据
import pandas as pd
# 导入BOW(词袋模型),可以选择将CountVectorizer替换为TfidfVectorizer(TF-IDF(词频-逆文档频率)),注意上下文要同时修改,亲测后者效果更佳
# from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# 导入KNN
from sklearn.neighbors import KNeighborsClassifier
# 过滤警告消息
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split
data = pd.read_csv("/home/aistudio/data/data231041/train.csv")
data['text'] = data['title'].fillna('')  + ' ' + data['abstract'].fillna('')+ ' ' + data['Keywords'].fillna('')
X = data.text
y = data.label

# 以8:2的比例划分训练集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# 停止词,剔除噪音数据
stops =[i.strip() for i in open(r'/home/aistudio/stop.txt',encoding='utf-8').readlines()] 
vector = TfidfVectorizer(stop_words=stops).fit(train['text'])
X_train_vector = vector.transform(X_train)
X_test_vector = vector.transform(X_test)

#训练模型
n_neighbors = 5
knn = KNeighborsClassifier(n_neighbors=n_neighbors)
knn.fit(X_train_vector, y_train)
y_pred = knn.predict(X_test_vector)
#查看各项得分
print("y_pred",y_pred)
print("y_test",y_test)
print("score on train set", knn.score(X_train_vector, y_train))
print("score on test set", knn.score(X_test_vector, y_test))
print("accuracy score", accuracy_score(y_test, y_pred))
print("f1-score: ", f1_score(y_test, y_pred))

insert image description here

2.4 Data Augmentation

#朴素贝叶斯

# 导入pandas用于读取表格数据
import pandas as pd

# 导入BOW(词袋模型),可以选择将CountVectorizer替换为TfidfVectorizer(TF-IDF(词频-逆文档频率)),注意上下文要同时修改,亲测后者效果更佳
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB # 朴素贝叶斯分类器

# 过滤警告消息
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)


# 读取数据集
train = pd.read_csv('/home/aistudio/data/data231041/train.csv')
train['title'] = train['title'].fillna('')
train['abstract'] = train['abstract'].fillna('')

# 数据增强
train_x = pd.read_csv('/home/aistudio/data/data231041/keywords.csv')
train_x['keywords'] = train_x['Keywords'].fillna('')
train['Keywords'] = train['Keywords'] + " " + train_x['Keywords']


test = pd.read_csv('/home/aistudio/data/data231041/testB.csv')
test['title'] = test['title'].fillna('')
test['abstract'] = test['abstract'].fillna('')


# 提取文本特征,生成训练集与测试集
train['text'] = train['title'].fillna('') + ' ' +  train['author'].fillna('') + ' ' + train['abstract'].fillna('')+ ' ' + train['Keywords'].fillna('')
test['text'] = test['title'].fillna('') + ' ' +  test['author'].fillna('') + ' ' + test['abstract'].fillna('')

# 停止词,剔除噪音数据
stops =[i.strip() for i in open(r'/home/aistudio/stop.txt',encoding='utf-8').readlines()] 
vector = TfidfVectorizer(stop_words=stops).fit(train['text'])
train_vector = vector.transform(train['text'])
test_vector = vector.transform(test['text'])


# 引入模型

model = MultinomialNB() # 改用朴素贝叶斯分类器

# 开始训练,这里可以考虑修改默认的batch_size与epoch来取得更好的效果
model.fit(train_vector, train['label'])

# 利用模型对测试集label标签进行预测
test['label'] = model.predict(test_vector)
test['Keywords'] = test['title'].fillna('')
test[['uuid','Keywords','label']].to_csv('submit_task_MultinomialNB1.csv', index=None)

After data augmentation, the accuracy will be improved. Here I found that the information in the author’s column is still useful, but the accuracy of Naive Bayes has become lower, and the final accuracy can reach 0.83425.


2.5 Submit results

Model F1-Score evaluate
LogisticRegression 0.67116 The logistic regression model is the worst, maybe I haven't adjusted it, but I feel that it is not as good as other models after adjustment.
SVM 0.6778 I always think that SVM is the ceiling model of classification in machine learning, but the effect of running out is not very ideal, so I changed the model decisively. I don’t know why it is so low. Maybe it is not suitable for the classification task of this text?
KNN 0.67538-0.72763 The KNN algorithm model is the one I use the most, so I decisively tried it on KNN, and found that when K is set to 5, the effect is the best, and it is not good if it is greater than or less than 5~
Random forest 0.75322 The effect of random forest is better than others, but I set the number of trees to 100 and modified the parameters of the feature extraction model to achieve a score of 0.75. If the number of trees is increased, such as 200, the score will become lower
Multinomial NB 0.82041 Here, the polynomial Naive Bayes algorithm model in Naive Bayes is used, without any parameter adjustment, the accuracy of the direct replacement model can reach 0.82041, and the effect is very good.

3. Deep learning method

3.1 Problem-solving ideas

The idea of ​​using the pre-trained BERT model for modeling is as follows:

  1. Data preprocessing: First, preprocess the text data, including text cleaning (such as removing special characters and punctuation marks), word segmentation and other operations. Common NLP toolkits such as NLTK or spaCy can be used to assist in preprocessing.
  2. To build the dataloader and dataset required for training, when building the Dataset class, you need to define three methods __init__, getitem , len , where the __init__ method completes the class initialization, __getitem__ is required to return the returned content and label, and the __len__ method returns the data length
  3. Construct a Dataloader, in which the actions of encoding, filling, and assembling batches for sentences are completed:
  4. Define the prediction model Use the pre-trained BERT model to solve the text binary classification task, we will use the [CLS] vector in the BERT model code to complete the binary classification task [CLS] is the meaning of classification, which can be understood as used for downstream classification
    tasks .

It is mainly used for the following two tasks:
Single text classification task: For text classification tasks, the BERT model inserts a [CLS] symbol before the text, and uses the output vector corresponding to the symbol as the semantic representation of the entire text for text classification ,As shown below. It can be understood that this symbol without obvious semantic information will more "fairly" fuse the semantic information of each word/word in the text compared with other words/words already in the text.
The idea in model design is that we take out the vectorized [CLS] vector of the text data, and then pass through the binary classification prediction layer to get the final result.

outputs = self.bert(**src).last_hidden_state[:, 0, :]
self.predictor(outputs)
self.predictor = nn.Sequential(
            nn.Linear(768, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )
  1. Model training and evaluation: The selected machine learning model is trained using the training set and then evaluated using the test set. Evaluation indicators can choose accuracy rate, precision rate, recall rate, F1 value, etc.
  2. Parameter tuning optimization: If the effect of the model is not satisfactory, you can try to adjust the parameters of feature extraction (such as word frequency threshold, word bag size, etc.) or machine learning model parameters to obtain better performance.

In this advanced practice, we use the deep learning method, and generally follow the following process:
insert image description here
In the advanced Baseline, we will use the Bert model, which is introduced as follows:

3.2 Introduction to BERT

BERT, is a classic deep learning, pre-trained model. In 2018, the paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" published by the Google team proposed the pre-training model BERT (Bidirectional Encoder Representations from Transformers), which set off a huge wave in the field of natural language processing. The model has achieved the-state-of-art (best performance) of seven natural language processing evaluation tasks including GLUE and MultiNLI, which can be called a milestone achievement. Since the launch of BERT, the pre-training + fine-tuning model has become the mainstream of natural language processing tasks, marking the significant progress of various natural language processing tasks and the establishment of the dominance of pre-training models. Until the release of ChatGPT last year, the research paradigm has been brought To the large language model + hint engineering, but today, BERT is still one of the most commonly used and important pre-training models in the field of natural language processing.

This advanced Baseline intends to use BERT as an advanced model to guide everyone on how to deploy and apply BERT to complete the tasks of this competition. Here is a brief introduction to the principles and ideas of BERT.

3.3 Pre-training + fine-tuning paradigm

The field of natural language processing has been developing and changing, and models and algorithms that can achieve optimal results on various tasks are emerging in an endless stream. The earliest paradigm is text representation + machine learning, such as the method demonstrated in the basic Baseline, by representing natural language text as a numerical vector, and then building a statistical machine learning model to practice downstream tasks. However, with the development of deep learning, since 2013, the neural network word vector has appeared on the stage of the times, and the neural network has gradually become the core method of NLP, and the core research paradigm of NLP has gradually evolved to deep learning.

insert image description here
The research method of deep learning mainly processes downstream tasks end-to-end through a multi-layer neural network, and integrates text representation, feature engineering, and modeling prediction into the deep neural network, which reduces the process of artificial feature construction and significantly Improved natural language processing capabilities. The neural network word vector is the core part of it, that is, the vector representation of the text after passing through the neural network. These vector representations can contain deep semantics and have appropriate dimensions. Subsequent research can often be directly used to replace traditional text representation methods. Typical applications such as Word2Vec .

However, Word2Vec is a static word vector, that is, there is a fixed vector representation for each word, and it cannot solve problems such as polysemy and complex features. In 2018, the introduction of the ELMo model opened the era of dynamic word vectors and pre-training models. The ELMo model is based on a bidirectional LSTM architecture, pre-trained on the training data based on the language model, and then fine-tuned for downstream tasks, showing more superior performance, marking the birth of the pre-training + fine-tuning paradigm.

The so-called pre-training + fine-tuning paradigm refers to pre-training on massive text data, and then fine-tuning for specific downstream tasks. Pre-training is generally based on language models, that is, given the previous word, predict the next word. Language models can be modeled on all text data without manual labeling, so it is easy to train on massive data. Through pre-training on massive data, the model can learn deep natural language logic. Then fine-tune the specified downstream tasks, that is, perform specific training on some manually labeled task data, such as text classification, text generation, etc., to train the ability of the model to perform downstream tasks.

insert image description here
The pre-training + fine-tuning paradigm alleviates the problem of expensive labeled data to a certain extent, and significantly improves the model performance. However, the bidirectional LSTM architecture used by ELMo has inherent defects that it is difficult to solve long-term dependencies and poor parallel effects. ELMo itself also retains word The application of vectors as feature input has not been able to finalize the mainstream status of the pre-training + fine-tuning paradigm. In 2017, the introduction of the Transformer model brought a new important member to the field of natural language processing - the Attention architecture. Based on the Attention architecture, also in 2018, the GPT model proposed by OpenAI is based on the Transformer model, combined with the pre-training + fine-tuning paradigm proposed by the ELMo model, which further refreshed the upper limit of many natural language processing tasks. ChatGPT, which will explode in 2023, is based on the GPT model.

From static encoding to static word vectors calculated by neural networks, to the pre-training + fine-tuning paradigm based on the bidirectional LSTM architecture, and the Transformer-based pre-training + fine-tuning mode, the pre-training model has gradually become the mainstream of natural language processing. However, it is the BERT that was proposed later that really established the important position of the pre-training + fine-tuning paradigm. BERT can be said to be a combination of ELMo and GPT, using the pre-training + fine-tuning paradigm, based on the Transformer architecture and discarding the inherently flawed LSTM, and aiming at the defect that GPT can only capture one-way sentence relationships, it proposes a method that can capture deep bidirectional semantic relationships. MLM pre-training task, thus pushing the pre-training model to a climax.

3.4 Transformer 与 Attention

The success of BERT and even the current LLM is inseparable from the Attention mechanism and the Transformer architecture based on the Attention mechanism. Here is a brief introduction to Transformer and Attention mechanism.

Before the Attention mechanism was proposed, there were two main basic architectures for deep learning: convolutional neural network (CNN) and recurrent neural network (RNN). Among them, CNN has outstanding performance in the CV field, while RNN and its variant LSTM are outstanding in the NLP direction. However, there are two natural defects in the RNN architecture: ① The sequential calculation mode limits the parallel computing capability of the computer, resulting in a high calculation time cost for the RNN-based model although the number of parameters is not particularly large. ② It is difficult for RNN to capture the correlation of long sequences. In the RNN architecture, the relationship between inputs that are farther away is more difficult to capture. At the same time, RNN needs to read the entire sequence into the memory and calculate it sequentially, which also limits the length of the sequence.

In response to the above two problems, Vaswani et al published the paper "Attention Is All You Need" in 2017, creatively proposed the Attention mechanism and completely abandoned the RNN architecture. The Attention mechanism first originated in the field of computer vision. Its core idea is that when we pay attention to a picture, we often do not need to see all the content clearly but only focus on the key parts. In the field of natural language processing, we can often achieve more efficient and high-quality computing results by focusing on one or a few tokens.

The feature of the Attention mechanism is to calculate the correlation between Query (query value) and Key (key value) for the weighted sum of the true value , so as to fit the correlation between each word in the sequence and other words. The approximate calculation process is shown in the figure:
insert image description here
Specifically, it can be simply understood that an input sequence is mapped into three matrices Q, K, and V through different parameter matrices, where Q is another sentence (or phrase) for calculating attention, V is the sentence to be calculated, and K is the corresponding key of each word in the sentence to be calculated. By doing the dot product of Q and K, the attention distribution of the sentence (V) to be calculated (that is, which parts are more important and which parts are not so important) can be obtained, and the input sequence can be obtained by weighting and summing V based on the attention distribution The output after attention calculation, the more important part with Q (that is, the other side of calculating attention), the higher the weight.

The Transformer built the Encoder-Decoder (encoder-decoder) structure based on the Attention mechanism, which is mainly suitable for Seq2Seq (sequence-to-sequence) tasks, that is, the input is a natural language sequence, and the output is also a natural language sequence. Its overall structure is as follows:
insert image description here

Transformer consists of an Encoder, a Decoder plus a Softmax classifier and two encoding layers. The left box in the figure above is the Encoder, and the right box is the Decoder.

Since it is a Seq2Seq task, the training corpus of Transformer is several sentence pairs during training, and the specific subtasks can be machine translation, reading comprehension, machine dialogue, etc. In the original paper, a machine translation task between English and German was trained. During training, sentence pairs will be divided into input corpus and output corpus. The input corpus will enter the Encoder through the encoding layer from the left, and the output corpus will enter the Decoder through the encoding layer from the right. The main task of the Encoder is to encode the input corpus and then output it to the Decoder. The Decoder then calculates based on the historical information of the output corpus and the output of the Encoder. The output result can then be output through a linear layer and a Softmax classifier. The logic is as follows:
insert image description here
Transformer as a whole is a topic worth exploring, so I won’t repeat it here. If you are interested, please read the original paper "Attention Is All You Need" (https://arxiv.org/pdf/1706.03762. pdf) and Pytorch-based Transformer source code interpretation: https://github.com/datawhalechina/thorough-pytorch/blob/main/source/%E7%AC%AC%E5%8D%81%E7%AB%A0/Transformer %20%E8%A7%A3%E8%AF%BB.md

3.5 Pre-training tasks

BERT's model architecture directly uses Transformer's Encoder as the overall architecture. Its core idea is to propose two new pre-training tasks - MLM (Masked Language Model, mask model) and NSP (Next Sentence Prediction, next Sentence prediction) instead of the traditional LM (Language Model).

insert image description here
The MLM task is the basis for BERT to deeply fit bidirectional semantic features. To put it simply, the MLM task is to cover some tokens of the input corpus with a certain proportion, replace them with (MASK) labels, and then let the model predict and restore the masked words based on their context, that is, to do a cloze task. Since in this task, the model needs to predict the label itself according to the context information around the (MASK) label, it will fully fit the bidirectional semantic information.

For example, the original input is I like you. Masking is performed at a ratio of 30%, then the input after masking may be: I (MASK) you. The task of the model is to predict the word corresponding to the (MASK) label as like based on the input.

The NSP task is a pre-training task used by BERT to solve sentence-level natural language processing tasks. BERT fully adopts the pre-training + fine-tuning paradigm, so it focuses on the models generated by pre-training to solve various diverse downstream tasks. MLM works well for token-level natural language processing tasks (such as named entity recognition, relationship extraction, etc.), but for sentence-level natural language processing tasks (such as sentence pair classification, reading comprehension, etc.), due to the pattern gap between pre-training and downstream tasks Larger, so not very good results can be achieved. The NSP task is to integrate the input corpus into sentence pairs. Half of the sentence pairs are coherent upper and lower sentences, marked as IsNext, and half are randomly selected sentence pairs, marked as NotNext. The model needs to predict whether it is a coherent upper and lower sentence based on the input sentence pair, that is, to predict the label of the sentence pair.

For example, the original input sentence pair might be (I like you ; Because you are so good) and (I like you; Today is a nice day). The task of the model is to predict the IsNext label for the previous sentence pair and predict the NotNext label for the next sentence pair.
Based on the above two pre-training tasks, BERT can use a large amount of unlabeled text data to achieve deep semantic fitting in the pre-training stage, thus achieving good prediction results. At the same time, BERT pursues the deep synchronization of pre-training and fine-tuning. Since the Transformer architecture can well support various types of natural language processing tasks, in BERT, fine-tuning only needs to add a SoftMax classification layer at the top of the pre-training model. Can. It is also worth mentioning that since the masking of the MLM task does not exist in the actual downstream tasks, a little adjustment is made in the strategy, that is, for the selected masked words, only 80% of the masking is directly masked, and the rest will have 10% were randomly replaced and 10% were reverted to the original word.

3.6 Implementation process

3.6.1 Importing modules

Import the modules we need for this Baseline code

#import 相关库
#导入前置依赖
import os
import pandas as pd
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
# 用于加载bert模型的分词器
from transformers import AutoTokenizer
# 用于加载bert模型
from transformers import BertModel
from pathlib import Path
3.6.2 Set global configuration
batch_size = 16
# 文本的最大长度
text_max_length = 128
# 总训练的epochs数,我只是随便定义了个数
epochs = 100
# 学习率
lr = 3e-5
# 取多少训练集的数据作为验证集
validation_ratio = 0.1
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 每多少步,打印一次loss
log_per_step = 50

# 数据集所在位置
dataset_dir = Path("./基于论文摘要的文本分类与关键词抽取挑战赛公开数据")
os.makedirs(dataset_dir) if not os.path.exists(dataset_dir) else ''

# 模型存储路径
model_dir = Path("./model/bert_checkpoints")
# 如果模型目录不存在,则创建一个
os.makedirs(model_dir) if not os.path.exists(model_dir) else ''

print("Device:", device)
3.6.3 Data collection and preparation

Download the data on the home page of the competition, read the data set, and preprocess the data (considering data amplification)

# 读取数据集,进行数据处理

pd_train_data = pd.read_csv('./基于论文摘要的文本分类与关键词抽取挑战赛公开数据/train.csv')
pd_train_data['title'] = pd_train_data['title'].fillna('')
pd_train_data['abstract'] = pd_train_data['abstract'].fillna('')

test_data = pd.read_csv('./基于论文摘要的文本分类与关键词抽取挑战赛公开数据/testB.csv')
test_data['title'] = test_data['title'].fillna('')
test_data['abstract'] = test_data['abstract'].fillna('')
pd_train_data['text'] = pd_train_data['title'].fillna('') + ' ' +  pd_train_data['author'].fillna('') + ' ' + pd_train_data['abstract'].fillna('')+ ' ' + pd_train_data['Keywords'].fillna('')
test_data['text'] = test_data['title'].fillna('') + ' ' +  test_data['author'].fillna('') + ' ' + test_data['abstract'].fillna('')
# 从训练集中随机采样测试集
validation_data = pd_train_data.sample(frac=validation_ratio)
train_data = pd_train_data[~pd_train_data.index.isin(validation_data.index)]
3.6.4 Build the dataloader and dataset required for training
  • Define dataset
    Custom datasets in Pytorch need to be inherited torch.utils.data.Dataset. When inheriting the Dataset class, it requires you to rewrite __getitem__(self, index)、 __len__(self) two methods. The former returns data by providing an index, that is, the way DataLoader obtains data; the latter returns the length of the dataset. DataLoaderDetermines the length of the self-indexed sampler according to len.
    In this tutorial, we __init__introduce our training data in the initialization class method, rewrite __getitem__(self, index)the method, and ensure that the text and label values ​​​​of the row corresponding to the index can be taken out, that is, the data we need for training and the results we hope to obtain. In the __len__ method In , we need to return the total length of the data set, which can be done by directly using Dataframethe data here len().
# 构建Dataset
class MyDataset(Dataset):

    def __init__(self, mode='train'):
        super(MyDataset, self).__init__()
        self.mode = mode
        # 拿到对应的数据
        if mode == 'train':
            self.dataset = train_data
        elif mode == 'validation':
            self.dataset = validation_data
        elif mode == 'test':
            # 如果是测试模式,则返回内容和uuid。拿uuid做target主要是方便后面写入结果。
            self.dataset = test_data
        else:
            raise Exception("Unknown mode {}".format(mode))

    def __getitem__(self, index):
        # 取第index条
        data = self.dataset.iloc[index]
        # 取其内容
        text = data['text']
        # 根据状态返回内容
        if self.mode == 'test':
            # 如果是test,将uuid做为target
            label = data['uuid']
        else:
            label = data['label']
        # 返回内容和label
        return text, label

    def __len__(self):
        return len(self.dataset)

train_dataset = MyDataset('train')
validation_dataset = MyDataset('validation')
train_dataset.__getitem__(0)

  • In the construction of DataloaderDataloader , we need to use it Dataloaderto load training data and training targets. It should be noted that the data after loading needs to be in tensor(张量)the form, so in the following we define collate_fn to help complete batch assembly and vectorize text content, and We use the bert model to complete the text content vectorization, and we directly use torch.LongTensor()the method to complete the vectorization of the label value.
#获取Bert预训练模型
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
#接着构造我们的Dataloader。
#我们需要定义一下collate_fn,在其中完成对句子进行编码、填充、组装batch等动作:
def collate_fn(batch):
    """
    将一个batch的文本句子转成tensor,并组成batch。
    :param batch: 一个batch的句子,例如: [('推文', target), ('推文', target), ...]
    :return: 处理后的结果,例如:
             src: {'input_ids': tensor([[ 101, ..., 102, 0, 0, ...], ...]), 'attention_mask': tensor([[1, ..., 1, 0, ...], ...])}
             target:[1, 1, 0, ...]
    """
    text, label = zip(*batch)
    text, label = list(text), list(label)

    # src是要送给bert的,所以不需要特殊处理,直接用tokenizer的结果即可
    # padding='max_length' 不够长度的进行填充
    # truncation=True 长度过长的进行裁剪
    src = tokenizer(text, padding='max_length', max_length=text_max_length, return_tensors='pt', truncation=True)

    return src, torch.LongTensor(label)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
validation_loader = DataLoader(validation_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)
inputs, targets = next(iter(train_loader))
print("inputs:", inputs)
print("targets:", targets)
3.6.5 Defining the model

To define a model in pytorch, you need to inherit nn.Modulea class, in which you need to define at least two methods, one is to initialize the model structure __init__, and the other is forwardto complete the reasoning process.

#定义预测模型,该模型由bert模型加上最后的预测层组成
class MyModel(nn.Module):

    def __init__(self):
        super(MyModel, self).__init__()

        # 加载bert模型
        self.bert = BertModel.from_pretrained('bert-base-uncased', mirror='tuna')

        # 最后的预测层
        self.predictor = nn.Sequential(
            nn.Linear(768, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )

    def forward(self, src):
        """
        :param src: 分词后的推文数据
        """

        # 将src直接序列解包传入bert,因为bert和tokenizer是一套的,所以可以这么做。
        # 得到encoder的输出,用最前面[CLS]的输出作为最终线性层的输入
        outputs = self.bert(**src).last_hidden_state[:, 0, :]

        # 使用线性层来做最终的预测
        return self.predictor(outputs)

model = MyModel()
model = model.to(device)
3.6.6 Define the loss function and optimizer
#定义出损失函数和优化器。这里使用Binary Cross Entropy:
criteria = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
3.6.7 Define the validation function
# 由于inputs是字典类型的,定义一个辅助函数帮助to(device)
def to_device(dict_tensors):
    result_tensors = {
    
    }
    for key, value in dict_tensors.items():
        result_tensors[key] = value.to(device)
    return result_tensors
#定义一个验证方法,获取到验证集的精准率和loss
def validate():
    model.eval()
    total_loss = 0.
    total_correct = 0
    for inputs, targets in validation_loader:
        inputs, targets = to_device(inputs), targets.to(device)
        outputs = model(inputs)
        loss = criteria(outputs.view(-1), targets.float())
        total_loss += float(loss)

        correct_num = (((outputs >= 0.5).float() * 1).flatten() == targets).sum()
        total_correct += correct_num

    return total_correct / len(validation_dataset), total_loss / len(validation_dataset)
3.6.8 Model training and evaluation
# 首先将模型调成训练模式
model.train()

# 清空一下cuda缓存
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# 定义几个变量,帮助打印loss
total_loss = 0.
# 记录步数
step = 0

# 记录在验证集上最好的准确率
best_accuracy = 0

# 开始训练
for epoch in range(epochs):
    model.train()
    for i, (inputs, targets) in enumerate(train_loader):
        # 从batch中拿到训练数据
        inputs, targets = to_device(inputs), targets.to(device)
        # 传入模型进行前向传递
        outputs = model(inputs)
        # 计算损失
        loss = criteria(outputs.view(-1), targets.float())
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        total_loss += float(loss)
        step += 1

        if step % log_per_step == 0:
            print("Epoch {}/{}, Step: {}/{}, total loss:{:.4f}".format(epoch+1, epochs, i, len(train_loader), total_loss))
            total_loss = 0

        del inputs, targets

    # 一个epoch后,使用过验证集进行验证
    accuracy, validation_loss = validate()
    print("Epoch {}, accuracy: {:.4f}, validation loss: {:.4f}".format(epoch+1, accuracy, validation_loss))
    torch.save(model, model_dir / f"model_{
      
      epoch}.pt")

    # 保存最好的模型
    if accuracy > best_accuracy:
        torch.save(model, model_dir / f"model_best.pt")
        best_accuracy = accuracy

#加载最好的模型,然后进行测试集的预测
model = torch.load(model_dir / f"model_best.pt")
model = model.eval()
test_dataset = MyDataset('test')
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)
3.6.9 Result output
results = []
for inputs, ids in test_loader:
    outputs = model(inputs.to(device))
    outputs = (outputs >= 0.5).int().flatten().tolist()
    ids = ids.tolist()
    results = results + [(id, result) for result, id in zip(outputs, ids)]
test_label = [pair[1] for pair in results]
test_data['label'] = test_label
test_data['Keywords'] = test_data['title'].fillna('')
test_data[['uuid', 'Keywords', 'label']].to_csv('submit_task1.csv', index=None)

This article only records learning, refer to: AI Summer Camp Phase III - Text Classification and Keyword Extraction Challenge Tutorial Based on Paper Abstracts

Guess you like

Origin blog.csdn.net/m0_63007797/article/details/132582634