Team study notes Task4: Classification of papers

Task 4: Classification of papers

@DateWhale
“Stay hungry Stay young”

Task description

  • Learning topic: paper classification (data modeling task), use existing data to model, classify new papers;
  • Learning content: Use the title of the paper to complete the category classification;
  • Learning outcomes: learn the basic methods of text classification, TF-IDFetc.;

In the original paper, the category of the paper is filled in by the author. In this task, the title and abstract of the paper can be used to generate the title. The
tutorial mainly provides two ideas, using TF-IDF+ machine learning classifier or FastText Deep learning tools quickly build classifiers. Here I choose to use machine learning methods to complete text classification.

Pretreatment

First, stitch the title and abstract together to complete the classification;

Categorize the category of the original paper

data['categories']=data['categories'].apply(lamuda x:x.split(''))
data['categories_big']=data['categories'].apply(lamuda x:[xx.split('.')[0]]for xx in x)

Then encode the category:
here is imported the MulitLabelBinarizer classifier

training

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
data_label = mlb.fit_transform(data['categories_big'].iloc[:])

The next step is to implement the idea, using TF-IDF to extract features. TF-IDF is a text feature extractor of the sklearn package. Its principle is: the importance of a word is proportional to the frequency of its appearance in the text (TF) , Which is inversely proportional to its frequency in the corpus (IDF).

First import the TF-IDF classifier, limit up to 4000 words

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=4000)
data_tfidf = vectorizer.fit_transform(data['text'].iloc[:])

Now I have obtained a sparse matrix of text. Since this is a multi-label classification, sklearn's multi-label classification is used for encapsulation. According to the idea of ​​machine learning, the training set and the data set are divided, and 20% is used for testing:

# 划分训练集和验证集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_tfidf, data_label,
                                                 test_size = 0.2,random_state = 1)

# 构建多标签分类模型
from sklearn.multioutput import MultiOutputClassifier
#算法选择先验高斯分布的分类器
from sklearn.naive_bayes import MultinomialNB
clf = MultiOutputClassifier(MultinomialNB()).fit(x_train, y_train)

Validation results

Finally, verify the accuracy of the model and output:

from sklearn.metrics import classification_report
print(classification_report(y_test, clf.predict(x_test)))

The output is:
Insert picture description here
This issue mainly learns some usages of sklearn, many of which I have used before. Recently I want to review the basic grammar of python. After learning probability theory this year, I want to push the principle of machine learning watermelon book from the beginning.

Guess you like

Origin blog.csdn.net/weixin_45717055/article/details/112987938