Paper Analysis Task4-Classification of Papers (to be added)

Task description

Study topic: paper classification (data modeling task), use existing data to model, classify new papers;
learning content: use the paper title to complete the category classification;

Text classification involves knowledge points and ideas

Sklearn.preprocessing.MultiLabelBinarizer for encoding

Idea 1: TF-IDF + machine learning classifier
directly use TF-IDF to extract features from text, use classifiers for classification, and select classifiers can use SVM, LR, XGboost, etc.

Idea 2: FastText
FastText is an entry-level word vector. Using the FastText tool provided by Facebook, you can quickly build a classifier

Idea 3: WordVec+ deep learning classifier
WordVec is an advanced word vector, and the classification is completed by constructing a deep learning classification. The network structure of deep learning classification can choose TextCNN, TextRnn or BiLSTM.

Idea 4: Bert word vector
Bert is a highly matched word vector with powerful modeling and learning capabilities

Data preprocessing completes paper classification by title and abstract

Import package

# 导入所需的package
import json #读取数据，我们的数据为json格式的
import pandas as pd #数据处理，数据分析
import matplotlib.pyplot as plt #画图工具

Import Data

data = []
with open("E:/datawhale数据分析/arxiv-metadata-oai-2019.json",'r') as f:
    for idx, line in enumerate(f): 
        d = json.loads(line)
        d = {
    
    'title': d['title'], 'categories': d['categories'], 'abstract': d['abstract']}
        data.append(d)
        
        # 选择部分数据
        if idx > 200000:
            break
data = pd.DataFrame(data)#将list变为dataframe格式，方便用pandas进行分析
data.shape

#通过标题和摘要内容进行论文分类，因此进行合并
data['text'] = data['title'] + data['abstract']
data['text'][0]

Insert picture description here

data['text'] = data['text'].apply(lambda x:x.replace('\n',' ')) 
data['text'] = data['text'].apply(lambda x:x.lower())
data['text'][0]
data = data.drop(['title','abstract'],axis = 1)
#因为类别有多个需要进行分割
data['categories'][7]
#out:'math.DG'
# 多个类别，包含子分类
data['categories'] = data['categories'].apply(lambda x : x.split(' '))
data['categories'][7]
#out:'math.DG'
# 单个类别，不包含子分类
data['categories_big'] = data['categories'].apply(lambda x : [xx.split('.')[0] for xx in x])
data['categories_big'][7]
#out:math

Sklearn.preprocessing.MultiLabelBinarizer that encodes categories for encoding

#将类别进行编码，这里类别是多个，所以需要多编码
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
data_label = mlb.fit_transform(data['categories_big'].iloc[:])
data_label[0]

Insert picture description here
Idea 1 Use TFIDF
TF-IDF introduction and code
TfidfVectorizer document

#思路1使用TFIDF提取特征，限制最多4000个单词，文本向量化：
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=4000)
data_tfidf = vectorizer.fit_transform(data['text'].iloc[:])

#多标签分类，可以使用sklearn的多标签分类进行封装：
# 划分训练集和验证集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_tfidf, data_label,
                                                 test_size = 0.2,random_state = 1)

# 构建多标签分类模型
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import MultinomialNB
clf = MultiOutputClassifier(MultinomialNB()).fit(x_train, y_train)

#验证模型精度
from sklearn.metrics import classification_report
print(classification_report(y_test, clf.predict(x_test)))

Insert picture description here
Idea 2 uses a deep learning model, embedding words and then training.
First divide the data set according to the text, encode the data set, and perform truncation.
Define the model and complete the training
Tokenizer. Tokenizer
production and embedding use

#首先划分训练集和测试集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data['text'].iloc[:], data_label,
                                                 test_size = 0.2,random_state = 1)

#将数据集进行编码并截断
# parameter
max_features= 500
max_len= 150
embed_size=100
batch_size = 128
epochs = 5

from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence

tokens = Tokenizer(num_words = max_features)
tokens.fit_on_texts(list(x_train)+list(x_test))

x_sub_train = tokens.texts_to_sequences(x_train)
x_sub_test = tokens.texts_to_sequences(x_test)

x_sub_train=sequence.pad_sequences(x_sub_train, maxlen=max_len)
x_sub_test=sequence.pad_sequences(x_sub_test, maxlen=max_len)

#定义模型并进行训练
# LSTM model
# Keras Layers:
from keras.layers import Dense,Input,LSTM,Bidirectional,Activation,Conv1D,GRU
from keras.layers import Dropout,Embedding,GlobalMaxPooling1D, MaxPooling1D, Add, Flatten
from keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D, concatenate, SpatialDropout1D# Keras Callback Functions:
from keras.callbacks import Callback
from keras.callbacks import EarlyStopping,ModelCheckpoint
from keras import initializers, regularizers, constraints, optimizers, layers, callbacks
from keras.models import Model
from keras.optimizers import Adam

sequence_input = Input(shape=(max_len, ))
x = Embedding(max_features, embed_size,trainable = False)(sequence_input)
x = SpatialDropout1D(0.2)(x)
x = Bidirectional(GRU(128, return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)
x = Conv1D(64, kernel_size = 3, padding = "valid", kernel_initializer = "glorot_uniform")(x)
avg_pool = GlobalAveragePooling1D()(x)
max_pool = GlobalMaxPooling1D()(x)
x = concatenate([avg_pool, max_pool]) 
preds = Dense(34, activation="sigmoid")(x)

model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy',optimizer=Adam(lr=1e-3),metrics=['accuracy'])
model.fit(x_sub_train, y_train, batch_size=batch_size, epochs=epochs)

Insert picture description here