Tianchi NLP Competition-News Text Classification (3)-Text Classification Based on Machine Learning


Series of articles
Tianchi NLP Competition-News Text Classification (1)
-Comprehension of Competition Questions Tianchi NLP Competition-News Text Classification (2) -Data Reading and Data Analysis
Tianchi NLP Competition-News Text Classification (3)-Based on Machine Learning Text classification


Three, text classification based on machine learning

3.1 Machine learning model

  1. Machine learning can solve certain problems, but you cannot expect machine learning to be omnipotent;
  2. There are many kinds of machine learning algorithms, depending on what the specific problem requires, then make a choice;
  3. Each machine learning algorithm has a certain preference, which requires specific analysis of specific problems;

Insert picture description here

For text classification problems, machine learning here, you can use: TF-IDF+sklearn machine learning model to complete the classification

3.2 Text representation method

In the field of natural language, text is of variable length. The method of expressing text as numbers or vectors that a computer can operate is generally called the Word Embedding method. Word embedding transforms text of variable length into a space of fixed length, which is the first step in text classification.

One-hot
The one-hot here is consistent with the operation in the data mining task, that is, each word is represented by a discrete vector. Specifically, each word/word is coded into an index, and then assigned according to the index.

Examples of one-hot representation methods are as follows:

Sentence 1: I love Tiananmen Square in Beijing.
Sentence 2: I like Shanghai.
First, index the words of all sentences, that is, determine a number for each word:

{ 'I': 1,'Love': 2,'North': 3,'King': 4,'Tian': 5, 'Ann': 6,'Door': 7,'Hi': 8, ' Huan': 9,'Up': 10,'Hai': 11 }


A total of 11 words are included here, so each word can be converted into an 11-dimensional sparse vector:

Me: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Love: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Sea: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

Bag of Words
Bag of Words (Bag of Words), also known as Count Vectors, the words/words of each document can be represented by the number of occurrences.

Sentence 1: I love Beijing Tiananmen Square
Sentence 2: I love Shanghai
Directly count the number of occurrences of each word, and assign a value:

Sentence 1: I love Beijing Tiananmen Square is
converted to [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

Sentence 2: I like Shanghai
Converted to [1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

CountVectorizer can be used directly in sklearn to achieve this step:

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus).toarray()

N-gram
N-gram is similar to Count Vectors, but joins adjacent words to form new words and counts them.

If the value of N is 2, then sentence 1 and sentence 2 become:

Sentence 1: I love North Beijing Jingtian Tianan’anmen
Sentence 2: I love to go to Shanghai

TF-IDF
TF-IDF score consists of two parts: the first part is the term frequency (Term Frequency), and the second part is the inverse document frequency (Inverse Document Frequency). Among them, calculating the total number of documents in the corpus divided by the number of documents containing the word, and then taking the logarithm is the inverse document frequency.

TF (t) = The number of times the word appears in the current document The total number of words in the current document TF(t) = \frac {The number of times the word appears in the current document}{Total number of words in the current document} TF(t)=When the pre- text files in word language of the total number ofThe word phrase in when former text file that is now the second number

IDF (t) = the total number of loge documents, the total number of documents with the term + 1 IDF(t) = log_e\frac {the total number of documents} {the total number of documents with the term +1} IDF(t)=logeOut now the word language of the text file the total number of+1Text files of the total number of

If a word is more common, then the denominator is larger, and the inverse document frequency is smaller and closer to 0. The reason for adding 1 to the denominator is to prevent the denominator from being 0 (that is, all documents do not contain the word). log means taking the logarithm of the obtained value.

T F − I D F = I F ∗ I D F TF-IDF = IF*IDF TFIDF=IFIDF

It can be seen that TF-IDF is directly proportional to the number of occurrences of a word in the document, and inversely proportional to the number of occurrences of the word in the entire language. Therefore, the algorithm for automatically extracting keywords is very clear, which is to calculate the TF-IDF value of each word in the document, and then sort it in descending order, taking the top few words.

Reference: http://www.ruanyifeng.com/blog/2013/03/tf-idf.html
Insert picture description here

3.3 Text classification based on machine learning

F1 score

F1 Score (F1 Score) is an index used in statistics to measure the accuracy of a binary classification model. It takes into account both the accuracy rate and recall rate of the classification model. F1 score can be regarded as a harmonic average of model accuracy and recall. Its maximum value is 1 and its minimum value is 0.

F1 Score, also known as balanced F Score, is defined as the harmonic average of precision and recall.

F 1 = 2 ∗ p r e c i s i o n ∗ r e c a l l p r e c i s i o n + r e c a l l F_1=2*\frac{precision*recall}{precision+recall} F1=2precision+recallprecisionrecall

More generally, we define F β F_\betaFbThe score is

F β = ( 2 + β 2 ) ∗ p r e c i s i o n ∗ r e c a l l ( β 2 ∗ p r e c i s i o n ) + r e c a l l F_\beta=(2+\beta^2)*\frac{precision*recall}{(\beta^2*precision)+recall} Fb=(2+b2)( b2precision)+recallprecisionrecall

Except for F 1 F_1F1Outside the score, F 0. 5 F_0.5F0. 5 Fractions andF 2 F_2F2Scores are also widely used in statistics. Among them, F 0. 5 F_0.5F0. In 5 scores, the weight of recall is higher than precision, whileF 2 F_2F2In the score, the weight of precision is higher than recall.

CountVectorizer explained:

Reference: https://blog.csdn.net/weixin_38278334/article/details/82320307

CountVectorizer transforms the words in the text into a word frequency matrix through the fit_transform function. The matrix element a[i][j] represents the word frequency of word j in the i-th text. That is, the number of occurrences of each word, the keywords of all texts can be seen through get_feature_names(), and the result of the word frequency matrix can be seen through toarray().

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird'] # “dog cat fish” 为输入列表元素,即代表一个文章的字符串
cv = CountVectorizer()#创建词袋数据结构
cv_fit=cv.fit_transform(texts)
#上述代码等价于下面两行
#cv.fit(texts)
#cv_fit=cv.transform(texts)

print(cv.get_feature_names())    #['bird', 'cat', 'dog', 'fish'] 列表形式呈现文章生成的词典

print(cv.vocabulary_	)              # {‘dog’:2,'cat':1,'fish':3,'bird':0} 字典形式呈现,key:词,value:词频

print(cv_fit)
# (0,3) 1   第0个列表元素,**词典中索引为3的元素**, 词频
#(0,1)1
#(0,2)1
#(1,1)2
#(1,2)1
#(2,0)1
#(2,3)1
#(3,0)1

print(cv_fit.toarray()) #.toarray() 是将结果转化为稀疏矩阵矩阵的表示方式;
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]

print(cv_fit.toarray().sum(axis=0))  #每个词在所有文档中的词频
#[2 3 2 2]

For RidgeClassifier , you can refer to:
https://blog.csdn.net/LOLUN9/article/details/106012418/

Next, we will compare the accuracy of different text representation algorithms, and calculate the F1 score by building a verification set locally.

Count Vectors + RidgeClassifier

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score

train_df = pd.read_csv('../input/train_set.csv', sep='\t', nrows=15000)

#超参数 max_features : 默认为None,可设为int,对所有关键词的term frequency进行降序排序,只取前max_features个作为关键词集
vectorizer = CountVectorizer(max_features=3000)
train_test = vectorizer.fit_transform(train_df['text'])

clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000])

val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
# 0.74

TF-IDF + RidgeClassifier

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score

train_df = pd.read_csv('../input/train_set.csv', sep='\t', nrows=15000)

tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000)
train_test = tfidf.fit_transform(train_df['text'])

clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000])

val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
# 0.87

Guess you like

Origin blog.csdn.net/bosszhao20190517/article/details/107583806