Text classification based on machine learning

Machine learning model

Machine learning is the study of computer algorithms that can be automatically improved through experience. Machine learning uses historical data to train models that correspond to the process of humans summarizing experience. Machine learning uses models to predict new data corresponds to the process of humans using the rules summarized to predict new problems.

There are many branches of machine learning. For learners, they should first master the classification of machine learning algorithms, and then learn one of the machine learning algorithms. Since there are too many branches and details of machine learning algorithms, if you are fascinated by the details at the beginning, it is difficult for you to know what the overall situation is.

If you are a beginner in machine learning, you should know the following things:

  • Machine learning can solve certain problems, but you cannot expect machine learning to be omnipotent;
  • There are many kinds of machine learning algorithms, depending on what the specific problem requires, then make a choice;
  • Each machine learning algorithm has a certain preference, which requires specific analysis of specific issues;

Insert picture description here

Text representation method Part1

In the training process of machine learning algorithm, suppose that given NNN samples, each sample hasMMM features, which formN × MN×MN×M sample matrix, and then complete the algorithm training and prediction. Similarly, in computer vision, the pixels of a picture can be regarded as features, and each picture can be regarded as a feature map of hight×width×3, and a three-dimensional matrix is ​​entered into the computer for calculation.

But in the field of natural language, the above method is not feasible: the text is of variable length. The method of expressing text as a number or vector that a computer can operate is generally called the word embedding method. Word embedding transforms text of variable length into a space of fixed length, which is the first step in text classification.

One-hot

One-hot here is consistent with the operation in the data mining task, that is, each word is represented by a discrete vector. Specifically encode each word/word into an index, and then assign a value according to the index.

Examples of one-hot representation methods are as follows:

句子1:我 爱 北 京 天 安 门
句子2:我 喜 欢 上 海

First index the words of all sentences, that is, determine a number for each word:

{
    
    
	'我': 1, '爱': 2, '北': 3, '京': 4, '天': 5,
  '安': 6, '门': 7, '喜': 8, '欢': 9, '上': 10, '海': 11
}

A total of 11 words are included here, so each word can be converted into an 11-dimensional sparse vector:

我:[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
爱:[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
...
海:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Bag of Words

Bag of Words (Bag of Words), also known as Count Vectors, the word/word of each document can be represented by the number of occurrences.

Directly count the number of occurrences of each word and assign values:

句子1:我 爱 北 京 天 安 门
转换为 [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

句子2:我 喜 欢 上 海
转换为 [1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

CountVectorizer can be used directly in sklearn to achieve this step:

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus).toarray()
N-gram

N-gram is similar to Count Vectors, except that adjacent words are combined to form new words and counted.

If N takes the value 2, then sentence 1 and sentence 2 become:

句子1:我爱 爱北 北京 京天 天安 安门
句子2:我喜 喜欢 欢上 上海
TF-IDF

The TF-IDF score consists of two parts: the first part is term frequency (Term Frequency), and the second part is inverse document frequency (Inverse Document Frequency). Among them, calculating the total number of documents in the corpus divided by the number of documents containing the word, and then taking the logarithm is the inverse document frequency.

TF(t)= 该词语在当前文档出现的次数 / 当前文档中词语的总数
IDF(t)= log_e(文档总数 / 出现该词语的文档总数)

Text classification based on machine learning

Next, we will compare the accuracy of different text representation algorithms and calculate the F1 score by building a verification set locally.

Count Vectors + RidgeClassifier
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score

train_df = pd.read_csv('train_set.csv', sep='\t', nrows=15000)

vectorizer = CountVectorizer(max_features=3000)
train_test = vectorizer.fit_transform(train_df['text'])

clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000])

val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))

Insert picture description here

TF-IDF + RidgeClassifier
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score

train_df = pd.read_csv('train_set.csv', sep='\t', nrows=15000)

tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000)
train_test = tfidf.fit_transform(train_df['text'])

clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000])

val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_45696161/article/details/107585144