[DataWhale Learning Record 15-03] Zero-based Introductory NLP-News Text Classification Contest Questions-03 Text Classification Based on Machine Learning

3 Task3 text classification based on machine learning

3.1 Learning objectives

  1. Learn the principle and use of TF-IDF
  2. Use sklearn's machine learning mo'xing to complete text classification

3.2 Text classification method Part1

In the field of natural language, text is of variable length. The method of expressing text as a number or vector that a computer can operate is generally referred to as a word embedding (Word Embedding) method. Word embedding transforms text of variable length into a space of fixed length, which is the first step in text classification.

3.2.1 One-hot

Each word is represented by a discrete vector (consistent with the operation in the data mining task). Specifically, each word/word is coded into an index, and then assigned according to the index/

Examples of representation methods are as follows:

句子1 : 我 爱 北 京 天 安 门
句子2 : 我 喜 欢 上 海

First, index the words of all sentences, that is, determine a number for each word:

{
    
    
'我': 1, '爱': 2, '北': 3, '京': 4, '天': 5,
'安': 6, '门': 7, '喜': 8, '欢': 9, '上': 10, '海海': 11
}

A total of 11 words are included here, so each word can be converted into an 11-dimensional sparse vector:

我:[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
爱:[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
...
海:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

3.2.2 Bag of Words

It is the word bag representation, also known as Count Vectors. The word/word of each document can be represented by the number of occurrences.

句子1:我 爱 北 京 天 安 门
句子2:我 喜 欢 上 海海

Directly count the number of occurrences of each word and assign values:

句子1:我 爱 北北 京 天 安 门
转换为 [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
句子2:我 喜 欢 上 海
转换为 [1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

CountVectorizer can be used directly in sklearn to achieve this step:

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus).toarray()

3.2.3 N-gram

N-gram is similar to Count Vectors, except that adjacent words are combined into new words and counted.

If the value of N is 2, then sentence 1 and sentence 2 become:

句子1:我爱 爱北 北京 京天 天安 安门
句子2:我喜 喜欢 欢上 上海

3.2.4 TF-IDF

The TF-IDF score is composed of two parts: the first part is the term frequency (Term Frequency), and the second part is the inverse document frequency (Inverse Document Frequency). Among them, calculating the total number of documents in the corpus divided by the number of documents containing the word, and then taking the logarithm is the inverse document frequency.

TF(t)= 该词语在当前⽂文档出现的次数 / 当前⽂文档中词语的总数
IDF(t)= log_e(⽂文档总数 / 出现该词语的⽂文档总数)

3.3 Text classification based on machine learning

Next, the accuracy of different text representation algorithms will be compared, and the F1 score will be calculated by building a verification set locally.

3.3.1 Count Vectors + RidgeClassifier

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score
train_df = pd.read_csv('../input/train_set.csv', sep='\t', nrows=15000)
vectorizer = CountVectorizer(max_features=3000)
train_test = vectorizer.fit_transform(train_df['text'])
clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000])
val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
# 0.74

3.3.2 TF-IDF + RidgeClassifier

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score
train_df = pd.read_csv('../input/train_set.csv', sep='\t', nrows=15000)
tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000)
train_test = tfidf.fit_transform(train_df['text'])
clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000])
val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
# 0.87

3.3 Summary of this chapter

The text classification method based on machine learning is introduced, and the comparison between the two methods is completed.

3.4 Homework in this chapter

  1. Try to change the parameters of TF-IDF and verify the accuracy.
    Insert picture description here

  2. Try to use other machine learning models to complete training and verification.

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_40463117/article/details/107557981