NLP news text classification-Task3

Task3-text classification based on machine learning

1. Text representation method

Insert word2vec here. In the process of talking about word2vec, one-hot is generally introduced. Here is an article that reveals the essence of text representation (mainly word2vec) in detail.
Understand the essence of word2vec in seconds

1.1 One-hot One-hot
here is consistent with the operation in the data mining task, that is, each word is represented by a discrete vector. Specifically encode each word/word into an index, and then assign a value according to the index.
An example of one-hot representation is as follows:
Sentence 1: I love Beijing Tiananmen Square
Sentence 2: I love Shanghai
First index the words of all sentences, that is, determine a number for each word:

{ 'I': 1,'Love': 2,'North': 3,'King': 4,'Tian': 5, 'Ann': 6,'Door': 7,'Hi': 8, ' Huan': 9,'上': 10,'海': 11 } There are a total of 11 words here, so each word can be converted into an 11-dimensional sparse vector:



Me: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Love: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Sea: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

1.2 Bag of Words
Bag of Words (Bag of Words), also known as Count Vectors, the characters/words of each document can be represented by the number of occurrences.
Sentence 1: I love Beijing Tiananmen Square
Sentence 2: I like Shanghai
Directly count the number of occurrences of each word and assign values:
Sentence 1: I love Beijing Tiananmen Square is
converted to [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
Sentence 2: I like Shanghai
Converted to [1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

1.3 N-gram
N-gram is similar to Count Vectors, but it adds adjacent words into a new word and counts it.
If the value of N is 2, then sentence 1 and sentence 2 become:
sentence 1: I love North Beijing Jingtian Tian’anmen
Sentence 2: I like Shanghai

1.4 TF-IDF
TF-IDF score consists of two parts: the first part is term frequency (Term Frequency), the second part is inverse document frequency (Inverse Document Frequency). Among them, calculating the total number of documents in the corpus divided by the number of documents containing the term, and then taking the logarithm is the inverse document frequency.

TF(t) = the number of times the word appears in the current document / the total number of words in the current document
IDF(t) = log_e (the total number of documents / the total number of documents with the word)

2. Text classification based on machine learning

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 	# Count Vectors + RidgeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer     # TF-IDF +  RidgeClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score

#导入数据
train_df = pd.read_csv('datalab/72510/train_set.csv', sep='\t', nrows=15000)

# 方法一:基于Count Vectors + RidgeClassifier
vectorizer = CountVectorizer(max_features=3000) #对关键词集按照出现频率排序,只取前max_features个词作为关键词集
train_test = vectorizer.fit_transform(train_df['text']) #计算各个词语出现的次数
clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000]) #回归拟合
val_pred = clf.predict(train_test[10000:]) #预测
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro')) #计算f1指标值
#输出为 0.654418775812

 # 方法二:基于TF-IDF +  RidgeClassifier
tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000) #使特征可以为单个字或者3个字 
train_test = tfidf.fit_transform(train_df['text'])
clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000])
val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
# 0.87193721737

Guess you like

Origin blog.csdn.net/DZZ18803835618/article/details/107570891