[nlp] Tianchi Learning Competition-News Text Classification-Machine Learning

Table of contents

1. Read data

Check sentence length

2. Visualization

2.1. Word count distribution of news

2.2. News text category statistics

3. Data analysis

3.1. Count the number of occurrences of each character

3.2. Count the number of times different characters appear in sentences

4. Text feature extraction

4.1、CountVectors+RidgeClassifier

4.2、TF-IDF + RidgeClassifier

4.3、MultinomialNB +CountVectors

4.4、MultinomialNB +TF-IDF

4.5. Drawing

1. Read data

import pandas as pd
import seaborn as sns

#  nrows=100  设置读取100行数据
train_df = pd.read_csv('新建文件夹/天池—新闻文本分类/train_set.csv', sep='\t')  
print(train_df.head())

   label                                               text
0      2  2967 6758 339 2021 1854 3731 4109 3792 4149 15...
1     11  4464 486 6352 5619 2465 4802 1452 3137 5778 54...
2      3  7346 4068 5074 3747 5681 6093 1777 2226 7354 6...
3      2  7159 948 4866 2109 5520 2490 211 3956 5520 549...
4      3  3646 3055 3055 2490 4659 6065 3370 5814 2465 5...

Check sentence length

#句子长度分析
train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' ')))
print(train_df['text_len'].describe())
#平均长度907.207110

count    200000.000000
mean        907.207110
std         996.029036
min           2.000000
25%         374.000000
50%         676.000000
75%        1131.000000
max       57921.000000
Name: text_len, dtype: float64

2. Visualization

2.1. Word count distribution of news

As can be seen from the figure, there are very few news texts with more than 10,000 words, and less than 5,000 words.

import matplotlib.pyplot as plt
_ = plt.hist(train_df['text_len'], bins=200)
plt.xlabel('Text char count')
plt.title("Histogram of char count")
plt.show()

2.2. News text category statistics

train_df['label'].value_counts().plot(kind = 'bar')
plt.title('News class count')
plt.xlabel('category')
plt.show()

The corresponding relationships among the labels in the data set are as follows: {'Technology': 0, 'Stocks': 1, 'Sports': 2, 'Entertainment': 3, 'Current Affairs': 4, 'Society': 5, 'Education': 6, 'Finance': 7, 'Home': 8, 'Games': 9, 'Real Estate': 10, 'Fashion': 11, 'Lottery': 12, 'Horoscope': 13}

According to Tuzhi, news in the technology, stock, and sports categories account for the largest proportion.

3. Data analysis

3.1. Count the number of occurrences of each character

Counter() is a class in collections. Its function is to calculate the number of occurrences of different elements in a string or list. The return value can be understood as a dictionary: {"Character": "Number of occurrences of characters"}

#统计每个字符出现的次数

from collections import Counter

#先将所有字符用空格连接起来
all_lines = ' '.join(list(train_df['text']))

#统计按空格切割的字符数目
#Counter 返回字典，key为元素，值为元素个数。
word_count = Counter(all_lines.split(' '))

#按降序排列字符出现的次数   #排的是次数
word_count = sorted(word_count.items(),key = lambda d : d[1],reverse = True)

#打印字符的数量
print('len(word_count) : ',len(word_count))
#打印第一个字符出现的次数
print('word_count[0]:',word_count[0])
#打印最后一个字符出现的次数
print('word_count[-1]:',word_count[-1])

'''
len(word_count) :  6869
word_count[0]: ('3750', 7482224)
word_count[-1]: ('3133', 1)'''

#训练集中总共包括6869个字，其中编号3750的字出现的次数最多，编号3133的字出现的次数最少。

3.2. Count the number of times different characters appear in sentences

list(set()): Deduplicate the original list and sort it from small to large
Cut the characters in the text with spaces and scramble them into an unordered list, and use spaces to connect the unordered lists.
The characters that appear repeatedly are likely to be punctuation marks. Characters 3750, 900 and 648 have a coverage rate of nearly 99% in 20w news, so they are likely to be punctuation marks.

#统计不同字符在句子中出现的次数

train_df['text_unique'] = train_df['text'].apply(lambda x : ' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(train_df['text_unique']))
word_count = Counter(all_lines.split(' '))

#按降序排列字符出现的次数   #排的是次数
word_count = sorted(word_count.items(),key = lambda d :int(d[1]),reverse = True)

#打印出现次数前三的字符
print('word_count[0]:',word_count[0])
print('word_count[1]:',word_count[1])
print('word_count[2]:',word_count[2])

'''word_count[0]: ('3750', 197997)
word_count[1]: ('900', 197653)
word_count[2]: ('648', 191975)'''

4. Text feature extraction

4.1、CountVectors+RidgeClassifier

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 3000,ngram_range=(1,3))
train_text = vectorizer.fit_transform(train_df['text'])

#CountVectors+RidgeClassifier
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split



df = pd.read_csv('新建文件夹/天池—新闻文本分类/train_set.csv', sep='\t',nrows = 15000) 
##统计每个字出现的次数，并赋值为0/1   用词袋表示text（特征集）
##max_features=3000，文档中出现频率最多的前3000个词
#ngram_range（1,3）（单个字，两个字，三个字 都会统计
vectorizer = CountVectorizer(max_features = 3000,ngram_range=(1,3))
train_text = vectorizer.fit_transform(train_df['text'])

X_train,X_val,y_train,y_val = train_test_split(train_text,df.label,test_size = 0.3)


#岭回归拟合训练集（包含text 和 label）
clf = RidgeClassifier()
clf.fit(X_train,y_train)
val_pred = clf.predict(X_val)
f1_score_cv = f1_score(y_val,val_pred,average = 'macro')
print(f1_score_cv)

4.2、TF-IDF + RidgeClassifier

#TF-IDF + RidgeClassifier
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score


df = pd.read_csv('新建文件夹/天池—新闻文本分类/train_set.csv', sep='\t',nrows = 15000)

train_test = TfidfVectorizer(ngram_range=(1,3),max_features = 3000).fit_transform(df.text)

X_train,X_val,y_train,y_val = train_test_split(train_text,df.label,test_size = 0.3)


clf = RidgeClassifier()
clf.fit(X_train,y_train)
val_pred = clf.predict(X_val)
f1_score_tfidf = f1_score(y_val,val_pred,average = 'macro')
print(f1_score_tfidf)

4.3、MultinomialNB +CountVectors

from sklearn.naive_bayes import MultinomialNB

df = pd.read_csv('新建文件夹/天池—新闻文本分类/train_set.csv', sep='\t',nrows = 15000) 
##统计每个字出现的次数，并赋值为0/1   用词袋表示text（特征集）
##max_features=3000文档中出现频率最多的前3000个词
#ngram_range（1,3）（单个字，两个字，三个字 都会统计
vectorizer = CountVectorizer(max_features = 3000,ngram_range=(1,3))
train_text = vectorizer.fit_transform(train_df['text'])

X_train,X_val,y_train,y_val = train_test_split(train_text,df.label,test_size = 0.3)

clf = MultinomialNB()
clf.fit(X_train,y_train)

val_pre_CountVec_NBC = clf.predict(X_val)
score_f1_CountVec_NBC = f1_score(y_val,val_pre_CountVec_NBC,average='macro')

print('CountVec + MultinomialNB : %.4f' %score_f1_CountVec_NBC )

4.4、MultinomialNB +TF-IDF

df = pd.read_csv('新建文件夹/天池—新闻文本分类/train_set.csv', sep='\t',nrows = 15000)

train_test = TfidfVectorizer(ngram_range=(1,3),max_features = 3000).fit_transform(df.text)

X_train,X_val,y_train,y_val = train_test_split(train_text,df.label,test_size = 0.3)

clf = MultinomialNB()
clf.fit(X_train,y_train)

val_pre_tfidf_NBC = clf.predict(X_val)
score_f1_tfidf_NBC = f1_score(y_val,val_pre_tfidf_NBC,average='macro')

print('TF-IDF + MultinomialNB : %.4f' %score_f1_tfidf_NBC )

4.5. Drawing

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

scores = [f1_score_cv , f1_score_tfidf , score_f1_CountVec_NBC , score_f1_tfidf_NBC]
x_ticks = np.arange(4)
x_ticks_label = ['CountVec_RidgeClassifier','tfidf_RidgeClassifier','CountVec_NBC','tfidf_NBC']
plt.plot(x_ticks,scores)
plt.xticks(x_ticks, x_ticks_label, fontsize=8) #指定字体
plt.ylabel('F1_score')
plt.show()

Text feature extraction