Datawhale-zero-based introduction NLP-news text classification Task02

In Task01, the competition questions are analyzed, and then the data reading and data analysis are carried out, and the data reading and analysis operations are completed by using the Pandas library.

1 Data read

According to the data format of the question, the train_set.csv data can be read through read_csv:

import pandas  as pd
import numpy as np
import matplotlib.pyplot as plt

#读取全量数据
train_df = pd.read_csv('./data/data45216/train_set.csv',sep='\t')
train_df.shape

#读取部分数据
train_df = pd.read_csv('./data/data45216/train_set.csv',sep='\t'，nrows=100)
train_df.shape

Parameter: Separator of each column, separated by'\t', nrows=100, read 100 data

Pandas can also read data in sql, excel, table, html, json and other formats.

2 Data analysis

2.1 Calculate the length of the news text

The characters of each sentence in the question data are separated by spaces, and the length of each sentence can be obtained by directly counting the number of words.

train_df['text_len'] = train_df['text'].apply(lambda x:len(x.split(' ')))
print(train_df['text_len'].describe())

It can be seen from the output that the average sentence length is 907, the shortest length is 2, and the maximum length is 57921:

View the histogram of sentence length:

_ = plt.hist(train_df['text_len'],bins=50)
plt.xlabel('Text char count')
plt.title('Histogram of char count')

Output result:

2.2 View the category distribution of the question data

View the distribution of each news category by drawing a histogram.

train_df['label'].value_counts().plot(kind='bar')
plt.title('News class count')
plt.xlabel('category')

It can be seen from the output result that most of the news distribution is 0,1,2, and the least is 13, and the category identification of news is: {'Technology':0,'Stock':1,'Sports':2,'Entertainment' : 3,'Current Affairs': 4,'Society': 5,'Education': 6,'Finance': 7,'Home Furnishing': 8,'Game': 9,'Property': 10,'Fashion': 11 ,'Lottery': 12,'constellation': 13}.

2.3 Character distribution

Count the number of occurrences of each character, concatenate the sentence and divide it into characters, and count the number of each character. Through statistics, it is known that 3750,900,648 appear more frequently, which can be inferred as punctuation marks.

from collections import Counter

#将文本变为一个list
all_lines = ' '.join(list(train_df['text']))
print(len(all_lines))
#对每个词统计个数
word_count = Counter(all_lines.split(" "))
#进行排序
word_count = sorted(word_count.items(),key=lambda d:d[1], reverse = True)
print(len(word_count))
print(word_count[0])
print(word_count[-1])

Use the Lambda function to de-duplicate the data of train_df['text'] first, and then concatenate statistics:

train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(train_df['text_unique']))
word_count = Counter(all_lines.split(' '))
word_count = sorted(word_count.items(),key=lambda d:int(d[1]),reverse=True)
print(len(word_count))
print(word_count[0])
print(word_count[-1])

Analysis conclusion:

1. The number of characters in each news is more than 900, and some news is longer, which may need to be truncated;

2. The uneven distribution of news categories will affect the accuracy of the model.

3 homework

(1) Assuming that the characters 3750,900,648 are sentence punctuation marks, please analyze how many sentences each news article consists of on average?

One, use the for loop to achieve

flaglist1 = []
flaglist2 = []
flaglist3 = []
for i in range(train_df['text'].shape[0]):
    flag1,flag2,flag3 = train_df['text'].loc[i].split(' ').count('3750'),train_df['text'].loc[i].split(' ').count('900'),train_df['text'].loc[i].split(' ').count('648')
    flaglist1.append(flag1)
    flaglist2.append(flag2)
    flaglist3.append(flag3)
flaglist = list(map(lambda x:x[0]+x[1]+x[2],zip(flaglist1,flaglist2,flaglist3)))
train_df['flag_freq'] = flaglist
train_df['flag_freq'].mean()

Second, use Counter to achieve

train_df['text_freq'] = train_df['text'].apply(lambda x: ' '.join(list(x.split(' '))))
print(len(train_df['text']))
# # #将文本变为一个list
strlist1 = []
strlist2 = []
strlist3 = []
for i in range(train_df['text_freq'].shape[0]):
    all_lines = train_df['text_freq'].loc[i]
    # #对每个词统计个数
    word_count = Counter(all_lines.split(' '))
    # print(word_count['3750'],word_count['900'],word_count['648'])
    strlist1.append(word_count['3750'])
    strlist2.append(word_count['900'])
    strlist3.append(word_count['648'])
    

flaglist = list(map(lambda x:x[0]+x[1]+x[2],zip(strlist1,strlist2,strlist3)))
train_df['flag_freq'] = flaglist
train_df['flag_freq'].mean()

(2) Count the characters with the most occurrences of each type of news

One, use groupby for grouping

groupdata = train_df.groupby(by=['label'])
print(groupdata.size())

#每类新闻出现最多的词
max_freq = []
for i in range(len(groupdata.size())):
    df = groupdata.get_group(i)['text'].apply(lambda x: ' '.join(list(x.split(' '))))
    all_lines = ' '.join(list(df))
    word_count = Counter(all_lines.split(' '))
    del word_count['3750']
    del word_count['900']
    del word_count['648']
    word_count = sorted(word_count.items(),key=lambda d:int(d[1]),reverse=True)
    print(word_count[1][0])
    max_freq.append(word_count[1][0])

Two, through the category data of Pandas to achieve

train_df['new_label'] = pd.cut(train_df['label'],[-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13],labels=['0','1','2','3','4','5','6','7','8','9','10','11','12','13'])
train_df.set_index('new_label').sort_index(ascending=False).head()

max_freq = []
for i in range(14):
    df = train_df[train_df['new_label']==str(i)]['text'].apply(lambda x: ' '.join(list(x.split(' '))))
    all_lines = ' '.join(list(df))
    word_count = Counter(all_lines.split(' '))
    del word_count['3750']
    del word_count['900']
    del word_count['648']
    word_count = sorted(word_count.items(),key=lambda d:int(d[1]),reverse=True)
    print(word_count[1][0])
    max_freq.append(word_count[1][0])

Thinking: How to solve the problem of category imbalance?

Datawhale-zero-based introduction NLP-news text classification Task02

Guess you like