NLP practice (news text classification)-data reading and data analysis

Read data

Although the question data is text data, each news is of variable length, but it is still stored in csv format. Therefore, you can directly use Pandas to complete the data reading operation.

import pandas as pd
train_df = pd.read_csv('train_set.csv', sep='\t', nrows=100)
#这里的read_csv由三部分组成,第一个为路径,第二个为分割符,第三个为读取行数(这里训练集比较大,因此就先读取少部分)。

After reading the data with Pandas, we often want to observe whether the data is accurately read, which requires the head() function in Pandas.
Insert picture description here

Here explains why only 5 rows of data are displayed, which is related to the head function.

DataFrame.head(n=5)
			Return the first n rows.

Parameters:	    n : int, default 5
						    	Number of rows to select.

Returns:	           obj_head : type of caller
						      The first n rows of the caller object.

It can be seen that the head function reads 5 lines by default.

In the above data, the table shows that the first column is news type, and the second column is news characters.

data analysis

After reading the data set, we can also perform data analysis operations on the data set. Although it is not necessary to do a lot of data analysis for unstructured data, some rules can still be found through data analysis.

In this step, we have read all the training set data. Here we hope to draw the following conclusions through data analysis:

  • What is the length of the news text in the data?
  • What is the category distribution of the data, and which categories are more numerous?
  • What is the distribution of characters in the data?

Here, we first analyze the sentence length.

Sentence length analysis

The characters of each sentence in the question data are separated by spaces, so you can directly count the number of words to get the length of each sentence. The statistics are as follows:

%pylab inline
train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' ')))
print(train_df['text_len'].describe())

Insert picture description here

Statistics of news sentences can be concluded that the text given in this competition question is relatively long, each sentence is composed of 907 characters on average, the shortest sentence length is 2, and the longest sentence length is 57921.

The following figure draws a histogram of sentence lengths. It can be seen that most of the sentence lengths are within 2000.

_ = plt.hist(train_df['text_len'], bins=200)
plt.xlabel('Text char count')
plt.title("Histogram of char count")

Insert picture description here

News category classification

Next, you can perform distribution statistics on the categories of the data set, and specifically count the number of samples of each type of news.

train_df['label'].value_counts().plot(kind='bar')
plt.title('News class count')
plt.xlabel("category")

Insert picture description here

The corresponding relationship of the labels in the data set is as follows: {'Technology': 0,'Stocks': 1,'Sports': 2,'Entertainment': 3,'Current affairs': 4,'Society': 5,'Education' : 6,'Finance': 7,'Home Furnishing': 8,'Game': 9,'Property': 10,'Fashion': 11,'Lottery': 12,'Constellation': 13}

It can be seen from the statistical results that there is a relatively uneven distribution of the data set categories of the competition questions. In the training set, technology news is the most, followed by stock news, and the least news is constellation news.

Character distribution statistics

Next, the number of occurrences of each character can be counted. First, all sentences in the training set can be spliced ​​and then divided into characters, and the number of each character can be counted.

from collections import Counter
all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse = True)

Insert picture description here

It can be seen from the statistical results that a total of 6869 words are included in the training set, of which the word number 3750 appears the most times, and the word number 3133 appears the least.

Here you can also infer punctuation according to the appearance of words in each sentence. The following code counts the number of times different characters appear in a sentence. Among them, the coverage rate of character 3750, character 900 and character 648 in 20w news is close to 99%, which is likely to be punctuation.

train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(train_df['text_unique']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:int(d[1]), reverse = True)

Insert picture description here

Count the most frequent characters

Here we write a function to count the top 3 characters in frequency

def TopWord(type_, n):
    all_lines = " ".join([train_df["text"][i] for i in range(len(train_df["text"])) if train_df["label"][i]==type_])
    word_count = Counter(all_lines.split(" "))
    word_count = sorted(word_count.items(), key=lambda d:d[1], reverse=True)
    for i in range(n):
        print(word_count[i])
        
TopWord(2, 3)

Insert picture description here

Data analysis conclusion

Through the above analysis, we can draw the following conclusions:

  1. The average number of characters contained in each news item in the question is 1000, and some news characters are longer;
  2. The distribution of news categories in the competition questions is uneven, the sample size of technology news is close to 4w, and the sample size of constellation news is less than 1k;
  3. The question includes 7000-8000 characters in total;

Through data analysis, we can also draw the following conclusions:

  1. The average number of characters in each news item is large and may need to be truncated;
  2. Due to the imbalance of the categories, it will seriously affect the accuracy of the model;

to sum up

The main content is data reading and data analysis. The Pandas library is used to complete the data reading operation, and the competition data is analyzed and composed, and the news sentence length, category and character are visualized.

Guess you like

Origin blog.csdn.net/weixin_45696161/article/details/107503859