Tianchi NLP Competition-News Text Classification (2)-Data Reading and Data Analysis


Series of articles
Tianchi NLP Competition-News Text Classification (1) ——Comprehension of Competition Questions
Tianchi NLP Competition-News Text Classification (2) ——Data Reading and Data Analysis


2. Data reading and data analysis

2.1 Data reading

Although the question data is text data, each news is of variable length, but it is still stored in csv format. Therefore, you can directly use Pandas to complete the data reading operation.

import pandas as pd
train_df = pd.read_csv('../input/train_set.csv', sep='\t', nrows=100)

Here is read_csvcomposed of three parts:

  • The read file path, here needs to be changed to your local path, you can use a relative path or an absolute path;
  • Separator, sepwhich is the character to \tbe separated in each column, just set it to;
  • The number of rows nrowsread is the function of reading the file this time, and it is a numeric type (due to the relatively large data set, it is recommended to set it to 100 first);
train_df.head()

Insert picture description here
The first column is the category of news, and the second column is the character of the news.

2.2 Data analysis

In this step, we have read all the training set data. Here we hope to draw the following conclusions through data analysis:

  • What is the length of the news text in the question data?
  • What is the category distribution of the question data, and which categories are more numerous?
  • What is the distribution of characters in the question data?

2.2.1 Sentence length analysis

%%time
%pylab inline
train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' ')))
print(train_df['text_len'].describe())

Populating the interactive namespace from numpy and matplotlib
count 200000.000000
mean 907.207110
std 996.029036
min 2.000000
25% 374.000000
50% 676.000000
75% 1131.000000
max 57921.000000
Name: text_len, dtype: float64
Wall time: 10.1 s

The statistics of news sentences can be concluded that the text given in this contest is relatively long, each sentence is composed of 907 characters on average, the shortest sentence length is 2, and the longest sentence length is 57921.

The following figure draws a histogram of the sentence length. It can be seen that most of the sentence lengths are within 2000.

_ = plt.hist(train_df['text_len'], bins=200)
plt.xlabel('Text char count')
plt.title("Histogram of char count")

Insert picture description here

2.2.2 News category distribution

Next, you can perform distribution statistics on the categories of the data set, and specifically count the number of samples of each type of news.

train_df['label'].value_counts().plot(kind='bar')
plt.title('News class count')
plt.xlabel("category")

Insert picture description here

The corresponding relationship of the labels in the data set is as follows: {'Technology': 0,'Stocks': 1,'Sports': 2,'Entertainment': 3,'Current Affairs': 4,'Society': 5,'Education' : 6,'Finance': 7,'Home Furnishing': 8,'Game': 9,'Property': 10,'Fashion': 11,'Lottery': 12,'Constellation': 13}

It can be seen from the statistical results that the distribution of the data set categories of the competition questions is relatively uneven. In the training set, technology news is the most, followed by stock news, and the least news is constellation news.

2.2.3 Character distribution statistics

Next, the number of occurrences of each character can be counted. First, all sentences in the training set can be spliced ​​and then divided into characters, and the number of each character can be counted.

It can be seen from the statistical results that a total of 6869 words are included in the training set, of which the word number 3750 appears the most times, and the word number 3133 appears the least.

from collections import Counter
all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse = True)

print(len(word_count))

print(word_count[0])

print(word_count[-1])

2405

(‘3750’, 3702)

(‘5034’, 1)

Here you can also infer the punctuation marks according to the appearance of the words in each sentence. The following code counts the number of times different characters appear in a sentence. Among them, the coverage rate of character 3750, character 900 and character 648 in 20w news is close to 99%, which is likely to be punctuation.

from collections import Counter
train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(train_df['text_unique']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:int(d[1]), reverse = True)

print(word_count[0])

print(word_count[1])

print(word_count[2])

(‘900’, 99)

(‘3750’, 99)

(‘648’, 96)

2.3 Conclusion of data analysis

Through the above analysis, we can draw the following conclusions:

  1. The average number of characters contained in each news item in the question is 1000, and some news characters are longer;
  2. The distribution of news categories in the competition questions is uneven, the sample size of science and technology news is close to 4w, and the sample size of constellation news is less than 1k;
  3. The contest questions consist of 7000-8000 characters in total;

Through data analysis, we can also draw the following conclusions:

  1. The average number of characters per news is large and may need to be truncated
  2. Due to the imbalance of the categories, it will seriously affect the accuracy of the model;

Guess you like

Origin blog.csdn.net/bosszhao20190517/article/details/107520201