[DataWhale Learning Record 15-02] Zero-based Introduction to NLP-News Text Classification Contest-02 Data Reading and Data Analysis

2 Task2 data reading and data analysis

2.1 Goal:

  1. Learn to use pandas to read contest data
  2. Analyze the distribution law of contest data

2.2 Data reading

Note: Although the question data is text data, each news is of variable length, but it is still stored in csv format. Therefore, you can directly use Pandas to complete the data reading operation.

import pandas as pd
train_df = pd.read_csv('../input/train_set.csv', sep='\t', nrows=100)

The read_csv here is composed of three parts:
3. The read file path, here needs to be changed to your local path, you can use an absolute path or a relative path;
4. Separator sep , which is the character to be divided in each column, set to **\t is fine ;
5. Read the number of rows
nrows**, which is the function of reading the file this time, and is a numeric type (due to the relatively large data set, it is recommended to set it to 100 first);result

2.3 Data analysis

Conclusions to be drawn: In the
question data:

  1. What is the length of the news text?
  2. What is the distribution of data types, and which types are more numerous?
  3. What is the character distribution?

2.3.1 Sentence length analysis

In the question data, the characters of each sentence are separated by spaces, so you can directly count the number of words to get the length of each sentence. The statistics are as follows:

%pylab inline
train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' ')))
print(train_df['text_len'].describe())

Output result:
Insert picture description here
Statistics show that each sentence is composed of 907 characters on average, the shortest sentence length is 2, and the longest sentence length is 57921.

The figure below plots the length of the sentence as a histogram. (It can be seen that most sentences are concentrated within 2000)

_ = plt.hist(train_df['text_len'], bins=200)
plt.xlabel('Text char count')
plt.title("Histogram of char count")

Insert picture description here

2.3.2 News categories

Perform distribution statistics on the categories of the data set, and specifically count the number of samples of each type of news.

train_df['label'].value_counts().plot(kind='bar')
plt.title('News class count')
plt.xlabel("category")

Insert picture description here
Insert picture description here
From the statistical results, it can be seen that the distribution of the data set categories of the competition questions is relatively uneven. In the training set, technology news is the most, followed by stock news, and the least news is constellation news.

2.3.3 Character distribution statistics

In this step, the number of occurrences of each character is counted. First, all sentences in the training set can be spliced ​​and then divided into characters, and the number of each character can be counted.

from collections import Counter
all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse = True)
print(len(word_count))
# 6869
print(word_count[0])
# ('3750', 7482224)
print(word_count[-1])
# ('3133', 1)

Output result: It
Insert picture description here
can be seen that the training set contains a total of 6869 words, of which the word number 3750 appears the most times, and the word number 3133 appears the least.

Here you can also infer the punctuation marks according to the appearance of the words in each sentence.

train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(train_df['text_unique']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:int(d[1]), reverse = True)
print(word_count[0])
# ('3750', 197997)
print(word_count[1])
# ('900', 197653)
print(word_count[2])
# ('648', 191975)

Output result: The
Insert picture description here
above code counts the number of times different characters appear in sentences. Among them, the coverage rate of character 3750, character 900 and character 648 in 20w news is close to 99%, which is probably punctuation.

2.3.4 Conclusion of data analysis

  1. From the above analysis, it can be seen that the average number of characters contained in each news item in the contest question is 1000, and some news characters are relatively long;
  2. The distribution of news categories in the competition questions is uneven, the sample size of science and technology news is close to 4w, and the sample size of constellation news is less than 1k.
  3. The contest questions consist of 7000-8000 characters in total;

Through data analysis, we can also draw the following conclusions:

  1. The average number of characters in each news is relatively large, which may require stages;
  2. Due to the imbalance of the categories, it will seriously affect the accuracy of the model;

2.3.5 Summary of this chapter

This article reads the question data and visually analyzes the length, type and characters of news sentences.

2.4 Homework

  1. Assuming that character 3750, character 900 and character 648 are sentence punctuation marks, please analyze the average number of sentences composed of each news article?
    Insert picture description here
    That is: each news article is composed of 80 sentences on average
  2. Count the characters that appear most frequently in each type of news.
train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
all_lines_in_a_class = []
for i in range(14):
    line = ' '.join(train_df[train_df['label'] == i]['text_unique'])
    all_lines_in_a_class.append(re.sub('3750|900|648','',line))
for i,line in enumerate(all_lines_in_a_class):
    line = filter(lambda x: x, line.split(' '))
    word_count = Counter(line)
    word_count = sorted(word_count.items(), key=lambda d: int(d[1]), reverse=True)
    print(i,':',word_count[0])

Insert picture description here
The character with the highest occurrence of each type.

Guess you like

Origin blog.csdn.net/qq_40463117/article/details/107489719