Data reading and data analysis
Read data
Although the question data is text data, each news is of variable length, but it is still stored in csv format. Therefore, you can directly use Pandas to complete the data reading operation.
import pandas as pd
train_df = pd.read_csv('train_set.csv', sep='\t', nrows=100)
#这里的read_csv由三部分组成,第一个为路径,第二个为分割符,第三个为读取行数(这里训练集比较大,因此就先读取少部分)。
After reading the data with Pandas, we often want to observe whether the data is accurately read, which requires the head() function in Pandas.
Here explains why only 5 rows of data are displayed, which is related to the head function.
DataFrame.head(n=5)
Return the first n rows.
Parameters: n : int, default 5
Number of rows to select.
Returns: obj_head : type of caller
The first n rows of the caller object.
It can be seen that the head function reads 5 lines by default.
In the above data, the table shows that the first column is news type, and the second column is news characters.
data analysis
After reading the data set, we can also perform data analysis operations on the data set. Although it is not necessary to do a lot of data analysis for unstructured data, some rules can still be found through data analysis.
In this step, we have read all the training set data. Here we hope to draw the following conclusions through data analysis:
- What is the length of the news text in the data?
- What is the category distribution of the data, and which categories are more numerous?
- What is the distribution of characters in the data?
Here, we first analyze the sentence length.
Sentence length analysis
The characters of each sentence in the question data are separated by spaces, so you can directly count the number of words to get the length of each sentence. The statistics are as follows:
%pylab inline
train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' ')))
print(train_df['text_len'].describe())
Statistics of news sentences can be concluded that the text given in this competition question is relatively long, each sentence is composed of 907 characters on average, the shortest sentence length is 2, and the longest sentence length is 57921.
The following figure draws a histogram of sentence lengths. It can be seen that most of the sentence lengths are within 2000.
_ = plt.hist(train_df['text_len'], bins=200)
plt.xlabel('Text char count')
plt.title("Histogram of char count")
News category classification
Next, you can perform distribution statistics on the categories of the data set, and specifically count the number of samples of each type of news.
train_df['label'].value_counts().plot(kind='bar')
plt.title('News class count')
plt.xlabel("category")
The corresponding relationship of the labels in the data set is as follows: {'Technology': 0,'Stocks': 1,'Sports': 2,'Entertainment': 3,'Current affairs': 4,'Society': 5,'Education' : 6,'Finance': 7,'Home Furnishing': 8,'Game': 9,'Property': 10,'Fashion': 11,'Lottery': 12,'Constellation': 13}
It can be seen from the statistical results that there is a relatively uneven distribution of the data set categories of the competition questions. In the training set, technology news is the most, followed by stock news, and the least news is constellation news.
Character distribution statistics
Next, the number of occurrences of each character can be counted. First, all sentences in the training set can be spliced and then divided into characters, and the number of each character can be counted.
from collections import Counter
all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse = True)
It can be seen from the statistical results that a total of 6869 words are included in the training set, of which the word number 3750 appears the most times, and the word number 3133 appears the least.
Here you can also infer punctuation according to the appearance of words in each sentence. The following code counts the number of times different characters appear in a sentence. Among them, the coverage rate of character 3750, character 900 and character 648 in 20w news is close to 99%, which is likely to be punctuation.
train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(train_df['text_unique']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:int(d[1]), reverse = True)
Count the most frequent characters
Here we write a function to count the top 3 characters in frequency
def TopWord(type_, n):
all_lines = " ".join([train_df["text"][i] for i in range(len(train_df["text"])) if train_df["label"][i]==type_])
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse=True)
for i in range(n):
print(word_count[i])
TopWord(2, 3)
Data analysis conclusion
Through the above analysis, we can draw the following conclusions:
- The average number of characters contained in each news item in the question is 1000, and some news characters are longer;
- The distribution of news categories in the competition questions is uneven, the sample size of technology news is close to 4w, and the sample size of constellation news is less than 1k;
- The question includes 7000-8000 characters in total;
Through data analysis, we can also draw the following conclusions:
- The average number of characters in each news item is large and may need to be truncated;
- Due to the imbalance of the categories, it will seriously affect the accuracy of the model;
to sum up
The main content is data reading and data analysis. The Pandas library is used to complete the data reading operation, and the competition data is analyzed and composed, and the news sentence length, category and character are visualized.