(Data Analysis) Online Course Comment Analysis

This article uses crawled data to briefly analyze the course information and comments of the four online course platforms of Chinese universities: MOOC (icourse), imooc, Tencent Classroom (keqq), and Netease Cloud Classroom (study163) . At the same time, make a summary of the overall process of data analysis. Please point out any errors in the content.


1. Data capture

The acquisition of the data set is the first step in our data analysis. The main ways to obtain data now are: ready-made data; write your own crawler to crawl the data; use existing crawler tools to crawl the required content, save it to the database, or save it locally in the form of a file.

The blogger uses existing data for data analysis.

If you want to crawl data by writing your own crawler, then the overall idea is roughly divided into: determining the crawled content, analyzing the main page, obtaining sub-pages, analyzing sub-pages, and saving data. Today’s websites have more or less basic anti-crawling measures, so when we write a crawler, we should develop corresponding anti-crawling strategies for the website, such as request headers, IP proxy, cookie restrictions, verification code restrictions, etc. . These common anti-crawling mechanisms should be able to be applied to the crawler you write.

If the crawler can roughly crawl the content we need, I think the next step is to increase the crawling speed and increase the stability. We know that when the request module initiates a request to the page, the entire program is in a blocking state, and the code after the request cannot be run. So when we need to initiate a request to many pages, we can use asynchronous The way of coroutine allows us to use the blocked time to perform other tasks. Since the requests module does not support asynchronous coroutines, we need to use the aiohttp module to initiate requests to the page, and then use asyncio to implement asynchronous crawlers.
To improve stability, you need some stable ip proxies to prevent ip from being blocked during crawler operation. It is recommended to crawl some free ip proxy websites by yourself, test the code, save the usable ones in the database, and pass them directly when using Class can be used. If you are still very familiar with the use of ip proxy in crawlers, you may wish to read this article (asynchronous crawler) on the use of proxy IP in requests and aiohttp .

Because I have never used a crawler tool, I won't introduce it.


2. Data cleaning

When the data is obtained, we need to clean the data we crawled to pave the way for subsequent data analysis. If the cleaning is not in place, it will inevitably affect the subsequent data analysis.
The following will introduce the unification of data format, null value processing, data deduplication, and comment cleaning.


2.1 Unified data format

For the data of the four platforms, because the content crawled by each platform is different, the data type is also different, we extract the content we need according to the needs of later analysis, and then merge the course and comment information after the data type and format are unified.

For example, in the course grading, the scoring format of imooc is '9 points', and in data analysis, only 9 is required as the score. The scoring requirements of each platform are different, here is a unified 5-point system.

# 正则提取数字
df['评论分数'] = df['评论分数'].str.extract(r'(\d+)', expand=True)
# 类型转换,并改为五分制
df['评论分数'] = df['评论分数'].values.astype(float)/2

Rename the column name

df.rename(columns=({
    
    '课程名':'course_name', '学习人数':'total_stu'}), inplace=True)

2.2 Null value processing

For the null value (Nan) in the data, if the data in the row has little effect on the subsequent data analysis. Then, just delete and change the line.

# 直接删除course_id列中值为空的行(不包含空字符串)
df = df.dropna(subset=['comment'])

If you want to delete an empty string, you can't directly use the above method. You can first convert the string to np.nan type, and then use dropna to delete.

# 将空字符串转为'np.nan',用于下一步删除
df['course_id'].replace(to_replace=r'^\s*$', value=np.nan, regex=True, inplace=True)
# 删除course_id中的空值,并重置索引
df = df.dropna(subset=['course_id'])
df.reset_index(drop=True, inplace=True)

2.3 Data deduplication

Used for crawling errors, some data is partially duplicated, this is the need to delete these duplicate data and only keep one.

# 根据course_id列的唯一性,可以把它作为作为参照,如存在多行course_id相同,那么只保留最开始出现的。
df.drop_duplicates(subset=['course_id'], keep='first', inplace=True)
# 重置索引
df.reset_index(drop=True, inplace=True)

2.4 Comment on cleaning

Remove duplicate comments from a single user. If the content of the comments made by the same user on the same course at different times is the same, it can be considered that the user did not think seriously when commenting, so only the first time should be kept to ensure the validity of the later analysis data.

df.drop_duplicates(subset=['user_id', 'comment'], keep='first', inplace=True)
df.reset_index(drop=True, inplace=True)

Remove line breaks (\n) and carriage returns (\r) in comments.

df['comment'] = df['comment'].str.replace('\r|\n', '')

Remove spaces at the beginning and end of comments

df['comment'] = df['comment'].str.strip()

For purely digital comments (such as '111', '123456', '666'), they have no practical meaning and cannot explain the evaluation of a certain thing and should be deleted. First replace it with an empty string through regularization, and then delete it uniformly.

df['comment'] = df['comment'].str.replace('^[0-9]*$', '')

For single repeated character comments (such as'aaaa','!!!'), it has no practical meaning.

df['comment'] = df['comment'].str.replace(r'^(.)\1*$', '')

Some comments contain time (such as '2020/11/20 20:00:00 check-in'), and the time and date are converted to an empty string through regular matching to prevent the subsequent word segmentation of the comment from being affected.

df['comment'] = df['comment'].str.replace(r'\d+/\d+/\d+ \d+:\d+:\d+', '')

Mechanical compression

(1) The idea of ​​mechanically compressing words.
Because the quality of comments in the comment information is uneven, there are many comments that have no practical meaning. It is difficult to delete those meaningless comments in large quantities by simply de-duplicating the text. Therefore, after simple text After weight removal, mechanical compression is used to remove weight again. Such as "very good, very good", "good, good, good" and so on.
This type of continuous repetitive comments is difficult to delete in the previous cleaning, but let it go. When sentiment analysis is followed, after word segmentation, the positive vocabulary is much different from the actual vocabulary, which will have a greater impact on the later statistics.

(2) Mechanical compression to remove the structure of words
From the general comment, people usually only add meaningless repetitive corpus at the beginning and end, such as'why is the course so expensive?','really very good, good'. When continuous words appear in the middle, most of them are idioms and noun modifications, such as'The teacher's lecture is really endless, like a river! 'Wait. If such words are compressed, the original meaning of the sentence may be changed. Therefore, only the repetitive words appearing at the beginning and end are compressed to remove words.

(3) The process of mechanical compression and the rule-making of
continuous cumbersome repetition can be completed by creating two lists of stored characters, reading the characters one by one, and placing the characters in the first or second according to different situations. A list or trigger compression judgment. If the result of the trigger judgment is repeated (that is, the meaningful character parts of List 1 and List 2 are exactly the same), then remove it. Here, according to the rule reference in the python data analysis and mining actual combat book, specify Seven rules.
Rule 1 : If the character read is the same as the first character in List 1, and no character is placed in List 2, then this character is placed in List 2.

Insert picture description here

Rule 2 : If the read character is the same as the first character in List 1, and there are characters in List 2, then the compression judgment is triggered. If the result is a duplicate, it is removed and List 2 is cleared.

Insert picture description here
Rule 3 : If the read character is the same as the first character in list 1, and there are characters in list 2, the compression judgment is triggered. If it is not repeated, the two lists are cleared, and the read character is placed in list 1 The first position.

Insert picture description here
Rule 4 : If the read character is not the same as the first character in list 1, the compression judgment will be triggered. If it is repeated and the number of characters in the two lists is greater than 2, then it will be removed, the two lists will be cleared, and the character will be stored Listing 1.
Rule 5 : If the read character is not the same as the first character in List 1, the compression judgment is triggered. If there is no repetition and there is no character in List 2, then continue to put characters in List 1.
Rule 6 : If the read character is not the same as the first character in list 1, the compression judgment is triggered. If there is no repetition and there are characters in list 2, then continue to put characters in list 2.
Rule 7 : After reading all characters, trigger compression judgment, compare List 1 and List 2, and delete them if they are repeated.

(4) Mechanical compression effect display

Insert picture description here

3. Data analysis and visualization

3.1 Course Rating Analysis

First, analyze and visualize the scores of each platform course.
Since only two platforms in the platform data contain course ratings, only the course ratings of these two platforms are analyzed.
From the combined course information csv, select 500 course ratings of the two platforms for analysis, and count the number of courses in each rating interval.

df = pd.read_csv(r'merge_course.csv', usecols=['platform', 'rating'])
df = df.loc[df['platform'] == 'imooc'].head(500)
print(len(df[df['rating'] >= 4.75]))
print(len(df[(df['rating'] < 4.75) & (df['rating'] >=4.5)]))
print(len(df[(df['rating'] < 4.5)]))

Then the data is visualized through the matplotlib library. Get the picture below.

It can be clearly seen from the figure that the score intervals of the two platforms are distributed. In general, the overall score of the courses on the study163 platform is higher than that on the imooc platform, and the courses with a score of 4.5 or less account for only about 1%.


3.2 User nickname format

The user name is obtained from the comment information, and the uniqueness of the user name is guaranteed by de-duplication. Since the user name of the keqq platform is hidden, no statistics are made.

df = pd.read_csv(r'C:/Users/pc/Desktop/new_merge_comment.csv', 
				low_memory=False, usecols=['platform', 'user_name'])
df = df.loc[df['platform']=='mooc163', :]
df.dropna(inplace=True)
df.drop_duplicates(keep='first', inplace=True)
df.reset_index(drop=True, inplace=True)
df = df.head(30000)
chinese_name = []

for name in df['user_name'].values:
    if re.findall(r'[\u4e00-\u9fff]', name):
        chinese_name.append(name)

pure_chinese_name = []
for name in chinese_name:
    if  name == ''.join(re.findall(r'[\u4e00-\u9fff]', name)):
        pure_chinese_name.append(name)
print(30000-len(chinese_name))   # 字符
print(len(pure_chinese_name))    # 中文

After getting the username format of each platform, you can visualize it. The result is shown below.

According to the information in the figure, on the imooc platform, the number of users of the three formats is roughly the same, while the number of people using pure characters as nicknames on the icourse and study163 platforms is significantly more than that of pure Chinese nicknames.


3.3 Average length of reviews on each platform

The average length of course reviews on each platform has been counted before, and now you only need to take their average.

df = pd.read_csv(r'Chinese_comment.csv', low_memory=False,
				 usecols=['platform', 'average_length'])
print(df.groupby('platform').describe().reset_index(drop=None))

The visualization results are as follows.

The average length of comments on the four platforms is Tencent Classroom (keqq) the highest, Chinese University MOOC (icourse) is similar to Netease Cloud Classroom (study163), and Mooc (imooc) is the lowest. The average length of comments reflects the activity of users on each platform.


3.4 High-frequency words in comments on each platform

Extract positive and negative high-frequency words from each platform.
Use the dataframe to save and read the content in the csv file, then read each comment from the dataframe, use the groupby() function to group the course names, and merge the course comments into the same column.

course_comment = df.groupby(['course_id'])['comment'].apply(sum)

Save the data in the dataframe to the dictionary, create a new dataframe object new_df to save the course name and the extracted positive/negative high-frequency words, loop through each comment of the key'comemnt' in the dictionary, and use the jieba word segmentation precision mode to match it Perform word segmentation, compare the imported Chinese stop vocabulary list to eliminate useless characters in the vocabulary list. Then import positive.txt and negative.txt as positive emotion dictionary and negative emotion dictionary respectively . Loop through the positive/negative vocabulary list of the course obtained after each word segmentation, and use the Courter function to obtain the five groups of words with the most occurrences of the positive/negative vocabulary list .

up_list = Counter(positive_list).most_common(5)
down_list = Counter(negative_list).most_common(5)

Then save the obtained high-frequency words and course names into new_df, and loop until all course comments are saved.
Insert picture description here
So far, we have obtained the high-frequency words of all the comments in each course of each platform. Next, you only need to integrate the high-frequency words of each platform course and display them in the form of a vocabulary map. Here we show the positive and negative high-frequency words of the imooc platform.
The visualization results are as follows

Active high-frequency words

Negative high-frequency words in course reviews

From the high-frequency words in these two word clouds, it is not difficult to see that users have higher requirements for the difficulty of the course, the teacher’s teaching level, the overall structure of the course, and the arrangement of after-school exercises, such as active vocabulary. More "easy to understand", "detailed", "basic", "clear", etc., and more "complex", "hard work", "nonsense" and so on in negative vocabulary .


3.5 The relationship between the number of reviews and the course rating

Here we randomly select 200 courses from imooc and Netease Cloud Classroom (study163) and count their course ratings and reviews respectively.

df = pd.read_csv('merge_course.csv', low_memory=False, 
				usecols=['platform', 'course_name', 'course_id', 'rating'])
df = df.loc[df['platform'] == 'imooc'].reset_index(drop=None).head(200)
df_comment = pd.read_csv('merge_comment.csv', low_memory=False, 
				usecols=['platform', 'course_id', 'comment'])
df_comment = df_comment.loc[df_comment['platform'] == 'mooc163']
series = df_comment.groupby('course_id')['comment'].count()

df_comment = pd.DataFrame()
df_comment['course_id'] = series.index
df_comment['count'] = series.values
df_comment['course_id'] = df_comment['course_id'].str.extract('^(\d+)', expand=True)
new_df = pd.merge(df, df_comment, on='course_id')

rating_list = new_df['rating'].values.tolist()    # 课程评分
count_list = new_df['count'].values.tolist()    # 课程评论

After getting the rating_list and count_list, visualize it through a scatter chart. The result is shown below.


Imooc

NetEase Cloud Classroom (study163)

Through the comparison of the two platforms, it is not difficult to see that when the number of comments is large, the course scores of MOOC.com are basically stable between 4.8–5.0, and the course ratings are basically stable between 4.9–5.0. The scores of the courses with the most comments on the two platforms are 4.8 and 4.9 respectively. It can be seen that the larger the number of ratings, the less likely it is to achieve full marks. At the same time, courses with lower ratings on the two platforms generally have low comments. It can be seen that when the number of ratings is small, the low scores of individual users will have a greater impact on the course ratings.



The above is the entire content of this article. Part of the code is not complete. It only provides ideas. You can refer to the official documentation for matplotlib visualization, so the visualization code is not included in the article.


Reference materials:
Python data analysis and mining actual combat
pandas official documentation
matplotlib official documentation

Guess you like

Origin blog.csdn.net/qq_43965708/article/details/111694223