"Bilibili" user behavior data analysis based on python data mining algorithm

Collect and follow to avoid getting lost


Preface

  After years of development, online video has become one of the main applications on the Internet. At present, online videos have the characteristics of large quantity, fast release, large impact and great influence. Bilili Barrage Video Network (referred to as Bilibili) is currently the leading barrage video website in China. A survey of videos across all platforms shows that the proportion of videos created by users at Station B is as high as 85%. For video creators, how to further analyze and study hot videos in the complex ocean of data on the Internet has become a difficult research problem. The data of this article is taken from Station B in August 2020. It mainly involves hot video data about the life section, and selects a large number of hot words, comments and other data for analysis and research, and finally realizes visual research on the data. Not only can Understand the overall trend of online public opinion during this period, grasp the psychological attitude of users, strengthen interactive feedback from the audience, and stimulate users' interest in exploring the culture of Bilibili.
Keywords Bilibili; user behavior analysis; hot videos;

1. Module design

   The structure of this platform is shown in Figure 2:
Insert image description here

Figure 2 Platform structure diagram

3.1 Data crawling module

  In the process of data mining using python, the corresponding user data information is mainly collected through crawler programs and data preprocessing. The implementation of web crawlers often uses the aid code used by users in the video upload process, and selects the URL of station B through requests, thereby ultimately collecting relevant data. Data preprocessing is largely used to crawl basic data information in the video collection process and perform related operations. (1) Data cleaning technology mainly uses regular expression technology in the Python language to collect a large amount of target data and further extract it. (2) Data conversion technology mainly uses the loading method to convert the strings collected in the source data into dictionaries according to corresponding rules and sequences.
(3) To deduplicate data, use the unique method to return an array or list without duplicate elements. Save to CSV file after preprocessing.

3.2 Data mining and analysis module

  Data mining mainly analyzes and summarizes existing data by using designed algorithms, and conducts sentiment analysis according to the characteristics of the data. In the process of statistical data, the snownlp class library is often used to implement this basic sentiment analysis operation, and the tendency is analyzed by calculating the data value of the barrage. Sentiment is often used in sentiment analysis to indicate the actual sentiment value. Among them, the closer the data is to 1, the more positive it is, and the closer it is to 0, the more negative it is. The relevant result data can be obtained as the basic data of sentiment analysis.

3.3 Data visualization module

  The data visualization module mainly uses pie charts, word clouds, line charts and other means to achieve final data visualization. And use matplotlib library and other technologies to further study and analyze the characteristics of the data, and finally display the deep meaning of the data through chart patterns. The visualization module includes visual graphics such as video playback volume proportion graphs during each period, hot word statistics graphs, video playback volume line graphs at different times of the week, and emotion proportion graphs.

2. Development environment

  Basically all Python crawler beginners will come into contact with two tool libraries, requests and BeautifulSoup. As the most common basic libraries, they are used in completely different ways. The requests tool library is mainly used to obtain the source code of web pages. It needs to send URL request instructions to the server; beautifulsoup is mainly used to read and parse the source language of the web page, including but not limited to HTML\xml, and extract important information. These two libraries simulate the process of people visiting web pages, reading web pages, and copying and pasting corresponding information, and can quickly capture data in batches. The process is shown in Figure 1.
Insert image description here

Figure 1 Data acquisition and analysis flow chart

3. Data preprocessing

Delete null values ​​and duplicate values, preprocess the data, replace the None value with 0, keep only Chinese characters, split the title into short words, process tags in the same way, set a rounding code, and calculate Ratio of three consecutive matches: like rate = likes/play volume100%; coin rate = coins/play volume100%; collection rate = collection/ View volume100%; forwarding rate = forwarding/view volume100%; barrage rate = barrage/view volume100%; comment rate = comments/play volume100%

4.2 Implementation of each functional module

4.2.1 Data analysis and visualization of hot videos

First check the processed video data information, as shown in Figure 3:
Insert image description here

Figure 3 Video data information
There are a total of 88,350 UP owners. Statistics on the number of videos in each playback interval show that there are 213,115 videos in the [0,9999] interval, accounting for 93.86% of the sample. %, there are 10731 items in the [10000,99999] interval, accounting for 4.73% of the sample, 2436 items in the [100000,499999] interval, accounting for 1.07% of the sample, and 464 items in the [500000,999999] interval, accounting for 0.14 of the sample. %, there are 320 intervals in [1000000,∞], accounting for 0.02% of the sample interval. Draw a pie chart, as shown in Figure 4:
Insert image description here

Figure 4 Playback volume proportion chart
If only content with more than 10,000 playbacks is displayed, count the number of videos in each playback range, and the total number of videos in the [10000,99999] interval 10731, accounting for 76.92% of the sample. There are 2436 in the [100000,499999] interval, accounting for 17.46% of the sample. There are 464 in the [500000,999999] interval, accounting for 3.33% of the sample. There are 320 in the [1000000,∞] interval. , accounting for 2.29% of the sample interval, draw a pie chart, as shown in Figure 5:
Insert image description here

Figure 5 Play volume proportion chart (over 10,000 plays)
Statistics display the top twenty UP hosts with the most plays. The statistical results are shown in Figure 6: :< /span>
Insert image description here

Figure 6 Ranking of playback volume
The specific data of the top 20 ranked by playback volume is displayed. The results are shown in Figure 7:
Insert image description here

Figure 7 Specific data display
Sort the total play volume of each UP in August according to the UP main group. The sorting results are shown in Figure 8:
Insert image description here

Figure 8 Display of the total play volume of each UP in August

The number of videos with more than 10,000 views released in different time periods every week is summarized. The results are shown in Figure 13:
Insert image description here

Figure 13 View volume statistics (video play volume is greater than 10,000)
Draw a word cloud and use the word cloud to display the "topic" hot words, as shown in Figure 14:
Insert image description here

Figure 14 Hot words of the topic
Use word cloud to display the hot words of more than 10,000 video "topics", as shown in Figure 15:
Insert image description here

Figure 15 Topic hot words (play volume greater than 10,000)
Use word cloud to display the hot words of the video "topic" with more than 100,000 views. The results are shown in Figure 16: < /span>
Insert image description here

Figure 16 Topic hot words (play volume greater than 100,000)
Use a word cloud to display the hot words of the video "topic" with more than 1 million views. The results are shown in Figure 17:
Insert image description here

Figure 17 Hot words in the topic (viewed more than 1 000 000)

4. Conclusion

  This article analyzes the preset modules one by one, and the basic modules have been implemented. Visually analyze the impact of hot words, likes, coins, collections, comments, comments and other data on popular videos on video playback volume.
  This article only selects related videos from the funny section of Bilibili as the research object. The selection range of data samples is mainly a single type of video. Its singleness determines that the video will not be affected by other topic videos. The actual user groups of Station B are mostly those born in the 1990s. The specific user age group makes the user attributes relatively unique, which is different from the corporate video platform. In future in-depth research, firstly, data information on multiple topics can be collected, and secondly, multi-platform surveys can be conducted to increase the authenticity of the conclusions by increasing sample diversity.

Table of contents

Contents
Chapter 1 Introduction 1
1.1 Background and significance of the topic 1
1.2 Research purpose and significance 2
1.3 Current Research Status at Home and Abroad 2
Chapter 2 Key Technologies 2
1.1 Crawler Technology 4
2.2Python 4
Chapter 3 Module Design 4
3.1 Data Crawling Module 5
3.2 Data Mining and Analysis Module 5
3.3 Data Visualization Module 5
Chapter 4 Data Mining and Analysis 6
4.1 Sample Selection and Data Sources 7
4.1.1 Data crawling 7
4.1.2 Data preprocessing 11
4.2 Implementation of each functional module 12 Chapter 5 Summary 32< /span> Acknowledgments 35 Reference 33 4.2.2 Video barrage data 27
4.2.1 Data analysis and visualization of hot videos 12



Guess you like

Origin blog.csdn.net/QQ2743785109/article/details/133799981