R language data analysis report about Bilibili

1. Research background and questions raised

In early January 2023, the performance report released by Bilibili on the Hong Kong Stock Exchange showed that starting from the second quarter of 2022, the company will collect statistics on active users of mobile applications and PCs. As of September 30, 2022, the company's average monthly active users were 333 million, an increase of 26.9 million from the previous quarter, setting a record high; during the same period, the company's average daily active users were 90.3 million, a year-on-year increase of 25%, and an increase of 6.8 million from the second quarter. Compared with other long video platforms, Bilibili has not yet encountered a bottleneck in user growth after being out of the industry many times. But in contrast, the income of different up owners is also very different, and the broadcast income of different up owners at station B is also different. Therefore, it is necessary to use the clustering method to divide the up owners.

2. Research objectives and significance

This analysis uses the five columns of data of "play volume", "coins", "likes", "favorites" and "shares" of 80 sports up owners for clustering. The average "play volume", "coins", "likes", "favorites", "shares" and other related data information of each category of videos are displayed separately. For up owners who want to be station B, it may have a certain value.

3. Research ideas

In data analysis, the relevant features are first described and analyzed, and then an elbow diagram is drawn to determine the optimal number of clusters. Mainly using Kmean clustering, five density maps were drawn for different results, respectively showing the average "play volume", "coins", "likes", "favorites", and "shares" of the videos in each category. Finally, save the clustered results

4. Data collection

The data is the basic data of 80 B-station sports up masters, mainly the Fitness_up.csv data table obtained on February 26, 2022, baseinfo_tag - up master basic data, and the redundant columns can be deleted directly.

Use the five columns of data "play volume", "coin input", "like", "collection" and "share" to perform clustering and filtering.

The final data set has a total of 6 columns and a data frame composed of 28501 rows.

5. Research content

Perform descriptive statistical analysis on the compiled data set, mainly calculating the number of videos published, the average number of views, "coins", "likes", "favorites", and "shares"


Perform a descriptive statistical analysis on the summarized data, mainly calculating the mean, standard deviation and median.

There are a total of 80 up owners, and the average number of videos published by each up owner is 356, the average views are 426445, the average coins are 7583, the average likes are 19676, the average collections are 16775, and the average shares are 2492.6 .

Based on the average of various indicators of 80 up masters, an elbow diagram is drawn to determine the optimal number of clusters.
Insert image description here

Plot the sum of squares within the group versus the number of extracted clusters. It drops very quickly from category one to category three (and then
drops very slowly), so it is recommended to use a solution with a cluster number of three.
Insert image description here

Next, use the kmean clustering method and set the number of clusters to 3. Divide the up owners, and then find the average "play volume", "coins", "likes", "favorites", and "shares" in different clusters

Judging from the results, cluster 1 should publish relatively few videos, and the number of views, coins, likes, collections, and shares are at a medium level. Cluster 2 should publish more videos, and the number of views, coins, likes, collections, and shares are at a low level. Cluster 3 should publish relatively few videos, with high levels of views, coins, likes, collections, and sharing.

Based on the results of the division, density maps are drawn separately.

code

library(tidyverse)
up <- read_csv("Fitness_up.csv") %>% select(baseinfo_id,baseinfo_name)
Video <- read_csv("CombineBiliUpVideo.csv")
##  筛选 "播放量","投币","点赞","收藏","分享"这四列聚类

df <- Video  %>% select("播放量","投币","点赞","收藏","分享","up_name")

df$播放量 <- as.numeric(df$播放量)
data  <- merge(df,up,by.x = "up_name",by.y = "baseinfo_name" )
res <- data %>%
    group_by(baseinfo_id) %>% 
    summarise(视频数量 = n(),
              平均播放量 = mean(播放量,na.rm = TRUE),
              平均投币 = mean(投币,na.rm = TRUE),
              平均点赞 = mean(点赞,na.rm = TRUE),
              平均收藏 = mean(收藏,na.rm = TRUE),
              平均分享 = mean(分享,na.rm = TRUE))
library(psych)
result <- describe(res)[2:5]
result

wssplot <- function(data, nc=15, seed=1234){ 
 wss <- (nrow(data)-1)*sum(apply(data,2,var)) 
 for (i in 2:nc){ 
 set.seed(seed) 
 wss[i] <- sum(kmeans(data, centers=i)$withinss)} 
 plot(1:nc, wss, type="b", xlab="Number of Clusters", 
 ylab="Within groups sum of squares")}

wssplot(na.omit(res[-1]))
r <- na.omit(res)
km <- kmeans(r[-1],3)
r$cluster  <- paste0("cluster",km$cluster)
r %>%
    group_by(cluster)%>%
    summarise(平均视频数量 = mean(视频数量),
              平均播放量 = mean(平均播放量),
              平均投币 = mean(平均投币),
              平均点赞 = mean(平均点赞),
              平均收藏 = mean(平均收藏),
              平均分享 = mean(平均分享))
r %>% ggplot(aes(视频数量,fill =cluster) ) + geom_density() + theme(legend.position = "top")
r %>% ggplot(aes(平均播放量,fill =cluster) ) + geom_density()+ theme(legend.position = "top")
r %>% ggplot(aes(平均投币,fill =cluster) ) + geom_density()+ theme(legend.position = "top")
r %>% ggplot(aes(平均点赞,fill =cluster) ) + geom_density()+ theme(legend.position = "top")
r %>% ggplot(aes(平均收藏,fill =cluster) ) + geom_density()+ theme(legend.position = "top")
r %>% ggplot(aes(平均分享,fill =cluster) ) + geom_density()+ theme(legend.position = "top")
write.csv(r,"聚类结果.csv",row.names = FALSE)


Guess you like

Origin blog.csdn.net/weixin_54707168/article/details/132642302