【python】youtube trending 热点分析 - 什么因素与视频流行有关?

##1. 问题概述和数据来源 ## - YouTube Trends YouTube Trends是油管提供的流行视频推荐列表,每日更新,但并不是个性化推荐,每个国家的推荐列表是统一的。 - Dataset 使用的数据是在Kaggle上找到的美/英/德/法/加拿大 五个国家2017/11/14到2018/06/14每天的trending video列表 - 思考的问题 我觉得这个数据集有趣的地方在于多维,时间、空间(国家)、不同youtuber/topics、不同category 的内容以及指标(view, likes, dislikes, comments)。可以探究view, likes, dislikes, comments之间的关联,热点youtuber、topics热度趋势(还可以结合Googletrends看YouTube热点与全网搜索热点的吻合程度,超前/滞后程度),以及不同国家热点的区别 (geographic differences)。 ## 2. 数据预处理 ##
import pandas as pd
import json 

###从csv导入数据,合并,添加国家标签###
df=pd.read_csv('CAvideos.csv')
df=df.assign(country='CA')
list_cou=['DE','FR','GB','US']
for name in list_cou:
    temp=pd.read_csv(name+'videos.csv')
    temp=temp.assign(country=name)
    df=pd.concat([df,temp])

###日期格式处理###
df['trending_date'] = pd.to_datetime(df['trending_date'], format='%y.%d.%m')  
df.trending_date = df.trending_date.dt.date   
df['publish_time'] = pd.to_datetime(df['publish_time'], format='%Y-%m-%dT%H:%M:%S.%fZ')
df=df.assign(publish_date=df['publish_time'].dt.date)
df['publish_time'] = df['publish_time'].dt.time
category名称另外保存在json文件中,读取添加过程如下:
###导入category名称###
df=df.assign(cat_name='a')
for name in list_cou:
    id_to_category = {}
    file=name+'_category_id.json'
    with open(file, 'r') as f:
        data=json.load(f)
        for category in data['items']:
            id_to_category[category['id']] = category['snippet']['title']
    print(id_to_category)
###实际上每个国家的category id-name 字典是一样的
df['category_id'] = df['category_id'].astype(str)
df.insert(4, 'category', df['category_id'].map(id_to_category))
整理之后的dataframe:
df.head()
Out[31]: 
      video_id trending_date  \
0  n1WpP7iowLc    2017-11-14   
1  0dBIkQ4Mz1M    2017-11-14   
2  5qpjK5DgCt4    2017-11-14   
3  d380meD0W0M    2017-11-14   
4  2Vv-BfVoq4g    2017-11-14   

                                               title channel_title  \
0         Eminem - Walk On Water (Audio) ft. Beyoncé    EminemVEVO   
1                      PLUSH - Bad Unboxing Fan Mail     iDubbbzTV   
2  Racist Superman | Rudy Mancuso, King Bach & Le...  Rudy Mancuso   
3                           I Dare You: GOING BALD!?      nigahiga   
4        Ed Sheeran - Perfect (Official Music Video)    Ed Sheeran   

        category category_id publish_time  \
0          Music          10     17:00:03   
1         Comedy          23     17:00:00   
2         Comedy          23     19:05:24   
3  Entertainment          24     18:01:41   
4          Music          10     11:04:14   

                                                tags     views    likes  \
0  Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In...  17158579   787425   
1  plush|"bad unboxing"|"unboxing"|"fan mail"|"id...   1014651   127794   
2  racist superman|"rudy"|"mancuso"|"king"|"bach"...   3191434   146035   
3  ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"...   2095828   132239   
4  edsheeran|"ed sheeran"|"acoustic"|"live"|"cove...  33523622  1634130   

   dislikes  comment_count                                  thumbnail_link  \
0     43420         125882  https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg   
1      1688          13030  https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg   
2      5339           8181  https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg   
3      1989          17518  https://i.ytimg.com/vi/d380meD0W0M/default.jpg   
4     21082          85067  https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg   

   comments_disabled  ratings_disabled  video_error_or_removed  \
0              False             False                   False   
1              False             False                   False   
2              False             False                   False   
3              False             False                   False   
4              False             False                   False   

                                         description country publish_date  
0  Eminem's new track Walk on Water ft. Beyoncé i...      CA   2017-11-10  
1  STill got a lot of packages. Probably will las...      CA   2017-11-13  
2  WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...      CA   2017-11-12  
3  I know it's been a while since we did this sho...      CA   2017-11-12  
4  ?: https://ad.gt/yt-perfect\n?: https://atlant...      CA   2017-11-09  
各个国家的数据所占比例大致相等:
df.country.value_counts()
Out[9]: 
US    40949
CA    40881
DE    40840
FR    40724
GB    38916
Name: country, dtype: int64
##3. youtube trends EDA ## **- 流行类别: music, entertainment 和 people&blogs 最流行**
plt.hist(list(df.category), bins=32, density=True, alpha=0.5, histtype='bar', color='steelblue', edgecolor='blue',rwidth=1,align='mid')
plt.xticks(rotation=90)
plt.title('popular categories')
plt.ylabel('density in trending videos')
plt.show()

- 不同类别从发表到流行所需平均时间: 时间并不与类别的流行程度成正比

#average time between publishment and trend for each categroy
df=df.assign(days=df.trending_date-df.publish_date)
# convert timedelta to numeric
df.days=df.days.dt.day
# average days taken to become trending
avg_days=df.groupby(['category'])['days'].mean()
avg_days=avg_days.sort_values()
barlist=plt.bar(avg_days.index,avg_days)
plt.xticks(rotation=90)
plt.ylabel('days taken from publish to trending')
barlist[6].set_color('orange')
barlist[13].set_color('orange')
barlist[16].set_color('orange')
plt.show()

三个最流行的类别用黄色标出,所需时间并不与类别的流行程度成正比

- views, comments, likes, dislikes之间的关联
如果使用所有数据,计算过程耗时,而实际上因为数据充足可以采取抽样。取20%sample然后seaborn pairplot观察relations:

sample=df.loc[df['comments_disabled']==False,['views','likes','dislikes','comment_count','country']].sample(frac=0.2)
p_resp=sns.pairplot(sample, hue='country')
out:

讨论:圈出来的三个图有一些有意思的信息。红色框内,likes vs views, 总体成正比;黑色框内和蓝色框内,dislike在大多数情况下随 view 和 likes 增长较慢,但是很明显能够看到也有不少 dislikes 相对快速增长的情况,并且这两种dislike增长模式区别非常明显。红色和黑色框内,在英国的推荐中,同样view的视频 likes, dislikes 相比美国trending 视频偏少。

让我们进一步来看views, likes 和 dislikes:

  • 绝大多数视频得到 Like 比得到 dislike 要容易很多
 sns.scatterplot(x='likes',y='dislikes',data=sample,hue='country')
x=list(range(0,max(sample.dislikes)))
plt.plot(x,x,label='likes=dislikes',color='k',linestyle='--')
plt.legend()
likes vs dislikes
  • 比较不同国家的 view-likes/dislikes 线性回归,英国的trending 视频确实是同样 views 下获得反应最小的,明显低于其他国家。
sns.lmplot(x='views',y='dislikes',data=sample,hue='country',scatter_kws= {'alpha': 0.3})
plt.xlim(0,300*1e6)
plt.ylim(0,0.6*1e6)
def scient(y, position):
    return str(y/1e6)
formatter = FuncFormatter(scient)
plt.gca().yaxis.set_major_formatter(formatter)
plt.gca().xaxis.set_major_formatter(formatter)
plt.xlabel('views(1e6)')
plt.ylabel('dislikes(1e6)')
plt.title('views vs. dislikes')

(持续更新中。。。。。。)

猜你喜欢

转载自blog.csdn.net/qq_33874620/article/details/81840996