B crawling up the main station program ape, ape up analysis program what type of video will be popular

Foreword

I, a get up every day before going to bed will brush the chicken dish program ape B station yesterday to see a dynamic program ape up the main concern of the release, was very upset, so just think of making this content
Here Insert Picture Description
in the end the program ape up of the main types of video will be what to do by the public like it?

Click on the picture to see the text you do not want to see my video station B it seems you can import video directly before, and now do not know why iframe fails

Started working

1. first need to collect data

Select the search to search program ape, and then select the video, most barrage (that is, real people watching more), to get a link to each video.
Here Insert Picture Description
So we get to the details page for each video link, and then use xpath to get the title from the details page, up name, up home page and other information (directly from the video player and the number of points like the number of the last saved result in this page is outliers), the number of points and the like with a number of players to get api.
Here Insert Picture Description
Here Insert Picture Description
api only video id can return the corresponding json result, it is relatively easy to obtain, and finally saved to a csv file.

Reptile Code:

import requests
from lxml import etree
import time
import pandas as pd
import re
import json

#https://search.bilibili.com/video?keyword=%E7%A8%8B%E5%BA%8F%E7%8C%BF&order=dm&duration=0&tids_1=36&tids_2=122&page=32

def get_html(url,header):
    html=requests.get(url,headers=header).text
    return html

def get_all_page(n):
    urls=[]
    for i in range(1,n+1):
        url=f"https://search.bilibili.com/video?keyword=%E7%A8%8B%E5%BA%8F%E7%8C%BF&order=dm&duration=0&tids_1=36&tids_2=122&page={i}"
        html=get_html(url,headers)
        selector = etree.HTML(html)
        li_list=selector.xpath("//ul[@class='video-list clearfix']")
        for li in li_list:
            urls.extend(li.xpath("//li[@class='video-item matrix']/a/@href"))
    return urls

def get_information(urls,avid):
    space_url=[]
    name_list=[]
    views_list=[]
    dz_list=[]
    video_names=[]
    count=0
    for url in urls:
        count+=1
        if count==10:
            time.sleep(1)
        url=url.replace('//','https://')
        print("正在爬取:",url)
        html=get_html(url,headers1)
        selector = etree.HTML(html)
        space_url.append(selector.xpath("//div[@class='name']/a[1]/@href")[0])
        name_list.append(selector.xpath("//div[@class='name']/a[1]/text()")[0])
        video_names.append(selector.xpath("//h1/@title")[0])
        # views_list.append(selector.xpath("//div[@class='video-data']/span[1]/text()")[0])
        # dz_list.append(selector.xpath("//div[@class='ops']/span[1]/text()")[0])

    for id in avid:
        base_url="https://api.bilibili.com/x/web-interface/view?aid="
        html=get_html(base_url+id,headers2)
        res=json.loads(html)
        video_info = res['data']
        views_list.append(video_info["stat"]["view"])
        dz_list.append(video_info["stat"]["like"])
    return space_url,name_list,views_list,dz_list,video_names


def save(n):
    urls=get_all_page(n)
    avid=[]
    for i in urls:
        avid.append(re.findall("\d+",i)[0])
    space_url,name_list,views_list,dz_list,video_names=get_information(urls,avid)
    data=pd.DataFrame({"空间链接":space_url,"up主":name_list,"视频名":video_names,"视频播放次数":views_list,"视频点赞数":dz_list})
    data.to_csv('./B站程序猿up主视频信息.csv',encoding='utf8')
    print("所有数据爬取完毕")



if __name__ == '__main__':
    headers = {
        'Host': 'search.bilibili.com',
        'Referer': 'https//www.bilibili.com/',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'same-origin',
        'Sec-Fetch-User': '?1',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }

    headers1 = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
        'Host': 'www.bilibili.com',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': 'max-age=0',
        'Connection': 'keep-alive'
    }


    headers2={
        'Host': 'api.bilibili.com',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Sec-Fetch-User': '?1',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla / 5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 79.0.3945 .130Safari / 537.36'
    }
    n=int(input("请输入想要爬取的页数:"))
    save(n)

I crawled up ten, a total of 200 data
Here Insert Picture Description

2. Start Data Analysis

Basic guide libraries and import data, and renaming the column name
Here Insert Picture Description
and data and pre-cleaning, outliers view missing values. However, since the B station data more friendly, no abnormality missing values.
Here Insert Picture Description
Then the number of subgroup analyzes point Like many up and play a few more up
Here Insert Picture Description
Here Insert Picture Description
this part with groupby is very simple to implement, so only the analysis of the drawing code point praise

# 找到视频点赞数最多的up主
most_dz=data.groupby(by=data['up主名'],as_index=False)['视频点赞数'].sum()
most_dz.columns=['up主名','视频点赞总数']
most_dz.head()

#降序排序
most_dz=most_dz.sort_values(by=['视频点赞总数'],ascending=False)
most_dz.head(10)

# 可视化点赞数前20的up主
plt.figure(figsize=(13,10))
sns.barplot(most_dz['up主名'][:20],most_dz['视频点赞总数'][:20])
plt.title('程序猿up主视频点赞总数前20', fontsize=22)
plt.grid()
plt.xticks(rotation=90)
plt.show()

See FIG. 3. Each of the first type of video

Here Insert Picture Description
I found a confused teacher of video programming involved is relatively small, mostly related to the video everyday computer use higher-order operation. It seems I later if they wish to point out a few points should be less praise and more involved in programming, the extra point to make daily use of video viewers after all programming class instructional videos are buried in the favorites, my favorites folder there are a lot of
Here Insert Picture Description
Nani, only 7.37 million + a video on the number of players, which is too strong it. But also that there are still many people are still willing to learn programming in the B station, in particular, it appears to be particularly relevant video python fire

4. Be summary

to sum up:

  1. In teaching the class B station video has a great probability of being put into favorites to eat ash, so the program ape up owners to seize the audience's preferences to create a video, such as python series of instructional video is very popular (the reason is python ecological good, simple to learn), or a video teaching some of the daily operation of the high-order computer, these are relatively easy to attract the audience.

  2. Dear sheep from the front of the program 20 can not enter more aware of this problem: the program sheep video quality is very high, but not good enough for groups because almost all of his video talk program ape practice, not too easy to learn Instead the program mostly ape who think it makes sense, but the outsider listening to look ignorant, simply just listen to the same bible. Haha, but for those of us apes for the program this up is simply the main treasure up, it is precisely because he chose to talk about some of the content of the program ape most beneficial, give up some other benefits, this also allows us to learn more and more good knowledge.

  3. Let's look at some of the different starting points program ape up, such as sheets of rice daily. He could have watched the number reason for such a high number of points and praise is his video only take into account program ape, ape using the program stems can bring joy to us, but also take care of other groups, from some of the less esoteric knowledge of computers departure, you can let the audience know more about the program ape life, and the length of each video is very short, in line with current trends in the short video.

  4. From the list is not difficult to find the familiar up, unconventional, technical fat ...... In fact, these owners are up and we share their knowledge and ideas feelings. There are not on the list, but I would like to mention up, seven meters teacher, you have pit foot, Cai Cai, the pieces, Wang Zhe, long, long time .net great God Anduin before ...... some of them may not have very fire, perhaps because some environments, a variety of factors have led to leave comments, but I think it is their intention to make video's. Seven meters and teacher Wang Zhe go speak the language of the relevant video, Cai Cai data analysis, machine learning, more of a short video python knowledge, there is talk of stepping algorithm pit master pieces a day before, on top of those who make me up more and more gain knowledge, so this is an important reason I love brush B station. Although I do not know why more pieces off for so long, but still hope he can come back to update the knowledge of small python one minute earlier.

At last

Through the above analysis we know that most of the video has several attractive features:
1. wide audience
2. Funny easily
3. mainly short video-based
4 popular and easy to learn techniques like video is also very popular (for example, python)
but we also need to release those up meaningful instructional videos or video, which is more help us improve.

Of course, these analyzes only the surface, we can ferret out more, go to such lengths to obtain climb up to the popular owners of the video, where the partition, profiles, video ...... barrage, the length of time from video, text sentiment analysis, and more perspective analysis, to find the biggest factors that influence.

Note that these data do not reflect the icon and the most authentic information , because I only crawling the first ten pages, and where there are not related to the program content will be searched ape result in data interference, so just add data and on my analysis summary of the main concerns of these up has been made, and can not quite true, but put forward a number of ideas and suggestions.

Published 85 original articles · won praise 55 · views 20000 +

Guess you like

Origin blog.csdn.net/shelgi/article/details/104509693