A simple Bilibili label crawler visualization

Simple Bilibili tag crawler visualization

1. Basic introduction

Bilibili has followed the trend of new online media in recent years, attracting more and more people to join its platform with its creation incentive plan and producing a large number of works. But as an up master, how to choose a suitable theme for creation, can crawl and analyze some data of the website, so as to obtain what content is most liked by users in a certain period of time? According to this original intention, I observed the website content of Station B and found that it contained a leaderboard, so I wondered if I could crawl the video links of the leaderboard and obtain the like data of all videos, etc., and use these data as Assign weights to the tags of the video and visualize them. However, since I have just learned crawlers and python, the code may still have some deficiencies, but the initial idea can basically be realized.

2. Tools used

Locale: python3.7 version
Third-party library: selenium (since station b uses a dynamically rendered URL, you can use selenium to crawl to dynamically rendered information), BeautifulSoup, pytagcloud, pygame

3. Preliminary preparation

selenium: I used selenium.webdriver's Chrome here, so I need to install the corresponding version of ChromeDriver downloaded from the computer, which can be obtained through the mirror network.
Details can be learned through: https://www.cnblogs.com/lfri/p/10542797.html , I think this one is very detailed.
BeautifulSoup: You can get it directly through pip: pip3 install beautifulsoup4
pytagcloud: Get it: pip3 install pytagcloud
But you need to make sure that pygame is installed when using it, because pytagcloud has import this library. The first parameter passed into its make_tags() function should be an object of class dict_items, which can be obtained by the dictionary's items() (python3).
pytagcloud Chinese font setting:
1. Open the fonts folder in the pytagcloud folder, and move the ttf file of the corresponding font into it.
insert image description here
2. Open the fonts.json file in fonts, and write the added fonts into the file. Note that
the fontname passed in by the format function of the json file is the corresponding "name". If there is no Chinese file, the Chinese word cloud cannot be generated!

4. Code implementation

(1) Crawl the leaderboard of station B

from selenium import webdriver
from bs4 import BeautifulSoup
import time
class LinkCatch:
    '返回视频网址列表'
    _url = ''
    _brower = webdriver.Chrome()
    _html=''
    _arrLink =[]
    def __init__(self, url:str):
        self._url=url#设置排行榜的url

    def __catch(self):
        print('LinkCatch.catch()正在运行。。。')
        self._brower.get(self._url)
        time.sleep(3)#等待，防止爬取到未加载完毕的网址
        self._html=self._brower.page_source

    def __dataAnalyze(self):
        print('LinkCatch.dataAnalyze()正在运行。。。')
        soup = BeautifulSoup(self._html,'lxml')
        temparr = soup.find_all(class_='rank-item')
        for item in temparr:
            self._arrLink.append(item.a['href'])#储存网址
    
    def run(self):
    #获取并返回网址列表
        self.__catch()
        self.__dataAnalyze()
        #self._brower.close()
        return self._arrLink

(2) Obtain numerical information and tags such as the number of likes of the video link

class DataCatch:
    '返回一个字典数组，包含标签与硬币点赞数的总数（权值化）'
    _linkArr=[]
    _likeRate = 1
    _coinRate = 1
    _shareRate = 1
    _collectRate = 1
    _brower = webdriver.Chrome()
    __keyword = ['share','collect','coin','like']
    __timeout = 3 #设置睡眠等待时间，可以使得未获取的视频数据变少
    __defaultList=[]
    def __init__(self, linkArr,rate:int=1000):
    #传入网址列表
        self._linkArr = linkArr

    def setRate(self,share:float,collect:float,coin:float,like:float):
    #设置不同数值的权占比
        print('DataCatch.setRate()正在运行。。。')
        self. _likeRate =like
        self._coinRate=coin
        self._shareRate =share
        self._collectRate =collect


    def __dataDeal(self,arr):
    #返回不同数值权占比乘以其数值之和 初始化为1：1：1：1
        print('DataCatch.dataDeal()正在运行。。。')
        temparr =[]
        for item in arr:#处理字符串含有的‘万’
            temp =0
            if not item.isdigit():
                temp += float(item.split('万')[0])
                temp*=10000
            else:
                temp=float(item)
            temparr.append(temp)
        total = 0
        total = temparr[0]* self._shareRate+temparr[1]*self._collectRate+temparr[2]*self._coinRate+temparr[3]*self. _likeRate
        return total

    def __synthesis(self,tagArr,numberArr):
        #传入标签数组和硬币等数量数组，返回一个标签与权值字典,即将一个视频的标签和其权值合成
        print('DataCatch.synthesis()正在运行。。。')
        N=self.__dataDeal(numberArr)
        dic ={
    
    }
        for item in tagArr:
            dic.update({
    
    item:N})
        return dic

    def __catchdata(self):
    #获取所需视频网址的所有信息
        print('DataCatch.catchdata()正在运行。。。')
        dic= {
    
    }
        for item in self._linkArr:
            tagArr = []
            numberArr=[]
            self._brower.get(item)
            time.sleep(self.__timeout)
            for name in self.__keyword:
                numberArr.append(self._brower.find_element_by_class_name(name).text)
            if not numberArr[0].split('万')[0].split('.')[0].isdigit() :
            #判别是否抓取到未渲染完成的网址，是的话将其放入defualtlist，这里未对未渲染完成的网址进行重新获取
                print("Link"+item+"doesn't got!")
                self.__defaultList.append(item)
                continue
            soup =BeautifulSoup(self._brower.page_source,'lxml')
            for item in soup.find_all(name='li',class_='tag'):
                tagArr.append(item.a.text)
            tempdic = self.__synthesis(tagArr,numberArr)
            keyArr = tempdic.keys()
            for item in keyArr:
            #如果标签已经存在在字典里，将其权值与字典里的权值相加，得到新的权值，如果没有，将标签与其权值放进字典
                if item in dic:
                    dic.update({
    
    item:dic.get(item)+tempdic.get(item)})
                else:
                    dic.update({
    
    item:tempdic.get(item)})
        return dic

    def printDefaultList(self):
        i=1
        for item in self.__defaultList:
            print(i + item)
            i=i+1
    def getDefaultList(self):
    #用于重新获取，不过懒得写重新获取了，dbq
    	return self.__defaultList

    def run(self):
    #返回标签及其权值的字典，以供可视化使用
        print('DataCatch.run()正在运行。。。')
        self.printDefaultList()
        return self.__catchdata()

(3) Visualize

if __name__ =='__main__':
   from pytagcloud import create_tag_image, make_tags
   from pytagcloud.lang.counter import get_tag_counts
   url='https://www.bilibili.com/ranking?spm_id_from=333.851.b_7072696d61727950616765546162.3'#B站排行榜的URL
   linkArr = LinkCatch(url).run()
   dic = DataCatch(linkArr).run()
   count = dic.items()
   tags = make_tags(count, maxsize=120)#maxsize设置字体的最大大小
   create_tag_image(tags, 'D://TEST//B站爬虫测试.png', size=(1000,1000), fontname='Deng')#这里的Deng为等线字体，需要自己添加进pytagcloud的字体库中

(4) Obtained finished products
insert image description here