Python Spider study notes (1): Crawling basic information of Bilibili videos

 1. Source of creation

         Recently, the data analysis needs to crawl the content of the relevant videos on the station B, but the code opened two years ago found that it was unable to run, or it was a lot of loopholes. After a period of tinkering, I found out that the reason was the replacement of the web page code at station B. (It should be, I’m not sure!) Since I was copying the code here and there, I finally couldn’t understand what it meant (I don’t know how the program ran at that time). Simply start over and learn to write by yourself.

2. Part 1: Using Selenium to obtain BV_ID

        For Bilibili videos, as long as you know his BV number, it is equivalent to knowing his ID number. It is not difficult to know more information about him. Therefore, in this article, The first step we need to take is to obtain the ID card of the B-station video we want to crawl information - BV_ID.

        This is the first and most critical step. At this time, you need to use the selenium library. For a detailed introduction and popular science about this library, you can go to other blogs to learn about it. We will not go into details here.

Selenium library installation:

pip install selenium

The specific usage is as follows:

        Introducing the necessary libraries

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

Define a selenium crawler

        ​ ​ ​ Only keywords are used as formal parameters, and the rest of the parts can be generated based on these keywords.

def spider_bvid(keyword):
    """
    利用seleniume获取搜索结果的bvid,供给后续程序使用
    :param keyword: 搜索关键词
    :return: 生成去重的output_filename = f'{keyword}BV号.csv'
    """
    

         First, define a file for writing crawling results. Because of the large amount of data, I chose a csv file. When you run it, you can choose the file format you like.

        ​ ​ ​Then set the interfaceless crawler, window size and disable gpu acceleration to reduce the memory usage of the browser and prevent browser crashes.

    # 保存的文件名
    input_filename = f'{keyword}BV号.csv'

    # 启动爬虫
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    browser = webdriver.Chrome(options=options)  # 设置无界面爬虫
    browser.set_window_size(1400, 900)  # 设置全屏,注意把窗口设置太小的话可能导致有些button无法点击
    browser.get('https://bilibili.com')
    # 刷新一下,防止搜索button被登录弹框遮住
    browser.refresh()
    print("============成功进入B站首页!!!===========")
    

        Use web page element positioning to find the search box and search button on the homepage of station B, enter the keywords we want to search, and click the search button.

    input = browser.find_element(By.CLASS_NAME, 'nav-search-input')
    button = browser.find_element(By.CLASS_NAME, 'nav-search-btn')

    # 输入关键词并点击搜索
    input.send_keys(keyword)
    button.click()
    print(f'==========成功搜索{keyword}相关内容==========')

    

        After successfully entering the search results page, our original technical idea is: 1. Position the page number box and next page box at the bottom of the page according to the positioning of web page elements, css, xpath or other methods; 2. Get the last page by The text value in the box and the text value of the next page are used to continuously simulate clicking on the next page through a loop, thereby achieving the purpose of crawling the content of all result pages.

        However, after the code of the webpage of station B was changed, it was displayed as 34 pages. After checking the webpage content, it was displayed as 42 pages (at most), which would lead to different page positioning. When I personally ran the program, the first keyword search appeared. The maximum number of pages box can be located, but the same box cannot be located using the next keyword entered in a loop.

        So we changed our thinking: the search results of station B do not display all the content, but at most more than a thousand, which means that the maximum number of pages is 42 pages. Therefore, when we find that there are many result pages after manual search, we can directly set the maximum number of pages to 42.

        Since this newbie not positioned this new next page button, I enter the page number in a loop to form the search web page URL to achieve the same effect as simulating clicking on the next page. However, the result of this method of operation is that repeated crawling will occur, so deduplication needs to be performed at the end.

        It is also worth mentioning that if you want to locate the research element, I have seen that there is a big guy on this site "Eagle of the Pampas". He wrote a scroll crawling search result using only selenium. A blog with first and second level comments for all videos. The code is open source, and the positioning and breakpoint resuming are done very well. If you have time, you can study it.

# 设置窗口
    all_h = browser.window_handles
    browser.switch_to.window(all_h[1])

    # B站最多显示42页
    total_page = 42
    # 同样由于B站网页代码的更改,通过找到并点击下一页的方式个人暂不能实现
    #(对,不会分析那个破网页!!!)

    for i in range(0, total_page):
        # url 需要根据不同关键词进行调整内容!!!
        # 这里的url需要自己先搜索一下然后复制网址进来
        url = (f"https://search.bilibili.com/all?keyword={keyword}"
               f"&from_source=webtop_search&spm_id_from='你自己的'&search_source='你自己的'&page={i}")

        print(f"===========正在尝试获取第{i + 1}页网页内容===========")
        print(f"===========本次的url为:{url}===========")
        browser.get(url)
        # 这里请求访问网页的时间也比较久(可能因为我是macos),所以是否需要等待因设备而异
        # 取消刷新并长时间休眠爬虫以避免爬取太快导致爬虫抓取到js动态加载源码
        # browser.refresh()
        print('正在等待页面加载:3')
        time.sleep(1)
        print('正在等待页面加载:2')
        time.sleep(1)
        print('正在等待页面加载:1')
        time.sleep(1)

        

        After successfully obtaining all the page results, we can directly analyze the page, because we only need to obtain the BV number, so we do not need to repeatedly crawl some data that can be easily obtained later.

        Here we directly use bs4 to analyze the page, directly locate the herf in the card, and obtain the details page URL of each video. In this URL, we can split out the BV number we need.

# 直接分析网页
        html = browser.page_source
        # print("网页源码" + html) 用于判断是否获取成功
        soup = BeautifulSoup(html, 'lxml')
        infos = soup.find_all(class_='bili-video-card')
        bv_id_list = []
        for info in infos:
            # 只定位视频链接
            href = info.find('a').get('href')
            # 拆分
            split_url_data = href.split('/')
            # 利用循环删除拆分出现的空白
            for element in split_url_data:
                if element == '':
                    split_url_data.remove(element)
            # 打印检验内容
            # print(split_url_data)
            # 获取bvid
            bvid = split_url_data[2]

            # 利用if语句直接去重
            if bvid not in bv_id_list:
                bv_id_list.append(bvid)
        for bvid_index in range(0, len(bv_id_list)):
            # 写入 input_filename
            write_to_csv_bvid(input_filename, bv_id_list[bvid_index])
        # 输出提示进度
        print('写入文件成功')
        print("===========成功获取第" + str(i + 1) + "次===========")
        time.sleep(1)
        i += 1

    # 退出爬虫
    browser.quit()

    # 打印信息显示是否成功
    print(f'==========爬取完成。退出爬虫==========')

        After writing the file, we can get the BV number after deduplication. Next, we can use the BV number to crawl the basic information of the video we need.

3. Part 2 Request function requests access

        Needless to say, the Request function is used by most programs involving crawlers, so I won’t go into details here. On the other hand, the reason for using the request function is that bilibili's api open interface can easily obtain the information we want.

        The bilibili-api library can also be used here, but the reason why it is not used in this article is: the bilibili-api library needs to make an asynchronous request when obtaining video details, but calling the asynchronous request function directly in the loop will cause various errors, Xiaobai It is definitely impossible to solve, even if you look through the information, it is difficult to understand. (Yes, I didn’t understand it.) For example: aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected, pipe broken, etc.

        Therefore, a relatively stupid method is used here. The BV number is spliced ​​into the URL of the video data api interface for access. The returned page is converted into json format, and then the value in the json dictionary is directly read and called.

        If you want to know more about the API interface of station B, you can check it out on Github.

https://github.com/SocialSisterYi/bilibili-API-collect

        The API interface URL for video information here is:

#A EXAMPLE : https://api.bilibili.com/x/web-interface/view?bvid=BV1n24y1D75V
api_url = f'https://api.bilibili.com/x/web-interface/view?bvid={bv_id}'

Written at the front: When using the Request function, be sure to pay attention to the time interval. If your device accesses the URL very quickly, you can increase the interval appropriately; if it accesses the URL slowly, you can reduce the interval appropriately. The value of Time.sleep() must be at least greater than 1.5s, otherwise an error will be reported or the network IP will be blocked by the uncle. (Of course, if your IP is really blocked, just change the network environment. For example, if you use a wireless network at home, then switching to your own mobile phone traffic hotspot can solve the problem.)

requests.exceptions.SSLError: HTTPSConnectionPool(host='api.bilibili.com', port=443)

        The specific calling code is:

        ​ ​ ​ First, we need to write our own request headers. User-Agenta is our own network agent, which can include chrome, firefox, etc. Cookies are some record files obtained by the website, mainly used for identification. Both values ​​can be obtained by capturing web pages. Take Chrome as an example. After logging in to site B, click on any video to play, press F12 (win) or option+command+I (mac), enter the network section, and try to find the js file starting with total?list. Then we can find the two values ​​​​we need relatively easily.

        ​ ​ Next, after passing in the BV number and splicing it into a usable URL, you can open it with a browser to check and analyze it yourself to ensure the validity of the URL. After that, use the json library to return the web page source code in json form. The returned value is basically a dictionary, which is easy to operate.

def get_video_info(bv_id):
    headers = {
        'User-Agent': "你的",
        'Cookie': "你的"}

    api_url = f'https://api.bilibili.com/x/web-interface/view?bvid={bv_id}'
    # 打印本次要获取的bvid,用于错误时确认
    print(f"正在进行爬取uid为:{bv_id}的UP主的粉丝数量与作品总数")
    print(f"==========本次获取数据的视频BV号为:{bv_id}==========")
    print(f"url为:{api_url}")
    # https://api.bilibili.com/x/web-interface/view?BV1n24y1D75V
    video_info = requests.get(url=api_url, headers=headers)
    video_info_json = json.loads(video_info.text)
    

        After getting the source code of the web page, the values ​​we need are stored in the "data" tag, and we can call it directly according to our needs. I have created a new dictionary here to store values, so you don’t have to go through so much trouble. For the relevant values ​​and the Chinese meanings corresponding to the English names, you can refer to the introduction of this Zhihu column, or you can view it in the github document above.

    # 创建存放的字典
    info_dict = {}
    # 信息解读
    # https://zhuanlan.zhihu.com/p/618885790
    # 视频bvid,即bv号
    bvid = video_info_json['data']['bvid']
    info_dict['bvid'] = bvid
    # 视频aid,即av号
    aid = video_info_json['data']['aid']
    info_dict['aid'] = aid
    # 视频cid,用于获取弹幕信息
    cid = video_info_json['data']['cid']
    info_dict['cid'] = cid
    # 作者id
    mid = video_info_json['data']['owner']['mid']
    info_dict['mid'] = mid
    # up主昵称
    name = video_info_json['data']['owner']['name']
    info_dict['name'] = name
    # 视频标题
    title = video_info_json['data']['title']
    info_dict['title'] = title
    # 视频标签
    tname = video_info_json['data']['tname']
    info_dict['tname'] = tname
    # 视频发布时间戳
    pubdate = video_info_json['data']['pubdate']
    # 转化时间戳
    pub_datatime = datetime.fromtimestamp(pubdate)
    # 整体格式
    pub_datatime_strf = pub_datatime.strftime('%Y-%m-%d %H:%M:%S')
    # 日期
    date = re.search(r"(\d{4}-\d{1,2}-\d{1,2})", pub_datatime_strf)
    info_dict['pub_date'] = date.group()
    # 时间
    pub_time = re.search(r"(\d{1,2}:\d{1,2}:\d{1,2})", pub_datatime_strf)
    info_dict['pub_time'] = pub_time.group()
    # 视频创建时间戳
    # ctime = info['ctime']
    # 视频简介
    desc = video_info_json['data']['desc']
    info_dict['desc'] = desc
    # 视频播放量
    view = video_info_json['data']['stat']['view']
    info_dict['view'] = view
    # 点赞数
    like = video_info_json['data']['stat']['like']
    info_dict['like'] = like
    # 投币数
    coin = video_info_json['data']['stat']['coin']
    info_dict['coin'] = coin
    # 收藏数
    favorite = video_info_json['data']['stat']['favorite']
    info_dict['favorite'] = favorite
    # 分享数
    share = video_info_json['data']['stat']['share']
    info_dict['share'] = share
    # 评论数
    repiy = video_info_json['data']['stat']['reply']
    info_dict['reply'] = repiy
    # 视频弹幕数量
    danmaku = video_info_json['data']['stat']['danmaku']
    info_dict['danmaku'] = danmaku

    print(f'=========={bv_id} 的视频基本信息已成功获取==========')

    # 发布作品时的动态
    # dynamic = info['dynamic']
    print('正在等待,以防访问过于频繁\n')
    time.sleep(3)

    return info_dict

        This will return the dictionary with our data, which can be called directly later.

        The overall idea of ​​obtaining UP master information is the same, so I won’t go into details here and paste the code directly:

        

def get_user_info(uid):
    """
    通过uid(即mid)获取UP主的粉丝总数和作品总数
    :param uid: mid
    :return:user_info_dict
    """
    # 定义空字典用于存放数据
    # 粉丝数 follower
    # 作品总数 archive
    user_info_dict = {}
    # 首先写入请求头
    # 设置用户代理 User_Agent及Cookies
    headers = {
        'User-Agent': "",
        'Cookie': ""}

    # 将传入的的uid组成up主主页的api_url
    # A Example: https://api.bilibili.com/x/web-interface/card?mid=1177893348
    api_url = f'https://api.bilibili.com/x/web-interface/card?mid={uid}'
    # https://api.bilibili.com/x/web-interface/view?BV1n24y1D75V
    # 打印次数,数据量大,便于查看进程
    print(f"正在进行爬取uid为:{uid}的UP主的粉丝数量与作品总数")

    # 打印本次要获取的uid,用于错误时确认
    print(f"==========本次获取数据的up主的uid为:{uid}==========")
    print(f"url为{api_url}")

    # 利用requests进行访问,并返回需要的封装信息
    up_info = requests.get(url=api_url, headers=headers)

    # 不知道会不会被封ip,保险起见
    # time.sleep(2)

    # 将数据转化为json格式
    up_info_json = json.loads(up_info.text)

    # 利用json定位相关数据
    fans_number = up_info_json['data']['card']['fans']
    user_info_dict['follower'] = fans_number
    archive_count = up_info_json['data']['archive_count']
    user_info_dict['archive'] = archive_count

    print(f'=========={bv_id} 的作者基本信息已成功获取==========\n')

    # 等待
    print('正在等待,以防访问过于频繁\n')
    time.sleep(1.5)
    return user_info_dict

4. Part 3 The final call

        After writing the above functions, we only need to create a main entrance and then call the function directly.

if __name__ == '__main__':

    # 针对不同内容修改搜索关键词!!!!
    keywords = ["1", "2"]
    for keyword in keywords:
        # 自动爬取多个主题时须注意上面的最大页数定位问题
        # 爬取后生成去重了的len(keywords)个f'{keyword}BV号.csv'文件
        spider_bvid(keyword)
    for keyword in keywords:
        # 拼接成文件名
        csv_to_merge = f'{keyword}BV号.csv'
        # 合并后生成未去重的文件
        merge_csv(input_filename=csv_to_merge, output_filename='BV号合并.csv')
    

    # 遍历读取bv_id
    filename = 'BV号合并.csv'
    # 打开文件并去重
    open_csv = pd.read_csv(filename)
    open_csv.drop_duplicates(subset='BV号')
    bv_id_list = np.array(open_csv['BV号'])
    """
    # 第一次调用,若读取csv进行爬取时,意外中断
    # 则更改为读取txt文本,将已爬取第bvid删除,以达到断点续爬的目的
    for bvid in bv_id_list:
        with open("bv_id_list.txt", 'a') as f:
            f.write(bvid+'\n')
    """
    with open("bv_id_list.txt", 'r') as f:
        bv_id_list = f.readlines()
    # 循环写入内容
    for i in range(0, len(bv_id_list)):
        bv_id = bv_id_list[i]
        print(f'正在进行第{i+1}次爬取\n')
        # 获取视频所有的基本信息
        video_info = get_video_info(bv_id)
        bvid = video_info['bvid']
        aid = video_info['aid']
        cid = video_info['cid']
        mid = video_info['mid']
        name = video_info['name']
        title = video_info['title']
        tname = video_info['tname']
        pub_date = video_info['pub_date']
        pub_time = video_info['pub_time']
        desc = video_info['desc']
        view = video_info['view']
        like = video_info['like']
        coin = video_info['coin']
        favorite = video_info['favorite']
        share = video_info['share']
        reply = video_info['reply']
        danmaku = video_info['danmaku']

        # 传播效果计算公式
        Communication_Index = math.log(
            0.5 * int(view) + 0.3 * (int(like) + int(coin) + int(favorite)) + 0.2 * (int(reply) + int(danmaku)))
        # 获取作者的相关信息
        user_info = get_user_info(uid=mid)
        follower = user_info['follower']
        archive = user_info['archive']
        write_to_csv(filename='视频基本信息.csv', bvid=bvid, aid=aid, cid=cid, mid=mid, name=name, follower=follower,
                     archive=archive, title=title, tname=tname, pub_date=pub_date, pub_time=pub_time, desc=desc,
                     view=view, like=like, coin=coin, favorite=favorite, share=share, reply=reply, danmaku=danmaku,
                     communication_index=Communication_Index)
        print(f'==========第{i+1}个BV号:{bv_id}的相关数据已写入csv文件中==========')
        print('==================================================\n')

5. Complete code

        Some things to note:

        ​ ​ 1. The writr_to_csv() function is copied from the code of the big guy. Although the else is basically not used later, the previous one is really easy to use. You can understand it yourself and then change it.

        ​ ​ ​ 2. Regarding the calculation formula for communication effects, this is a quote from a paper written by one of my own experts. Please indicate the citation if necessary. Academic misconduct is a very serious matter.

        Quote: Chen Qiang, Zhang Yangyi, Ma Xiaoyue, et al. Factors influencing the information dissemination effect of government B station number and empirical research [J]. Library and Information Service, 2020, 64(22): 126 - 134.

        3. I solved some parts of the code in a very stupid way. If any students have optimized it, please post it in the comment area for exchange and learning.

# -*- coding: utf-8 -*-
"""
@ Project : pythonProject
@ File : spider bilibi.py
@ IDE : PyCharm
@ Auther : Avi-OvO-CreapDiem
@ Date : 2023/9/2 08:49
@ Purpose : 
"""

import re
import os
import csv
import time
import math
import json
import requests
import numpy as np
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options


def merge_csv(input_filename, output_filename):
    """
    读取csv文件内容,并写入新的文件
    :param input_filename: 传入的文件名称
    :param output_filename: 写入的新文件的名称
    :return: 向新文件中写入input_filename中的内容
    """

    # 读取文件
    csv_data_read = pd.read_csv(input_filename)
    # 获取文件总行数
    number_of_row = (len(csv_data_read))
    # 循环该csv文件中的所有行,并写入信息
    for i in range(0, number_of_row):
        row_info = csv_data_read.values[i]
        # 输出查看内容
        # print(row_info)
        # 具体内容
        row_content = row_info[0]
        # 写入
        write_to_csv_bvid(output_filename, row_content)
        # 退出循环
    # 打印进度
    print(f'成功向{output_filename}中写入了{input_filename}的全部信息')


def write_to_csv_bvid(input_filename, bvid):
    """
    写入新的csv文件,若没有则创建,须根据不同程序进行修改
    :param input_filename: 写入的文件名称
    :param bvid: BV号
    :return: 生成写入的input_filename文件
    """
    # OS 判断路径是否存在
    file_exists = os.path.isfile(input_filename)
    # 设置最大尝试次数
    max_retries = 50
    retries = 0

    while retries < max_retries:
        try:
            with open(input_filename, mode='a', encoding='utf-8', newline='') as csvfile:
                fieldnames = ['BV号']
                writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

                if not file_exists:
                    writer.writeheader()

                writer.writerow({
                    'BV号': bvid
                })
                # print('写入文件成功')
            break  # 如果成功写入,跳出循环
        except PermissionError as e:
            retries += 1
            print(f"将爬取到的数据写入csv时,遇到权限错误Permission denied,文件可能被占用或无写入权限: {e}")
            print(f"等待3s后重试,将会重试50次... (尝试 {retries}/{max_retries})")
            time.sleep(3)  # 等待10秒后重试
    else:
        print("将爬取到的数据写入csv时遇到权限错误,且已达到最大重试次数50次,退出程序")


def spider_bvid(keyword):
    """
    利用seleniume获取搜索结果的bvid,供给后续程序使用
    :param keyword: 搜索关键词
    :return: 生成去重的output_filename = f'{keyword}BV号.csv'
    """
    # 保存的文件名
    input_filename = f'{keyword}BV号.csv'

    # 启动爬虫
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    browser = webdriver.Chrome(options=options)  # 设置无界面爬虫
    browser.set_window_size(1400, 900)  # 设置全屏,注意把窗口设置太小的话可能导致有些button无法点击
    browser.get('https://bilibili.com')
    # 刷新一下,防止搜索button被登录弹框遮住
    browser.refresh()
    print("============成功进入B站首页!!!===========")
    input = browser.find_element(By.CLASS_NAME, 'nav-search-input')
    button = browser.find_element(By.CLASS_NAME, 'nav-search-btn')

    # 输入关键词并点击搜索
    input.send_keys(keyword)
    button.click()
    print(f'==========成功搜索{keyword}相关内容==========')

    # 设置窗口
    all_h = browser.window_handles
    browser.switch_to.window(all_h[1])
    """
    # 这里可以通过xpath或者其他方法找到B站搜索结果页最下方的页码数值
    # 但B站网页代码更改后,显示为34页,网页内容检查后显示为42页(至多)
    # 由于我们的搜索结果很多,肯定超出B站最大显示的42页,故而直接设置最大页数为42
    # 找到最后一个页码所在位置,并获取值
    # total_btn = browser.find_element(By.XPATH,"//*[@id="i_cecream"]/div/div[2]/div[2]/div/div/div/div[4]/div/div/button[9]"")
    # //*[@id="i_cecream"]/div/div[2]/div[2]/div/div/div/div[4]/div/div/button[9]
    # total = int(total_btn)
    # print(f'==========成功搜索! 总页数: {total}==========')
    """

    # B站最多显示42页
    total_page = 42
    # 同样由于B站网页代码的更改,通过找到并点击下一页的方式个人暂不能实现(对,不会分析那个破网页!!!)
    # 因此这里利用总页数进行循环访问来实现自动翻页的效果

    for i in range(0, total_page):
        # url 需要根据不同关键词进行调整内容!!!
        url = (f"https://search.bilibili.com/all?keyword={keyword}"
               f"&from_source=webtop_search&spm_id_from=333.1007&search_source=5&page={i}")

        print(f"===========正在尝试获取第{i + 1}页网页内容===========")
        print(f"===========本次的url为:{url}===========")
        browser.get(url)
        # 这里请求访问网页的时间也比较久(可能因为我是macos),所以是否需要等待因设备而异
        # 取消刷新并长时间休眠爬虫以避免爬取太快导致爬虫抓取到js动态加载源码
        # browser.refresh()
        print('正在等待页面加载:3')
        time.sleep(1)
        print('正在等待页面加载:2')
        time.sleep(1)
        print('正在等待页面加载:1')
        time.sleep(1)

        # 直接分析网页
        html = browser.page_source
        # print("网页源码" + html) 用于判断是否获取成功
        soup = BeautifulSoup(html, 'lxml')
        infos = soup.find_all(class_='bili-video-card')
        bv_id_list = []
        for info in infos:
            # 只定位视频链接
            href = info.find('a').get('href')
            # 拆分
            split_url_data = href.split('/')
            # 利用循环删除拆分出现的空白
            for element in split_url_data:
                if element == '':
                    split_url_data.remove(element)
            # 打印检验内容
            # print(split_url_data)
            # 获取bvid
            bvid = split_url_data[2]

            # 利用if语句直接去重
            if bvid not in bv_id_list:
                bv_id_list.append(bvid)
        for bvid_index in range(0, len(bv_id_list)):
            # 写入 input_filename
            write_to_csv_bvid(input_filename, bv_id_list[bvid_index])
        # 输出提示进度
        print('写入文件成功')
        print("===========成功获取第" + str(i + 1) + "次===========")
        time.sleep(1)
        i += 1

    # 退出爬虫
    browser.quit()

    # 打印信息显示是否成功
    print(f'==========爬取完成。退出爬虫==========')


def write_to_csv(filename, bvid, aid, cid, mid, name, follower, archive, title, tname, pub_date, pub_time, desc,
                 view, like, coin, favorite, share, reply, danmaku, communication_index):
    """
    向csv文件中写入B站视频相关的基本信息,未按路径找到文件,则新建文件
    :param filename: 写入数据的文件名
    :param bvid: BV号
    :param aid: AV号
    :param cid: 用于获取弹幕文本的
    :param mid: UP主的ID
    :param name: UP主名称
    :param follower: UP主粉丝数
    :param archive: UP主作品总数
    :param title: 标题
    :param tname: tag名称
    :param pub_date: 发布日期
    :param pub_time: 发布时间
    :param desc: 视频简介
    :param view: 播放量
    :param like: 点赞数
    :param coin: 投币数
    :param favorite: 收藏数
    :param share: 分享数
    :param reply: 评论数
    :param danmaku: 弹幕数
    :param communication_index: 传播效果公式的值
    :return:
    """
    file_exists = os.path.isfile(filename)
    max_retries = 50
    retries = 0

    while retries < max_retries:
        try:
            with open(filename, mode='a', encoding='utf-8', newline='') as csvfile:
                fieldnames = ['BV号', 'AV号', 'CID', 'UP主ID', 'UP主名称', 'UP主粉丝数', '作品总数', '视频标题',
                              '视频分类标签',
                              '发布日期', '发布时间', '视频简介', '播放量', '点赞数', '投币数', '收藏数', '分享数',
                              '评论数',
                              '弹幕数', '传播效果指数']
                writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

                if not file_exists:
                    writer.writeheader()

                writer.writerow({
                    'BV号': bvid, 'AV号': aid, 'CID': cid, 'UP主ID': mid, 'UP主名称': name, 'UP主粉丝数': follower,
                    '作品总数': archive, '视频标题': title, '视频分类标签': tname, '发布日期': pub_date,
                    '发布时间': pub_time,
                    '视频简介': desc, '播放量': view, '点赞数': like, '投币数': coin, '收藏数': favorite,
                    '分享数': share,
                    '评论数': reply, '弹幕数': danmaku, '传播效果指数': communication_index
                })
            break  # 如果成功写入,跳出循环
        except PermissionError as e:
            retries += 1
            print(f"将爬取到的数据写入csv时,遇到权限错误Permission denied,文件可能被占用或无写入权限: {e}")
            print(f"等待3s后重试,将会重试50次... (尝试 {retries}/{max_retries})")
    else:
        print("将爬取到的数据写入csv时遇到权限错误,且已达到最大重试次数50次,退出程序")


def get_user_info(uid):
    """
    通过uid(即mid)获取UP主的粉丝总数和作品总数
    :param uid: mid
    :return:user_info_dict
    """
    # 定义空字典用于存放数据
    # 粉丝数 follower
    # 作品总数 archive
    user_info_dict = {}
    # 首先写入请求头
    # 设置用户代理 User_Agent及Cookies
    headers = {
        'User-Agent': "",
        'Cookie': ""}

    # 将传入的的uid组成up主主页的api_url
    # A Example: https://api.bilibili.com/x/web-interface/card?mid=1177893348
    api_url = f'https://api.bilibili.com/x/web-interface/card?mid={uid}'
    # https://api.bilibili.com/x/web-interface/view?BV1n24y1D75V
    # 打印次数,数据量大,便于查看进程
    print(f"正在进行爬取uid为:{uid}的UP主的粉丝数量与作品总数")

    # 打印本次要获取的uid,用于错误时确认
    print(f"==========本次获取数据的up主的uid为:{uid}==========")
    print(f"url为{api_url}")

    # 利用requests进行访问,并返回需要的封装信息
    up_info = requests.get(url=api_url, headers=headers)

    # 不知道会不会被封ip,保险起见
    # time.sleep(2)

    # 将数据转化为json格式
    up_info_json = json.loads(up_info.text)

    # 利用json定位相关数据
    fans_number = up_info_json['data']['card']['fans']
    user_info_dict['follower'] = fans_number
    archive_count = up_info_json['data']['archive_count']
    user_info_dict['archive'] = archive_count

    print(f'=========={bv_id} 的作者基本信息已成功获取==========\n')

    # 等待
    print('正在等待,以防访问过于频繁\n')
    time.sleep(1.5)
    return user_info_dict


def get_video_info(bv_id):
    headers = {
        'User-Agent': "",
        'Cookie': ""}

    api_url = f'https://api.bilibili.com/x/web-interface/view?bvid={bv_id}'
    # 打印本次要获取的bvid,用于错误时确认
    print(f"正在进行爬取uid为:{bv_id}的UP主的粉丝数量与作品总数")
    print(f"==========本次获取数据的视频BV号为:{bv_id}==========")
    print(f"url为:{api_url}")
    # https://api.bilibili.com/x/web-interface/view?BV1n24y1D75V
    video_info = requests.get(url=api_url, headers=headers)
    video_info_json = json.loads(video_info.text)
    # 创建存放的字典
    info_dict = {}
    # 信息解读
    # https://zhuanlan.zhihu.com/p/618885790
    # 视频bvid,即bv号
    bvid = video_info_json['data']['bvid']
    info_dict['bvid'] = bvid
    # 视频aid,即av号
    aid = video_info_json['data']['aid']
    info_dict['aid'] = aid
    # 视频cid,用于获取弹幕信息
    cid = video_info_json['data']['cid']
    info_dict['cid'] = cid
    # 作者id
    mid = video_info_json['data']['owner']['mid']
    info_dict['mid'] = mid
    # up主昵称
    name = video_info_json['data']['owner']['name']
    info_dict['name'] = name
    # 视频标题
    title = video_info_json['data']['title']
    info_dict['title'] = title
    # 视频标签
    tname = video_info_json['data']['tname']
    info_dict['tname'] = tname
    # 视频发布时间戳
    pubdate = video_info_json['data']['pubdate']
    # 转化时间戳
    pub_datatime = datetime.fromtimestamp(pubdate)
    # 整体格式
    pub_datatime_strf = pub_datatime.strftime('%Y-%m-%d %H:%M:%S')
    # 日期
    date = re.search(r"(\d{4}-\d{1,2}-\d{1,2})", pub_datatime_strf)
    info_dict['pub_date'] = date.group()
    # 时间
    pub_time = re.search(r"(\d{1,2}:\d{1,2}:\d{1,2})", pub_datatime_strf)
    info_dict['pub_time'] = pub_time.group()
    # 视频创建时间戳
    # ctime = info['ctime']
    # 视频简介
    desc = video_info_json['data']['desc']
    info_dict['desc'] = desc
    # 视频播放量
    view = video_info_json['data']['stat']['view']
    info_dict['view'] = view
    # 点赞数
    like = video_info_json['data']['stat']['like']
    info_dict['like'] = like
    # 投币数
    coin = video_info_json['data']['stat']['coin']
    info_dict['coin'] = coin
    # 收藏数
    favorite = video_info_json['data']['stat']['favorite']
    info_dict['favorite'] = favorite
    # 分享数
    share = video_info_json['data']['stat']['share']
    info_dict['share'] = share
    # 评论数
    repiy = video_info_json['data']['stat']['reply']
    info_dict['reply'] = repiy
    # 视频弹幕数量
    danmaku = video_info_json['data']['stat']['danmaku']
    info_dict['danmaku'] = danmaku

    print(f'=========={bv_id} 的视频基本信息已成功获取==========')

    # 发布作品时的动态
    # dynamic = info['dynamic']
    print('正在等待,以防访问过于频繁\n')
    time.sleep(1.5)

    return info_dict


if __name__ == '__main__':

    # 针对不同内容修改搜索关键词!!!!
    keywords = ["1", "2"]
    for keyword in keywords:
        # 自动爬取多个主题时须注意上面的最大页数定位问题
        # 爬取后生成去重了的len(keywords)个f'{keyword}BV号.csv'文件
        spider_bvid(keyword)
    for keyword in keywords:
        # 拼接成文件名
        csv_to_merge = f'{keyword}BV号.csv'
        # 合并后生成未去重的文件
        merge_csv(input_filename=csv_to_merge, output_filename='BV号合并.csv')
    

    # 遍历读取bv_id
    filename = 'BV号合并.csv'
    # 打开文件并去重
    open_csv = pd.read_csv(filename)
    open_csv.drop_duplicates(subset='BV号')
    bv_id_list = np.array(open_csv['BV号'])

    """
    # 第一次调用,若读取csv进行爬取时,意外中断
    # 则更改为读取txt文本,将已爬取第bvid删除,以达到断点续爬的目的
    for bvid in bv_id_list:
        with open("bv_id_list.txt", 'a') as f:
            f.write(bvid+'\n')
    with open("bv_id_list.txt", 'r') as f:
        bv_id_list = f.readlines()
    """

    
    # 循环写入内容
    for i in range(0, len(bv_id_list)):
        bv_id = bv_id_list[i]
        print(f'正在进行第{i+1}次爬取\n')
        # 获取视频所有的基本信息
        video_info = get_video_info(bv_id)
        bvid = video_info['bvid']
        aid = video_info['aid']
        cid = video_info['cid']
        mid = video_info['mid']
        name = video_info['name']
        title = video_info['title']
        tname = video_info['tname']
        pub_date = video_info['pub_date']
        pub_time = video_info['pub_time']
        desc = video_info['desc']
        view = video_info['view']
        like = video_info['like']
        coin = video_info['coin']
        favorite = video_info['favorite']
        share = video_info['share']
        reply = video_info['reply']
        danmaku = video_info['danmaku']

        # 传播效果计算公式
        Communication_Index = math.log(
            0.5 * int(view) + 0.3 * (int(like) + int(coin) + int(favorite)) + 0.2 * (int(reply) + int(danmaku)))
        # 获取作者的相关信息
        user_info = get_user_info(uid=mid)
        follower = user_info['follower']
        archive = user_info['archive']
        write_to_csv(filename='视频基本信息.csv', bvid=bvid, aid=aid, cid=cid, mid=mid, name=name, follower=follower,
                     archive=archive, title=title, tname=tname, pub_date=pub_date, pub_time=pub_time, desc=desc,
                     view=view, like=like, coin=coin, favorite=favorite, share=share, reply=reply, danmaku=danmaku,
                     communication_index=Communication_Index)
        print(f'==========第{i+1}个BV号:{bv_id}的相关数据已写入csv文件中==========')
        print('==================================================\n')

Guess you like

Origin blog.csdn.net/Smile_to_destiny/article/details/132642215