B crawling station barrage

B crawling station barrage

Some time ago crawling barrage b station now record it experiences

Preparatory

HTML parsing

Here Insert Picture Description
Here Insert Picture Description
Find json response file barrage is located, found that there are more than 1,000 real-time barrage
analysis parameters:
Here Insert Picture Description
only the primary key to identify the video of oid
Reform found oid crawl down (I was thinking about all video grab up a master, so we should pay attention take all oid)
see each video url is fixed format: https: //www.bilibili.com/video/av+vid
grab all vid from the main interface, then all of the crawling xpath oid
this point, all oid crawl is completed

The following entry code fetch barrage, analysis data

First with a small test, grab individual video barrage test:

url="https://api.bilibili.com/x/v1/dm/list.so?oid=144896116"
headers={
        "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0"
         }
# query_list={"oid":"144896116"}
response=requests.get(url,headers=headers)
html_str=response.content
html=etree.HTML(html_str)
d_list=html.xpath('//d')
content_list=[]

a=d_list[0].xpath('//text()')
items={}
items["danmu"]=a
content_list.append(items)
with open("blbl.txt","w",encoding="utf-8") as f:
    for content in content_list:
        f.write(json.dumps(content,ensure_ascii=False))
        f.write("\n")

Sure enough, grab success

Then grab all of the barrage, resolution, stored in the database

import pymysql
import json
import io
import sys

con = pymysql.connect(

    host='127.0.0.1',

    port=3306,

    user='root',

    password='root',

    db='test',

    charset='utf8mb4'

)
cur = con.cursor()
cur.execute("insert into danmu(dm) value('测试')")
con.commit()
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='gb18030')
with open("danmu.json", "r", encoding="utf-8") as f:
    danmu_str = json.load(f)
    # print(danmu_str["dm"][0]["danmu"])
    j = 0
    for j in range(0, 1002):
        for item in danmu_str["dm"][j]["danmu"]:

            # sql_str="insert into danmu(dm) value( \""+item+"\")"

            for stuff in dirty_stuff:
                item = item.replace(stuff, "")
            print(item)
            length = item.__len__()
           
            for l in range(1, length - 1):
                if (item_list[0]==item_list[l]):
                    item=item_list[0]
            sql_str = "insert into danmu(dm) value( \"" + item + "\")"
            cur.execute(sql_str)
            con.commit()


Here Insert Picture Description
Parsing data, barrage Rank:


select count(*) as count,dm as danmu
from danmu
group by dm
order by count desc;

Top result:
Here Insert Picture Description
I was arrested and 1 million barrage of King Hanqing, ranking the advantages of unexpected results, what ah 0, f, ah, ah ... the

Published 73 original articles · won praise 20 · views 4395

Guess you like

Origin blog.csdn.net/lzl980111/article/details/104228861