Taking "Zuo Son" as a practical case, teach you to use python to crawl "iqiyi" video barrage

1 Introduction

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

PS: If you need Python learning materials, you can click on the link below to get it by yourself

Python free learning materials, codes and exchange answers click to join


Recently, iQiyi’s solo drama "Zui Son" is very popular, and I have been chasing after it. With the help of the technology in hand, I want to crawl the barrage to analyze the specific situation of the show and the comments of netizens!

In order for Xiaobai to thoroughly learn the technique of using python to crawl iqiyi barrage, this article introduces how to crawl in detail, and then analyze the data below!

2. Analyze the data packet

1. Find the packet

Press F12 in the browser

 

 

Find this type of url


https://cmts.iqiyi.com/bullet/54/00/7973227714515400_60_2_5f3b2e24.br

2. Analyze barrage links

Among them, /54/00/7973227714515400 is useful! ! ! !

Iqiyi’s barrage acquisition address is as follows:

https://cmts.iqiyi.com/bullet/参数1_300_参数2.z

Parameter 1 is: /54/00/7973227714515400
Parameter 2 is: numbers 1, 2, 3.....

IQiyi will load a new barrage every 5 minutes, each episode is about 46 minutes, 46 divided by 5 rounded up to 10

So the link to the barrage is as follows:


https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_1.z
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_2.z
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_3.z
......
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_10.z

3. Decode the binary data packet

The barrage package downloaded through the barrage link is a file with the suffix of z and needs to be decoded!


def zipdecode(bulletold):
    '对zip压缩的二进制内容解码成文本'
    decode = zlib.decompress(bytearray(bulletold), 15 + 32).decode('utf-8')
    return decode

Save the data in xml format after decoding


# 把编码好的文件分别写入个xml文件中(类似于txt文件),方便后边取数据
  with open('./lyc/zx' + str(x) + '.xml', 'a+', encoding='utf-8') as f:
      f.write(xml)

 

3. Parse the xml

1. Extract data

 

By viewing the xml file, the content we need to extract is 1. user id (uid), 2. comment content (content), 3. comment likes count (likeCount).


#读取xml文件中的弹幕数据数据
from xml.dom.minidom import parse
import xml.dom.minidom
def xml_parse(file_name):
    DOMTree = xml.dom.minidom.parse(file_name)
    collection = DOMTree.documentElement
    # 在集合中获取所有entry数据
    entrys = collection.getElementsByTagName("entry")
    print(entrys)
    result = []
    for entry in entrys:
        uid = entry.getElementsByTagName('uid')[0]
        content = entry.getElementsByTagName('content')[0]
        likeCount = entry.getElementsByTagName('likeCount')[0]
        print(uid.childNodes[0].data)
        print(content.childNodes[0].data)
        print(likeCount.childNodes[0].data)

4. Save the data

1. Work before saving


import xlwt
# 创建一个workbook 设置编码
workbook = xlwt.Workbook(encoding = 'utf-8')
# 创建一个worksheet
worksheet = workbook.add_sheet('sheet1')

# 写入excel
# 参数对应 行, 列, 值
worksheet.write(0,0, label='uid')
worksheet.write(0,1, label='content')
worksheet.write(0,2, label='likeCount')

Import the xlwt library (write to csv) and define the title (uid, content, likeCount)

2. Write data


for entry in entrys:
    uid = entry.getElementsByTagName('uid')[0]
    content = entry.getElementsByTagName('content')[0]
    likeCount = entry.getElementsByTagName('likeCount')[0]
    print(uid.childNodes[0].data)
    print(content.childNodes[0].data)
    print(likeCount.childNodes[0].data)
    # 写入excel
    # 参数对应 行, 列, 值
    worksheet.write(count, 0, label=str(uid.childNodes[0].data))
    worksheet.write(count, 1, label=str(content.childNodes[0].data))
    worksheet.write(count, 2, label=str(likeCount.childNodes[0].data))
    count=count+1

Finally save it as a barrage data set.xls


for x in range(1,11):
    l = xml_parse("./lyc/zx" + str(x) + ".xml")

# 保存
workbook.save('弹幕数据集.xls')

5. Summary

1. Through the actual case "Zuo Son-in-law", the python crawling Iqiyi barrage is realized by hand.
2. Python parses data in xml format.
3. Write data to excel.
Related materials for this article:
https://github.com/bigtigeryo/iqiyidanmu

Guess you like

Origin blog.csdn.net/pythonxuexi123/article/details/114457021