"Zuo Son" is very popular recently? Python crawling video barrage

Preface

Recently, Iqiyi’s solo drama " Zui Son " has been very popular, and the author has been chasing after it. With the help of the technology in hand, I want to crawl the bullet screen to analyze the specific situation of the show and the comments of netizens!

 

"Zuo Son" is very popular recently?  Python crawling video barrage

 

 

In order for Xiaobai to thoroughly learn the technique of using python to crawl iqiyi barrage, this article introduces how to crawl in detail, and then analyze the data below !

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542 

Python learning exchange group: 1039645993

Analyze the packet

1. Find the packet

Press F12 in the browser

"Zuo Son" is very popular recently?  Python crawling video barrage

 

Find this type of url

https://cmts.iqiyi.com/bullet/54/00/7973227714515400_60_2_5f3b2e24.br

 

2. Analyze barrage links

Among them, /54/00/7973227714515400 is useful! ! ! !

Iqiyi’s barrage acquisition address is as follows:

https://cmts.iqiyi.com/bullet/parameter1_300_parameter2.z

Parameter 1 is: /54/00/7973227714515400

Parameter 2 is: number 1, 2, 3....

爱奇艺每5分钟会加载新的弹幕,每一集约是46分钟,46除以5向上取整就是10

So the link to the barrage is as follows:

https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_1.z
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_2.z
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_3.z
......
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_10.z

 

3. Decode the binary data packet

Barrage by barrage download package is the z suffix format file to be decoded!

def zipdecode(bulletold):
    '对zip压缩的二进制内容解码成文本'
    decode = zlib.decompress(bytearray(bulletold), 15 + 32).decode('utf-8')
    return decode

Save the data in xml format after decoding

# 把编码好的文件分别写入个xml文件中(类似于txt文件),方便后边取数据
  with open('./lyc/zx' + str(x) + '.xml', 'a+', encoding='utf-8') as f:
      f.write(xml)

"Zuo Son" is very popular recently?  Python crawling video barrage

 

"Zuo Son" is very popular recently?  Python crawling video barrage

 

Parse xml

1. Extract data

"Zuo Son" is very popular recently?  Python crawling video barrage

 

By viewing the xml file, the content we need to extract is 1. User id (uid), 2. Comment content (content), 3. Comment like count (likeCount).

#读取xml文件中的弹幕数据数据
from xml.dom.minidom import parse
import xml.dom.minidom
def xml_parse(file_name):
    DOMTree = xml.dom.minidom.parse(file_name)
    collection = DOMTree.documentElement
    # 在集合中获取所有entry数据
    entrys = collection.getElementsByTagName("entry")
    print(entrys)
    result = []
    for entry in entrys:
        uid = entry.getElementsByTagName('uid')[0]
        content = entry.getElementsByTagName('content')[0]
        likeCount = entry.getElementsByTagName('likeCount')[0]
        print(uid.childNodes[0].data)
        print(content.childNodes[0].data)
        print(likeCount.childNodes[0].data)

 

"Zuo Son" is very popular recently?  Python crawling video barrage

 

save data

1. Work before saving

import xlwt
# 创建一个workbook 设置编码
workbook = xlwt.Workbook(encoding = 'utf-8')
# 创建一个worksheet
worksheet = workbook.add_sheet('sheet1')


# 写入excel
# 参数对应 行, 列, 值
worksheet.write(0,0, label='uid')
worksheet.write(0,1, label='content')
worksheet.write(0,2, label='likeCount')

 

Import the xlwt library (write to csv) and define the title (uid, content, likeCount)

 

2. Write data

for entry in entrys:
    uid = entry.getElementsByTagName('uid')[0]
    content = entry.getElementsByTagName('content')[0]
    likeCount = entry.getElementsByTagName('likeCount')[0]
    print(uid.childNodes[0].data)
    print(content.childNodes[0].data)
    print(likeCount.childNodes[0].data)
    # 写入excel
    # 参数对应 行, 列, 值
    worksheet.write(count, 0, label=str(uid.childNodes[0].data))
    worksheet.write(count, 1, label=str(content.childNodes[0].data))
    worksheet.write(count, 2, label=str(likeCount.childNodes[0].data))
    count=count+1

Finally saved as a barrage data set-Li Yunchen.xls

for x in range(1,11):
    l = xml_parse("./lyc/zx" + str(x) + ".xml")


# 保存
workbook.save('弹幕数据集-李运辰.xls')

"Zuo Son" is very popular recently?  Python crawling video barrage

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/114704473