Preface
Recently, Iqiyi’s solo drama " Zui Son " has been very popular, and the author has been chasing after it. With the help of the technology in hand, I want to crawl the bullet screen to analyze the specific situation of the show and the comments of netizens!
In order for Xiaobai to thoroughly learn the technique of using python to crawl iqiyi barrage, this article introduces how to crawl in detail, and then analyze the data below !
Python crawler, data analysis, website development and other case tutorial videos are free to watch online
https://space.bilibili.com/523606542
Python learning exchange group: 1039645993
Analyze the packet
1. Find the packet
Press F12 in the browser
Find this type of url
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_60_2_5f3b2e24.br
2. Analyze barrage links
Among them, /54/00/7973227714515400 is useful! ! ! !
Iqiyi’s barrage acquisition address is as follows:
https://cmts.iqiyi.com/bullet/parameter1_300_parameter2.z
Parameter 1 is: /54/00/7973227714515400
Parameter 2 is: number 1, 2, 3....
爱奇艺每5分钟会加载新的弹幕,每一集约是46分钟,46除以5向上取整就是10
So the link to the barrage is as follows:
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_1.z
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_2.z
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_3.z
......
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_10.z
3. Decode the binary data packet
Barrage by barrage download package is the z suffix format file to be decoded!
def zipdecode(bulletold):
'对zip压缩的二进制内容解码成文本'
decode = zlib.decompress(bytearray(bulletold), 15 + 32).decode('utf-8')
return decode
Save the data in xml format after decoding
# 把编码好的文件分别写入个xml文件中(类似于txt文件),方便后边取数据
with open('./lyc/zx' + str(x) + '.xml', 'a+', encoding='utf-8') as f:
f.write(xml)
Parse xml
1. Extract data
By viewing the xml file, the content we need to extract is 1. User id (uid), 2. Comment content (content), 3. Comment like count (likeCount).
#读取xml文件中的弹幕数据数据
from xml.dom.minidom import parse
import xml.dom.minidom
def xml_parse(file_name):
DOMTree = xml.dom.minidom.parse(file_name)
collection = DOMTree.documentElement
# 在集合中获取所有entry数据
entrys = collection.getElementsByTagName("entry")
print(entrys)
result = []
for entry in entrys:
uid = entry.getElementsByTagName('uid')[0]
content = entry.getElementsByTagName('content')[0]
likeCount = entry.getElementsByTagName('likeCount')[0]
print(uid.childNodes[0].data)
print(content.childNodes[0].data)
print(likeCount.childNodes[0].data)
save data
1. Work before saving
import xlwt
# 创建一个workbook 设置编码
workbook = xlwt.Workbook(encoding = 'utf-8')
# 创建一个worksheet
worksheet = workbook.add_sheet('sheet1')
# 写入excel
# 参数对应 行, 列, 值
worksheet.write(0,0, label='uid')
worksheet.write(0,1, label='content')
worksheet.write(0,2, label='likeCount')
Import the xlwt library (write to csv) and define the title (uid, content, likeCount)
2. Write data
for entry in entrys:
uid = entry.getElementsByTagName('uid')[0]
content = entry.getElementsByTagName('content')[0]
likeCount = entry.getElementsByTagName('likeCount')[0]
print(uid.childNodes[0].data)
print(content.childNodes[0].data)
print(likeCount.childNodes[0].data)
# 写入excel
# 参数对应 行, 列, 值
worksheet.write(count, 0, label=str(uid.childNodes[0].data))
worksheet.write(count, 1, label=str(content.childNodes[0].data))
worksheet.write(count, 2, label=str(likeCount.childNodes[0].data))
count=count+1
Finally saved as a barrage data set-Li Yunchen.xls
for x in range(1,11):
l = xml_parse("./lyc/zx" + str(x) + ".xml")
# 保存
workbook.save('弹幕数据集-李运辰.xls')