Python crawler-iqiyi barrage actual combat-hidden corner-2020 latest

table of Contents

background

Crawler analysis

Fool analysis

Advanced analysis

Actual code

Result display

data analysis

Sentiment Analysis of Chinese Barrage


background

I watched the hidden corner a few days ago and felt that the show was really good, so it seemed to see if everyone agrees with me. I did not find the complete crawler code for a while. What are they doing in hiding? Can you cook it? ! ! Forget it, use some of the information they provided to write one yourself.

Crawler analysis

Fool analysis

Open it and see that the format of the bullet screen is fixed in the DIV, but it will be updated in real time, so don't think about foolish operations.

So what, can I give up as a child prodigy? Ctrl+Shift+C  directly opens the review panel, switches to the Network interface, Ctrl+R refreshes, and the response to the request starts.

Advanced analysis

According to the incomplete guidance, I searched for bullet (English translation of bullet screen), hey, a logo came into view, open a few more to see if there is any regularity in the comparison, coincidentally, there is really.

After the preliminary judgment is completed, a compressed package will be downloaded as soon as the URL is copied. The compressed package contains the bullet screen. Below, I only need to download the bullet screen compressed package for each episode! ! ! There are two more questions here

  • The first attempt to download and decompress the compressed package directly will not work. It needs to be resolved. See the StackFlow solution for details .
  • Since we want to crawl the number of episodes, naturally we have to determine the id of each episode. The above preliminary judgment is that the TV series id is incorrectly labeled. The real TV series id is albumid. You can search for albumid directly on the panel to find this thing.

Copy it foolishly to a new window to see if it is (here is based on my clever little brain to judge), get the json format data and basically declare victory.

Everything is in place, only owe Dongfeng, one word, do it! ! !

Actual code

#!/usr/bin/env python

# -*- encoding: utf-8 -*-

'''
@Author  :   {Jack Zhao}

@Time    :   2020/6/29 10:39

@Contact :   {[email protected]}

@Desc    :  隐秘的角落弹幕爬取复现
'''
import json
import zlib

import pandas as pd
import requests
from bs4 import BeautifulSoup
from threading import Thread

def get_IQ_data(tv_index,tv_id):
	'''由上文分析可知,只需要知道tvid,即可下载对应的弹幕压缩包'''
	url = 'https://cmts.iqiyi.com/bullet/{}/{}/{}_300_{}.z'
	datas = pd.DataFrame(columns=['uid','contentsId','contents','likeCount'])
	for index in range(1,20):
		# 这里每一集的时长所对应的弹幕有好多个压缩包,为压缩包连续编号,最好从头看到尾,这里我没耐心看盲猜小于20个
		# 后分析发现,弹幕文件每5分钟(300秒)向服务器请求一次,故每集弹幕文件数量等于视频时间除以300之后向上取整,实际编程时这里可以简单处理
		myUrl = url.format(tv_id[-4:-2],tv_id[-2:],tv_id,index)
		# print(myUrl)
		res = requests.get(myUrl)
		if res.status_code == 200:
			btArr = bytearray(res.content) # 需要解码
			xml = zlib.decompress(btArr).decode('utf-8')
			bs = BeautifulSoup(xml,"xml") # 解析xml文档
			data = pd.DataFrame(columns=['uid','contentsId','contents','likeCount'])
			data['uid'] = [i.text for i in bs.findAll('uid')]  # 用户编号
			data['contentsId'] = [i.text for i in bs.findAll('contentId')] # 弹幕对应ID
			data['contents'] = [i.text for i in bs.findAll('content')] # 内容
			data['likeCount'] = [i.text for i in bs.findAll('likeCount')] # 他人点赞该弹幕的数量
		else:
			break
		datas = pd.concat([datas,data],ignore_index = True) # 一集中所有弹幕数据录入一个df,不设置ignore的话会出现index重复
	datas['tv_name'] = str(tv_index) # 增加一列作为集数标识
	return datas


def get_TV_Id(aid):
	'''每一集的tvid其实是由albumid生成的,具体Find - album找到info即可'''
	tv_id_list = []
	for page in range(1,2):
		url = 'https://pcw-api.iqiyi.com/albums/album/avlistinfo?aid=' \
              + aid + '&page='\
              + str(page) + '&size=30'
		res = requests.get(url).text
		res_json = json.loads(res)
		# 视频列表
		movie_list = res_json['data']['epsodelist']
		for j in movie_list:
			tv_id_list.append(j['tvId'])
		return tv_id_list

if __name__ == '__main__':
	# album id
	my_aid = '252449101'
	my_tv_id_list = get_TV_Id(my_aid) # 获得电视剧集数
	print("下面是TV_ID列表:")
	print(my_tv_id_list)
	data_all = pd.DataFrame(columns=['uid', 'contentsId', 'contents', 'likeCount','tv_name']) # 同样设置一个总表作为产出
	for index,i in enumerate(my_tv_id_list):
		#data = get_data('隐秘的角落第'+index+'集',str(i))
		# 下一步可Thread改造成多进程
		data = pd.DataFrame(get_IQ_data(index, str(i)))
		data.to_csv('./'+str(index+1)+'.csv')
		data_all = pd.concat([data_all, data], ignore_index=True)  # 存入总表
	print("12集弹幕已经存入csv文件中")
	data_all.to_csv('data_all.csv')
	print('总表已经合并完成,已存为csv')





Result display

Multi-threaded operation is very simple, see the multi-thread teaching for details .

The final data is about 20w. When you have time, use Pyecharts to analyze and analyze. By the way, review the basic exercises. See you.

data analysis

Sentiment Analysis of Chinese Barrage

# 下面采取的是Notebook开发
# 复习一下知识点,读入数据库
import pymysql
import pandas as pd 
# 创建连接
connection = pymysql.connect(host='localhost',port=3306,user='root',password='chuan32',db='zc')
cursor =connection.cursor() # 获取游标
# 创建数据表,一般使用navicate直接构造表结构,不使用代码
# 下面从csv中读出数据并存入数据库
content = pd.DataFrame(pd.read_csv('data_all.csv'))
print(content.head(5))
# 由于爬虫书写的失误,tv_name一列需要加一
content['tv_name'] = content['tv_name']+1 
content.iloc[:,1:] # 查看内容
content.iloc[:,1:]
uid = content['uid'].astype(str)
contentsId = content['contentsId'].astype(str)
contents = content['contents'].astype(str)
likeCount = content['likeCount'].astype(str)
tv_name = content['tv_name'].astype(str)
content.iloc[:,1:]
# 下面存入数据库
cursor = connection.cursor()
sql = 'insert into iqiyi(uid,contentsId,contents,likeCount,tv_name) values (%s,%s,%s,%s,%s);'
for i in range(len(content)):
    try:
        cursor.execute(sql,[uid[i],contentsId[i],contents[i],likeCount[i],tv_name[i]])  
        connection.commit()
    except Exception as e:
        print(e)
        conn.rollback()
# 保证事务一致性和及时关闭
cursor.close()
connection.close()
# 数据库查询
# 由于上面已经关闭连接,这里许哟啊重新进行连接
config = {'host':'localhost','port':3306,'user':'root','password':'chuang199832','db':'zc'}
conn = pymysql.connect(**config)
cursor = conn.cursor()
sql = 'select * from iqiyi;'
cursor.execute(sql)
data = cursor.fetchall()
print(len(data))
print(type(data))
# 获取表头
decs = []
for field in cursor.description:
    decs.append(field[0])
print(decs)
# 转化为DataFrame
data = pd.DataFrame(data,columns= dec)
data.head(5)
# 采用snowNLP进行情感分析
from snownlp import SnowNLP
print(data.head(5))
data_copy = data.copy() # 操作前最好先存一个副本
senti = []
for index,content in enumerate(data_copy['contents']):
    content = SnowNLP(content)# 创建分析对象
    senti.append(content.sentiments)

data_copy['sentiment'] = senti
data_copy

Guess you like

Origin blog.csdn.net/weixin_40539952/article/details/107018666