Cloud Studio Actual Combat - Popular Video Top100 Crawler Application Development

Cloud Studio is very popular recently , I also tried it, and it feels really convenient! I will use Python to crawl the top 100 videos in each area of ​​station B, and make a visualization to share with you Cloud Studio ! Application Link: Cloud Studio Actual Combat - Popular Video Top100 Crawler Application Development at Station B

1. Introduction to Tencent Cloud

insert image description here
Click to open a workbench, select a link, and edit the code in it, without worrying about the incompatibility of the local environment. Tencent Cloud Cloud Studio is a cloud-based development environment that helps developers develop and collaborate more efficiently. It provides an integrated development environment (IDE) that can be accessed from anywhere over the Internet without requiring any software to be installed locally.

I summarize the advantages of Tencent Cloud Cloud Studio as follows:

  1. Flexibility: Cloud Studio can be used on any device, just a device with a web browser. This allows developers to access their development environments anytime, anywhere, whether in the office, at home, or on the go.

  2. Resource scalability: Cloud Studio runs on the cloud and can dynamically adjust computing and storage resources as needed. This means that developers can flexibly expand or shrink resources according to project needs, without having to pay attention to the limitations of hardware devices.

  3. Collaborative ability: Cloud Studio supports multi-person collaborative development, and multiple developers can work simultaneously in the same development environment. This can improve team collaboration efficiency and reduce code conflicts and merge issues.

  4. Security: Tencent Cloud provides strict security measures to protect user data and development environment. Cloud Studio uses a secure transmission protocol and provides functions such as data encryption and access control to ensure that user codes and data are protected.

  5. Ecosystem integration: Cloud Studio is tightly integrated with other services of Tencent Cloud, such as cloud server, object storage, database, etc. This makes it easy for developers to use these services to build and deploy applications.

In my opinion, Tencent Cloud Cloud Studio provides a flexible, scalable, secure and collaborative development environment, enabling developers to more efficiently develop and collaborate on software.

2. Station B crawler

insert image description here
Bilibili, full name Bilibili Animation, is a well-known online video sharing platform in China and one of the largest two-dimensional cultural communities in the world. With the theme of animation, comics, and games (ACG), Station B provides users with services such as high-quality original animation, barrage comments, live interaction and community communication. As a cultural community loved by young people, station B has brought together a large number of ACG content creators and fans, forming a unique two-dimensional cultural atmosphere. Through Station B, users can enjoy all kinds of wonderful animation works, participate in interactive live broadcast activities, and share hobbies and exchange experiences with like-minded people.

2.1 Crawler code

import requests
import pandas as pd
url_dict = {
	'全站': 'https://api.bilibili.com/x/web-interface/ranking/v2?rid=0&type=all',
	'动画': 'https://api.bilibili.com/x/web-interface/ranking/v2?rid=1&type=all',
	'生活': 'https://api.bilibili.com/x/web-interface/ranking/v2?rid=160&type=all',
	'动物圈': 'https://api.bilibili.com/x/web-interface/ranking/v2?rid=217&type=all',
	'娱乐': 'https://api.bilibili.com/x/web-interface/ranking/v2?rid=5&type=all',
	'影视': 'https://api.bilibili.com/x/web-interface/ranking/v2?rid=181&type=all',
	'原创': 'https://api.bilibili.com/x/web-interface/ranking/v2?rid=0&type=origin',
}
headers = {
	'Accept': 'application/json, text/plain, */*',
	'Origin': 'https://www.bilibili.com',
	'Host': 'api.bilibili.com',
	'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Safari/605.1.15',
	'Accept-Language': 'zh-cn',
	'Connection': 'keep-alive',
	'Referer': 'https://www.bilibili.com/v/popular/rank/all'
}

for i in url_dict.items():
	url = i[1]  # url地址
	tab_name = i[0]  # tab页名称
	title_list = []
	play_cnt_list = []  # 播放数
	danmu_cnt_list = []  # 播放数
	coin_cnt_list = []  # 投币数
	like_cnt_list = []  # 点赞数
	dislike_cnt_list = []  # 点踩数
	share_cnt_list = []  # 分享数
	favorite_cnt_list = []  # 收藏数
	author_list = []
	score_list = []
	video_url = []
	try:
		r = requests.get(url, headers=headers)
		print(r.status_code)
		# pprint(r.content.decode('utf-8'))
		# r.encoding = 'utf-8'
		# pprint(r.json())
		json_data = r.json()
		list_data = json_data['data']['list']
		for data in list_data:
			title_list.append(data['title'])
			play_cnt_list.append(data['stat']['view'])
			danmu_cnt_list.append(data['stat']['danmaku'])
			coin_cnt_list.append(data['stat']['coin'])
			like_cnt_list.append(data['stat']['like'])
			dislike_cnt_list.append(data['stat']['dislike'])
			share_cnt_list.append(data['stat']['share'])
			favorite_cnt_list.append(data['stat']['favorite'])
			author_list.append(data['owner']['name'])
			score_list.append(data['score'])
			video_url.append('https://www.bilibili.com/video/' + data['bvid'])
			print('*' * 30)
	except Exception as e:
		print("爬取失败:{}".format(str(e)))

	df = pd.DataFrame(
		{'视频标题': title_list,
		 '视频地址': video_url,
		 '作者': author_list,
		 '综合得分': score_list,
		 '播放数': play_cnt_list,
		 '弹幕数': danmu_cnt_list,
		 '投币数': coin_cnt_list,
		 '点赞数': like_cnt_list,
		 '点踩数': dislike_cnt_list,
		 '分享数': share_cnt_list,
		 '收藏数': favorite_cnt_list,
		 })
	df.to_csv('B站TOP100-{}.csv'.format(tab_name), encoding='utf_8_sig')  # utf_8_sig修复乱码问题
	print('写入成功: ' + 'B站TOP100-{}.csv'.format(tab_name))

2.2 Crawler results

insert image description here

What you get is a headquarter, six partitions of popular video content, stored in a csv file. There are seven csv files in total. Open the whole station file and you can see:
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-gPoIuupZ-1691499948735)(image-1.png)]

The csv file stores information such as the video title, address, author, number of plays, number of bullet screens, and number of coins in the current area, and these data can be used for data processing operations.

3. Data visualization part

3.1 Master station analysis pie chart

3.1.1 Master station analysis pie chart code

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
# 全站饼状图
Total_station = pd.read_csv("B站TOP100-全站.csv")
num_dic = {}
# play_num = Total_station["播放数"]
barrage_num = Total_station["弹幕数"]
coin_num = Total_station["投币数"]
like_num = Total_station["点赞数"]
share_num = Total_station["分享数"]
collection_num = Total_station["收藏数"]
# num_dic["播放数"] = sum(play_num)
num_dic["弹幕数"] = sum(barrage_num)
num_dic["投币数"] = sum(coin_num)
num_dic["点赞数"] = sum(like_num)
num_dic["分享数"] = sum(share_num)
num_dic["收藏数"] = sum(collection_num)
Num = sum(num_dic.values())
# 单个数据
data = list(num_dic.values())
# 数据标签
labels = list(num_dic.keys())
# 各区域颜色
colors = ['green', 'orange', 'red', 'purple', 'blue']
# 数据计算处理
sizes = [data[0] / Num * 100, data[1] / Num * 100, data[2] / Num * 100, data[3] / Num * 100, data[4] / Num * 100]
# 设置突出模块偏移值
expodes = (0, 0, 0, 0.1, 0)
# 设置绘图属性并绘图
plt.pie(sizes, explode=expodes, labels=labels,shadow=True,autopct="%3.1f%%", colors=colors)
## 用于显示为一个长宽相等的饼图
plt.axis('equal')
plt.title("主站分析饼状图",fontsize=20)
# 保存并显示
plt.savefig('主站分析饼状图.png')
plt.show()

3.1.2 The master station analyzes the pie chart results

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-AoRWrJxg-1691499948737)(image-2.png)]

3.2 Comparison vertical chart of each station

3.2.1 Comparison of vertical chart codes for each station

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False


all_list =['视频标题','视频地址','作者','综合得分','播放数','弹幕数','投币数','点赞数','点踩数','分享数','收藏数']
all_dic = {}
Total_station = pd.read_csv("B站TOP100-全站.csv")
animal = pd.read_csv("B站TOP100-动物圈.csv")
animation = pd.read_csv("B站TOP100-动画.csv")
original = pd.read_csv("B站TOP100-原创.csv")
entertainment = pd.read_csv("B站TOP100-娱乐.csv")
film_television = pd.read_csv("B站TOP100-影视.csv")
life = pd.read_csv("B站TOP100-生活.csv")
# all_dic["全站"] = sum(Total_station["播放数"])
# 垂直各站对比图
all_dic["动物圈"] = sum(animal["播放数"])
all_dic["动画"] = sum(animation["播放数"])
all_dic["原创"] = sum(original["播放数"])
all_dic["娱乐"] = sum(entertainment["播放数"])
all_dic["影视"] = sum(film_television["播放数"])
all_dic["生活"] = sum(life["播放数"])
y1 = list(all_dic.values())
x = np.arange(len(y1))
plt.bar(x=x,height=y1,width=0.4)
a = [0,1,2,3,4,5]
labels = ['动物圈', '动画', '原创', '娱乐','影视','生活']
plt.xticks(a,labels,rotation = 10)
plt.xlabel('不同区名称',fontsize=10)
plt.ylabel('播放总数',fontsize=10)
plt.title("不同区前一百播放总数对比",fontsize=20)
plt.savefig("垂直各站对比图.jpg", dpi=300)
# plt.show()

3.2.2 Comparison of vertical graph results at each station

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-Wtx4NZ8P-1691499948739)(image-3.png)]

3.3 Word cloud analysis

3.3.1 Word cloud analysis code

import wordcloud as wc
import jieba
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Total_station = pd.read_csv("B站TOP100-全站.csv")
f = open('temp.txt',mode='w')
title = Total_station["视频标题"][:5:]
author = Total_station["作者"]
for i in title:
    f.write(i)  # write 写入
            #关闭文件
for i in author:
    f.write(i)  # write 写入
f.close()
with open("temp.txt", mode="r", encoding="GBK") as fp:
    content = fp.read()  # 读取文件内容
res = jieba.lcut(content)  # 中文分词
text = " ".join(res)  # 用空格连接所有的词
mask = np.array(Image.open("背景.jpg"))  # 指定词云图效果
word_cloud = wc.WordCloud(font_path="msyh.ttc", mask=mask)  # 创建词云对象
word_cloud.generate(text)  # 生成词语
plt.imshow(word_cloud)  # 显示词云图
word_cloud.to_file("词云分析.png")  # 保存成图片
plt.show()  # 显示图片2.4.2词云分析结果

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-5R9VhJmD-1691499948741)(image-4.png)]

4. Code explanation

4.1 Reptiles

  • First of all, you need to install the request and pandas libraries on your computer. If you are in the anaconda environment, it should have these two libraries by itself, and you don’t need to install them separately. If you don’t have these two libraries, you need to install them yourself. The corresponding tutorial can Go to CSDN or station B to find it. There are many tutorials. Follow him and you can install it.
  • url_dict={} defines a dictionary, the key of which is the name of the partition, and the value is the corresponding url, which you can also understand as its URL.
  • Headers are used to hide yourself. If you crawl the browser locally in pycharm, if you don’t add this header, the browser can easily judge that you are a crawler and reject you. This headers is equivalent to wearing a jacket, or you can understand it as getting an ID card recognized by the browser. With this package, you can smoothly crawl the specified browser.
  • Next is a for loop, ur l_dict is the dictionary we defined above, and ur l_dict.items() is to get all its keys and values. url is i[1], tab_name = i[0].
  • try – except: Used to catch exceptions and prevent exceptions during the crawling process. This instruction can make the program more robust.
  • The content in try is the core of the whole crawler: r = requests.get(url, headers=headers)+ json_data = r.json() is to obtain the information of the target website, and returns a nested dictionary associated with keys and values(As shown below) [External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-i8aZYYWG-1691499948742)(image-5.png)]

list_data = json_data['data']['list'] is to get the value of list in the dictionary whose key is data, and return a list.
Use the for loop to traverse list_data, and add the corresponding data to the corresponding list. The knowledge points involved here are the index of the list and dictionary, and the index of the nested list of the nested dictionary.
df = pd.DataFrame converts the corresponding dictionary into a DataFrame format, which is convenient for writing into a csv file later.
Finally, use df.to_csv to write the data into the csv file, and utf_8_sig to fix the garbled problem. Then give a prompt statement to indicate that the writing is complete.

4.2 Master station analysis pie chart

  • First read the file through pandas, and store the number of barrage, coin, likes, shares, and favorites in variables in turn.
  • Use the dictionary to match the variables with the corresponding variables one by one, the sum is data = list(num_dic.values()), and the data label is labels = list(num_dic.keys()). Set a color list colors = ['green', 'orange', 'red', 'purple', 'blue'].
  • Data calculation and processing, that is, to find out how much each part accounts for in the whole, and exposes to set the module offset.
  • Plt.pie is used to draw a pie chart, adding data, labels, colors and other information in this function.
  • Add a title to the entire picture, and finally save the picture and display it.

4.3 Comparison vertical chart of each station

  • First read the data of each partition, extract the playback data of different partitions, and calculate the sum as the popularity of the partition.
  • The vertical comparison chart is drawn with plt.bar, which requires two basic parameters, x and y. x is the name of different partitions, and y is the heat value calculated above.
  • Use plt.xlabel, plt.ylabel, plt.title to add the title of x, y axis and the title of the whole picture respectively, and finally save the picture and display it.

4.4 Word cloud analysis

  • First you need to install these dependencies:
    [External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-mC3C2HfQ-1691499948743)(image-6.png)]

  • Similarly, we read the data of the whole station, title = Total_station["video title"][:5:] reads the top five titles, author = Total_station["author"] reads all popular authors.

  • with open("temp.txt", mode="r", encoding="GBK") as fp: open the temp file, create a new one if it does not exist, use the for loop to input the title and author information into the temp file, and Finally close the file.

  • res = jieba.lcut(content) uses the jieba word breaker for Chinese word segmentation, and connects all words with spaces.

  • mask = np.array(Image.open("Background.jpg")) specifies the word cloud effect, and then creates a word cloud object, generates words and displays the word cloud.

  • Finally save the slice and display it.

5. Summary of Cloud Studio

By using Tencent Cloud Cloud Studio, I successfully developed an application for crawling Bilibili video data and performing visual analysis. This app has the following key features:

  1. Flexibility and convenience: With Cloud Studio, I can access my development environment anytime, anywhere without worrying about device and software limitations. This makes the development process more flexible and convenient.

  2. Data crawling: By calling the API interface of station B, I can obtain the required video data, including video title, playback volume, number of likes, etc. This provides a data basis for subsequent visual analysis.

  3. Visual analysis: I use Python's data analysis and visualization library to process and analyze the crawled Bilibili video data. By drawing charts and graphs, I can more intuitively display trends in video data, popular content, user preferences, etc.

  4. Real-time update: With the cloud environment of Cloud Studio, I can run my application regularly, get the latest Bilibili video data in real time and update the visualization results. This keeps my application always up to date and with accurate data.

In general, Tencent Cloud Cloud Studio provided me with an efficient, flexible and safe development environment, enabling me to successfully develop an application for crawling and visualizing videos from Station B. This app can not only help me better understand the trend and popular content of Bilibili video, but also provide valuable data analysis and reference for other users.

Guess you like

Origin blog.csdn.net/weixin_63866037/article/details/132175608