突然想看电影了,就分析了下猫眼电影,但是不知道哪部电影好看,就随便翻了下,感觉不是很准,然后就批量分析了一下数据,然而结果感觉不是那么理想,具体实现流程如下,有兴趣可以尝试下
一、开发环境
- python3.8
- requests
- bs4
- matplotlib
- wordcloud
- random
- time
二、分析站点
1、进入猫眼电影站点,选择电影分类,如下图选择详细分类
2、分析网页内容:打开控制台,选择一个电影标签,在控制台可以看到对应的html样式
三、代码实现
1、获取页面
def get_html(url):
print("获取网页: %s" % url)
# 代理
proxies = [
{'http': 'http://202.55.5.209:8090'},
{'http': 'http://183.247.199.114:30001'},
{'http': 'http://122.9.101.6:8888'},
{'http': 'http://202.55.5.209:8090'},
]
# 请求头伪装
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
"Cookie": "uuid_n_v=v1; uuid=965711E0E8C611EC8A445B788061A84C64DB045EA13840149498F0104B8AF19A; _csrf=231cfce2d54abbd1bf2609ca76cd22ec894318c8223a9e698eb9b798bc2adbd8; Hm_lvt_703e94591e87be68cc8da0da7cbd0be2=1654870020; _lx_utm=utm_source=google&utm_medium=organic; _lxsdk_cuid=1814df090c8c8-02bcc6fa43763b-1d525635-13c680-1814df090c8c8; _lxsdk=965711E0E8C611EC8A445B788061A84C64DB045EA13840149498F0104B8AF19A; Hm_lpvt_703e94591e87be68cc8da0da7cbd0be2=1654870030; __mta=142572173.1654870021186.1654870021186.1654870030627.2; _dd_s=logs=1&id=aa00cee7-93d3-4b37-aa97-0eb2f74923bf&created=1654870020029&expire=1654871326477; _lxsdk_s=1814df090c9-86d-7ca-093||6"
}
resp = requests.get(url, headers=headers, proxies=random.choice(proxies))
if resp.status_code == 200:
return resp.text
return ""
复制代码
2、分析页面,并保存到文件
def extract_html(html):
print("数据解析开始")
soup = BS4(html, "lxml")
hover_list = soup.find_all("div", class_="movie-item-hover")
for hover in hover_list:
i_list = []
name = hover.find("span", class_="name").text
score = hover.find("span", class_="score")
if score is None:
score = "无评分"
else:
score = score.text
info_list = hover.find_all("div", class_="movie-hover-title")
i_list.append(name)
i_list.append(score)
num = 0
for info in info_list:
num = num + 1
if num == 1:
continue
i_list.append(str.strip(info.find("span", class_="hover-tag").next_sibling))
with open("data.txt", "a+") as d:
d.writelines(str(i_list) + "\n")
复制代码
3、按照分析猫眼的url后,添加自动翻页功能
def main():
print("任务开始")
for i in range(20, 40):
print("开始第%d页" % i)
url = "https://www.maoyan.com/films?showType=3&offset=%d" % (i * 30)
html = get_html(url)
extract_html(html)
print("完成第%d页" % i)
sleep_time = random.randint(1, 3)
print("休眠%d页" % sleep_time)
time.sleep(sleep_time)
print("任务完成")
复制代码
4、分析数据、组装成图表和词云
def analysis():
name_list = []
score_list = []
words = ""
with open("data.txt", "r") as f:
num = 0
while 1:
data = f.readline()
if data == "" or num > 1000:
break
data = data.replace("]", "")
data = data.replace("[", "")
data = data.strip("\n")
data = data.replace("'", "")
data = data.replace(" ", "")
data_list = data.split(",")
# name_list.append(data_list[0])
name_list.append(str(num))
score_list.append(float(data_list[1]) if data_list[1] != "无评分" else 0)
num = num + 1
print(data_list[0] + ":" + data_list[2])
words = words + data_list[2] + ","
words = words.replace("/", ",")
# 处理图表中文乱码问题
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.figure(figsize=(10, 10.5))
plt.scatter(name_list, score_list, c="red")
plt.xlabel("电影")
plt.ylabel("分数")
plt.title("电影评分")
plt.show()
w = wordcloud.WordCloud(width=1000, height=700, background_color='white', font_path='11.ttf', collocations=False,
scale=1.5)
w.generate(words)
w.to_file('res.png')
复制代码
四、效果图
1、散点图
2、词云
五、个人意见
商业电影网站评分不太准,有点失望,大家认为哪个平台比较准些,可以讨论下?