突然想看电影了,就分析了下猫眼电影,然而。。。

突然想看电影了,就分析了下猫眼电影,但是不知道哪部电影好看,就随便翻了下,感觉不是很准,然后就批量分析了一下数据,然而结果感觉不是那么理想,具体实现流程如下,有兴趣可以尝试下

一、开发环境

  • python3.8
  • requests
  • bs4
  • matplotlib
  • wordcloud
  • random
  • time

二、分析站点

1、进入猫眼电影站点,选择电影分类,如下图选择详细分类

image.png

2、分析网页内容:打开控制台,选择一个电影标签,在控制台可以看到对应的html样式

image.png

三、代码实现

1、获取页面

def get_html(url):
    print("获取网页: %s" % url)
    # 代理
    proxies = [
        {'http': 'http://202.55.5.209:8090'},
        {'http': 'http://183.247.199.114:30001'},
        {'http': 'http://122.9.101.6:8888'},
        {'http': 'http://202.55.5.209:8090'},
    ]
    # 请求头伪装
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
        "Cookie": "uuid_n_v=v1; uuid=965711E0E8C611EC8A445B788061A84C64DB045EA13840149498F0104B8AF19A; _csrf=231cfce2d54abbd1bf2609ca76cd22ec894318c8223a9e698eb9b798bc2adbd8; Hm_lvt_703e94591e87be68cc8da0da7cbd0be2=1654870020; _lx_utm=utm_source=google&utm_medium=organic; _lxsdk_cuid=1814df090c8c8-02bcc6fa43763b-1d525635-13c680-1814df090c8c8; _lxsdk=965711E0E8C611EC8A445B788061A84C64DB045EA13840149498F0104B8AF19A; Hm_lpvt_703e94591e87be68cc8da0da7cbd0be2=1654870030; __mta=142572173.1654870021186.1654870021186.1654870030627.2; _dd_s=logs=1&id=aa00cee7-93d3-4b37-aa97-0eb2f74923bf&created=1654870020029&expire=1654871326477; _lxsdk_s=1814df090c9-86d-7ca-093||6"
    }
    resp = requests.get(url, headers=headers, proxies=random.choice(proxies))
    if resp.status_code == 200:
        return resp.text
    return ""
复制代码

2、分析页面,并保存到文件

def extract_html(html):
    print("数据解析开始")
    soup = BS4(html, "lxml")
    hover_list = soup.find_all("div", class_="movie-item-hover")
    for hover in hover_list:
        i_list = []
        name = hover.find("span", class_="name").text
        score = hover.find("span", class_="score")
        if score is None:
            score = "无评分"
        else:
            score = score.text
        info_list = hover.find_all("div", class_="movie-hover-title")
        i_list.append(name)
        i_list.append(score)
        num = 0
        for info in info_list:
            num = num + 1
            if num == 1:
                continue
            i_list.append(str.strip(info.find("span", class_="hover-tag").next_sibling))
        with open("data.txt", "a+") as d:
            d.writelines(str(i_list) + "\n")
复制代码

3、按照分析猫眼的url后,添加自动翻页功能

def main():
    print("任务开始")
    for i in range(20, 40):
        print("开始第%d页" % i)
        url = "https://www.maoyan.com/films?showType=3&offset=%d" % (i * 30)
        html = get_html(url)
        extract_html(html)
        print("完成第%d页" % i)
        sleep_time = random.randint(1, 3)
        print("休眠%d页" % sleep_time)
        time.sleep(sleep_time)
    print("任务完成")
复制代码

4、分析数据、组装成图表和词云

def analysis():
    name_list = []
    score_list = []
    words = ""
    with open("data.txt", "r") as f:
        num = 0
        
        while 1:
            data = f.readline()
            if data == "" or num > 1000:
                break
            data = data.replace("]", "")
            data = data.replace("[", "")
            data = data.strip("\n")
            data = data.replace("'", "")
            data = data.replace(" ", "")
            data_list = data.split(",")
            # name_list.append(data_list[0])
            name_list.append(str(num))
            score_list.append(float(data_list[1]) if data_list[1] != "无评分" else 0)
            num = num + 1
            print(data_list[0] + ":" + data_list[2])
            words = words + data_list[2] + ","
            words = words.replace("/", ",")
    # 处理图表中文乱码问题
    plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
    plt.figure(figsize=(10, 10.5))
    plt.scatter(name_list, score_list, c="red")
    plt.xlabel("电影")
    plt.ylabel("分数")
    plt.title("电影评分")
    plt.show()
    
    w = wordcloud.WordCloud(width=1000, height=700, background_color='white', font_path='11.ttf', collocations=False,
                            scale=1.5)
    w.generate(words)
    w.to_file('res.png')
复制代码

四、效果图

1、散点图

image.png

2、词云

image.png

五、个人意见

商业电影网站评分不太准,有点失望,大家认为哪个平台比较准些,可以讨论下?

猜你喜欢

转载自juejin.im/post/7107824882078973989
今日推荐