Python crawls hot movies, and uses "visualization and word cloud show" to show you about hot movies

1 Introduction

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

PS: If you need Python learning materials, you can click on the link below to get it by yourself

Python free learning materials, codes and exchange answers click to join


Today, we will crawl several popular data (ratings, duration, genre) and relevant netizen comments and other data released on "New Year's Day".

Chart visualization of ratings, duration, and types

Use different word cloud patterns to show the word cloud of seven movies "review"! ! ! !

2. Data acquisition

1. Scoring data

Web analytics

 

Looking at the source code of the webpage, you can see that the target data is in the tag <ul class="lists">, which can be obtained through xpath analysis. Directly on the code below!

Programming realization


headers = {
            'Host':'movie.douban.com',
            'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36',
            'cookie':'bid=uVCOdCZRTrM; douban-fav-remind=1; __utmz=30149280.1603808051.2.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __gads=ID=7ca757265e2366c5-22ded2176ac40059:T=1603808052:RT=1603808052:S=ALNI_MYZsGZJ8XXb1oU4zxzpMzGdK61LFA; dbcl2="165593539:LvLaPIrgug0"; push_doumail_num=0; push_noty_num=0; __utmv=30149280.16559; ll="118288"; __yadk_uid=DnUc7ftXIqYlQ8RY6pYmLuNPqYp5SFzc; _vwo_uuid_v2=D7ED984782737D7813CC0049180E68C43|1b36a9232bbbe34ac958167d5bdb9a27; ct=y; ck=ZbYm; __utmc=30149280; __utmc=223695111; __utma=30149280.1867171825.1603588354.1613363321.1613372112.11; __utmt=1; __utmb=30149280.2.10.1613372112; ap_v=0,6.0; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1613372123%2C%22https%3A%2F%2Fwww.douban.com%2Fmisc%2Fsorry%3Foriginal-url%3Dhttps%253A%252F%252Fmovie.douban.com%252Fsubject%252F34841067%252F%253Ffrom%253Dplaying_poster%22%5D; _pk_ses.100001.4cf6=*; __utma=223695111.788421403.1612839506.1613363340.1613372123.9; __utmb=223695111.0.10.1613372123; __utmz=223695111.1613372123.9.4.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/misc/sorry; _pk_id.100001.4cf6=e2e8bde436a03ad7.1612839506.9.1613372127.1613363387.',
        }
 url="https://movie.douban.com/cinema/nowplaying/zhanjiang/"
r = requests.get(url,headers=headers)
r.encoding = 'utf8'
s = (r.content)
selector = etree.HTML(s)
li_list = selector.xpath('//*[@id="nowplaying"]/div[2]/ul/li')

dict = {}
for item in li_list:
    name = item.xpath('.//*[@class="stitle"]/a/@title')[0].replace(" ","").replace("\n","")
    rate = item.xpath('.//*[@class="subject-rate"]/text()')[0].replace(" ", "").replace("\n", "")
    dict[name] = float(rate)
    print("电影="+name)
    print("评分="+rate)
    print("-------")

 

The movie name and rating data have been crawled down and sorted in descending order, which will be visualized later.

2. Duration and type of movie

Web analytics

 

In the page source code, the web page tag of movie duration is ropety="v:runtime", and the web tag of movie type corresponds to property="v:genre"

Programming realization


###时长
def getmovietime():
    url = "https://movie.douban.com/cinema/nowplaying/zhanjiang/"
    r = requests.get(url, headers=headers)
    r.encoding = 'utf8'
    s = (r.content)
    selector = etree.HTML(s)
    li_list = selector.xpath('//*[@id="nowplaying"]/div[2]/ul/li')

    for item in li_list:
        title = item.xpath('.//*[@class="stitle"]/a/@title')[0].replace(" ", "").replace("\n", "")
        href = item.xpath('.//*[@class="stitle"]/a/@href')[0].replace(" ", "").replace("\n", "")

        r = requests.get(href, headers=headers)
        r.encoding = 'utf8'
        s = (r.content)
        selector = etree.HTML(s)
        times = selector.xpath('//*[@property="v:runtime"]/text()')
        type = selector.xpath('//*[@property="v:genre"]/text()')

        print(title)
        print(times)
        print(type)

        print("-------")

3. Comment data

Web analytics

 


Check the webpage code, the target tag of the comment data is <div class="mod-bd" id="comments"> (do not know how to analyze, you can read the previous article [python crawls 44130 user viewing data, analyzes and mines The hidden information between the user and the movie!], this article is also an analysis of Douban movies, which has a detailed introduction).

 

Let's start crawling the comment data of these seven movies! ! ! !

Programming realization


###评论数据
def getmoviecomment():
    url = "https://movie.douban.com/cinema/nowplaying/zhanjiang/"
    r = requests.get(url, headers=headers)
    r.encoding = 'utf8'
    s = (r.content)
    selector = etree.HTML(s)
    li_list = selector.xpath('//*[@id="nowplaying"]/div[2]/ul/li')

    for item in li_list:
        title = item.xpath('.//*[@class="stitle"]/a/@title')[0].replace(" ", "").replace("\n", "")
        href = item.xpath('.//*[@class="stitle"]/a/@href')[0].replace(" ", "").replace("\n", "").replace("/?from=playing_poster", "")
        print("电影=" + title)
        print("链接=" + href)
        ###
        with open(title+".txt","a+",encoding='utf-8') as f:
            for k in range(0,200,20):
                url = href+"/comments?start="+str(k)+"&limit=20&status=P&sort=new_score"
                r = requests.get(url, headers=headers)
                r.encoding = 'utf8'
                s = (r.content)
                selector = etree.HTML(s)
                li_list = selector.xpath('//*[@class="comment-item "]')
                for items in  li_list:

                    text = items.xpath('.//*[@class="short"]/text()')[0]
                    f.write(str(text)+"\n")

        print("-------")
        time.sleep(4)

 

 

Save these comment data to text files, and then use different graphics for visual display of these comment data! ! ! !

3. Data visualization

1. Visualization of scoring data


###画图
font_size = 10  # 字体大小
fig_size = (13, 10)  # 图表大小


data = ([datas])

# 更新字体大小
mpl.rcParams['font.size'] = font_size
# 更新图表大小
mpl.rcParams['figure.figsize'] = fig_size
# 设置柱形图宽度
bar_width = 0.35

index = np.arange(len(data[0]))
# 绘制评分
rects1 = plt.bar(index, data[0], bar_width, color='#0072BC')

# X轴标题
plt.xticks(index + bar_width, itemNames)
# Y轴范围
plt.ylim(ymax=10, ymin=0)
# 图表标题
plt.title(u'豆瓣评分')
# 图例显示在图表下方
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.03), fancybox=True, ncol=5)

# 添加数据标签
def add_labels(rects):
    for rect in rects:
        height = rect.get_height()
        plt.text(rect.get_x() + rect.get_width() / 2, height, height, ha='center', va='bottom')
        # 柱形图边缘用白色填充,纯粹为了美观
        rect.set_edgecolor('white')

add_labels(rects1)

# 图表输出到本地
plt.savefig('豆瓣评分.png')

analysis

Among the seven hits, "Hello, Li Huanying" has the highest rating (8.3), and "Detective Chinatown 3" has the lowest rating (5.8)

2. Visualization of duration and type

Duration data visualization


#####时长可视化
itemNames.reverse()
datas.reverse()

# 绘图。
fig, ax = plt.subplots()
b = ax.barh(range(len(itemNames)), datas, color='#6699CC')

# 为横向水平的柱图右侧添加数据标签。
for rect in b:
    w = rect.get_width()
    ax.text(w, rect.get_y() + rect.get_height() / 2, '%d' %
            int(w), ha='left', va='center')

# 设置Y轴纵坐标上的刻度线标签。
ax.set_yticks(range(len(itemNames)))
ax.set_yticklabels(itemNames)
plt.title('电影时长(分钟)', loc='center', fontsize='15',
          fontweight='bold', color='red')

#plt.show()
plt.savefig("电影时长(分钟)")

analysis

1.
The duration of the movies in the picture is about 120 minutes. 2. The longest movie "Detective Chinatown 3" (136 minutes), and the shortest one is "Bear Infested Wild Continent" (99 minutes)

Movie type data visualization


#####2.类型可视化
###从小到大排序
dict = sorted(dict.items(), key=lambda kv: (kv[1], kv[0]))
print(dict)

itemNames = []
datas = []
for i in range(len(dict) - 1, -1, -1):
    itemNames.append(dict[i][0])
    datas.append(dict[i][1])

x = range(len(itemNames))
plt.plot(x, datas, marker='o', mec='r', mfc='w', label=u'电影类型')
plt.legend()  # 让图例生效
plt.xticks(x, itemNames, rotation=45)
plt.margins(0)
plt.subplots_adjust(bottom=0.15)
plt.xlabel(u"类型")  # X轴标签
plt.ylabel("数量")  # Y轴标签
plt.title("电影类型统计")  # 标题
plt.savefig("电影类型统计.png")

analysis

Count the types of these seven movies (some movies belong to multiple types, such as'action','fantasy', and'adventure').
1. Four of the seven movies are comedies.
2. Science fiction, crime, suspense, and adventure all belong to one part.

3. Word cloud visualization of comment data

Use seven different patterns for word cloud visualization, so encapsulate the drawing code into a function! ! !


####词云代码
def jieba_cloud(file_name, icon):
    with open(file_name, 'r', encoding='utf8') as f:
        text = f.read()
        text = text.replace('\n',"").replace("\u3000","").replace(",","").replace("。","")
        word_list = jieba.cut(text)
        result = " ".join(word_list)  # 分词用 隔开
        # 制作中文云词
        icon_name = ""
        if icon == "1":
            icon_name ='fas fa-envira'
        elif icon == "2":
            icon_name = 'fas fa-dragon'
        elif icon == "3":
            icon_name = 'fas fa-dog'
        elif icon == "4":
            icon_name = 'fas fa-cat'
        elif icon == "5":
            icon_name = 'fas fa-dove'
        elif icon == "6":
            icon_name = 'fab fa-qq'
        elif icon == "7":
            icon_name = 'fas fa-cannabis'
        """
        # icon_name='',#国旗
        # icon_name='fas fa-dragon',#翼龙
        icon_name='fas fa-dog',#狗
        # icon_name='fas fa-cat',#猫
        # icon_name='fas fa-dove',#鸽子
        # icon_name='fab fa-qq',#qq
        """
        picp = file_name.split('.')[0] + '.png'
        if icon_name is not None and len(icon_name) > 0:
            gen_stylecloud(text=result, icon_name=icon_name, font_path='simsun.ttc', output_name=picp)  # 必须加中文字体,否则格式错误
        else:
            gen_stylecloud(text=result, font_path='simsun.ttc', output_name=picp)  # 必须加中文字体,否则格式错误

    return picp

Start plotting the review data for these seven movies

 


###评论数据词云
def commentanalysis():
    lists = ['刺杀小说家','你好,李焕英','人潮汹涌','侍神令','唐人街探案3','新神榜:哪吒重生','熊出没·狂野大陆']
    for i in range(0,len(lists)):
       title =lists[i]+".txt"
       jieba_cloud(title , (i+1))

analysis

Not much nonsense, let’s start the "Word Cloud Show"! ! ! ! ! ! !

1. Assassinate a novelist

2. Crowd

3. Bear Infested Wild Continent

4. New God List: Nezha Rebirth

5. Detective Chinatown 3

6. Hello, Li Huanying

7. Order of Samurai

4. Summary

1. Crawl the data of movies released on Douban "New Year's Day" (rating, duration, genre, comment)
2. Visualize the rating, duration, and genre on a chart
3. Use different word cloud patterns to "review" word cloud of seven movies Show! ! ! !

Guess you like

Origin blog.csdn.net/pythonxuexi123/article/details/114588597