Python analyzes the word-of-mouth and box office data of seven movies in the Spring Festival file and displays them visually

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

The following article comes from Python crawler data analysis and mining, author Li Yunchen

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542

Preface

This year, we will crawl several popular data ( rating , duration , genre ) and relevant netizen comments and other data released on "New Year's Day" .

Chart visualization of ratings, duration, and types

Use different word cloud patterns to show the word cloud of seven movies "review" ! ! ! !

 

data collection

1. Scoring data

Web analytics

 

Looking at the source code of the webpage, you can see that the target data is in the tag <ul class="lists">, which can be obtained through xpath analysis. Directly on the code below!

 

Programming realization

headers = {
            'Host':'movie.douban.com',
            'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36',
            'cookie':'bid=uVCOdCZRTrM; douban-fav-remind=1; __utmz=30149280.1603808051.2.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __gads=ID=7ca757265e2366c5-22ded2176ac40059:T=1603808052:RT=1603808052:S=ALNI_MYZsGZJ8XXb1oU4zxzpMzGdK61LFA; dbcl2="165593539:LvLaPIrgug0"; push_doumail_num=0; push_noty_num=0; __utmv=30149280.16559; ll="118288"; __yadk_uid=DnUc7ftXIqYlQ8RY6pYmLuNPqYp5SFzc; _vwo_uuid_v2=D7ED984782737D7813CC0049180E68C43|1b36a9232bbbe34ac958167d5bdb9a27; ct=y; ck=ZbYm; __utmc=30149280; __utmc=223695111; __utma=30149280.1867171825.1603588354.1613363321.1613372112.11; __utmt=1; __utmb=30149280.2.10.1613372112; ap_v=0,6.0; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1613372123%2C%22https%3A%2F%2Fwww.douban.com%2Fmisc%2Fsorry%3Foriginal-url%3Dhttps%253A%252F%252Fmovie.douban.com%252Fsubject%252F34841067%252F%253Ffrom%253Dplaying_poster%22%5D; _pk_ses.100001.4cf6=*; __utma=223695111.788421403.1612839506.1613363340.1613372123.9; __utmb=223695111.0.10.1613372123; __utmz=223695111.1613372123.9.4.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/misc/sorry; _pk_id.100001.4cf6=e2e8bde436a03ad7.1612839506.9.1613372127.1613363387.',
        }
 url="https://movie.douban.com/cinema/nowplaying/zhanjiang/"
r = requests.get(url,headers=headers)
r.encoding = 'utf8'
s = (r.content)
selector = etree.HTML(s)
li_list = selector.xpath('//*[@id="nowplaying"]/div[2]/ul/li')


dict = {}
for item in li_list:
    name = item.xpath('.//*[@class="stitle"]/a/@title')[0].replace(" ","").replace("\n","")
    rate = item.xpath('.//*[@class="subject-rate"]/text()')[0].replace(" ", "").replace("\n", "")
    dict[name] = float(rate)
    print("电影="+name)
    print("评分="+rate)
    print("-------")

 

 

The movie name and rating data have been crawled down and sorted in descending order , which will be visualized later .

2. Duration and type of movie

Web analytics

 

In the page source code, the web page tag for movie duration is ropety="v:runtime", and the web page tag for movie type corresponds to property="v:genre"

Programming realization

###时长
def getmovietime():
    url = "https://movie.douban.com/cinema/nowplaying/zhanjiang/"
    r = requests.get(url, headers=headers)
    r.encoding = 'utf8'
    s = (r.content)
    selector = etree.HTML(s)
    li_list = selector.xpath('//*[@id="nowplaying"]/div[2]/ul/li')


    for item in li_list:
        title = item.xpath('.//*[@class="stitle"]/a/@title')[0].replace(" ", "").replace("\n", "")
        href = item.xpath('.//*[@class="stitle"]/a/@href')[0].replace(" ", "").replace("\n", "")


        r = requests.get(href, headers=headers)
        r.encoding = 'utf8'
        s = (r.content)
        selector = etree.HTML(s)
        times = selector.xpath('//*[@property="v:runtime"]/text()')
        type = selector.xpath('//*[@property="v:genre"]/text()')


        print(title)
        print(times)
        print(type)


        print("-------")

 

 

3. Comment data

Web analytics

 

Check the webpage code, the target tag of the comment data is <div class="mod-bd" id="comments">

(I don't know how to analyze it. You can read the previous article Python crawling 44130 user viewing data, analyzing and mining the hidden information between users and movies! This article also analyzes Douban movies, which has a detailed introduction).

Let's start crawling the comment data of these seven movies! ! ! !

 

Programming realization

###评论数据
def getmoviecomment():
    url = "https://movie.douban.com/cinema/nowplaying/zhanjiang/"
    r = requests.get(url, headers=headers)
    r.encoding = 'utf8'
    s = (r.content)
    selector = etree.HTML(s)
    li_list = selector.xpath('//*[@id="nowplaying"]/div[2]/ul/li')


    for item in li_list:
        title = item.xpath('.//*[@class="stitle"]/a/@title')[0].replace(" ", "").replace("\n", "")
        href = item.xpath('.//*[@class="stitle"]/a/@href')[0].replace(" ", "").replace("\n", "").replace("/?from=playing_poster", "")
        print("电影=" + title)
        print("链接=" + href)
        ###
        with open(title+".txt","a+",encoding='utf-8') as f:
            for k in range(0,200,20):
                url = href+"/comments?start="+str(k)+"&limit=20&status=P&sort=new_score"
                r = requests.get(url, headers=headers)
                r.encoding = 'utf8'
                s = (r.content)
                selector = etree.HTML(s)
                li_list = selector.xpath('//*[@class="comment-item "]')
                for items in  li_list:


                    text = items.xpath('.//*[@class="short"]/text()')[0]
                    f.write(str(text)+"\n")


        print("-------")
        time.sleep(4)

 

 

 

Save these comment data to text files , and then use different graphics for visual display of these comment data ! ! ! !

 

data visualization

 

1. Visualization of scoring data

###画图
font_size = 10  # 字体大小
fig_size = (13, 10)  # 图表大小




data = ([datas])


# 更新字体大小
mpl.rcParams['font.size'] = font_size
# 更新图表大小
mpl.rcParams['figure.figsize'] = fig_size
# 设置柱形图宽度
bar_width = 0.35


index = np.arange(len(data[0]))
# 绘制评分
rects1 = plt.bar(index, data[0], bar_width, color='#0072BC')


# X轴标题
plt.xticks(index + bar_width, itemNames)
# Y轴范围
plt.ylim(ymax=10, ymin=0)
# 图表标题
plt.title(u'豆瓣评分')
# 图例显示在图表下方
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.03), fancybox=True, ncol=5)


# 添加数据标签
def add_labels(rects):
    for rect in rects:
        height = rect.get_height()
        plt.text(rect.get_x() + rect.get_width() / 2, height, height, ha='center', va='bottom')
        # 柱形图边缘用白色填充,纯粹为了美观
        rect.set_edgecolor('white')


add_labels(rects1)


# 图表输出到本地
plt.savefig('豆瓣评分.png')

 

 

Among the seven hits, "Hello, Li Huanying" has the highest score (8.3), and "Detective Chinatown 3" has the lowest rating (5.8). This is a bit unexpected ( Detective Chinatown 3 is far more popular than you, Li Huanying The heat is high).

2. Visualization of duration and type

Duration data visualization

 

#####时长可视化
itemNames.reverse()
datas.reverse()


# 绘图。
fig, ax = plt.subplots()
b = ax.barh(range(len(itemNames)), datas, color='#6699CC')


# 为横向水平的柱图右侧添加数据标签。
for rect in b:
    w = rect.get_width()
    ax.text(w, rect.get_y() + rect.get_height() / 2, '%d' %
            int(w), ha='left', va='center')


# 设置Y轴纵坐标上的刻度线标签。
ax.set_yticks(range(len(itemNames)))
ax.set_yticklabels(itemNames)
plt.title('电影时长(分钟)', loc='center', fontsize='15',
          fontweight='bold', color='red')


#plt.show()
plt.savefig("电影时长(分钟)")

 

 

The movie duration in the picture is about 120 minutes

The longest movie " Detective Chinatown 3 " (136 minutes), the shortest is " Bear Infested Wild Continent " (99 minutes)

 

Movie type data visualization

#####2.类型可视化
###从小到大排序
dict = sorted(dict.items(), key=lambda kv: (kv[1], kv[0]))
print(dict)


itemNames = []
datas = []
for i in range(len(dict) - 1, -1, -1):
    itemNames.append(dict[i][0])
    datas.append(dict[i][1])


x = range(len(itemNames))
plt.plot(x, datas, marker='o', mec='r', mfc='w', label=u'电影类型')
plt.legend()  # 让图例生效
plt.xticks(x, itemNames, rotation=45)
plt.margins(0)
plt.subplots_adjust(bottom=0.15)
plt.xlabel(u"类型")  # X轴标签
plt.ylabel("数量")  # Y轴标签
plt.title("电影类型统计")  # 标题
plt.savefig("电影类型统计.png")

 

 

Count the types of these seven movies (some movies belong to multiple types, such as'action''fantasy'and'adventure' ). Four of the seven movies are comedies. Science fiction, crime, suspense, and adventure all belong to one of them.

 

3. Word cloud visualization of comment data

Use seven different patterns for word cloud visualization, so encapsulate the drawing code into a function ! ! !

 

####词云代码
def jieba_cloud(file_name, icon):
    with open(file_name, 'r', encoding='utf8') as f:
        text = f.read()
        text = text.replace('\n',"").replace("\u3000","").replace(",","").replace("。","")
        word_list = jieba.cut(text)
        result = " ".join(word_list)  # 分词用 隔开
        # 制作中文云词
        icon_name = ""
        if icon == "1":
            icon_name ='fas fa-envira'
        elif icon == "2":
            icon_name = 'fas fa-dragon'
        elif icon == "3":
            icon_name = 'fas fa-dog'
        elif icon == "4":
            icon_name = 'fas fa-cat'
        elif icon == "5":
            icon_name = 'fas fa-dove'
        elif icon == "6":
            icon_name = 'fab fa-qq'
        elif icon == "7":
            icon_name = 'fas fa-cannabis'
        """
        # icon_name='',#国旗
        # icon_name='fas fa-dragon',#翼龙
        icon_name='fas fa-dog',#狗
        # icon_name='fas fa-cat',#猫
        # icon_name='fas fa-dove',#鸽子
        # icon_name='fab fa-qq',#qq
        """
        picp = file_name.split('.')[0] + '.png'
        if icon_name is not None and len(icon_name) > 0:
            gen_stylecloud(text=result, icon_name=icon_name, font_path='simsun.ttc', output_name=picp)  # 必须加中文字体,否则格式错误
        else:
            gen_stylecloud(text=result, font_path='simsun.ttc', output_name=picp)  # 必须加中文字体,否则格式错误


    return picp

 

Start plotting the review data for these seven movies

 

###评论数据词云
def commentanalysis():
    lists = ['刺杀小说家','你好,李焕英','人潮汹涌','侍神令','唐人街探案3','新神榜:哪吒重生','熊出没·狂野大陆']
    for i in range(0,len(lists)):
       title =lists[i]+".txt"
       jieba_cloud(title , (i+1))

Not much nonsense, let’s start the "Word Cloud Show"! ! ! ! ! ! !

 

1. Assassinate a novelist

 

2. Crowd

 

3. Bear Infested Wild Continent

 

4. New God List: Nezha Rebirth

 

5. Detective Chinatown 3

 

6. Hello, Li Huanying

 

7. Order of Samurai

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/113886157