Python data analysis case 30 - Analysis of China's high-box office movies (the whole process of crawler data acquisition and analysis visualization)

Case background

Recently, I have always seen that the box office of "The Vanishing Her" has exceeded the box office, and the box office of "Desperate" has exceeded the box office...

So I wanted to crawl it myself to get the data of movies with high box office in China, and then analyze it.

Data comes from Taopiaopiao: Movie Total Box Office Ranking (maoyan.com)

Just climb it.


Code

First, the crawler obtains data:

data collection

Import package

import requests; import pandas as pd
from bs4 import BeautifulSoup

 Incoming web page and request headers

url = 'https://piaofang.maoyan.com/rankings/year'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.62'}
response1 = requests.get(url,headers=headers)
response.status_code


200 indicates that the web page file was successfully obtained.

Then parse the web page file to obtain movie information data

%%time
soup = BeautifulSoup(response.text, 'html.parser')
soup=soup.find('div', id='ranks-list')
movie_list = []

for ul_tag in soup.find_all('ul', class_='row'):
    movie_info = {}
    li_tags = ul_tag.find_all('li')
    movie_info['序号'] = li_tags[0].text
    movie_info['标题'] = li_tags[1].find('p', class_='first-line').text
    movie_info['上映日期'] = li_tags[1].find('p', class_='second-line').text
    movie_info['票房(亿)'] = f'{(float(li_tags[2].text)/10000):.2f}'
    movie_info['平均票价'] = li_tags[3].text
    movie_info['平均人次'] = li_tags[4].text
    movie_list.append(movie_info)

Data acquisition is completed! View dictionary data:
 

movie_list

Okay, very standard, no problem, then turn it into a data frame and check the first three rows

movies=pd.DataFrame(movie_list)
movies.head(3)

After cleaning the data to a certain extent, we see that the data in the release date contains the word "release". We need to remove it and convert it into a time format. The box office, ticket price, and attendance must be converted into numerical data.

We only take the top 250 movies at the box office, corresponding to Douban 250.,,,,China box office 250 is good

Then we also need to extract the year and month data from the date to facilitate subsequent analysis.

#清洗
movies=movies.set_index('序号').loc[:'250',:]  
movies['上映日期']=pd.to_datetime(movies['上映日期'].str.replace('上映',''))
movies[['票房(亿)','平均票价','平均人次']]=movies.loc[:,['票房(亿)','平均票价','平均人次']].astype(float)
movies['年份']=movies['上映日期'].dt.year  ;   movies['月份']=movies['上映日期'].dt.month
movies.head(2)

After the data processing is completed, start drawing and analyzing!


Drawing analysis

Import paint package

import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams ['font.sans-serif'] ='SimHei'               #显示中文
plt.rcParams ['axes.unicode_minus']=False  

 Draw a bar chart of the top 20 movies at the box office

top_movies = movies.nlargest(20, '票房(亿)')
plt.figure(figsize=(7, 4),dpi=128)
ax = sns.barplot(x='票房(亿)', y='标题', data=top_movies, orient='h',alpha=0.5)
#plt.xticks(rotation=80, ha='center')

# 在柱子上标注数值
for p in ax.patches:
    ax.annotate(f'{p.get_width():.2f}', (p.get_width(), p.get_y() + p.get_height() / 2.),
                va='center', fontsize=8, color='gray', xytext=(5, 0),
                textcoords='offset points')

plt.title('票房前20的电影')
plt.xlabel('票房数量(亿)')
plt.ylabel('电影名称')
plt.tight_layout()
plt.show()

Not bad, very interesting. You can see the names of the top 20 box office movies in China's history and their box office numbers.

Analyze the average ticket price and average number of passengers:
 

plt.figure(figsize=(7, 6),dpi=128)
# 绘制第一个子图:平均票价点图
plt.subplot(2, 2, 1)
sns.scatterplot(y='平均票价', x='年份', data=movies,c=movies['年份'],cmap='plasma')
plt.title('平均票价点图')
plt.ylabel('平均票价')
#plt.xticks([])

plt.subplot(2, 2, 2)
sns.boxplot(y='平均票价', data=movies)
plt.title('平均票价箱线图')
plt.xlabel('平均票价')

plt.subplot(2, 2, 3)
sns.scatterplot(y='平均人次', x='年份', data=movies,c=movies['年份'],cmap='plasma')
plt.title('平均人次点图')
plt.ylabel('平均人次')

plt.subplot(2, 2, 4)
sns.boxplot(y='平均人次', data=movies)
plt.title('平均人次箱线图')
plt.xlabel('平均人次')
plt.tight_layout()
plt.show()

Looking at the histogram first, you can see that there are some outliers in the average ticket price and average number of visits. Then we draw a scatter plot of them and the year on the left. You can see in detail that as the year gets older, the movie The average attendance is getting lower and lower, and the average ticket price is getting higher and higher... That is to say, recent movies are getting more expensive than previous movies, and the average number of people watching each show is getting smaller and smaller. .....It also reflects some of the "high ticket prices" in our country's film industry, the "ghost theater box office" and other chaos...

I noticed that before 2000, there was a movie that had a particularly high attendance per show and a very low ticket price. I was curious about what movie it was, so I checked it out:

movies[movies['年份']<2000]

It turns out that it is a national-level "Titanic", so that's okay, it deserves its name.

Number of high-grossing movies in different years:

plt.figure(figsize=(7, 3), dpi=128)
year_count = movies['年份'].value_counts().sort_index()
sns.lineplot(x=year_count.index, y=year_count.values, marker='o', lw=1.5, markersize=3)
plt.fill_between(year_count.index, 0, year_count, color='lightblue', alpha=0.8)
plt.title('不同年份高票房电影数量')
plt.xlabel('年份')
plt.ylabel('电影数量')
# 在每个数据点上标注数值
for x, y in zip(year_count.index, year_count.values):
    plt.text(x, y+0.2, str(y), ha='center', va='bottom', fontsize=8)

plt.tight_layout()
plt.show()

It can be seen that my country's high-grossing movies began to grow rapidly in 2010 and reached their peak in 2017. The famous "Wolf Warrior 2" was released in 2017, and then dropped slightly in 2018 and 2019, and fell off a cliff in 2020. Why , I understand, the reason for the epidemic.

Analysis of the percentage share of high-grossing movies in different months:

plt.figure(figsize=(4, 4),dpi=128)
month_count = movies['月份'].value_counts(normalize=True).sort_index()
# 绘制饼图
sns.set_palette("Set3")
plt.pie(month_count, labels=month_count.index, autopct='%.1f%%', startangle=140, counterclock=False,
        wedgeprops={'alpha': 0.9})
plt.axis('equal')  # 保证饼图是正圆形
plt.text(-0.3,1.2,'不同月份高票房电影数量',fontsize=8)
plt.tight_layout()
plt.show()

We can see that high-box office movies are mainly concentrated in February, July, and December.

The reason is also very simple. The Spring Festival in February, the summer vacation in July, and the New Year's Eve in December... Movies like to be released during these three times.


Custom evaluation indicators

We all used box office analysis above. We found that movies with high box office really reflect the number of people watching them? Is it really a good movie that the audience likes?

The data is limited. Although we cannot eliminate factors such as publicity, time hot spots, directors, social atmosphere, etc., we can control the ticket price to a certain extent. Because movies with high box office may also be caused by excessive ticket prices, we use 'box office/average ticket price' and then perform a weighted sum with 'average attendance'.

The box office/average ticket price represents the number of people watching a movie, and is given a weight of 70%. The average number of visitors is given a three-level weight, and then the data units are standardized and unified, and the total becomes our own evaluation index:


In order to facilitate standardization, we first import the standardization function of sklearn in a machine learning library.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

 Calculated indicators:

movies['我的评价指标']=(movies['票房(亿)'].astype(float)/movies['平均票价'].astype(float))
data1=scaler.fit_transform(movies[['我的评价指标', '平均人次']])
movies['我的评价指标']=0.7*data1[:,0]+0.3*data1[:,1]
movies=movies.sort_values(by='我的评价指标',ascending=False)

Draw and view:
 

my_top_movies = movies.nlargest(20, '我的评价指标')
plt.figure(figsize=(7, 4),dpi=128)
ax = sns.barplot(x='我的评价指标', y='标题', data=my_top_movies, orient='h',alpha=0.6,palette='rainbow_r')
#plt.xticks(rotation=80, ha='center')

# 在柱子上标注数值
for p in ax.patches:
    ax.annotate(f'{p.get_width():.2f}', (p.get_width(), p.get_y() + p.get_height() / 2.),
                va='center', fontsize=8, color='gray', xytext=(5, 0),
                textcoords='offset points')

plt.title('前20电影')
plt.xlabel('我的评价指标')
plt.ylabel('电影名称')
plt.tight_layout()
plt.show()

Compare this to the previous top 20 grossing films so that we can compare which films are over-grossing and which ones may be undervalued.

def get_unique_elements(list1, list2):
    # 获取每个列表中的唯一元素
    set1 = set(list1) ; set2 = set(list2)
    unique_to_list1 = list(set1 - set2)
    unique_to_list2 = list(set2 - set1)
    common_elements = list(set1 & set2)
    return unique_to_list1, common_elements, unique_to_list2
票价过高的电影,确实是好电影,被低估的电影=get_unique_elements(top_movies['标题'].to_list(), my_top_movies['标题'].to_list())

 The function of this function is to select elements unique to the first list, elements common to both lists, and elements unique to the second list.

If this movie is in the top 20 at the box office and is also in the top 20 in our evaluation indicators, then it is a good movie. If it is in the top 20 at the box office but not in the top 20 according to our evaluation indicators, then it may be a "watery movie" whose ticket price is too high.

print(f'票价过高的电影:{票价过高的电影},\n\n确实是好电影:{确实是好电影},\n\n低估的电影:{被低估的电影}')

Movies with exorbitant ticket prices: ['Eight Hundred', 'My Hometown and Me', 'Walking Alone on the Moon', 'The Wandering Earth 2'],emmmm

I don’t know much about these movies, so I can’t comment on them...

word cloud diagram

Let’s add a word cloud chart to make it look better:

First customize a random color function:

import numpy as np
def randomcolor():
    colorArr = ['1','2','3','4','5','6','7','8','9','A','B','C','D','E','F']
    color ="#"+''.join([np.random.choice(colorArr) for i in range(6)])
    return color
[randomcolor() for i in range(3)]

Then draw a word cloud picture: A mat mask is used here, and the original picture shape is a six-pointed star like this:—— 

from wordcloud import WordCloud
from matplotlib import colors
from imageio.v2 import imread    #形状设置
mask = imread('词云.png')  

# 将'标题'和'票房'列合并为字典,以便生成词云图
word_freq = dict(zip(movies['标题'], movies['票房(亿)']))
color_list=[randomcolor() for i in range(20)]

wordcloud = WordCloud(width=1000, height=500, background_color='white',font_path='simhei.ttf',
                      max_words=50, max_font_size=50,random_state=42,mask = mask,
                          colormap=colors.ListedColormap(color_list)).generate_from_frequencies(word_freq)

plt.figure(figsize=(10, 5),dpi=256)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()



Summarize

This demonstration demonstrates the entire process from data crawler acquisition, to cleaning and sorting, to calculation and visual analysis, plus a few more diagrams, text analysis angles, and models. As a thesis for most undergraduates, it is a similar workload.

Guess you like

Origin blog.csdn.net/weixin_46277779/article/details/132620596