Record a Python crawling movie data and visual analysis

1. Obtain data

1. Technical tools

IDE editor: vscode

Send request: requests

Analysis tool: xpath

def Get_Detail(Details_Url):
    Detail_Url = Base_Url + Details_Url
    One_Detail = requests.get(url=Detail_Url, headers=Headers)
    One_Detail_Html = One_Detail.content.decode('gbk')
    Detail_Html = etree.HTML(One_Detail_Html)
    Detail_Content = Detail_Html.xpath("//div[@id='Zoom']//text()")
    Video_Name_CN,Video_Name,Video_Address,Video_Type,Video_language,Video_Date,Video_Number,Video_Time,Video_Daoyan,Video_Yanyuan_list = None,None,None,None,None,None,None,None,None,None
    for index, info in enumerate(Detail_Content):
        if info.startswith('◎译  名'):
            Video_Name_CN = info.replace('◎译  名', '').strip()
        if info.startswith('◎片  名'):
            Video_Name = info.replace('◎片  名', '').strip()
        if info.startswith('◎产  地'):
            Video_Address = info.replace('◎产  地', '').strip()
        if info.startswith('◎类  别'):
            Video_Type = info.replace('◎类  别', '').strip()
        if info.startswith('◎语  言'):
            Video_language = info.replace('◎语  言', '').strip()
        if info.startswith('◎上映日期'):
            Video_Date = info.replace('◎上映日期', '').strip()
        if info.startswith('◎豆瓣评分'):
            Video_Number = info.replace('◎豆瓣评分', '').strip()
        if info.startswith('◎片  长'):
            Video_Time = info.replace('◎片  长', '').strip()
        if info.startswith('◎导  演'):
            Video_Daoyan = info.replace('◎导  演', '').strip()
        if info.startswith('◎主  演'):
            Video_Yanyuan_list = []
            Video_Yanyuan = info.replace('◎主  演', '').strip()
            Video_Yanyuan_list.append(Video_Yanyuan)
            for x in range(index + 1, len(Detail_Content)):
                actor = Detail_Content[x].strip()
                if actor.startswith("◎"):
                    break
                Video_Yanyuan_list.append(actor)
    print(Video_Name_CN,Video_Date,Video_Time)
    f.flush()
    try:
        csvwriter.writerow((Video_Name_CN,Video_Name,Video_Address,Video_Type,Video_language,Video_Date,Video_Number,Video_Time,Video_Daoyan,Video_Yanyuan_list))
    except:
        pass

Save data: csv

if __name__ == '__main__':
    with open('movies.csv','a',encoding='utf-8',newline='')as f:
        csvwriter = csv.writer(f)
        csvwriter.writerow(('Video_Name_CN','Video_Name','Video_Address','Video_Type','Video_language','Video_Date','Video_Number','Video_Time','Video_Daoyan','Video_Yanyuan_list'))
        spider(117)

2. Crawl target

    本次爬取的目标网站是阳光电影网https://www.ygdy8.net,用到技术为requests+xpath。主要获取的目标是2016年-2023年之间的电影数据。

3. Field information

    获取的字段信息有电影译名、片名、产地、类别、语言、上映时间、豆瓣评分、片长、导演、主演等,具体说明如下:
field name meaning

Video_Name_CN

film translation

Video_Name

movie title

Video_Address

film origin

Video_Type

movie category

Video_language

movie language

Video_Date

release time

Video_Number

movie rating

Video_Time

Length

Video_Daoyan

director

Video_Yanyuan_list

starring list

2eb1c81e86c946a9bb0f639a603507ec.png

2. Data preprocessing

Technical tool: jupyter notebook

1. Load data

First use pandas to read the movie data just obtained with the crawler

9777abe0e1d343f683e8693af8b2d2e7.png

2. Outlier processing

Outliers handled here include missing values ​​and duplicate values

First check the absence of each field in the original data

7eb50ac4e1514425a915b6dec4483703.png

From the results, it can be found that there are quite a lot of missing data. Here, for the convenience of unified deletion processing, the duplicate data is also deleted

831de5c76758493c8127b2e4c55ed5d8.png

It can be found that there are 1711 pieces of processed data left.

3. Field processing

Since the information in each field in the crawled raw data is very messy, there are many "/" "," and the like, which are processed here in a unified manner, mainly using the apply() function in pandas, and because the data we analyzed in 2016- Movie data in 2023, other than that will be deleted

# 数据预处理
data['Video_Name_CN'] = data['Video_Name_CN'].apply(lambda x:x.split('/')[0]) # 处理Video_Name_CN
data['Video_Name'] = data['Video_Name'].apply(lambda x:x.split('/')[0]) # 处理Video_Name
data['Video_Address'] = data['Video_Address'].apply(lambda x:x.split('/')[0])  # 处理Video_Address
data['Video_Address'] = data['Video_Address'].apply(lambda x:x.split(',')[0].strip())
data['Video_language'] = data['Video_language'].apply(lambda x:x.split('/')[0])
data['Video_language'] = data['Video_language'].apply(lambda x:x.split(',')[0])
data['Video_Date'] = data['Video_Date'].apply(lambda x:x.split('(')[0].strip())
data['year'] = data['Video_Date'].apply(lambda x:x.split('-')[0])
data['Video_Number'] = data['Video_Number'].apply(lambda x:x.split('/')[0].strip())
data['Video_Number'] = pd.to_numeric(data['Video_Number'],errors='coerce')
data['Video_Time'] = data['Video_Time'].apply(lambda x:x.split('分钟')[0])
data['Video_Time'] = pd.to_numeric(data['Video_Time'],errors='coerce')
data['Video_Daoyan'] = data['Video_Daoyan'].apply(lambda x:x.split()[0])
data.drop(index=data[data['year']=='2013'].index,inplace=True)
data.drop(index=data[data['year']=='2014'].index,inplace=True)
data.drop(index=data[data['year']=='2015'].index,inplace=True)
data.dropna(inplace=True)
data.head()

d4b371dc9cd5455abf1680f37929c04b.png

3. Data visualization

1. Import the visualization library

This visualization mainly uses third-party libraries such as matplotlib, seaborn, pyecharts, etc.

import matplotlib.pylab as plt
import seaborn as sns
from pyecharts.charts import *
from pyecharts.faker import Faker
from pyecharts import options as  opts 
from pyecharts.globals import ThemeType
plt.rcParams['font.sans-serif'] = ['SimHei'] #解决中文显示
plt.rcParams['axes.unicode_minus'] = False   #解决符号无法显示

2. Analyze the proportion of movies released in each country

# 分析各个国家发布的电影数量占比
df2 = data.groupby('Video_Address').size().sort_values(ascending=False).head(10)
a1 = Pie(init_opts=opts.InitOpts(theme = ThemeType.LIGHT))
a1.add(series_name='电影数量',
        data_pair=[list(z) for z in zip(df2.index.tolist(),df2.values.tolist())],
        radius='70%',
        )
a1.set_series_opts(tooltip_opts=opts.TooltipOpts(trigger='item'))
a1.render_notebook()

0cb43f91c069428a869464b8154f2356.png

3. Top 5 directors with the highest number of films released

# 发布电影数量最高Top5导演
a2 = Bar(init_opts=opts.InitOpts(theme = ThemeType.DARK))
a2.add_xaxis(data['Video_Daoyan'].value_counts().head().index.tolist())
a2.add_yaxis('电影数量',data['Video_Daoyan'].value_counts().head().values.tolist())
a2.set_series_opts(itemstyle_opts=opts.ItemStyleOpts(color='#B87333'))
a2.set_series_opts(label_opts=opts.LabelOpts(position="top"))
a2.render_notebook()

3729dda2f8134060be3c22b02877201d.png

4. Analyze the top ten countries with the highest average film rating

# 分析电影平均评分最高的前十名国家
data.groupby('Video_Address').mean()['Video_Number'].sort_values(ascending=False).head(10).plot(kind='barh')
plt.show()

b5c0bb8766ad4b59a956bd85eb80d771.png

5. Analyze which language is the most popular

# 分析哪种语言最受欢迎
from pyecharts.charts import WordCloud
import collections
result_list = []
for i in data['Video_language'].values:
    word_list = str(i).split('/')
    for j in word_list:
        result_list.append(j)
result_list
word_counts = collections.Counter(result_list)
# 词频统计:获取前100最高频的词
word_counts_top = word_counts.most_common(100)
wc = WordCloud()
wc.add('',word_counts_top)
wc.render_notebook()

0a6985be59194a13b35e0ebcffa99eb0.png

6. Analyze which type of movie is the most popular

# 分析哪种类型电影最受欢迎
from pyecharts.charts import WordCloud
import collections
result_list = []
for i in data['Video_Type'].values:
    word_list = str(i).split('/')
    for j in word_list:
        result_list.append(j)
result_list
word_counts = collections.Counter(result_list)
# 词频统计:获取前100最高频的词
word_counts_top = word_counts.most_common(100)
wc = WordCloud()
wc.add('',word_counts_top)
wc.render_notebook()

a45b8c08cbdc42b89ed7e49665b2b3d5.png

7. Analyze the ratio of various types of movies

# 分析各种类型电影的比例
word_counts_top = word_counts.most_common(10)
a3 = Pie(init_opts=opts.InitOpts(theme = ThemeType.MACARONS))
a3.add(series_name='类型',
        data_pair=word_counts_top,
        rosetype='radius',
        radius='60%',
        )
a3.set_global_opts(title_opts=opts.TitleOpts(title="各种类型电影的比例",
                        pos_left='center',
                    pos_top=50))
a3.set_series_opts(tooltip_opts=opts.TooltipOpts(trigger='item',formatter='{a} <br/>{b}:{c} ({d}%)'))
a3.render_notebook()

55179367949d41dabc16322715245895.png

8. Analyze the distribution of movie lengths

# 分析电影片长的分布
sns.displot(data['Video_Time'],kde=True)
plt.show()

198a8105d28e49f5bfad37f07c7bb327.png

9. Analyze the relationship between film length and rating

# 分析片长和评分的关系
plt.scatter(data['Video_Time'],data['Video_Number'])
plt.title('片长和评分的关系',fontsize=15)
plt.xlabel('片长',fontsize=15)
plt.ylabel('评分',fontsize=15)
plt.show()

0f8408e04743414f9a96ab8ca3a9f380.png

10. Statistics of the total number of movies produced from 2016 to the present

# 统计 2016 年到至今的产出的电影总数量
df1 = data.groupby('year').size()
line = Line()
line.add_xaxis(xaxis_data=df1.index.to_list())
line.add_yaxis('',y_axis=df1.values.tolist(),is_smooth = True)  
line.set_global_opts(xaxis_opts=opts.AxisOpts(splitline_opts = opts.SplitLineOpts(is_show=True)))
line.render_notebook()

b5c16c900c4f42cbb97da0958808dd34.png

Four. Summary

This experiment uses crawlers to obtain movie data from 2016 to 2023, and draws the following conclusions through visual analysis:

1. The number of movies gradually increased from 2016 to 2019, reached the maximum in 2019, and began to decline rapidly year by year from 2020.

2. The countries with the largest number of released movies are China and the United States.

3. The feature film with the most movie types.

4. The film length is normally distributed, and the film length and rating are positively correlated.

Welfare at the end of the article

Finally, I would like to thank everyone who has read my article carefully. Reciprocity is always necessary. Although the following information is not very valuable, you can take it away if you need it:

  • ① Learning roadmap for all directions of Python, clear what to learn in each direction
  • ② More than 600 Python course videos, covering the necessary basics, crawlers and data analysis
  • ③ More than 100 practical cases of Python, including detailed explanations of 50 super-large projects, learning is no longer just theory
  • ④ 20 mainstream mobile games forced solutions Retrograde forced solution tutorial package for reptile mobile games
  • ⑤ Crawler and anti-crawler offensive and defensive tutorial package, including 15 large-scale website forced solutions
  • ⑥ Reptile APP reverse actual combat tutorial package, including detailed explanations of 45 top-secret technologies
  • ⑦ More than 300 Python e-books, ranging from beginners to advanced
  • ⑧ Huawei produces exclusive Python comic tutorials, which can also be learned on mobile phones
  • ⑨ The actual Python interview questions of Internet companies over the years are very convenient for review

Guess you like

Origin blog.csdn.net/BlueSocks152/article/details/131221680