Reptiles crawling before use watercress 250 video and data visualization

 

I. Project

1.1  Project blog address

https://www.cnblogs.com/venus-ping/

1.2  features and characteristics of the project completed

    Reptile use of watercress score top250 movie crawling, get movie-related information, and the acquired data is data visualization. The acquired data are as follows: grab top250 movie showtimes, country, score, type, number of evaluators, director, actor starred in; implement released countries, the proportion of the film type, the number of the top 10 works director and a good actor starred the number of work data visualization.

1.3  technology stack adopted by the project

Use Software: Visual Studio Code, JetBrains PyCharm

The use of technology: pyecharts, MongoDB, Python third party libraries

  1.4 Address Item reference source code

python reptile - IMDb top250 and data visualization   https://www.jianshu.com/p/deaf10d4fd9b

1.5 team member task allocation table

Chenjia Ping of crawling into the top250 video information to achieve the number of top10 works director, film category and the proportion of data visualization

Wu Linlin design and implementation top250 movie information crawling, released data visualization countries

Xiao Ruyun of crawling to the number of participating works top250 video information to achieve the number of comments, outstanding actor data visualization

Second, the project's needs analysis

   For more and more film and television work, the level is not poor, through the watercress top250 movie crawling and analysis, more intuitive choose to watch the movie.

Third, the project functional architecture diagram, primarily functional flowchart

1.1  Function Chart

 

 Figure 1 Function Chart

1.2  Main functional flow diagram

 

 

 

 

 

 FIG 2 crawler crawling

 

 

 FIG 3 country information analysis

 

 

 Figure 4 mvtop250 movie information

 

 

 

 

 Figure 5 top10 director

 

 图 6 yanyuan

 

 Figure 7 category

Fourth, the system module described

   . 1 .1 module list system 

 

 

 Figure 8 program structure.

   . 1 .2 modules described in detail (name, function, operation theme, the key source code) 

1 , mvtop250.py: realization of watercress top250 movie information crawling

1) Construction of a recursive loop, page by page crawling

 

 

 

 

2)  establishing Mongodb connection for data storage

 

 

 

 

3)  grab top250 movie showtimes, country, score, type, number of evaluation

def get_movie_list(url,headers):

    # Instantiated soup objects, ease of handling

    soup = requests.get(url,headers=headers) #向网站发起请求,并获取响应对象

    response = BeautifulSoup(soup.text,'lxml')#利用xml html解析器,具有容错功能

    lists = response.select('div.info')

    #循环获取信息

    for list in lists:

        #获取链接, 也就是获取a链接中href对应的值;

        sing_url =list.select('a')[0].get('href')

        #获取影片名称

        name =list.select('div.hd .title')[0].text

        #导演及主演

        type_list = list.select('div.bd p')[0].text.strip('').split('...')[-1].replace(' ','').split('/')

        #上映时间

        year =type_list[0]

        #国家

        country = type_list[1]

        #影片所属类别

        category = type_list[2]

        #获取影片评分

        star = list.select('div.bd .star .rating_num')[0].text.replace(' ','')

        #获取引述

        quote =list.select('div.bd .quote')[0].text

        #获取评论人数

        people_num = list.select('div.bd .star span:nth-of-type(4)')[0].text.split('')[0]

        get_detail_movie(sing_url,name,year,country,category,star,quote,people_num,headers)

 

4) 抓取top250电影的执导导演、参演演员,并保存数据到mongodb中

 

 

 

 

我们将获取到影片信息数据保存到数据库中,以便后面对数据的分析,效果如下:

 

 

 

 

2country.py :构建top250电影中出自国家最多的20个国家的列表

 1)构建top250电影中出自国家最多的20个国家的列表

#建立国家列表

country_list =[]

#循环查找国家名

for i in item_info.find():

    if '' not in i['country']:

        for j in i['country']:

            if j !='':

                #插入获得信息,放入列表

                country_list.append(str(j).strip('\xa0'))

country_list1 = list(set(country_list))

append_list=[]

for i in country_list1:

    list11 =[]

    list11.append(str(i))

    list11.append(country_list.count(i))

    append_list.append(list11)

#排序

list22 = sorted(append_list,key =lambda d:d[1],reverse=True)[:10]

2) 绘制饼状图

 c = (Pie().add("",

            [list(z) for z in zip(list(a[0] for a in list22),list(a[1] for a in list22) )],

            center=["35%", "50%"])

        .set_global_opts(

            title_opts=opts.TitleOpts(title="豆瓣top250电影产源国家数量占比"),

            legend_opts=opts.LegendOpts(pos_bottom='0'))

        .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}")))

c.render('country.html')# 使用 render() 渲染生成 .html 文件

效果如下:

 

 

 

 

 

3、top10导演.py :对爬取到的信息进行分类整理,统计作品数前10的导演,以及数据可视化

1)统计所有导演数量

 

 

 

 

2)统计作品数前10导演

 

 

 

 

3)绘制柱状图

 

 

 

 

4)效果如下

 

 

 

 

 

 

 

4、category.py:影片类型比重

 1)对爬取到影片进行分类整理,放入list列表中

 

 

 

 

2)针对分类列表进行递归循环,相同类型的影片数量count+1

 

 

 

 

3) 绘制饼状图

 

 

 

 

效果如下:

 

 由饼状图可知,爱情类的影片更受观众喜欢。

 

 

5、people.py:评论人数分析

1)对爬取的影片评论人数进行分析

 

 

 

 

 

2)绘制柱状图

 

 

 

 

效果如下:

 

 

 

 

 

 

 

 

6、yanyuan.py:分析优秀演员参演作品数量

1)对获取到的演员列表进行分析

 

 

 

 

2)绘制漏斗图

 

 

 

 

效果如下:

 

 

 

 

 

 

 

 

五、项目总结

 5.1 特点

利用不同的技术,实现爬取,数据保存,数据可视化。使用mongodb存放数据,利用pyecharts包实现数据可视化。使用 render() 渲染生成html文件后,创建index文件将所有渲染的html文件进行连接。

 5.2 不足之处

 1.爬取数据数量有限。
2.数据量大,爬取速度慢。
3.无法识别链接重要程度,不能判断网页数据的价值程度。
4.使用Mongodb作为数据存储,而不是用MySQL

Guess you like

Origin www.cnblogs.com/venus-ping/p/12049579.html