Python crawler combat 1: Douban movie top250 crawler + analysis

Preface

This article mainly introduces the crawling and analysis of Douban movie top250 . The main libraries used in crawling are re, request, Beautifulsoup, and lxml, and pandas and matplotlib are mainly used in analysis. Finally, some reference materials related to crawlers are introduced, and interested readers can refer to them by themselves.

Crawling

Crawlers, I think, is to use computers to replace humans, let them simulate human search, and perform a lot of repetitive and boring tasks.
Regarding crawlers, there are now more comprehensive crawler tutorials on the Internet, so I won't go into details here. I'm learning materials mainly Web Scraping with Python Collecting More Data from the Modern Web and written in Python web crawler , then there are problems Baidu Google. However, the proposed school before reptile, first on their own to learn about HTML 1 regular expressions and 2 knowledge, we can do more with less.

Ideas

The first thing to find is the webpage to be crawled , here is Douban movie top250 . Then use F12 to view the source code of the webpage, check the elements, and locate the information to be crawled. Here you can right-click and copy the xpath for crawler positioning.
Check element
Copy xpathThere are three methods for crawler positioning:

  1. Targeting through regular expressions
  2. Locate through the find function in Beautifulsoup
  3. Locate through Xpath in lxml

Not much here to start specific reference written in Python web crawlers
look at the page, you can find a total of 10, each of which has 25 films, each page has a similar domain name. Therefore, you can write a loop to download the website of each page, get all the movie links of the website, and then write a loop inside to download the movie link of each page, and crawl the required content information.
Crawl content

Finally, the crawled results are cleaned and output into a csv file.

Code

# -*- coding: utf-8 -*-
"""
Created on Tue Sep 15 09:35:01 2020

@author: zxw
"""
# 引入库
import re
import pandas as pd
import time
import urllib.request
from lxml.html import fromstring
from bs4 import BeautifulSoup

# 下载链接
def download(url):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36') #进行伪装
    resp = urllib.request.urlopen(request)
    html = resp.read().decode('utf-8')
    return html

# 待爬取内容
name = []
year = []
rate = []
director = []
scriptwriter = []
protagonist = []
genre = []
country = []
language = []
length = []

# 循环爬取每页内容
for k in range(10):
    url = download('https://movie.douban.com/top250?start={}&filter='.format(k*25))
    time.sleep(5)   #间隔5s,防止被封禁
    #找出该页所有电影链接
    Links = re.findall('https://movie\.douban\.com/subject/[0-9]+/', url)
    movie_list = sorted(set(Links),key=Links.index)
    for movie in movie_list:
        url = download(movie)
        time.sleep(5)
        tree = fromstring(url)
        soup = BeautifulSoup(url)
        #利用正则表达式定位爬取
        name.append(re.search('(?<=(<span property="v:itemreviewed">)).*(?=(</span>))',url).group())
        year.append(re.search('(?<=(<span class="year">\()).*(?=(\)</span>))', url).group())
        rate.append(re.search('(?<=(<strong class="ll rating_num" property="v:average">)).*(?=(</strong>))', url).group())
        #利用xpath定位爬取
        director.append(tree.xpath('//*[@id="info"]/span[1]')[0].text_content())
        scriptwriter.append(tree.xpath('//*[@id="info"]/span[2]')[0].text_content())
        protagonist.append(tree.xpath('//*[@id="info"]/span[3]')[0].text_content())
        #利用find_all爬取
        genres = soup.find_all('span',{
    
    'property':'v:genre'})
        #将类型用'/'拼接
        temp = []
        for each in genres:
            temp.append(each.get_text())
        genre.append('/'.join(temp))
        #利用find定位爬取
        country.append(soup.find(text='制片国家/地区:').parent.next_sibling) #兄弟节点
        language.append(soup.find(text='语言:').parent.next_sibling)       
        length.append(soup.find('span',{
    
    'property':'v:runtime'}).get_text())

# 将list转化为dataframe
name_pd = pd.DataFrame(name)
year_pd = pd.DataFrame(year)
rate_pd = pd.DataFrame(rate)
director_pd = pd.DataFrame(director)
scriptwriter_pd = pd.DataFrame(scriptwriter)
protagonist_pd = pd.DataFrame(protagonist)
genre_pd = pd.DataFrame(genre)
country_pd = pd.DataFrame(country)
language_pd = pd.DataFrame(language)
length_pd = pd.DataFrame(length)
# 拼接
movie_data = pd.concat([name_pd,year_pd,rate_pd,director_pd,scriptwriter_pd,protagonist_pd,genre_pd,country_pd,language_pd,length_pd],axis=1)
movie_data.columns=['电影','年份','评分','导演','编剧','主演','类型','国家/地区','语言','时长']

#保留电影中文名
f = lambda x: re.split(' ',x)[0]
movie_data['电影'] = movie_data['电影'].apply(f)
#删去冗余部分
g = lambda x: x[4:-1] + x[-1]
movie_data['导演'] = movie_data['导演'].apply(g)
movie_data['编剧'] = movie_data['编剧'].apply(g)
movie_data['主演'] = movie_data['主演'].apply(g)
movie_data.head()

# 输出
outputpath='c:/Users/zxw/Desktop/修身/与自己/数据分析/数据分析/爬虫/豆瓣/data/movie.csv' ##这里需要改路径名
movie_data.to_csv(outputpath,sep=',',index=False,header=True,encoding='utf_8_sig')

The results are shown below, and the data set is shown in Annex 3
Insert picture description here

analysis

Next, this article does some simple analysis of the crawled movie data

Preliminary preparation

#引入库
import pandas as pd
import matplotlib.pyplot as plt
import re
#读取数据
movie_data = pd.read_csv('c:/Users/zxw/Desktop/修身/与自己/数据分析/数据分析/爬虫/豆瓣/data/movie.csv')
movie_data.head()

Three major years

year_counts = movie_data['年份'].value_counts()
year_counts.columns=['年份','次数']
plt.figure(figsize=(15, 6.5))
year_counts.sort_index().plot(kind='bar')

Three major years
1994, 2004, and 2010 are the three years with the most appearances of Douban movie top250

Three directors

f = lambda x: re.split('/',x)
director_list = movie_data['导演'].apply(f)
directors = []
for element in director_list:
    for director in element:
        director = director.replace(" ", "")
        directors.append(director)
directors_pd = pd.Series(directors)
directors_pd.value_counts().head(10)

Steven Spielberg 7
Christopher Nolan 7
Hayao Miyazaki 7
Lee Ann 5
Wang Karwei 5
David Fincher 4 Yoko Hiro
and 4
Lee Angkridge 3
James Cameron 3
Jiang Wen 3
dtype: int64

Spielberg, Nolan, and Hayao Miyazaki are in the first tier, and Ang Lee and Wong Kar Wai are the most on the list among Chinese directors.

Best Screenplay

scriptwriter_list = movie_data['编剧'].apply(f)
scriptwriters = []
for element in scriptwriter_list:
    for scriptwriter in element:
        scriptwriter = scriptwriter.replace(" ", "")
        scriptwriters.append(scriptwriter)
scriptwriter_pd = pd.Series(scriptwriters)
scriptwriter_pd.value_counts().head(10)

Hayao Miyazaki 9
Christopher Nolan 7
Steve Cloves 5
Wang Jiawei 5 J.K
Rowling 5
Jonathan Nolan 5
Andrew Nicole 4
Peter Dougert 4
Is Hiroyuki and 4
James Cameron 3
dtype: int64

Hayao Miyazaki nb

Two actors

actor_list = movie_data['主演'].apply(f)
actors = []
for element in actor_list:
    for actor in element:
        actor = actor.replace(" ", "")
        actors.append(actor)
actors_pd = pd.Series(actors)
actors_pd.value_counts().head(10)

Tony Leung 8
Leslie Cheung 8
Hugo Weaving 7
Maggie Maggie 7
Alan Rickman 7
Gary Oldman 6
Star Chow 6
Matt Damon 6
Leonardo DiCaprio 6
Tom Hanks 6
dtype: int64

Tony Leung and Leslie Cheung are on the list the most

The specific code is detailed in Annex 3

postscript

A little thought

This is my first personal article. I am currently a senior, majoring in applied mathematics in mathematics. I am interested in doing data science-related research, and I am learning almost from scratch. Recently, I have studied crawlers for a week and completed this entry-level crawler actual combat project. There are many shortcomings, and readers are welcome to give us advice.
In the future, I will update a bit of what I have learned from time to time. First, it is a summary of my own learning, and secondly, I hope to help others.
A thousand miles begins with a single step. The road is long and long, I will search up and down.

Reference


  1. HTML Tutorial ↩︎

  2. Regular expression ↩︎

  3. Extraction code: fim3 ↩︎ ↩︎

Guess you like

Origin blog.csdn.net/weixin_43084570/article/details/108637208