Python crawler beginners introductory teaching (1): crawling Douban movie ranking information

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542

Basic development environment

  • Python 3.6
  • Pycharm

Use of related modules

  • requests
  • parcel
  • csv

Install Python and add it to the environment variables, pip installs the required related modules.

Basic idea of ​​crawler

 

One, clear needs

Crawling Douban Top 250 movie information

  • Movie name
  • Director, starring
  • Year, country, type
  • Number of ratings and evaluations
  • Movie Synopsis

Two, send the request

A large number of open source modules in Python make coding very simple. The first module we need to understand when writing a crawler is requests.

 

 

Request url address, use get request, add headers request header, simulate browser request, web page will return response object to you

# 模拟浏览器发送请求
import requests
url = 'https://movie.douban.com/top250'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
print(response)

 


200 is the status code, indicating that the request was successful

2xx (success)
3xx (redirect)
4xx (request error)
5xx (server error)

Common status codes

  • 200-The server successfully returned the web page, and the client request was successful.
  • 302-The object moved temporarily. The server currently responds to requests from web pages in different locations, but the requester should continue to use the original location for future requests.
  • 304-belongs to a redirect. The requested page has not been modified since the last request. When the server returns this response, the content of the web page will not be returned.
  • 401-Not authorized. The request requires authentication. For web pages that need to log in, the server may return this response.
  • 404-Not found. The server could not find the requested page.
  • 503 (Service unavailable) The server is currently unavailable (due to overload or maintenance shutdown). Usually, this is only a temporary state.

Three, get data

import requests
url = 'https://movie.douban.com/top250'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
print(response.text)

 

requests.get(url=url, headers=headers)  requests the webpage to return the response object

response.text:  Get webpage text data

response.json:  Get the json data of the webpage

These two are the most used, of course there are others

 

Fourth, parse the data

Common data analysis methods: regular expressions, css selectors, xpath, lxml...

Commonly used parsing modules: bs4, parsel...

We are using  parsl.  Whether it is in the previous article or the following crawler series, I will use   the parsing library of parsl . Without it, I think it is more fragrant than bs4.

parsel  is a third-party module,  you can install it with pip install parsel

parsel can use css, xpath, re parsing methods

 


All movie information is contained in the  li  tag.

# 把 response.text 文本数据转换成 selector 对象
selector = parsel.Selector(response.text)
# 获取所有li标签
lis = selector.css('.grid_view li')
# 遍历出每个li标签内容
for li in lis:
    # 获取电影标题 hd 类属性 下面的 a 标签下面的 第一个span标签里面的文本数据 get()输出形式是 字符串获取一个  getall() 输出形式是列表获取所有
    title = li.css('.hd a span:nth-child(1)::text').get()   # get()输出形式是 字符串
    movie_list = li.css('.bd p:nth-child(1)::text').getall()     # getall() 输出形式是列表
    star = movie_list[0].strip().replace('\xa0\xa0\xa0', '').replace('/...', '')
    movie_info = movie_list[1].strip().split('\xa0/\xa0')   # ['1994', '美国', '犯罪 剧情']
    movie_time = movie_info[0]  # 电影上映时间
    movie_country = movie_info[1]   # 哪个国家的电影
    movie_type = movie_info[2]     # 什么类型的电影
    rating_num = li.css('.rating_num::text').get()   # 电影评分
    people = li.css('.star span:nth-child(4)::text').get()   # 评价人数
    summary = li.css('.inq::text').get()   # 一句话概述
    dit = {
        '电影名字': title,
        '参演人员': star,
        '上映时间': movie_time,
        '拍摄国家': movie_country,
        '电影类型': movie_type,
        '电影评分': rating_num,
        '评价人数': people,
        '电影概述': summary,
    }
    # pprint 格式化输出模块
    pprint.pprint(dit)

 


The above knowledge points are used

  • Parsel method of parsing modules
  • for loop
  • css selector
  • Dictionary creation
  • List value
  • String method: split, replace, etc.
  • pprint formatted output module

So a solid foundation is necessary. Otherwise you don't even know why you want to write the code like this.

5. Save data (data persistence)

Commonly used methods of saving data  with open

Data like Douban movie information is better saved in Excel.

So you need to use the  csv  module

# csv模块保存数据到Excel
f = open('豆瓣电影数据.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=['电影名字', '参演人员', '上映时间', '拍摄国家', '电影类型',
                                           '电影评分', '评价人数', '电影概述'])

csv_writer.writeheader()    # 写入表头

 

 


This is the crawled data and save it locally. This is just one page of data, and crawling data is definitely not just crawling one page of data. To achieve multi-page data crawling, it is necessary to analyze the url address change law of web page data.

 


It can be clearly seen that the url address of each page is incremented by 25, and the for loop is used to realize the page turning operation

for page in range(0, 251, 25):
    url = f'https://movie.douban.com/top250?start={page}&filter='

Complete implementation code

""""""
import pprint
import requests
import parsel
import csv
'''
1、明确需求:
    爬取豆瓣Top250排行电影信息
        电影名字
        导演、主演
        年份、国家、类型
        评分、评价人数
        电影简介
'''
# csv模块保存数据到Excel
f = open('豆瓣电影数据.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=['电影名字', '参演人员', '上映时间', '拍摄国家', '电影类型',
                                           '电影评分', '评价人数', '电影概述'])

csv_writer.writeheader()    # 写入表头

# 模拟浏览器发送请求
for page in range(0, 251, 25):
    url = f'https://movie.douban.com/top250?start={page}&filter='
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=url, headers=headers)
    # 把 response.text 文本数据转换成 selector 对象
    selector = parsel.Selector(response.text)
    # 获取所有li标签
    lis = selector.css('.grid_view li')
    # 遍历出每个li标签内容
    for li in lis:
        # 获取电影标题 hd 类属性 下面的 a 标签下面的 第一个span标签里面的文本数据 get()输出形式是 字符串获取一个  getall() 输出形式是列表获取所有
        title = li.css('.hd a span:nth-child(1)::text').get()   # get()输出形式是 字符串
        movie_list = li.css('.bd p:nth-child(1)::text').getall()     # getall() 输出形式是列表
        star = movie_list[0].strip().replace('\xa0\xa0\xa0', '').replace('/...', '')
        movie_info = movie_list[1].strip().split('\xa0/\xa0')   # ['1994', '美国', '犯罪 剧情']
        movie_time = movie_info[0]  # 电影上映时间
        movie_country = movie_info[1]   # 哪个国家的电影
        movie_type = movie_info[2]     # 什么类型的电影
        rating_num = li.css('.rating_num::text').get()   # 电影评分
        people = li.css('.star span:nth-child(4)::text').get()   # 评价人数
        summary = li.css('.inq::text').get()   # 一句话概述
        dit = {
            '电影名字': title,
            '参演人员': star,
            '上映时间': movie_time,
            '拍摄国家': movie_country,
            '电影类型': movie_type,
            '电影评分': rating_num,
            '评价人数': people,
            '电影概述': summary,
        }
        pprint.pprint(dit)
        csv_writer.writerow(dit)

Achieve effect

 

 

 

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/113055904